UBC Theses and Dissertations

UBC Theses Logo

UBC Theses and Dissertations

Learning image-based localization Meng, Lili 2017

Your browser doesn't seem to have a PDF viewer, please download the PDF to view this item.

Item Metadata

Download

Media
24-ubc_2017_november_meng_lili.pdf [ 66.45MB ]
Metadata
JSON: 24-1.0355764.json
JSON-LD: 24-1.0355764-ld.json
RDF/XML (Pretty): 24-1.0355764-rdf.xml
RDF/JSON: 24-1.0355764-rdf.json
Turtle: 24-1.0355764-turtle.txt
N-Triples: 24-1.0355764-rdf-ntriples.txt
Original Record: 24-1.0355764-source.json
Full Text
24-1.0355764-fulltext.txt
Citation
24-1.0355764.ris

Full Text

Learning Image-based LocalizationbyLili MengB.Sc., University of Science and Technology Beijing, 2008M.Sc., University of Science and Technology Beijing, 2011A THESIS SUBMITTED IN PARTIAL FULFILLMENTOF THE REQUIREMENTS FOR THE DEGREE OFDoctor of PhilosophyinTHE FACULTY OF GRADUATE AND POSTDOCTORALSTUDIES(Mechanical Engineering)The University of British Columbia(Vancouver)September 2017c© Lili Meng, 2017AbstractImage-based localization plays a vital role in many tasks of robotics and computervision, such as global localization, recovery from tracking failure, and loop closuredetection. Recent methods based on regression forests for camera relocalizationdirectly predict 3D world locations for 2D image locations to guide camera poseoptimization. During training, each tree greedily splits the samples to minimizethe spatial variance. This thesis develops techniques to improve the performancecamera pose estimation based on regression forests method and extends its appli-cation domains. First, random features and sparse features are combined so thatthe new method only requires an RGB image in the testing. After that, a label-freesample-balanced objective is developed to encourage equal numbers of samplesin the left and right sub-trees, and a novel backtracking scheme is developed toremedy the incorrect 2D-3D correspondence in the leaf nodes caused by greedysplitting. Furthermore, the methods based on regression forests are extended to uselocal features in both training and test stages for outdoor applications, eliminatingtheir dependence on depth images. Finally, a new camera relocalization method isdeveloped using both points and lines. Experimental results on publicly availableindoor and outdoor datasets demonstrate the efficacy of the developed approaches,showing superior or on-par accuracy with several state-of-the-art baselines.Moreover, an integrated software and hardware system is presented for mo-bile robot autonomous navigation in uneven and unstructured indoor environments.This modular and reusable software framework incorporates capabilities of percep-tion and autonomous navigation. The system is evaluated are in both simulationand real-world experiments, demonstrating the efficacy and efficiency of the devel-oped system.iiLay SummaryImage-based localization plays a vital role in many robotics and computer visiontasks. Deep learning research has attempted to overcome the challenges of usingdepth images for camera relocalization, making it applicable for outdoor scenes.The accuracy level, however, was low compared to other methods. The main con-tributions of this work resulted in increased prediction accuracy with reduced train-ing time and eliminating the dependence on depth images which broadened theapplication to outdoor scenes.Robots are operating in shared spaces indoors with people more and more.However, robots still face challenges operating in uneven and unstructured envi-ronments, such as those designed for wheelchair mobility. The other contributionof this thesis is an integrated software and hardware system for autonomous mobilerobot navigation in uneven and unstructured environment. The thesis details the re-sults of extensive experiments in both simulation and real-world, demonstrating theefficacy and efficiency of the system.iiiPrefaceThis dissertation is based on the research work conducted in collaboration withmultiple researchers at The University of British Columbia under the guidance andsupervision of Prof. Clarence W. de Silva.• Chapter 3 addresses the problem of estimating camera pose relative to aknown scene, given a single RGB image. This chapter is based the fol-lowing published work: Lili Meng, Jianhui Chen, Frederick Tung, JamesJ. Little, and Clarence W. de Silva. ”Exploiting Random RGB and SparseFeatures for Camera Pose Estimation.” In 27th British Machine Vision Con-ference(BMVC), 2016.• Chapter 4 addresses the problem of camera pose learning based on back-tracking regression forests. This chapter is based on the accepted publi-cation: Lili Meng, Jianhui Chen, Frederick Tung, James J. Little, JulienValentin and Clarence W. de Silva. ”Backtracking Regression Forests forAccurate Camera Relocalization.” In IEEE/RSJ Intl. Conf. on IntelligentRobots and Systems(IROS), 2017.• Chapter 5 addresses exploiting points and lines in regression forests forRGB-D camera relocalization. This chapter is based on the following un-der reviewed submission: Lili Meng, Frederick Tung, James J. Little, JulienValentin and Clarence W. de Silva. ” Exploiting Points and Lines in Regres-sion Forests for RGB-D Camera Relocalization.” IEEE International Con-ference on Robotics and Automation (ICRA), 2018.• Chapter 6 addresses the problem of mobile autonomous navigation in unevenivand unstructured world with localization learning and other methods. Thischapter is based on the following accepted publication: Chaoqun Wang∗, LiliMeng∗, Sizhen She, Ian M. Mitchell, Teng Li, Frederick Tung, Weiwei Wan,Max. Q.-H. Meng, and Clarence W. de Silva. ”Autonomous Mobile RobotNavigation in Uneven and Unstructured Indoor Environments.” In IEEE/RSJIntl. Conf. on Intelligent Robots and Systems(IROS), 2017. (∗Indicatesequal contribution)For the BMVC, IROS on camera relocalization, and ICRA paper/submission,the author identified the problem, formulated the solution, designed and imple-mented most of the experiments. For the IROS paper on autonomous navigation,the author is equally contributed. For the BMVC paper, the author implemented thecamera relocalization using a single RGB image while the implementation on thesports camera calibration was mostly done by Jianhui Chen, which is not includedin this thesis. For the IROS backtracking regression forests and journal paper, Jian-hui Chen contributed ideas in discussion and provided suggestions on the coding,while the author implemented all the experiments. For all the BMVC, IROS pa-per and ICRA submission on camera relocalization, Dr. Frederick Tung and Prof.James J. Little helped on ideas discussion and the manuscripts edition. For the au-tonomous navigation paper, Chaoqun Wang helped on the system experiment, anddesigned the global path planning algorithm, which is not included in the thesis.Sizhen She helps on the platform setup at the preliminary stage for the Pioneerrobot which is not included in thesis. Teng Li helped part of the robot autonomousnavigation environment setup in the Industrial Automation Lab, especially in theperiod lab relocation in the ICICS building.Prof. Clarence De Silva contributed in at all stages of the projects includingresearch proposal, problem identification, experimental equipment, infrastructureand environments developments, research funding, supervision, guidance, thesiswriting and so on. Prof. James J. Little contributed ideas and provided feedback forthe camera localization learning. Prof. Ian M. Mitchell contributed ideas, discus-sions and manuscript revision on the autonomous navigation IROS paper. WeiweiWan contributed the discussion and manuscript revision on the IROS autonomousnavigation paper. Julien Valentin helped on the discussion on running speed ofvregression forests and suggested to add other experiments in IROS backtrackingregression forests paper and ICRA submission.viTable of ContentsAbstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iiLay Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iiiPreface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ivTable of Contents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viiList of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xiList of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xivList of Abbreviations . . . . . . . . . . . . . . . . . . . . . . . . . . . . xxivAcknowledgments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xxvi1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.1 Research motivation . . . . . . . . . . . . . . . . . . . . . . . . 11.1.1 Image-based localization methods . . . . . . . . . . . . . 11.1.2 Mobile robot autonomous navigation . . . . . . . . . . . 31.2 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41.3 Thesis outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62.1 Related work . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62.1.1 Image-based camera relocalization . . . . . . . . . . . . . 6vii2.2 Random forests . . . . . . . . . . . . . . . . . . . . . . . . . . . 92.2.1 Decision tree . . . . . . . . . . . . . . . . . . . . . . . . 102.2.2 The randomness model . . . . . . . . . . . . . . . . . . . 132.2.3 Random forest ensemble . . . . . . . . . . . . . . . . . . 142.3 Camera pose estimation . . . . . . . . . . . . . . . . . . . . . . . 142.3.1 RGB camera pose estimation . . . . . . . . . . . . . . . . 142.3.2 RGB-D camera pose estimation . . . . . . . . . . . . . . 153 Image-based localization using regression forests and keyframe poserefinement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 163.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 163.2 Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 183.2.1 Random RGB features and labels . . . . . . . . . . . . . 183.2.2 Random forests for 2D-3D correspondence regression . . 203.2.3 Pose refinement . . . . . . . . . . . . . . . . . . . . . . . 233.3 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 253.3.1 Camera relocalization on Microsoft 7 Scenes dataset . . . 253.3.2 Camera relocalization on the Stanford 4 Scenes dataset . . 273.3.3 Implementation details . . . . . . . . . . . . . . . . . . . 293.4 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 304 Image-based localization using backtracking regression forests . . . 324.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 324.2 Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 344.2.1 Image features . . . . . . . . . . . . . . . . . . . . . . . 354.2.2 Scene coordinate labels . . . . . . . . . . . . . . . . . . . 374.2.3 Backtracking regression forest training . . . . . . . . . . 384.2.4 Backtracking in regression forests prediction . . . . . . . 394.2.5 Camera pose optimization . . . . . . . . . . . . . . . . . 404.3 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 414.3.1 Indoor camera relocalization . . . . . . . . . . . . . . . . 424.3.2 Outdoor camera relocalization . . . . . . . . . . . . . . . 534.3.3 Implementation details . . . . . . . . . . . . . . . . . . . 56viii4.3.4 Limitations . . . . . . . . . . . . . . . . . . . . . . . . . 574.4 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 585 Exploiting points and lines in regression forests for RGB-D camerarelocalization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 605.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 605.2 Related work . . . . . . . . . . . . . . . . . . . . . . . . . . . . 615.3 Problem setup and method overview . . . . . . . . . . . . . . . . 625.4 Regression forest with point and line features . . . . . . . . . . . 625.4.1 Points sampling and scene coordinate labels . . . . . . . . 625.4.2 Regression forest training . . . . . . . . . . . . . . . . . 655.4.3 Weak learner model . . . . . . . . . . . . . . . . . . . . 655.4.4 Training objective . . . . . . . . . . . . . . . . . . . . . 655.4.5 Regression forest prediction . . . . . . . . . . . . . . . . 665.5 Camera pose optimization . . . . . . . . . . . . . . . . . . . . . 665.6 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 675.6.1 Evaluations on Stanford 4 Scenes dataset . . . . . . . . . 675.6.2 Evaluations on Microsoft 7 Scenes dataset . . . . . . . . . 695.6.3 Evaluations on TUM dynamic dataset . . . . . . . . . . . 735.7 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 766 Mobile robot autonomous navigation in uneven and unstructuredenvironments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 776.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 776.2 Hardware and software platform . . . . . . . . . . . . . . . . . . 786.3 Environment representation . . . . . . . . . . . . . . . . . . . . . 806.3.1 3D SLAM . . . . . . . . . . . . . . . . . . . . . . . . . 806.3.2 Multilayer maps and traversable map . . . . . . . . . . . 846.4 Localization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 856.4.1 Regression forests based global localization . . . . . . . . 866.4.2 Local localization . . . . . . . . . . . . . . . . . . . . . . 866.5 Planning and execution . . . . . . . . . . . . . . . . . . . . . . . 866.5.1 Global and local planning . . . . . . . . . . . . . . . . . 86ix6.5.2 Plan execution . . . . . . . . . . . . . . . . . . . . . . . 876.6 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 876.6.1 Simulation experiments . . . . . . . . . . . . . . . . . . 876.6.2 Real-world experiments . . . . . . . . . . . . . . . . . . 906.7 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 917 Conclusions and future work . . . . . . . . . . . . . . . . . . . . . . 927.1 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 927.2 Future directions . . . . . . . . . . . . . . . . . . . . . . . . . . 937.2.1 Speed up image based localization . . . . . . . . . . . . . 937.2.2 Transfer learning . . . . . . . . . . . . . . . . . . . . . . 937.2.3 Active image-based localization learning for autonomousrobot . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96xList of TablesTable 3.1 Relocalization results for the 7 Scenes dataset. Median per-formance is shown for the proposed method on all scenes againstthree state-of-the-art methods: SCRF [93], PoseNet [53] andBayesian PoseNet [51]. . . . . . . . . . . . . . . . . . . . . . 27Table 3.2 Camera relocalization results for 4 Scenes dataset. The per-centage of correct frames is given (within 5cm translational and5◦ angular error), and median of the proposed method on 4Scenes dataset against three state-of-the-art methods: ORB+PnP,SIFT+PnP, and Bayesian PoseNet[51]. The best performance ishighlighted. . . . . . . . . . . . . . . . . . . . . . . . . . . . 30Table 4.1 Image features. The present method uses random features [93],SIFT features [64] and Walsh-Hadamard transform (WHT) fea-tures [43] according to different scenarios. The choice of fea-tures considers robustness and computational efficiency. . . . . 35Table 4.2 Camera relocalization results for the indoor dataset. Thepercentage of correct frames (within 5cm translational and 5◦angular error) of the developed method is shown on 4 Scenesdataset against four state-of-the-art methods: ORB+PnP, SIFT+PnP,Random+SIFT [70], MNG [103]. The best performance is high-lighted. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42xiTable 4.3 Camera relocalization results for the 7 Scenes dataset .Cor-rect percentage performance is shown for the developed methodon all scenes against three state-of-the-art methods: Sparse base-line [93], SCRF [93], MutliOutput [38]. The correct percentageshow the test frames within 5cm translational and 5◦ angularerror. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48Table 4.4 Median camera relocalization performance for the 7 Scenesdataset . Median performance for the present method on allscenes is shown against three state-of-the-art methods: PoseNet[53], Bayesian [51], SCRF [93]. . . . . . . . . . . . . . . . . 49Table 4.5 Camera relocalization results for the TUM dataset . The cor-rect percentage performance, median performance, and RMSEof ATE are presented. . . . . . . . . . . . . . . . . . . . . . . 51Table 4.6 Camera relocalization results for the outdoor CambridgeLandmarks Dataset. The median performance for the devel-oped method is shown against five state-of-the-art methods: Ac-tive Search without prioritization (w/o) and with prioritization(w/) [87], PoseNet [53], Bayesian PoseNet [51], CNN+LSTM[108], and PoseNet+Geometric loss[52]. . . . . . . . . . . . . 55Table 5.1 Camera relocalization results for the indoor dataset. Thepercentage of correct frames (within 5cm translational and 5◦angular error) of the developed method is shown using 4 Scenesdataset against four state-of-the-art methods: ORB+PnP, SIFT+PnP,Random+SIFT [70], MNG [103]. The best performance is high-lighted. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68Table 5.2 Relocalization results for the 7 Scenes dataset. Test framessatisfying the error metric (within 5cm translational and 5◦ an-gular error) are shown for the present method on all scenesagainst five strong state-of-the-art methods: SCRF [93], Multi[38],BTBRF, Uncertainty [102], AutoConext [6]. The best perfor-mance is highlighted. . . . . . . . . . . . . . . . . . . . . . . 72xiiTable 5.3 Relocalization results for the 7 Scenes dataset. Median per-formance for the present method is shown on all scenes againstfive state-of-the-art methods: PoseNet [53], Bayesian BayesianPoseNet [51], Active Search without prioritization [87], SCRF[93], BTBRF. The best performance is highlighted. . . . . . . . 72Table 5.4 Camera relocalization results for the TUM dataset . Correctpercentage performance, median performance, RMSE of ATEare shown. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75xiiiList of FiguresFigure 2.1 Decision tree example. Best viewed in color. (a) A tree isa set of split nodes and leaf nodes organized in a hierarchicalway. Split nodes are denoted with blue circles and leaf nodeswith green circles. (b) Each split node of a decision tree is asplit (or test) function to be applied to the incoming data. Eachleaf node stores the final answer (predictor). An example ispresented here on using a decision tree to classify whether animage represents an indoor scene or an outdoor scene. . . . . 11Figure 2.2 Decision tree test and training process Best viewed in color.(a) In the test, a split node applies a test to the input data v andsends it to the appropriate child. The process is proceeded untila leaf node is reached (red path). (b) Training a decision tree issending the entire training set S0 into the tree and optimizingthe parameters of the split nodes so as to optimize a chosenenergy function. . . . . . . . . . . . . . . . . . . . . . . . . . 11xivFigure 3.1 Camera pose estimation pipeline. During training, the sceneinformation is encoded in a random forest. In the testing stage,an initial camera pose is estimated using the predictions fromthe random forest with real-time response. Then, the initialcamera pose is used to query a nearest neighbor (NN) imagefrom keyframes. Finally, the camera pose is refined by sparsefeature matching between the test image and the NN image.In our method, the labels can constitute any information (e.g.,scene coordinate positions) associated with pixel locations. . . 17Figure 3.2 Random RGB pixel comparison features. The red star rep-resents the pixel p being computed. The distance between thered star and the blue circle represents the pixel offset as definedin Eq. 3.2. In (a), the two example features at image positionp1 and p2 give a large color difference response. In (b), thetwo features at the same image locations in a different imagegive a much smaller response compared with (a). . . . . . . . 19Figure 3.3 Training random forests for 2D-3D coordinate correspon-dence regression. A set St0 of labeled sample pixels (p,m) israndomly chosen from the entire set S0 = [S10, · · · ,ST0 ] for eachtree, where m is the 3D world coordinate label of pixel p. Thetree grows recursively from the root node to the leaf node. Thegoal is to optimize the parameters of the tree split nodes. Herespatial-variance is used to represent entropy. . . . . . . . . . 21Figure 3.4 Test phase of random forests for 2D-3D correspondence re-gression. In the testing stage, at each split node i, the fea-ture fφi is compared with the feature response τi to determinewhether to go to the left or the right child node. A particular in-put may go along different paths in different trees as indicatedby the red arrows. . . . . . . . . . . . . . . . . . . . . . . . . 22Figure 3.5 Microsoft 7 Scenes dataset. . . . . . . . . . . . . . . . . . . 26xvFigure 3.6 Pixel-wise prediction error distribution from the regres-sion forests. Heat maps show the prediction error distributiondirectly from the regression forests on the Chess and Headsscenes. Large errors occur on black screens and other texture-less regions. . . . . . . . . . . . . . . . . . . . . . . . . . . . 27Figure 3.7 4 Scenes dataset example images[103]. Each RGB-D imageis accompanied with a camera pose. Each sequence also pro-vides a textured 3D model (not used in the present thesis). . . 28Figure 4.1 Depth-adaptive random RGB pixel comparison features.The red star represents the pixel p being computed. The dis-tance between red star and blue circle represents the offset pix-els θD(p) as defined in Eq. 4.2. In (a), the two example featuresat image position p1 and p2 give a large color difference re-sponse. In (b), the two features at the same image locations ina different image give a much smaller response compared with(a). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36Figure 4.2 Training random forests for 2D-3D coordinate correspon-dence regression. A set St0 of labeled sample pixels (p,m) israndomly chosen from the entire set S0 = [S10, · · · ,ST0 ] for eachtree, where m is the 3D world coordinate label of pixel p. Thetree grows recursively from the root node to the leaf node. Thegoal is to optimize the parameters of the tree split nodes. Thetraining objective 1 is the sample-balanced objective, and thetraining objective 2 is the spatial-variance objective. . . . . . 37Figure 4.3 Decision tree using two split objectives. Best viewed in color.The split nodes are illustrated as the pie charts, which show thepercentage of samples in the left and right sub-trees. In thisfive-level tree, the split nodes of the first two levels are split us-ing the sample-balanced objective. While the rest of levels aresplit using the unbalanced objectives (e.g., the spatial-varianceobjective). Details are in Sect. 4.2.3. . . . . . . . . . . . . . . 40xviFigure 4.4 Test phase of backtracking regression forests In the test,at each split node i, the feature fφi is compared with featureresponse τi to determine whether to go to the left or to theright child node. The red arrows represent the prediction pro-cess without backtracking while the purple arrows representthe backtracking process. A particular input may go along dif-ferent paths in different trees as indicated by the red and purplearrows. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41Figure 4.5 Impact of the sample-balanced objective and backtrack-ing on prediction accuracy. These figures show the accu-mulated percentage of predictions within a sequence of inlierthresholds. The proposed method with the sample-balancedobjective (red lines) consistently has a higher percentage ofinliers compared with the unbalanced objective (blue lines).Backtracking (green lines) further improves prediction accu-racy. Max number of backtracking leaves is 16 here. . . . . . 44Figure 4.6 Qualitative results for indoor dataset (from office2/gates381).Best viewed in color. The ground truth is in red and the presentestimated camera pose is in green. (a) camera trajectories. (b)several evenly sampled camera frusta are shown for visualiza-tion. the present method produces accurate camera locationsand orientations. Note: the 3D model is only for visualizationpurposes and it is not used for the present camera relocaliza-tion. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44Figure 4.7 Camera relocaliztion accuracy vs. number of tree levelsusing the sample-balanced objective. When the number oflevels using sample-balanced objective increases, the averageperformance (dashed line) increases. . . . . . . . . . . . . . 45Figure 4.8 Camera relocalization accuracy vs. backtracking leaf nodenumbers. The camera relocalization performance increaseswith more backtracking leaf nodes though eventually levelsout. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46xviiFigure 4.9 Split ratio distribution. The heat map shows the split ratiodistribution as a function of tree depth (level). The data isfrom all split nodes in a sequence. In the first 8 levels, the dis-tribution is concentrated around 0.5 as a result of the sample-balanced objective. From level 8 to level 25, the distribution isalmost uniform. . . . . . . . . . . . . . . . . . . . . . . . . 47Figure 4.10 Camera relocalization accuracy VS number of trees . . . 47Figure 4.11 Example sequence in TUM dynamic dataset. . . . . . . . . 50Figure 4.12 Quantitative results on TUM dynamic dataset. . . . . . . . 52Figure 4.13 Failure cases on TUM dynamic dataset. . . . . . . . . . . 53Figure 4.14 Overview of the present framework of SuperSIFT for cam-era relocalization (a) Visual structure from motion [115] isemployed to autonomously generate training data (local fea-tures and their corresponding 3D point positions in the worldcoordinate). During training, regression forest is used to learnthe correspondence between local features and 3D world co-ordinates. (b) At test time, local features are extracted fromquery image, and then their 3D correspondences are estimatedfrom regression forests with backtracking. Finally the camerapose is estimated using PnP solver and RANSAC. . . . . . . 54Figure 4.15 Qualitative results for the outdoor dataset Cambridge Land-marks, King’s College. Best viewed in color. Camera frustaoverlaid on the 3D scene. The camera poses are evenly sam-pled every ten frames for visualization. Camera frusta withhollow green frames show ground truth camera poses, whilered ones with light opacity show the present estimated cameraposes. The estimated camera poses of the developed methodare very close to ground truth in spite of partial occlusion,moving objects, motion blur, large illumination and pose changes. 56Figure 4.16 Effect of backtracking leaf node numbers on camera relocal-ization accuracy in Cambridge landmarks dataset. . . . . . . 57xviiiFigure 5.1 Line segment example. (a) original RGB image (b) with LSDline features. In scenes with little texture and repetitive pat-terns which are typical in indoor environments, line featuresare more robust. . . . . . . . . . . . . . . . . . . . . . . . . . 61Figure 5.2 Depth corruption and discontinuity on line segments. (a)LSD line segments overlaid on original RGB image (b) trun-cated depth map. Effective depth information is not alwaysavailable for 2D line segments in the corresponding RGB im-age, such as the wrong depth values shown on the desk and theglass corridor areas. . . . . . . . . . . . . . . . . . . . . . . . 63Figure 5.3 3D line estimation based on sampling points. Within a pin-hole camera model, the 2D image points are evenly sampledon a 2D image line and then back-projected on the scene co-ordinate to be 3D scene points. These 3D scene points containoutliers which could be removed by RANSAC to fit a 3D linein scene coordinate. . . . . . . . . . . . . . . . . . . . . . . 64Figure 5.4 Qualitative results on Stan f ord 4 Scenes dataset. . . . . . 69Figure 5.5 Large error example on Stan f ord 4 Scenes dataset Apt2 Luke.70Figure 5.6 Qualitative results for the Heads scene in Microso f t 7 Scenesdataset. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71Figure 5.7 Large error examples on Microso f t 7 Scenes dataset. . . . 73Figure 5.8 Qualitative results on TUM dynamic dataset. . . . . . . . . 74Figure 5.9 Large error examples on TUM dynamic dataset. . . . . . 75Figure 6.1 The robot is navigating up the slope to the goal at the higherplatform. In the presence of staircases and slope, the robotbuilds a 3D representation of the environment for the traversablemap, and then the robot can navigate through the slope andavoid the staircases to reach the goal efficiently and safely. . . 79xixFigure 6.2 Robot hardware platform. (a) Segway robot in a sloped area.The robot base is segway RMP100 with custom installed cast-ers for safety and onboard battery pack for providing power tosensors. (b) Xtion Pro Live RGB-D camera is capable of pro-viding 30Hz RGB and depth images, with 640x480 resolutionand 58 HFV. (c) Hokuyo UTM-30LX laser scanner with range10m to 30m, and 270◦ area scanning range for localization. . . 80Figure 6.3 High-level system architecture. The robot first builds a 3DOctoMap representation for uneven environment with the present3D SLAM using wheel odometry, 2D laser and RGB-D data.Multi-layer maps from OctoMap are used for generating thetraversable map, which serves as the input for autonomousnavigation. The robot employs a variable step size RRT ap-proach for global planning, adaptive Monte Carlo localizationmethod to localize itself, and elastic bands method as the localplanner to gap the global planning and real-time sensor-basedrobot control. . . . . . . . . . . . . . . . . . . . . . . . . . . 81Figure 6.4 3D environment representation of simulated world. (a) Sim-ulated environment model in Gazebo (b) 3D OctoMap environ-ment built by our 3D SLAM using wheel odometry, a 2D laserscanner and an RGB-D sensor. . . . . . . . . . . . . . . . . . 82Figure 6.5 Octomap representation for f r1/room of TUM RGB-D SLAMbenchmark [98] with visual SLAM. (a) the sparse map fromoriginal ORB-SLAM [72], map points (black, red), keyframes(blue). (b) OctoMap representation . . . . . . . . . . . . . . 83Figure 6.6 Generation of traversable map from multilayer maps forthe Gazebo Caffe environment. (a) slope and staircase visu-alization with occupied voxels, (b) multi-layer maps and traversablemap. In the traversable map, the staircases are occupied space,while the slope area except the edge is free space, which is safefor robot to navigate. For details please refer to Sec. 6.3.2. . . 84xxFigure 6.7 Autonomous navigation in the simulated environment. Thefirst row shows images of autonomous navigation in Gazebo,and the second row shows the Rviz views of the process. . . . 88Figure 6.8 Real environment. (a) a photo of the real environment. (b)3D representation of the environment with OctoMap. Onlyoccupied voxels are shown for visualization. . . . . . . . . . . 89Figure 6.9 Multilayer maps and traversable map for real environment.(a)-(d) multiple projected layers from OctoMap, (e) the traversablemap. The staircases and slope edge are occupied while theslope is free space. . . . . . . . . . . . . . . . . . . . . . . . 89Figure 6.10 Robot autonomous navigation example in real environ-ment. The first row shows images of robot autonomous nav-igation, and the second row shows screenshots of Rviz viewsof the process. . . . . . . . . . . . . . . . . . . . . . . . . . 89Figure 6.11 Dynamic obstacle avoidance. (a) a dynamic obstacle is ap-proaching the robot. (b) The human suddenly blocks the wayin front of the robot. (c) The robot changes direction to avoidthe human. . . . . . . . . . . . . . . . . . . . . . . . . . . . 90xxiNomenclature(µ,Σ) mean and covarianceyˆp estimated world coordinate associated to image location pΛ(S) covariance matrix of the labels in Sp a 2D location in image coordinatesPc a 3D point in camera coordinatesPw a 3D point in world coordinatest translation vectorv a feature vectorD(p) depth in image location pI(p,c) pixel lookup at location p, channel cK camera matrixR rotation matrixΘ space of split parametersθ j split parameter in tree node j{p,mp} a 2D image location and its associated 3D world locationf (.) a regression functionxxiih(p;θ) decision tree binary feature indication functionH(S) entropy of examples in set SI j Information in tree node jj tree node indexS j set of examples in tree node jSLj set of examples in the left child of tree node jSRj set of examples in the right child of tree node jxxiiiList of AbbreviationsAMCL adaptive Monte Carlo localizationAR augmented realityATE absolute trajectory errorBRF balanced regression forestsBTRF backtracking regression forestsCNN convolutional neural networkDOF degree of freedomEPnP efficient perspective-n point (camera pose estimation)GPS global positioning systemLSD Line Segment DetectorLSTM long short-term memoryNN nearest neighborORB Oriented FAST and rotated BRIEFPnP perspecative-n point (camera pose estimation)PLForests point line regression forestsRANSAC ransom sample consensusRFs random forestsRMSE root mean squared errorROS robot operating systemSCRF scene coordinate regression forestSfM structure from motionSIFT scale invariant feature transformationSLAM simultaneous localization and mappingSURF speeded up robust featuresxxivVR virtual realityWHT Walsh-Hadamard transformxxvAcknowledgmentsIt would be an impossible task to list everyone who has contributed to my researchand study during my PhD, but I am sincerely grateful to everyone who has helpedme in various of ways.I am very fortunate to be in The Institute for Computing, Information and Cogni-tive Systems (ICICS) which is a multidisciplinary research institute that promotescollaborative research in advanced technologies systems. I am exposed both tomechatronics and computer science here, and most importantly, the great people inboth fields.I’d like to express my sincere gratitude to my supervisor Prof. Clarence de Silvafor providing me the research freedom and supporting my research in various ofways all these years in UBC. I’d like to express my deep gratitude to Prof. JimLittle who provides many delightful encouragements, insightful views and inspira-tional discussions not only for my doctoral research but also my view on robotics.I’d like to thank Prof. Bob Woodham who introduced me to the field of ComputerVision in his vivid computer vision course with candies (students who ask and an-swer questions in his class could get candies!). I’d like to thank Prof. Ian Mitchellfor introducing me to the computational robotics field, for his kind support andnumerous of discussions on various of ideas in robotics.I would like to express my sincere gratitude to Jianhui Chen and Frederich Tungon the numerous of inspirational and fruitful discussions, patient coding reviews,detailed paper writing suggestions and warm friendship. I would like to expressmy gratitude to Chaoqun Wang for a great collaborator on autonomous navigationand I learned a lot of path planning from him. Thanks Sizhen She a lot for helpingsetup the robot platform together. Thanks a lot Victor Gan, Neil Traft and AnkurxxviGupta on the discussion of various questions on robotics with long distance run-ning. Julieta Martinez and Alireza Shafaei, thank you so much for your insightfulviews and helpful discussion on many computer vision problems.Thanks Ji Zhang and Weiwei Wan for contributing fruitful discussion and pro-viding valuable suggestions on visual odometry and paper writing.I would like to express my sincere gratitude to Yu Du, Yunfei Zhang, Teng Li,Min Xia, Yanjun Wang, Haoxiang Lang Muhammad Tufail, Jiahong Chen, SimonGong and all the labmates for the great time and various of support.Lastly, and most importantly, I wish to acknowledge my gratitude for my fam-ily and Jesus Christ. I deeply thank my family for their unconditional love, pa-tience, and support. Lord Jesus, thank you so much for coming into my life andbeing in me, for lighting my life and career, for filling me with your love, peaceand grace, and for bringing all these amazing people in my life!xxviiChapter 1IntroductionWe keep moving forward, opening new doors, and doing new things,because we’re curious and curiosity keeps leading us down newpaths. — Walt Disney1.1 Research motivation1.1.1 Image-based localization methodsWith the advance of machine learning methods, the field of robotics is undergoinga transformation from sensing and executing fixed tasks in fixed environments toexhibiting increased adaptability for autonomous operation in dynamic, unknown,unstructured and flexible environments. In order to operate autonomously in un-structured environments, the robot must be capable of interacting with its environ-ment in an intelligent way, which implies that an autonomous robot must be ableto correctly and comprehensively sense the environment and then perform actionsbased on the sensed information. Sensing is the preliminary step for all tasks ina sense-plan-act scheme. Localization, which is the problem of determining thelocation and orientation of a robot relative to a known environment, is a basic per-ceptual problem for mobile robots [101].Image-based localization (also referred to as camera relocalization in literature)is one of the fundamental problems in robotics, computer vision and virtual reality.1It is the problem of estimating the position and orientation of the camera relativea known scene, given a single RGB/RGB-D image. The global positioning system(GPS), which is widely used in outdoor environments, is not available in indoorenvironments and with the presence of skyscrapers and other interfering structuresin outdoor environments. Moreover, the GPS may have approximately 2−10 me-ters of error. As a result, it poses serious problem for autonomous robots, suchas mobile robots, self-driving cars and unmanned aerial vehicles. Light-weightand affordable visual sensors are an appealing alternative to GPS and other expen-sive sensors. Recent consumer robotics products such as iRobot 980 and Dyson360 eyes are already equipped with visual simultaneous localization and mapping(SLAM) techniques, enabling them to effectively navigate in complex environ-ments. Meanwhile, in order to insert virtual objects in an image sequence (i.e., inaugmented reality applications), camera poses have to be estimated in a visuallyacceptable way.Camera relocalization with a single RGB-D image or RGB image belongs tothe subject of global localization in robotics research and development. However,global localization may also use other sensors such as laser or sonar. Local andglobal localization are two types of localization [101]. Local localization concernsestimation of the relative robot pose from a previous state. It aims to compensatefor robot motion noise from sensors (e.g., wheel odometry). In global localization,the initial pose and previous motion information are unknown. Global localiza-tion is more difficult than local localization as there is no previous pose/motioninformation available and the environment can be extensive and have repeatableobjects. It plays a critical role as it can help solve the kidnapped robot problemand allow the robot to recover from severe pose errors, which is essential for trulyautonomous robots.The image-based camera localization with machine learning methods is an at-tractive field for advancing the state-of-the-art of intelligent robotics because of thelow cost of vision sensors and scalability of machine learning methods. Randomforests (RFs) are popular and successful machine learning method because of theirspeed, robustness and learning abilities [21]. RFs have been used for camera relo-calization by Shotton et al.[93]. In Scene Coordinate Regression Forests (SCRF),a regression forest is employed, which is trained to have capabilities of predicting2any pixel’s correspondence to its 3D point position in the scene’s world coordi-nate. Then the camera pose is estimated using a robust optimization scheme withpredicted world points. Methods based on deep learning also have been used incamera re-localization. For example, PoseNet [53] trains a convolutional neuralnetwork to regress the 6-DOF camera pose from a single RGB image in real-time.However, the accuracy is much worse than the methods based on random forests.Therefore, this thesis focuses on a machine learning method random forests forimage-based camera relocalization.1.1.2 Mobile robot autonomous navigationWith a rapidly aging population and a shortage of workforce, autonomous robotssuch as self-driving cars, autonomous delivery robots, smart wheelchairs, and homeassistant robots play a more an more important role in our life. According to theUNs population projections, the older population (aged 65 and over) in 2050 willbe about 16.7 percent of the total population in many parts of the world, which ismore than double that of 2015 [42]. The “old-age dependency ratio”, which is theratio of old people to those of working age, will grow even faster. In 2015 the worldhad 15 persons aged 65 and over for every 100 adults between the ages of 25 and64. By 2050 that number is projected to have risen to 30 [42]. A significant por-tion of the aging population is expected to need physical and cognitive assistance.Confidence in the ability to undertake various tasks is core to one’s psychologicalfunctioning [69], and a greater sense of control of life can be positively correlatedwith a reduced mortality rate. Yet, the shortage of space and staff in nursing homesand other care facilities is already causing problems. Therefore, there exists anurgent need to develop robot-assisted systems. A great deal of attention and re-search is directed to assistive systems aimed at promoting aging-in-place, therebyfacilitating living independently as long as possible.Significant progress has been achieved in recent decades in advancing the state-of-the-art of mobile robot technologies. These robots are operating more and morein unknown and unstructured environments, which requires a high degree of flex-ibility, perception, motion and control. Companies such as Google and Uber aredeveloping advanced self-driving cars and expecting to present them to market in3the next few years. Various mobile robots are roaming in factories and warehousesto automate the production lines and inventory, saving workers from walking dailymarathons [25]. Robot vacuum cleaners such as Roomba and Dyson360 eyes aremoving around in the house to help clean the floors. Personal robots such as PR2[46, 67] and Care-O-bot [82] have demonstrated the ability to perform a variety ofintegrated tasks such as long-distance navigation and complex manipulation.Mobile robots that work safely and autonomously in uneven and unstructuredenvironments still pose great challenges, such as navigation through sloped areasinstead of staircases. However, little work has focused on an integrated system ofautonomous navigation in sloped and unstructured indoor areas, which are com-mon in many modern buildings.In this backdrop, the present thesis focuses on an integrated software and hard-ware system for autonomous mobile robot navigation in indoor environments thatare designed for and shared with people. It is believed that a functioning robot sys-tem of this type will inspire and benefit technology development and the generalpublic.1.2 ContributionsThe main contributions of this thesis can be summarized as follows:• Three camera relocalization methods are developed in this thesis. The firstone is to use just RGB images in the testing stage to estimate camera pose.The second one contains a sample balanced decision-tree objective functionand a backtracking search scheme. The third one employs both point andline features to estimate the camera pose. The state-of-the-art performancesof the developed methods are demonstrated on publicly available camerarelocalization benchmarks against several strong baselines.• For autonomous navigation, an integrated software and hardware architec-ture for autonomous mobile robot navigation in 3D uneven and unstructuredindoor environments is presented. The integrated system is evaluated both insimulation and on real scenarios, demonstrating the efficacy of our methods,providing some insight for more autonomous mobile robots and wheelchairsworking around us.41.3 Thesis outlineThe thesis is organized into seven chapters. It starts with an introduction to theimportance of camera relocalization and motivation for image-based localizationlearning methods. Chapter 2 introduces related work and background on the re-gression forests and camera pose estimation that are used in our work. Chapter 3presents the proposed camera relocalization method, which only uses a single RGBimage in testing. Chapter 4 discusses the problem of accurate camera relocalizationwith the developed sample-balance scheme and backtracking technique. Chapter5 presents the developed method that uses both point and line features in camerapose estimation. Chapter 6 presents an integrated system for autonomous robotnavigation. Chapter 7 provides conclusions on the topic of image-based localiza-tion learning, and presents several promising avenues for possible future work.5Chapter 2BackgroundIf I have seen farther it is by standing on the shoulders of Giants.— Isaac NewtonThis chapter presents an overview of previous work on image-based camerare-localization and autonomous navigation of mobile robots. Moreover, somebackground is provided of the random forests method and camera pose estima-tion which represents the foundation for the localization method that is developedin the present work.2.1 Related work2.1.1 Image-based camera relocalizationImage-based camera relocalization has been widely studied in the context of largescale global localization [53, 75], recovery from tracking failure [33, 54], loopclosure detection in visual SLAM [111], global localization in mobile robotics[23, 89], and sports camera calibration [17]. Moreover, place recognition[4, 30,99, 100] could be seen as a special case of camera relocalization with the focuson recognizing whether the places have been visited or not, without providing anaccurate 6D pose estimation. Approaches based on local features, keyframes, ran-dom forests and deep learning are four categories of camera pose estimation. Othersuccessful variants and hybrid methods also [83] exist.6Local feature based approachesLocal feature based methods [75, 89] usually match the descriptors extracted fromthe incoming frame and the descriptors stored in the database. Then the combina-tion of the perspective-three-point [31] and RANSAC [28] is usually employed todetermine camera poses. The local features (e.g. commonly used SIFT [64] andSURF [5]), can represent the image local properties, so they are more robust toviewpoint changes as long as a sufficient number of keypoints can be recognized.However, these methods have to store a large database of descriptors and requireefficient approximate nearest neighbor search methods for feature matching. Muchrecent work has focused on efficiency [87], scalability [88], and learning of featuredetection, description, and matching [60].Keyframe based approachesMethods based on keyframes [27, 32] hypothesize an approximate camera poseby computing whole-image similarity between a query image and keyframes. Forexample, randomized ferns encode an RGB-D image as a string of binary codes.They have been used to recover from tracking failure [33] and detect loop closurein SLAM [111]. Place recognition [30, 100] uses keyframe based camera relo-calization [72, 110]. First, the incoming frame is encoded by bag of visual words[96], and the keyframe database is queried for the most similar keyframe [72]. Sec-ond, the sparse features are extracted from the incoming frame and the most similarkeyframe in the database, and the camera pose is estimated through the 2D-3D cor-respondence. However, these methods provide inaccurate matches when the queryframe is significantly different from the stored keyframes.Random forests based approachesApproaches based on random forests for image-based camera relocalization havegained interest since the introduction of scene coordinate regression forests (SCRF)[93]. Random forests are used as regression for camera relocalization so we alsorefer to them as regression forests. These approaches first employ a regressionforest to learn an estimation of each 2D image pixel’s corresponding 3D points inthe scene’s world coordinates with training RGB-D images and their corresponding7ground truth poses. Then camera pose optimization is conducted by an adaptedversion of preemptive RANSAC. The features in random forests are based on fastrandom pixel comparisons. A similar technique has been developed by Lepetit andFua [60] for keypoint recognition and image registration. These simple and fastpixel comparison features suffice to work for detection and tracking even underlarge perspective and scale variations.Random forests based methods do not need to compute ad hoc descriptors andsearch for nearest-neighbors, which are time-consuming steps common in localfeature based and key-frame based methods by shifting much computational bur-den to the training stage. They have been used in many applications. For example,Shotton et al.[94] use the random pixel comparison feature from the depth imagein the random forests for per-pixel classification with the application of classifyingwhich body parts the pixel belongs. Both [60] and [94] use synthetic images underdifferent views for data augmentation in the training set. In contrast to [60, 94]using random features in random forests as a classification, scene coordinate re-gression forests (SCRF) formulate the 2D-3D correspondence search problem as aregression problem which is solved by random forests. Because the environmentgenerally has repeated objects, such as similar chairs in an office room, the randomforests have multi-outputs from an input, resulting in ambiguities. To solve thisproblem, Guzman et al.[38] proposed a hybrid discriminative-generative learningarchitecture to choose the optimal camera poses in SCRF. To improve the camerarelocation accuracy, Valentin et al.[102] exploit uncertainty from regression forestsby using a Gaussian mixture model in leaf nodes.However, these methods need RGB-D images for both training and test, con-straining their applications. [6, 70, 103] extended the random forest based meth-ods to use RGB images at test time, while depth images are still needed to get theground truth labels. Valentin et al.[103] proposed multiple accurate initial solu-tions and defined a navigational structure over the solution space that can be usedfor efficient gradient-free local search. However, their method needs an explicit 3Dmodel as the auxiliary of the RGB image at the testing stage. We have proposeda method that only require RGB pixel comparison features in [70], eliminating thedependency on the depth image at test time. Furthermore, the method integratesrandom RGB features and sparse feature matching in an efficient and accurate way,8allowing the method to be applied to highly dynamic scenes. Cavallari et al.[14]have extended SCRF for online camera relocalization by adapting a pre-trainedforest to a new scene on the fly.Deep learning based approachesDeep learning [59] has led to rapid progress and impressive performance on a va-riety of computer vision tasks, such as visual recognition [41, 95] and object de-tection [80]. There is emerging work [51, 53, 108] for camera re-localization.For example, PoseNet [53] trains a convolutional neural network to regress the6 degree-of-freedom (DOF) camera pose from a single RGB image in real time.Bayesian PoseNet [51] applies an uncertainty framework to the PoseNet by aver-aging Monte Carlo dropout samples from the posterior Bernoulli distribution ofthe Bayesian ConvNet’s weights. Besides the camera pose, it also estimates themodel’s relocalization uncertainty. [116] uses CNNs to regress pixels world coor-dinates, and it has larger camera relocalization error but better performance on pre-dicting world coordinates of pixels compared to SCRF, which indicates that betterprediction on world coordinate does not necessarily lead to better final camera poseestimation due to the outlier removal capability of camera pose optimation usingPreemptive RANSAC. [62] presents a dual-stream CNN with both color imagesand depth images as the network inputs, but still has almost an order of magnitudeof error compared to methods based on regression forests. To improve the PoseNetaccuracy, geometric loss functions is explored in [52] and LSTM units are used onthe CNN output to capture the contextual information. However, these methodsbased on deep learning have much lower overall accuracy compared with methodsbased on random forests in indoor scenes and methods based on local features inoutdoor scenes. Moreover, high-end GPU dependency and high power consump-tion make these deep learning based methods less favorable for energy sensitiveapplications such as mobile robots.2.2 Random forestsRandom forests (RFs) are one of the most popular machine learning tools becauseof their speed, robustness and generalization [9, 19]. RFs are broadly applied in9many domains, including natural language processing tasks such as language mod-eling [117] and financial tasks such as stock price prediction [77] and computervision tasks such as human pose recognition [94]. A random forest is an ensembleof decision trees. In a random forest, a fraction of data and a particular number offeatures are selected at random to train a decision tree. Now a single decision treeis discussed, followed by the tree ensemble method.2.2.1 Decision treeA decision tree is a binary tree data structure made of a collection of nodes orga-nized in a hierarchical way as shown in Fig. 2.1 (a). The decision tree solves acomplex problem by running a sequence of simpler tests. For a given input object,a decision tree estimates an unknown property of the object by asking successivequestions about its known properties. Which question to ask next depends on theanswer of the previous question and this relationship is represented graphically asa path through the tree [19]. The final answer is stored in the leaf node which is theterminal node of the path that the object follows. Consider the simple example ofdeciding whether a photo represents an indoor scene or an outdoor scene as illus-trated in Fig. 2.1 (b). Suppose that only the image pixel information is available.Then, the questions can start from whether the top of the image is blue or not. Ifthe answer is yes, then that part might be the sky, and the whole image data is thensent to the right child. Based on this answer, another question can be asked, for in-stance if the bottom part is green. If it is, it might be the grass, which indicates thatthe photo has high probability of showing outdoor scene. Successive questions andanswers are followed up to the leaf node. The more questions asked and answered,the higher the confidence on the final prediction.Data Points and Features: In decision trees, a generic data point is denoted bya vector v = (x1,x2, ...,xd) ∈ Rd , where the components xi represent an individualmeasurable attribute of a data point, called feature. The number of features dependson the property of the data point. In theory, the dimensionality of d can be verylarge, or even infinite. In practice, only a small important portion of d is used asneeded.10trueIs top part blue?truetruetruefalsefalsefalsefalsetruefalseIs bottom !part green?Is bottom !part blue?OutdoorIndoorS split node leaf node (a) (b)Figure 2.1: Decision tree example. Best viewed in color. (a) A tree is a set ofsplit nodes and leaf nodes organized in a hierarchical way. Split nodesare denoted with blue circles and leaf nodes with green circles. (b) Eachsplit node of a decision tree is a split (or test) function to be appliedto the incoming data. Each leaf node stores the final answer (predic-tor). An example is presented here on using a decision tree to classifywhether an image represents an indoor scene or an outdoor scene.split node leaf node 01 23 4 5 6node 0node 1 node 2S2 = SR0S1 = SL0S0VTest Training (a) (b)SL1 SR1 SL2 SR2Figure 2.2: Decision tree test and training process Best viewed in color.(a) In the test, a split node applies a test to the input data v and sendsit to the appropriate child. The process is proceeded until a leaf nodeis reached (red path). (b) Training a decision tree is sending the entiretraining set S0 into the tree and optimizing the parameters of the splitnodes so as to optimize a chosen energy function.11Training Points and Training Sets: A training point is a data point with a knownground truth label, and these feature-label correspondence can be used to computethe tree parameters. A training set S0 is a collection of different training data pointsas shown in Fig. 2.2 (b).Training a Decision Tree: The training entails finding the parameters of splitfunctions stored in the split nodes by optimizing a specified objective functiongiven the feature-label correspondences. The process proceeds in a greedy manner.At each node j, depending on the subset of the incoming training set S, the functionthat best splits S j into SLj and SRj is learned. Fig. 2.1 (b) shows the training process.S j represents the subset of incoming training data before split, while SLj and SRjrepresent the subset after the split at the node j. SLj = S2 j+1 and S j=SLj ∪ SRj . Theselection of the “best” split parameters can be formulated as the maximization ofan objective function I j at the split node:θ ∗j = argmaxθ∈ΘI j(S j,θ j) (2.1)where Θ represents the space of all split parameters.Given the set S j and the split parameters θ j, the corresponding left and rightsets are uniquely determined as:{SLj = ((v, ·) ∈ S j|h(v,θ j) = 0)SRj = ((v, ·) ∈ S j|h(v,θ j) = 1)(2.2)where (v,) denotes the feature-label pair, and · can represent either continuous la-bels for the regression problem or the discrete labels for the classification problem.The objective function I j is essential in constructing a decision tree that willperform the desired task. In fact, the result of the optimization problem in Eq. 2.1determines the parameters of the split functions, which, in turn, determine the pathfollowed by a data point. The split functions determine the prediction behavior ofa decision tree. Here a generic information gain associated with a tree split nodeis used as the objective function, which is defined as the reduction in uncertaintyachieved by splitting the training data S j that arrives at the node j into multiplechild subsets.12I j = H(S j)− ∑i∈{L,R}|Sij|S jH(Sij) (2.3)where H is the entropy.H(S) =−∑c∈Cp(c)log(p(c) (2.4)where p(c) is calculated as the normalized empirical histogram of labels corre-sponding to the training points in S.The split procedure described above proceeds recursively to all the newly con-structed nodes and the training phase continues until a stopping criterion is met.There are various stopping criteria such as stopping the tree when a maximumnumber of levels has reached or a node contains too few training points. In the leafnode, the statistic of labels are stored to predict the unknown data points. When thetraining phase is done, we can get the split functions associated with each internalnode and label distributions in each leaf node.Testing of a Decision Tree In the testing stage, the new data point v travels thedecision tree from the root node to a particular leaf node using the splitting valuesin the tree (see Fig 2.2 (a) ). The leaf node predicts a label for the input data pointv.2.2.2 The randomness modelRandomness is injected into the decision trees at the training stage, while testing isalmost always considered to be deterministic [19]. Random training set sampling[8, 9] and randomized node optimization [3, 44] are two popular methods for inject-ing randomness. A short introduction is provided here, and detailed explanationsare found in [3, 8, 9, 22, 44, 63] for more detailed explanation. Random training setsampling (e.g. bagging) concerns training each tree in a forest on different trainingsubset, sampled at random from the same labeled database. It can help improvegeneralization. Randomized node optimization concerns selecting a small randomset of split node parameters rather than the entire set. Random training set splitting13and randomized node optimization could also be used together.2.2.3 Random forest ensembleBecause a single tree has weak prediction capabilities, a combination of trees (i.e.,a forest) is used in practice. A random forest F = {F1,F2, ...,TN} is an ensemble ofN randomly and independently trained decision trees. The independence betweenthese trees lead to de-correlation between the individual tree predictions, resultingin better generalization and robustness compared with a single tree. In the testphase, each test point v is simultaneously pushed through all trees until it reachesthe leaf node. Combining all tree predictions into a single forest prediction can bedone by a simple averaging scheme.p(c|v) = 1NN∑k=1pk(c|v) (2.5)where pk(c|v) denotes the posterior distribution of the kth tree.Random forests used as regression is referred as regression forests.2.3 Camera pose estimation2.3.1 RGB camera pose estimationAn RGB camera projects a 3D point Pw = [X ,Y,Z,1]T in world coordinate to a 2Dpoint p = [x,y,w]T in image coordinate. Here homogeneous coordinates are used,which are suitable for projective geometry. The perspective projection is:p = K[R|t]Pw, (2.6)where K is a 3×3 camera matrix, R is a 3×3 rotation matrix and t is a 3×1 trans-lation vector. In the camera relocalization problem, the camera matrix is knownfrom camera calibration and its value is fixed. The goal is to estimate the rigidtransformation [R|t] from 2D-3D correspondences {p↔ P}. Because the freedomof rigid transformation is 6 and each correspondence provides 2 constraints (onein the image X coordinate and the other in the image Y coordinate), at least three14correspondences are required to solve this problem. To robustly solve this prob-lem, Lepetit [61] et al.have developed the EPnP method which requires at least 4correspondences. This method has been implemented in OpenCV [7] library.2.3.2 RGB-D camera pose estimationIn RGB-D cameras, an image location p = [x,y,w] has a depth d in the cameracoordinate. So the back-projection isPc = K−1p,subject to Pcz = d (2.7)where Pc is a 3D point in the camera coordinate. As a result, the camera relocal-ization is to estimate the rigid transform from world coordinates to camera coordi-nates:Pc = [R|t]Pw. (2.8)The constraint is 3D-3D correspondences {Pc↔ P}. Given at least 3 correspon-dences, the Karbsh algorithm [49] can robustly solve this problem with a closedform solution.In practice, the correspondences have outliers because of the errors of mea-surement of sensors and predictions from machine learning method. The RANSACmethod is used to remove these outliers and estimate the camera pose in an iterativeway.15Chapter 3Image-based localization usingregression forests and keyframepose refinementThis chapter addresses the problem of estimating the camera pose relative to aknown scene, given a single incoming RGB image. The recent advances in scenecoordinate regression forests for camera relocalization in RGB-D images are ex-tended to use RGB features, enabling camera relocalization from a single RGBimage. Furthermore, regression forests and sparse feature matching are integratedin an efficient and accurate way. The developed method is evaluated using bothsmall scale and large scale publicly available datasets with challenging cameraposes against several strong baselines. Experimental results demonstrate the ef-ficacy of the developed approach, showing superior or on-par performance withseveral strong baselines.3.1 IntroductionThe present work is mainly inspired by recent advances in the methods based onscene coordinate regression forests (SCRF) [38, 93, 102] for camera relocalization.The present method also benefits from privilege learning [18, 90, 104] in whichadditional information only exists at the training stage. The SCRF-based methods16Figure 3.1: Camera pose estimation pipeline. During training, the sceneinformation is encoded in a random forest. In the testing stage, an initialcamera pose is estimated using the predictions from the random forestwith real-time response. Then, the initial camera pose is used to query anearest neighbor (NN) image from keyframes. Finally, the camera poseis refined by sparse feature matching between the test image and the NNimage. In our method, the labels can constitute any information (e.g.,scene coordinate positions) associated with pixel locations.use an efficient regression forest to guide the camera pose optimization using RGB-D images, achieving high accuracy. In these SCRF-based methods, much of thecomputational burden is shifted to a training phase, while the test phase is veryefficient. The method developed in the present thesis learns from RGB-D imagesduring training but does not require depth images at test time, enabling it to bemore accessible for end users.Fig.3.1 illustrates the present pipeline. A regression forest is trained usingRGB images and pixel-wise labels. In the testing stage, an initial camera poseis estimated using predicted labels from the random forest. The accuracy of thecamera pose is refined by sparse feature matching.The main contributions of this chapter are:• Extend the SCRF-based methods to only use RGB features (without depth)in the testing stage.• Integrate random features and sparse features, ensuring both efficiency andaccuracy.17• Evaluate using publicly available datasets against several strong baselines,showing superior or on-par performance.3.2 MethodIn the initial camera pose estimation, the problem is modeled as a structural regres-sion problem:yˆp = f (I,p|θ) (3.1)where I is an RGB image, p ∈ R2 is a 2D pixel location and θ contains modelparameters. In the training stage, {p,mp} represent the paired training data. Thelabel mp can represent any information associated with that pixel. For example, itis the world coordinate in camera re-localization. In the testing stage, yˆp denotesthe prediction associated with pixel p.Random regression forests [22] are chosen here to implement the regressionfunction as it naturally handles structure regression, while being robust and rea-sonably easy to train.3.2.1 Random RGB features and labelsRandom RGB Features: Here features based on pairwise pixel comparison areused as in [60, 93]. At a given 2D pixel location p of an image, the feature fφ (p)computes the color intensity difference between the pixel p in a random color chan-nel c1 and the pixel with a 2D offset δ in a random color channel c2:fφ (p) = I(p,c1)−I(p+δ ,c2) (3.2)where δ is a 2D offset and I(p,c) represents an RGB pixel lookup in channel c.The φ contains feature response parameters {δ ,c1,c2}. Pixels outside the imageboundary are marked and not used as samples for both training and testing. A sig-nificant difference here with the depth-adaptive feature used by Shotton et al.[93]is that the present feature does not require depth information.Fig. 3.2 illustrates two random RGB features at different pixel locations p1 andp2. Feature fφ1 looks rightwards and Eq.3.2 will give a large response for 3.2(a) andsmall response for 3.2(b) according to different color changes. Feature fφ2 looks18P2 P2(a) (b)1 1P12P12f1(p) f1(p)f2(p) f2(p)Figure 3.2: Random RGB pixel comparison features. The red star repre-sents the pixel p being computed. The distance between the red star andthe blue circle represents the pixel offset as defined in Eq. 3.2. In (a),the two example features at image position p1 and p2 give a large colordifference response. In (b), the two features at the same image locationsin a different image give a much smaller response compared with (a).upwards and Eq.3.2 will give a large response for 3.2(a) and a small response for3.2(b). In this way, the pixel comparison feature distinguishes fφ1 from fφ2 .Individually each feature only provides a weak description about which part ofthe scene the pixel belongs to. But with thousands of such features per image incombination in a random forest, these features are accurate enough to describe allthe trained images. These features are very efficient. Each feature needs only 2arithmetic operations without pre-processing.Labels: For scenarios where RGB-D images are available in the training stage,the training set contains sequences of RGB-D frames with associated known cam-era poses P which includes 3×3 rotation matrix R and 3×1 translation vector Tfrom the camera coordinate to world coordinate as shown in Eq. 3.3.P=[R T0 1]=R11 R12 R13 TxR21 R22 R23 TyR31 R32 R33 Tz0 0 0 1∈ SE(3) (3.3)19The 3D point x in camera coordinate of the corresponding pixel p is computedby back-projecting the depth image pixels:x = xyz= (u− cx)×d/ fx(v− cy)×d/ fyd (3.4)where [u,v]T is the pixel p position in image plane, and [x,y,z]T is the point positionin camera coordinate, [cx,cy]T and [ fx, fy]T are the camera principal point and focallength respectively. di = depthimage[v,u]/ f actor, and depthimage[v,u] is the measureddepth value at image point [v,u]. The factor is RGB-D camera depth factor, usuallyf actor = 5000 for the 16−bit PNG files.The scene’s world coordinate position m of the corresponding pixel p could becomputed by:m = Px (3.5)The associated camera pose P for each RGB-D image in the training data canbe obtained through camera tracking method KinectFusion[73] or visual SLAMmethod [58].3.2.2 Random forests for 2D-3D correspondence regressionA regression forest is an ensemble of T independently trained decision trees. Eachdecision tree is a binary-tree-structured regressor consisting of decision (or split)nodes and prediction (or leaf) nodes. We grow the regression forest using greedyforest training [22].Weak Learner Model: Each split node i represents a ‘weak learner’ parameter-ized by θi = {φi,τi} that splits the data to the left or the right child. In the trainingphase as shown in Fig. 3.3, a set St0 of labeled sample pixels (p,m) is randomlychosen for each tree, where m is the 3D world coordinate label of pixel p. Thetree grows recursively from the root node to the leaf node. At each split node, theparameter θi is sampled from a set of randomly sampled candidates Θi. At eachsplit node i, for the incoming training set Si, samples are evaluated on split nodes2001 2node 0split nodeleaf node…01 2node 0Tree 1 … Tree TSR1Reduce entropySR1SL1SL1SL2SL2SR2SR2S1 = SL0S1 = SL0 S2 = SR0 S2 = SR0S10 = {(p1,m1), ...(pn,mn)} ST0 = {(p1,m1), ...(pn,mn)}Figure 3.3: Training random forests for 2D-3D coordinate correspon-dence regression. A set St0 of labeled sample pixels (p,m) is randomlychosen from the entire set S0 = [S10, · · · ,ST0 ] for each tree, where m is the3D world coordinate label of pixel p. The tree grows recursively fromthe root node to the leaf node. The goal is to optimize the parameters ofthe tree split nodes. Here spatial-variance is used to represent entropy.to learn the split parameter θi that best splits the left child subset SLi and the rightchild subset SRi as follows:h(p;θi) ={0, if fφi(p)≤ τi, then go to the left subset SLi .1, if fφi(p)> τi, then go to the right subset SRi .(3.6)Here, τi is a threshold on random feature fφi(p).At each split node i, for the incoming training set Si, the training compriseslearning of the split parameter θi that best splits the left child subset SLi and the rightchild subset SRi . The selection of the “best” split parameters can be formulated asthe maximization of an information gain Ii:θ ∗i = arg maxθi∈ΘiIi(Si,θi) (3.7)Ii =V (Si)− ∑j∈{L,R}|S ji (θi)||Si| V (Sji (θi)) (3.8)21node 101 2node 0node 2f0(p)  ⌧0 f0(p) > ⌧0f2(p) > ⌧2f2(p)  ⌧2p split nodeleaf node…node 101 2node 0node 2f0(p)  ⌧0 f0(p) > ⌧0pTree 1 … Tree Tf1(p)  ⌧1 f1(p) > ⌧1Figure 3.4: Test phase of random forests for 2D-3D correspondence re-gression. In the testing stage, at each split node i, the feature fφi iscompared with the feature response τi to determine whether to go to theleft or the right child node. A particular input may go along differentpaths in different trees as indicated by the red arrows.V (Si) =1|Si| ∑(p,m)∈Si||m− m¯||22 (3.9)where V (Si) is the spatial variance of the labels in Si, subset Sji is conditioned onthe split parameter θi, and m¯ represents the mean of m in Si.Fig. 3.4 shows the test phrase. At each split node i, the feature fφi is comparedwith the feature response τi to determine whether to go to the left or right childnode. This process proceeds until it reaches the leaf node.Leaf Prediction: Training terminates when a node reaches a maximum depth Dor contains too few examples. In tree t, one leaf node contains a set of sampleswhose distribution is described by the leaf model parameter θ f . During the testingstage, the leaf generates an estimated vector:v∗t = argmaxv p(v|θ f ) (3.10)with22p(v|θ f ) = N(v;µ,Σ) (3.11)where µ,Σ are the mean and covariance of a Gaussian distribution of labels. Be-sides the pixel location in 3D world coordinate, the present method also stores thecolor distribution of samples in the leaf node to reduce color ambiguities.Forest Ensemble: A forest is an ensemble of independently trained decisiontrees. Since the number of training samples and possible split node tests are largein the present work, building the optimal tree quickly can become intractable. In-stead, multiple trees are grown so that each tree yields a different partition of thespace. Furthermore, multiple trees can have better generalization in the test phase.Once the trees are built, the 3D world coordinate prediction in the leaf node that hasthe most similar color distribution with the test pixel is used as the final predictionbecause the present method stores RGB color distribution in leaf nodes.3.2.3 Pose refinementThe downside of the RGB random feature is that it is not naturally scale-invariant.A naive way to solve this problem is to train the random forest using image pyra-mids. However, it was found that the pyramid method was too time-consumingand did not significantly improve the accuracy.Based on this observation, a hybrid pipeline is designed. First, the 2D-3Dcorrespondence from the above regression forests is used to get an initial camerapose P0. Then, accurate SIFT features [64] are used to refine the initial camerapose. The second step requires an image from the training set that shares similarcamera parameters with the test image. Searching for the nearest neighbor (NN)image whose camera is closest to the location of P0 while having an orientationdifference no greater than a predefined threshold τ0 (1/4 of the field of view) isdone.The dimension of the camera pose space (which is ≤ 6) is much smaller thanthe image descriptor space (which can be up to several hundreds [16]). Therefore,the present method is much more efficient than the methods that use bag of words[30, 72]. Once the NN image is found, the camera pose is refined by minimizing the23reprojection error, which is a geometric error corresponding to the image distancebetween a projected point xˆk and a measured one xk:P∗ = argminP∑kd(xk, xˆk)2 = argminP∑kd(xk,PXk)2 (3.12)where P is a 3× 4 matrix including the camera intrinsic and extrinsic parameters,xk and xˆk are the kth observed and projected feature point locations in the image,respectively. Xk = [Xk,Yk,Zk] is the corresponding 3D world coordinate from theNN image.Here, a pinhole camera model is used in which a scene view is formed byprojecting 3D points into the image plane using a perspective transformation [39].Therefore, each projected feature point xˆ can be obtained by:xˆ≈ PX = K[R|T ]X = fx 0 cx0 fy cy0 0 1 R11 R12 R13 TxR21 R22 R23 TyR31 R32 R33 TzXYZ1 (3.13)where X ,Y,Z are the coordinates of a 3D point in the world coordinate space,K is a known camera intrinsic parameters, [R|T ] is a camera matrix of extrinsicparameters in which R is a 3×3 matrix which represents rotation and T is a 1×3matrix which represents translation, (cx,cy) is a principal point that is usually atthe image center, and fx, fy are the focal lengths expressed in pixel units. We use≈ to indicate that the image location x is transformed from a 3× 1 homogeneousvector.Because correspondences contain outliers, Eq.3.12 is optimized using EPnP[61] and RANSAC[28]. The Perspective-n-Point (PnP) is to determine the positionand orientation of a camera given its intrinsic parameters and a set of n correspon-dences between 3D points and their 2D projections. The complexity of EPnP islinear with the number of 2D-3D correspondences .Algorithm 1 briefly summarizes the proposed method step by step.24Algorithm 1 Camera pose estimation from a single RGB imageRequire: A setSI = {I1,I2, · · ·In} of images with pixel-wise labelsRequire: An RGB image IEnsure: The camera pose P of image I1: Si = a set of randomly sampled pixel-wise training samples;2: train a regression forest usingSi; //3: P0 = initial camera pose from the regression forest predictions;4: search nearest neighbor image Inn in SI using P0;5: match I and Inn to obtain 2D-3D correspondences;6: estimate P using the 2D-3D correspondences and solvePnPRansac; // Sec.3.2.37: return P3.3 ExperimentsThe proposed method is evaluated using both a small scale (about 2− 6m3) Mi-crosoft 7 Scenes dataset and a considerably larger (14− 79m3) Stanford 4 Scenesdataset.3.3.1 Camera relocalization on Microsoft 7 Scenes datasetDataset: The 7 Scenes dataset as shown in Fig. 3.5 is from [93] and consists of 7scenes which were recorded with a handheld Kinect RGB-D camera at 640×480resolution. Each scene includes several camera sequences that contain RGB-Dframes together with the corresponding ground-truth camera poses. The groundtruth camera pose is obtained from an implementation of the KinectFusion [73]system. The dataset exhibits shape/color ambiguities, specularities and motionblur, which present great challenges for the proposed RGB-only random features.Baselines and Error Metric: Three strong methods are used as the baselines:SCRF [93], PoseNet [53] and Bayesian PoseNet [51]. SCRF employs a scenecoordinate regression forest to guide camera pose optimization using RANSACwith RGB-D images. PoseNet trains a ConvNet as a pose regressor to estimate the6-DOF pose from a single RGB image. Bayesian PoseNet improved the accuracyof PoseNet through an uncertainty framework by averaging Monte Carlo dropout25(a) (b) (c) (d) (e) (f) (g)Figure 3.5: Example sequence in Microsoft 7 Scenes dataset. (a) Chess,(b) Fire, (c) Heads, (d) Office, (e) Pumpkin, (f) Red Kitchen, (g) Stairs.The upper row shows the 3D reconstructed scenes and the lower rowshows the RGB image example.samples. The same median translational error and rotational error are used as in[51, 53] for fair comparison.Results and Analysis: The main results of the proposed work are given in Table3.1. The proposed method considerably outperforms the PoseNet and BayesianPoseNet on all scenes. It is noted, however, that the PoseNet based methods needonly RGB images for both training and testing. The proposed method is not asaccurate as the SCRF method. However, the proposed approach does not requirethe depth image in the testing stage, which greatly lightens the requirements of endusers.The best performance of the proposed method is found on the Heads Scene.The worst performance is for the Stairs in which the SCRF approach has the sameproblem due to the repetitive properties of the scene. The second worst scene isPumpkin in which there are large uniformly colored planes, such as the cabinet orthe ground. The lack of color distinction degrades the localization capability.To separately evaluate the performance of random RGB features, the pixel-wise prediction error is visualized using heat maps as shown in Fig. 3.6. The erroris the truncated distance between the predicted 3D locations and the ground truth.The typical large error areas are black screens and other texture-less regions, whichare intractable for low-level visual features.The pose refinement step by integrating sparse feature matching proved crucialto achieve good results. With this turned off, our RGB forest achieves only median26Table 3.1: Relocalization results for the 7 Scenes dataset. Median perfor-mance is shown for the proposed method on all scenes against three state-of-the-art methods: SCRF [93], PoseNet [53] and Bayesian PoseNet [51].# Frames Spatial Baselines OursScene Train Test Extent SCRF PoseNet BayesianTraining — — — RGB-D RGB RGB RGB-DTest — — — RGB-D RGB RGB RGBChess 4k 2k 3x2x1m 0.03m, 0.66◦ 0.32m, 8.12◦ 0.37m, 7.24◦ 0.12m, 3.92◦Fire 2k 2k 2.5x1x1m 0.05m, 1.50◦ 0.47m, 14.4◦ 0.43m, 13.7◦ 0.14m, 4.64◦Heads 1k 1k 2x0.5x1m 0.06m, 5.50◦ 0.29m, 12.0◦ 0.31m, 12.0◦ 0.10m, 6.82◦Office 6k 4k 2.5x2x1.5m 0.04m, 0.78◦ 0.48m, 7.68◦ 0.48m, 8.04◦ 0.15m, 4.23◦Pumpkin 4k 2k 2.5x2x1m 0.04m, 0.68◦ 0.47m, 8.42◦ 0.61m, 7.08◦ 0.22m, 5.40◦Red Kitchen 7k 5k 4x3x1.5m 0.04m, 0.76◦ 0.59m, 8.64◦ 0.58m, 7.54◦ 0.14m, 3.71◦Stairs 2k 1k 2.5x2x1.5m 0.32m, 1.32◦ 0.47m, 13.8◦ 0.48m, 13.1◦ 0.30m, 8.08◦Average — — — 0.08m, 1.60◦ 0.44m, 10.4◦ 0.47m, 9.81◦ 0.17m, 5.26◦Figure 3.6: Pixel-wise prediction error distribution from the regressionforests. Heat maps show the prediction error distribution directly fromthe regression forests on the Chess and Heads scenes. Large errorsoccur on black screens and other texture-less regions.localization result 0.21m,6.09◦ for Chess and 0.25m,8.53◦ on Fire, for instance.3.3.2 Camera relocalization on the Stanford 4 Scenes datasetDataset: The 4 Scenes dataset was introduced in [103] to push the boundariesof RGB-D and RGB camera relocalization. It contains two apartment scenes andtwo office scenes which are very common indoor environments. The recordedenvironment is significantly larger than the 7 Scenes dataset [93] (14−79m3 versusabout 2− 6m3). This large environment is more practical for the application ofrobot indoor localization. The scenes were captured by a Structure.io depth sensorwith an iPad RGB camera. Both cameras have been calibrated and temporally27Figure 3.7: 4 Scenes dataset example images[103]. Each RGB-D image isaccompanied with a camera pose. Each sequence also provides a tex-tured 3D model (not used in the present thesis).synchronized. The RGB image sequences were recorded at a resolution of 1296×968 pixels, and the depth resolution is 640×480. The depth image is re-sampled tothe RGB resolution to align the RGB and depth images. The ground truth cameraposes are from BundleFusion [24], which is a real-time globally consistent 3Dreconstruction system.Baselines and Error Metric: Three state-of-the-art methods of camera relocal-ization with a single RGB image are used as the baselines: SIFT+PnP, ORB+PnP,Bayesian PoseNet[51]. The SIFT+PnP and ORB+PnP are based on matching lo-cal features, following mainstream feature-based relocalization approaches such as[72, 112]. SIFT+PnP uses SIFT[64] as features for feature description and match-ing while ORB+PnP uses ORB[84] as features.These implementations are from [103]. As PoseNet [53] and Bayesian PoseNet[51] have similar performance in [51], here only the performance of BayesianPoseNet is shown as the baseline. For error metric, the percentage of “correctframes” is used for SIFT+PnP, ORB+PnP and the proposed method, while the me-dian performance is used for Bayesian PoseNet and our method. The “correctframes” is defined as the camera pose estimation error is within 5cm translationaland 5o rotational error compared with the ground truth. This accuracy is suffi-cient for augmented reality applications such as restarting any good model-based28tracking system [93].Results and Analysis Table 3.2 shows the camera relocalization results for the4 Scenes dataset. From this dataset, it can be seen that the proposed method is al-most one order of magnitude more accurate than the Bayesian PoseNet. Comparedwith ORB+PnP and SIFT+PnP, our method has the best performance on somescenes, such as Apt1/living, Apt2/Kitchen, Office1/Floor5a, while ORB+PnP andSIFT+PnP also have the best performance on some scenes. It shows that the per-formance is highly dependent on the data properties. It is noticed that the proposedmethod has much better accuracy on the 4 Scenes dataset than on the 7 Scenesdataset, although the 4 Scenes dataset has a spatial extent that is an order of mag-nitude larger than those of 7 Scenes. This is probably due to the following reasons.First, the 4 Scenes dataset has much better RGB image quality, including higherresolution and less motion blur. The RGB image quality is critical for the proposedmethod as this method is using RGB pixel comparison feature. Second, the cameraposes in 4 Scenes are not as challenging as 7 Scenes. For instance, the Stairs scenein 7 Scenes dataset contains too many similar and repetitive steps, which increasesthe inherent ambiguity.Although SIFT+PnP and ORB+PnP have some best performance scenes com-pared with the proposed method, the proposed method has much better overall per-formance considering both speed and accuracy. The proposed method has similarefficiency with ORB+PnP but has 6% higher accuracy. Compared with SIFT+PnP,the proposed method is more efficient as it uses 32 dimension SIFT and searchesonly in the camera pose space of 6 dimensions rather than the SIFT feature spaceof 128 dimensions.3.3.3 Implementation detailsThe approach developed in the present work is implemented with C++ using OpenCV[7] and VXL [1] on an Intel 2.3 GHz, 8GB memory Mac System. For the randomforest, the parameter settings are: tree number T = 10 for 7 Scenes and T = 5 for4 Scenes; 500 and 200 training images per tree for 7 Scenes and 4 Scenes datasets,respectively; 5,000 randomly sampled example pixels per training image. The29Table 3.2: Camera relocalization results for 4 Scenes dataset. The percent-age of correct frames is given (within 5cm translational and 5◦ angularerror), and median of the proposed method on 4 Scenes dataset againstthree state-of-the-art methods: ORB+PnP, SIFT+PnP, and BayesianPoseNet[51]. The best performance is highlighted.Frame numbers Spatial Baselines OursSequence training test Extent ORB+PnP SIFT+PnP Bayesian PoseNet Random+SparseTraining — — RGB-D RGB-D RGB RGB-DTest — — RGB RGB RGB RGBKitchen 744 357 33m3 66.39% 71.43% 0.423m, 5.19◦ 0.030m, 1.73◦, 70.3%Living 1035 493 30m3 41.99% 56.19% 0.611m, 3.70◦ 0.040m, 1.56◦, 60.0%Bed 868 244 14m3 71.72% 72.95% 0.626m, 7.68◦ 0.039m, 1.81◦, 65.7%Kitchen 768 230 21m3 63.91% 71.74% 0.175m, 11.92◦ 0.030m, 1.45◦, 76.7%Living 725 359 42m3 45.40% 56.19% 0.294m, 10.10◦ 0.047m, 1.89◦, 52.2%Luke 1370 624 53m3 54.65% 70.99% 0.649m, 7.71◦ 0.056m, 2.35◦, 46.0%Floor5a 1001 497 38m3 28.97% 38.43% 0.437m, 8.43◦ 0.050m, 2.06◦, 49.5%Floor5b 1391 415 79m3 56.87% 45.78% 0.899m, 6.87◦ 0.042m, 1.42◦, 56.4%Gates362 2981 386 29m3 49.48% 67.88% 0.223m, 6.77◦ 0.033m, 1.41◦, 67.7%Gates381 2949 1053 44m3 43.87% 62.77% 0.410m, 10.35◦ 0.044m, 1.91◦, 54.6%Lounge 925 327 38m3 61.16% 58.72% 0.602m, 7.75◦ 0.046m, 1.61◦, 54.0%Manolis 1613 807 50m3 60.10% 72.86% 0.034m, 2.02◦ 0.034m, 1.56◦, 65.1%Average — — — 53.7% 62.2% 0.483m, 8.08◦ 0.041m, 1.73◦, 59.9%maximum tree depth is 16. A modified 32-dimension SIFT feature is used fromthe VLFeat library [105] for sparse feature matching. The SIFT feature computa-tion is still the bottleneck of the current implementation, the frame-rate responsecan be achieved with the GPU SIFT [114]. For the camera pose optimization,off-the-shelf solvePnPRansac is used in OpenCV.In the present work, the PoseNet and Bayesian PoseNet baselines are run usingthe code by their original author on a Linux machine with an Nvidia GeForceGTX670 GPU. The pre-trained weights are used to initialize the network weights.3.4 ConclusionsThis chapter presented a hybrid method using random RGB features and sparsefeatures for camera pose estimation. The method used RGB-only images in the testand achieves near real-time response. The comparisons on the challenging 7 Scenes30and 4 Scenes dataset with several strong baselines demonstrated the efficacy of thedeveloped method, showing comparable results with state-of-the-art methods.Future work may include improving the prediction accuracy using deep fea-tures. It is also planned to investigate the possibility of integrating accurate sparsefeatures with the random features in the training phase so that the prediction accu-racy could be further improved from random regression forests.31Chapter 4Image-based localization usingbacktracking regression forestsThis chapter presents a backtracking technique that improves the accuracy of cam-era relocalization. Methods based on random forests for camera relocalization di-rectly predict 3D world locations for 2D image locations to guide the camera poseoptimization. During training, each tree greedily splits the samples to minimize thespatial variance. However, these greedy splits often produce uneven sub-trees or in-correct 2D-3D correspondences predictions. To address these problems, a sample-balanced objective is proposed to encourage equal numbers of samples in the leftand right sub-trees, and a novel backtracking scheme to remedy the incorrect 2D-3D correspondence in the leaf nodes caused by greedy splitting. Furthermore, theregression forests based methods are extended to use local features in both train-ing and test stages for outdoor applications. Experimental results using publiclyavailable indoor and outdoor datasets demonstrate the efficacy of the developedapproach, showing superior or on-par accuracy with several state-of-the-art base-lines.4.1 IntroductionMethods based on local features and keyframes have been extensively studied andare still very active approaches for image-based camera relocalization [54, 87, 89].32Recent advances in machine learning methods, especially random forests and deeplearning, have led to rapid progress on their application to camera relocalization.Methods based on random forests are among the first machine learning meth-ods for camera relocalization [38, 93, 102]. In these methods, the forest is trainedto directly predict correspondences from any image pixel to points in the scene’s3D world coordinate, which removes the traditional pipeline of feature detection,description and matching. These correspondences are then used for camera poseestimation based on an efficient RANSAC algorithm without an explicit 3D modelof the scene, which is very attractive when a 3D model is not available. Further-more, these methods can localize the camera from a single RGB-D frame withouttracking. The latest work [6, 70] extends these random forests based methods totest with just RGB images. However, depth images are still necessary to get 3Dworld coordinate labels in the training stage.In the training stage of these methods based on random forests, each tree greed-ily splits the samples to minimize the spatial variance. However, these greedysplits usually produce uneven left and right sub-trees, or incorrect 2D-3D cor-respondences in the test. To address this problem, a label-free sample-balancedobjective is proposed here to encourage equal numbers of samples in the left andright sub-trees, and a novel backtracking scheme to remedy the incorrect 2D-3Dcorrespondence in the leaf nodes caused by greedy splitting. The efficacy of the de-veloped methods is demonstrated through evaluations on publicly available indoorand outdoor datasets.Some recent methods based on deep learning [51, 53, 108] overcome the chal-lenges of using depth images for camera relocalization, extending their applicationto outdoor scenes. These methods train a convolutional neural network to regressthe 6-DOF camera pose from a single RGB image in an end-to-end manner in realtime. However, even integrating LSTM units on the CNN output [108], these meth-ods still generate much lower accuracy compared with methods based on randomforests [93] in indoor scenes and methods based on local features [87] in outdoorscenes in general. To eliminate the dependency on depth images while ensuringhigh accuracy, the present method integrates local features in the random forests,broadening the application of random forests methods to outdoor scenes for thefirst time while achieving the best accuracy against several strong state-of-the-art33baselines.To summarize, the main contributions of the work in this chapter are as follows:• The present work proposes a sample-balanced objective that encourages equalnumbers of samples in the sub-trees, increasing prediction accuracy whilereducing training time.• The present work proposes a novel backtracking scheme to remedy incorrect2D-3D correspondence in the leaf nodes caused by greedy splitting, furtherimproving prediction accuracy.• The present work integrates local features in the regression forests, enablingthe use of just RGB images for both training and testing. The elimination ofthe dependence on depth images broadens the present scope of applicationto outdoor scenes.4.2 MethodThe present work models the world coordinate prediction problem as a regressionproblem:mˆp = fr(I,D,p|θ) (4.1)where I is an RGB image, D is an optional depth image, p ∈ R2 is a 2D pixel lo-cation in the image I, mp ∈R3 is the corresponding 3D scene coordinate of p, andθ is the model parameter. In training, {p,mp} are paired training data. In testing,the 3D world location mˆp is predicted by the model θ . Then, the camera param-eters are estimated using predicted image-world correspondences in a RANSACframework.The novelty of the present method lies in both the regression forest trainingand prediction. First, the present work uses a sample-balanced split objective intraining. This objective function encourages equal numbers of samples in the leftand right sub-trees. In practice, it acts as a regularization term in the training andthus improving prediction accuracy. Second, the present work presents a novelbacktracking technique in prediction. The backtracking searches the tree using apriority queue and decreases the chance of falling in a local minimum.34Table 4.1: Image features. The present method uses random features [93],SIFT features [64] and Walsh-Hadamard transform (WHT) features [43]according to different scenarios. The choice of features considers robust-ness and computational efficiency.Feature TypesDataset Split Backtrack Local descriptorRGB-D Random Random WHTRGB SIFT SIFT SIFT4.2.1 Image featuresThe present method employs different image feature fφi(p) that associates withpixel location p as shown in Table 4.1 according to different application scenarios.Image Features on RGB-D ImagesFor indoor scenes, the present method uses random pixel comparison features[93]and Walsh-Hadamard transform (WHT) features [43]. Random features and WHTfeatures do not require expensive feature detection so they can speed up the process.Also they provide a sufficient number of robust features in texture-less areas, whichis especially important for indoor camera relocalization.In the present method, random features are used for splitting decision treesin internal nodes, while WHT features are used to describe local patches Pr(p)centered at the pixel p. The random features as shown in Fig. 4.1 are based onpairwise pixel comparison as in [93, 102]:fφ (p) = I(p,c1)−I(p+ δD(p),c2) (4.2)where δ is a 2D offset and I(p,c) indicates an RGB pixel lookup in channel c.The φ contains feature response parameters {δ ,c1,c2}. The D(p) is the depth atpixel p in image D of the corresponding RGB image I. Pixels with undefined depthand those outside the image boundary are marked and not used as samples for bothtraining and test.35P2 P2(a) (b)P1 P1f1(p) f1(p)f2(p) f2(p)1D(P1)1D(P1)2D(P2)2D(P2)Figure 4.1: Depth-adaptive random RGB pixel comparison features. Thered star represents the pixel p being computed. The distance betweenred star and blue circle represents the offset pixels θD(p) as defined inEq. 4.2. In (a), the two example features at image position p1 and p2give a large color difference response. In (b), the two features at thesame image locations in a different image give a much smaller responsecompared with (a).For WHT features, the Walsh-Hadamard Transform [43] is calculated for allpatches Pr(p) centered at all pixel positions p. We choose the first 20 Walsh-Hadamard projection vectors for each color channel, thus the total WHT featuredimension is 60 for all the three color channels in the present case. The reason ofchoosing the first 20 projection vectors that they correspond to the ones with largestaverage magnitude of responses, while the rest mainly capture very fine details andthus do not influence performance too much. The fixed patch size of 64×64 pixelsis used for the WHT feature.Image Features on RGB imagesFor scenarios where only RGB images are available in both training and test phase,the present method uses SIFT features [64] as local feature descriptors. The 64dimension SIFT features are used to decrease the trained model size as our back-tracking regression forests method stores the mean vector of feature descriptors.36…Tree 1 … Tree TSR1 SR1SL1SL1SL2SL2SR2SR2S1 = SL0S1 = SL0 S2 = SR0 S2 = SR0split node with objective1leaf nodesplit node with objective21 2S10 = {(p1,m1), ...(pn,mn)} ST0 = {(p1,m1), ...(pn,mn)}Figure 4.2: Training random forests for 2D-3D coordinate correspon-dence regression. A set St0 of labeled sample pixels (p,m) is randomlychosen from the entire set S0 = [S10, · · · ,ST0 ] for each tree, where m is the3D world coordinate label of pixel p. The tree grows recursively fromthe root node to the leaf node. The goal is to optimize the parametersof the tree split nodes. The training objective 1 is the sample-balancedobjective, and the training objective 2 is the spatial-variance objective.4.2.2 Scene coordinate labelsThe labels for image features are the pixel p’s corresponding world coordinatepoint position m.Labels for RGB-D ImagesFor scenarios where RGB-D images are available in the training stage, please referto Sec. 3.2.1 on the label details.Labels for RGB ImagesFor scenarios where only RGB images are available in the test stage, for eachfeature point at pixel p, the scene’s world coordinate position m could be obtainedfrom visual structure from motion [115].37Algorithm 2 Building the random forestsInput: Feature-3D dataset D . from Visual Structure from MotionInput: Number of trees NOutput: The random forest data structure1: procedure BUILDRANDOMFOREST(D,N)2: trees←{}3: for i←{1 · · ·N} do4: Di← random subset o f D . e.g., data from 200 frames5: treesi← BUILDTREE(D)6: end for7: return trees8: end procedure9: procedure BUILDTREE(D)10: if |D|== 1 then . or satisfying other criterion11: CREATELEAFNODE(D) . store mean values of labels and features12: else13: Φ← a set of random split dimension and value14: {o∗,φ ∗}← {∞, /0} . φ has splitDim and splitValue15: for each φ in Φ do16: o← objective value by splitting D using φ17: if o< o∗ then18: φ ∗ = φ . update the optimal splitting19: end if20: end for21: {Dl,Dr}← splitted dataset using φ ∗22: Tl ← BUILDTREE(Dl) . left subtree23: Tr← BUILDTREE(Dr) . right subtree24: CREATEINNERNODE(φ ∗,Tl,Tr) . Store splitting parameters25: end if26: end procedure4.2.3 Backtracking regression forest trainingWeak learner modelA regression forest is an ensemble of T independently trained decision trees. Eachtree is a binary tree structure consisting of split nodes and leaf nodes. We use thesame weak learner as in Chapter 3.2.2.38Training objectivesAt each split node i, for the incoming training set Si, the training is to learn thesplit parameter θi that best splits the left child subset SLi and the right child subsetSRi . The selecting “best” split parameters can be formulated as the minimize thetraining objective Qi:θ ∗i = arg minθi∈ΘiQi(Si,θi) (4.3)The present regression forest uses two objectives depending on the levels of the de-cision tree. At upper levels of a decision tree (tree depth is smaller than a thresholdLmax), the present work uses a sample-balanced objective:Qb(Si,θ) =abs(|SLi |− |SRi |)|SLi |+ |SRi |(4.4)where abs(.) is the absolute value operator and |S| represents the size of set S.This objective penalizes uneven splitting. The sample-balanced objective has twoadvantages. First, it makes the training faster as it only counts the number ofsamples in sub-trees. Second, it produces more accurate predictions in practice,which will be shown in the experiments Sec. 4.3. When the tree depth is largerthan a threshold Lmax, the present work uses the spatial-variance objective Eqn.3.8.Fig.4.3 illustrates the idea of using two objectives in a five-level tree. Thefirst two levels use the sample-balanced objective while the next two levels use thespatial-variance objective.At the end of training, all samples reach to the leaf nodes. In a leaf node, thepresent method not only stores a mean vector of 3D positions but also a mean vectorof local patch descriptors. The local patch descriptor will be used to choose theoptimal predictions in a backtracking process which will be described in Sec.4.2.4.4.2.4 Backtracking in regression forests predictionIn the testing phase, a regression tree predicts 3D locations by comparing the fea-ture value and the feature response at the split nodes. Because the comparison isconducted on a single dimension at a time and all these parameters are optimizedgreedily during the training, making mistakes is inevitable.39Figure 4.3: Decision tree using two split objectives. Best viewed in color.The split nodes are illustrated as the pie charts, which show the percent-age of samples in the left and right sub-trees. In this five-level tree, thesplit nodes of the first two levels are split using the sample-balanced ob-jective. While the rest of levels are split using the unbalanced objectives(e.g., the spatial-variance objective). Details are in Sect. 4.2.3.To mitigate this drawback, the present work has developed a backtrackingscheme in the testing phase as shown in Fig. 4.4 to find the optimal predictionwithin time budgets using a priority queue. The priority queue stores the siblingnodes that are not visited along the path when a testing sample travels from the rootnode to the leaf node. The priority queue is ordered by increasing distance to thesplit value of each split node. The backtracking continues until a predefined num-ber Nmax of leaf nodes are visited. Algorithm 3 illustrates the detailed procedure ofthe present backtracking scheme.4.2.5 Camera pose optimizationThe backtracking regression forest described above is capable of predicting the3D world coordinate for any 2D image pixel. The present work uses this 2D−3Dcorrespondence to estimate the camera pose. The problem of estimating the camerapose is formulated as the energy minimization:P∗ = argminPE(P) (4.5)40node 101 2node 0node 2f0(p)  ⌧0 f0(p) > ⌧0f2(p) > ⌧2f2(p)  ⌧2p split nodeleaf node…node 101 2node 0node 2f0(p)  ⌧0 f0(p) > ⌧0pTree 1 … Tree Tf1(p)  ⌧1 f1(p) > ⌧1Figure 4.4: Test phase of backtracking regression forests In the test, ateach split node i, the feature fφi is compared with feature response τito determine whether to go to the left or to the right child node. Thered arrows represent the prediction process without backtracking whilethe purple arrows represent the backtracking process. A particular inputmay go along different paths in different trees as indicated by the redand purple arrows.where P is a camera pose matrix. The energy function can be solved in differentways depending on whether depth information is available in the test phase.When depth images are available (i.e., indoor application), we use the Kabschalgorithm [49] to estimate camera poses. Otherwise, the solvePnPRansac methodfrom OpenCV [7] is used. Because predictions from the regression forests may stillhave large errors, the preemptive RANSAC method [74] is used to remove outliers.4.3 ExperimentsThis section evaluates the present camera relocalization methods using publiclyavailable indoor and outdoor datasets against several strong state-of-the-art meth-ods.41Table 4.2: Camera relocalization results for the indoor dataset. The per-centage of correct frames (within 5cm translational and 5◦ angular error)of the developed method is shown on 4 Scenes dataset against four state-of-the-art methods: ORB+PnP, SIFT+PnP, Random+SIFT [70], MNG[103]. The best performance is highlighted.Spatial Baselines Our ResultsSequence Extent ORB+PnP SIFT+PnP Random+SIFT[70] MNG[103] BTBRF UBRF BRF BTBRFTraining RGB-D RGB-D RGB-D RGB-D RGB-D RGB-D RGB-D RGB-DTest RGB RGB RGB RGB+3D model RGB RGB-D RGB-D RGB-DKitchen 33m3 66.39% 71.43% 70.3% 85.7% 77.1% 82.6% 88.2% 92.7%Living 30m3 41.99% 56.19% 60.0% 71.6% 69.6% 81.7% 90.5% 95.1%Bed 14m3 71.72% 72.95% 65.7% 66.4% 60.8% 71.6% 81.3% 82.8%Kitchen 21m3 63.91% 71.74% 76.7% 76.7% 82.9% 80.0% 85.7% 86.2%Living 42m3 45.40% 56.19% 52.2% 66.6% 62.5% 65.3% 92.3% 99.7%Luke 53m3 54.65% 70.99% 46.0% 83.3% 39.3% 50.5% 71.5% 84.6%Floor5a 38m3 28.97% 38.43% 49.5% 66.2% 50.3% 68.0% 85.7% 89.9%Floor5b 79m3 56.87% 45.78% 56.4% 71.1% 58.5% 83.2% 92.3% 98.9%Gates362 29m3 49.48% 67.88% 67.7% 51.8% 82.1% 88.9% 90.4% 96.7%Gates381 44m3 43.87% 62.77% 54.6% 52.3% 51.4% 73.9% 82.3% 92.9%Lounge 38m3 61.16% 58.72% 54.0% 64.2% 59.6% 74.6% 91.4% 94.8%Manolis 50m3 60.10% 72.86% 65.1% 76.0% 68.5% 87.2% 93.9% 98.0%Average — 53.7% 62.2% 59.9% 69.3% 63.6% 75.6% 87.1% 92.7%4.3.1 Indoor camera relocalizationIndoor camera relocalization on Stanford 4 Scenes DatasetBaselines: The present method uses four state-of-the-art methods as the base-lines: SIFT+PnP, ORB+PnP, Random+SIFT [70], Multiscale Navigation Graph(MNG) [103]. The SIFT+PnP and ORB+PnP are based on matching local features,following mainstream feature-based relocalization approaches such as [72, 112].These methods have to store a large database of descriptors. Random+SIFT [70]uses both random features and sparse features but in two separate steps. The sparsefeature is used as post-processing in this method. The present method is differ-ent from this method by simultaneously using both random feature and sparsefeatures within the framework of regression forests. The multiscale navigationgraph (MNG) method [103] estimates the camera pose by maximizing the photo-consistency between the query frame and the synthesized image which is condi-tioned on the camera pose. In the testing stage, the MNG method requires the42trained model (retrieval forests) and the 3D model of the scene.Error Metric: The percentage of test frames for which the estimated camera poseis essentially ‘correct’ is reported. A pose is correct when it is within 5cm transla-tional error and 5o angular error of the ground truth.Main Results and Analysis: For the indoor dataset, the present work uses ran-dom features and Walsh-Hadamard transform (WHT) features [43]. The obtainedmain camera relocalization results are presented in Table 4.2. The results of ORB+PnP,SIFT+PnP and MNG are from Valentin et al.[103]. Besides the developed fi-nal method BTBRF, presented here are the results of unbalanced regression forest(UBRF) which does not use the sample-balanced objective, and balanced regres-sion forest (BRF) which uses both sample-balanced objective and spatial-varianceobjective in training but does not use backtracking in testing.In this dataset, sparse baselines do a reasonable job and the SIFT+PnP methodis better than the ORB+PnP method. The present method using RGB-only im-ages in the testing stage achieves higher accuracy than SIFT+PnP, ORB+PnP, andRandom+Sparse, and is less accurate than the MNG method. However, the MNGmethod needs a explicit 3D model to render synthetic images to refine the posewhile the present method does not need. Moreover, the MNG method needs alarge number of synthetic images (9 times of the original training images) for dataaugmentation while the present method does not need them. The present methodusing RGB-D images at test time considerably outperforms all the baselines inaccuracy for camera relocalization.To demonstrate that the improvement is indeed from the sample-balanced ob-jective and backtracking, the present work compares the world coordinate pre-diction accuracy from the present UBRF, BRF and BTBRF. Fig. 4.5 shows theaccumulated percentage of predictions within different error thresholds for foursequences. For a particular threshold, the higher the percentage, the more accu-rate the prediction is. The figure clearly shows that the present method with thesample-balanced objective and the backtracking technique is consistently betterthan the method without sample-balanced objective and without backtracking.43Error threshold (m)(a) Apt1/living0 0.1 0.2 0.3 0.4 0.5Percentage00.10.20.30.40.50.60.70.8Un-balancedBalancedBalanced + BacktrackingError threshold (m)(b) Apt2/living0 0.1 0.2 0.3 0.4 0.5Un-balancedBalancedBalanced + BacktrackingError threshold (m)(c) Office1/gates3620 0.1 0.2 0.3 0.4 0.5Un-balancedBalancedBalanced + BacktrackingError threshold (m)(d) Office2/5a0 0.1 0.2 0.3 0.4 0.5Un-balancedBalancedBalanced + BacktrackingFigure 4.5: Impact of the sample-balanced objective and backtrackingon prediction accuracy. These figures show the accumulated percent-age of predictions within a sequence of inlier thresholds. The proposedmethod with the sample-balanced objective (red lines) consistently hasa higher percentage of inliers compared with the unbalanced objective(blue lines). Backtracking (green lines) further improves prediction ac-curacy. Max number of backtracking leaves is 16 here.(a) (b)Figure 4.6: Qualitative results for indoor dataset (from office2/gates381).Best viewed in color. The ground truth is in red and the present es-timated camera pose is in green. (a) camera trajectories. (b) severalevenly sampled camera frusta are shown for visualization. the presentmethod produces accurate camera locations and orientations. Note: the3D model is only for visualization purposes and it is not used for thepresent camera relocalization.44Balance levels0 2 4 6 8 10Percentage0.650.70.750.80.850.90.951Apt1/livingApt2/livingOffice1/gates362Office2/5aMeanFigure 4.7: Camera relocaliztion accuracy vs. number of tree levels us-ing the sample-balanced objective. When the number of levels usingsample-balanced objective increases, the average performance (dashedline) increases.Fig. 4.6 shows examples of qualitative results of the present method. In Fig.4.6 (a), the estimated camera trajectory and the ground truth trajectory are highlyoverlapped in the scene, illustrating the high accuracy of predicted camera loca-tions. In Fig. 4.6 (b), several camera frusta show the accuracy of estimated cameraorientation. These results show that the estimated camera poses are accurate inboth location and orientation.Tree Structure AnalysisThe structure of the developed decision trees is analyzed to explore the influenceof several important parameters.Maximum tree depth of using the sample-balanced objective Lmax : Fig.4.7 shows relocalization accuracy against tree levels with the sample-balancedobjectives. The sample-balanced level is increased from 0 (no sample-balancedlevel) to 10. The mean accuracy gradually increases from around 80% to around95%. Analyzing the four sequences separately, we find that the performance on theapt2/living and offce2/5a sequences improved significantly while the performanceon the other two sequences improve moderately. Please note the accuracy in thisfigure is measured without using backtracking.45Backtracking leaf node numbers0 5 10 15 20 25 30Correct frames percentage0.850.90.951Apt1/livingApt2/livingOffice2/5aOffice1/gates362MeanFigure 4.8: Camera relocalization accuracy vs. backtracking leaf nodenumbers. The camera relocalization performance increases with morebacktracking leaf nodes though eventually levels out.Backtracking number Nmax: Fig. 4.8 shows the camera relocalization accu-racy against different Nmax. The accuracy is significantly improved within the first(about) 10 leaf nodes and saturates after that. Because the processing time linearlyincreases with the backtracking number, a small number of Nmax is preferred.Split Ratio Distribution: In internal nodes, the split ratio measures the balanceof the number of samples. Fig. 4.9 shows an example of split ratio distribution.The tree uses the sample-balanced objectives in the first 8 levels. After level 8,the distribution is close to a uniform distribution. The present method successfullyguides the tree to have a sample-balanced structure at designated levels. Comparedwith the unbalanced tree structure, the present method has relatively shallowertrees. As it has been pointed out, deeper trees lead to over-fitting [20].Tree Numbers: Fig. 4.10 plots regression forests method on apt1/kitchenscene. It illustrates that the camera relocalization performance increases with thenumber of trees while eventually leveling out. Other random forest variants fordifferent tasks [20, 92] also have similar findings. It has been proved that the ran-dom forests have less danger of overfitting when the number of trees increases [9].However, the memory overhead and training time of using multiple trees increase46Figure 4.9: Split ratio distribution. The heat map shows the split ratio dis-tribution as a function of tree depth (level). The data is from all splitnodes in a sequence. In the first 8 levels, the distribution is concentratedaround 0.5 as a result of the sample-balanced objective. From level 8 tolevel 25, the distribution is almost uniform.Figure 4.10: Camera relocalization accuracy VS number of trees47Table 4.3: Camera relocalization results for the 7 Scenes dataset .Correctpercentage performance is shown for the developed method on all scenesagainst three state-of-the-art methods: Sparse baseline [93], SCRF [93],MutliOutput [38]. The correct percentage show the test frames within5cm translational and 5◦ angular error.Baselines OursScene Sparse [93] SCRF [93] MultiOutput[38] BRF BTBRFTraining RGB-D RGB-D RGB-D RGB-D RGB-DTest RGB RGB RGB-D RGB-D RGB-DChess 70.7% 92.6% 96% 97.8% 99.6%Fire 49.9% 82.9% 90% 92.1% 95.2%Heads 67.6% 49.4% 56% 77.3% 90.4%Office 36.6% 74.9% 92% 91.3% 95.9%Pumpkin 21.3% 73.7% 80% 76.5% 75.7%Kitchen 29.8% 71.8% 86% 85.4% 89.4%Stairs 9.2% 27.8% 55% 51.3% 60.8%Average 40.7% 67.6% 79.3% 81.7% 86.7%linearly with the number of trees, so at some point the accuracy increase may notjustify the additional memory used. Fig. 4.10 shows the relocalization accuracyincreases steeply at first several trees but slowly after around 5 trees. Similar trendsare observed on other sequences. As a result, the present work uses 5 trees to strikea trade off between speed and accuracy.Indoor camera relocalization on 7 Scenes DatasetThe developed method is also evaluated using the widely used Microsoft 7 Scenesdataset. Also the present method is compared against a sparse feature based methodreported from [93], SCRF [93] and a multi-output version of SCRF [38] in termsof correct frame percentage. The present method is compared with PoseNet andBayesian PoseNet in terms of median translation error and rotation error.Main results and analysis: The main results and comparisons are presented inTable 4.3. The developed method BTBRF outperforms all the baselines on all48Table 4.4: Median camera relocalization performance for the 7 Scenesdataset . Median performance for the present method on all scenes isshown against three state-of-the-art methods: PoseNet [53], Bayesian[51], SCRF [93].Baselines OursScene PoseNet [53] Bayesian [51] SCRF [93] BRF BTBRFTraining RGB RGB RGB-D RGB-D RGB-DTest RGB RGB RGB-D RGB-D RGB-DChess 0.32m, 8.12◦ 0.37m, 7.24◦ 0.03m, 0.66◦ 0.017m, 0.69◦ 0.015m, 0.59◦Fire 0.47m, 14.4◦ 0.43m, 13.7◦ 0.05m, 1.50◦ 0.018m, 0.99◦ 0.016m, 0.89◦Heads 0.29m, 12.0◦ 0.31m, 12.0◦ 0.06m, 5.50◦ 0.028m, 2.66◦ 0.020m, 1.84◦Office 0.48m, 7.68◦ 0.48m, 8.04◦ 0.04m, 0.78◦ 0.022m, 0.90◦ 0.018m, 0.75◦Pumpkin 0.47m, 8.42◦ 0.61m, 7.08◦ 0.04m, 0.68◦ 0.024m, 0.90◦ 0.023m, 0.84◦Kitchen 0.59m, 8.64◦ 0.58m, 7.54◦ 0.04m, 0.76◦ 0.027m, 1.22◦ 0.025m, 1.02◦Stairs 0.47m, 13.8◦ 0.48m, 13.1◦ 0.32m, 1.32◦ 0.047m, 1.40◦ 0.040m, 1.22◦Average 0.44m, 10.4◦ 0.47m, 9.81◦ 0.08m, 1.60◦ 0.261m, 1.25◦ 0.022m, 1.02◦scenes except the Pumpkin scene in terms of correct percentage. The sparse base-line tends to perform poorly when there exists severe motion blur, or texturelessareas in which situations few features are detected. The SCRF performs betterthan the present BRF but worse than BTBRF. The main reason is that mean shiftis not used, which may improve the leaf prediction. Only the balanced objectivecannot deal with the ambiguities on the leaf prediction but the backtracking schemehelps more. All these three methods perform poorly on the Stairs due to the inher-ently repetitive and ambiguous property of the scene. However, the present methodachieves 35% more in correct percentage accuracy compared with the SCRF, 86%more with AutoContext and 308% more with Spare baseline. The performanceboost may be attributed to the balanced objective and backtracking could help tocorrect the severely wrong prediction.To gain some insight into the translation and rotation error, the median perfor-mance is also presented in Table 4.4. It is can be seen that SCRF and the presentmethods are significantly more accurate than PoseNet and Bayesian PoseNet. Thereason may be that both PoseNet and Bayesian PoseNet consider the holistic im-ages which tends to fail when occlusions occurred. The other reason may be that49(a-1) (b-1) (c-1) (d-1)(a-2) (b-2) (c-2) (d-2)Figure 4.11: Example sequence in TUM dynamic dataset. (a) sitting xyz,(b) sitting rpy, (c) walking halfsphere, (d) walking xyz. The upperrow shows the training image example. The training images also con-tain severe motion blur, occlusion and rotated images. The lower rowshows the training (blue) and test (red) sequences. The camera frus-tums are uniformely sampled in every tenth image.they do not use the depth image and lose more information. The present methodBTBRF has much better performance on translational error but inferior perfor-mance on angular error.Indoor camera relocalization on TUM Dynamic DatasetTUM RGB-D dynamic dataset: TUM RGB-D Dataset [98] is mainly for theevaluation of RGB-D SLAM systems. A large set of image sequences of vari-ous of characteristics (e.g. non-texture, dynamic objects, hand-held SLAM, robotSLAM) from a Microsoft Kinect RGB-D sensor with highly accurate and time-synchronized ground truth camera poses from a motion capture system. The se-quences contain both the color and depth images in image resolution of 640×480.Here, just the dynamic objects dataset is used to complement the previous staticMicroso f t 7 Scenes dataset and Stan f ord 4 Scenes dataset where no dynamic ob-jects existed. This dynamic dataset is very challenging as there are severe occlu-sions and moving dynamic objects in the scene. The training set is from the scenesas listed in Table 4.5 while the test set is from the respective evaluation sequences.50Table 4.5: Camera relocalization results for the TUM dataset . The correctpercentage performance, median performance, and RMSE of ATE arepresented.# Frames BTBRFScene Train Test Correct Median RMSEsitting static 676 715 64.6% 0.015m, 0.99◦ 0.018msitting xyz 1216 829 70.2% 0.029m, 0.72◦ 0.039msitting halfsphere 1069 944 44.4% 0.056m, 1.59◦ 0.046msitting rpy 792 753 74.6% 0.029m, 0.98◦ 0.065mwalking halfsphere 1018 1171 61.7% 0.042m, 1.03◦ 0.085mwalking rpy 864 777 53.7% 0.047m, 1.14◦ 0.551mwalking static 714 785 89.2% 0.019m, 0.49◦ 0.027mwalking xyz 826 897 41.7% 0.048m, 1.24◦ 0.064mAverage — — 62.5% 0.036m, 1.02◦ 0.119mFig. 4.11 show several RGB example for training sequences.Error metric Three error metrics are used for this evaluation: ’Correct frames’which is the percentage of test frames within 5cm and 5◦, median translational andangular error, and root mean squared error (RMSE) for Absolute Trajectory Error(ATE) [98] which is commonly used in many SLAM systems [72, 98]. Unlike theMicroso f t 7 Scenes and Stan f ord 4 Scenes datasets, the training and evaluationdata are in different world coordinate systems. The estimated trajectory is stillin the training data world coordinate. For alignment, TUM dataset benchmarkprovide tools using Horn’s method [45] to align the estimated trajectory P1:n andground truth trajectory Q1:n. The ATE at time step i is computed asFi = Q−1i SPi (4.6)The root mean square error (RMSE) over all time indices of the translational com-ponents is computed as:RMSE(F1:n) = (1nn∑i=1||trans(Fi)||2) 12 (4.7)51(a) (b)(c) (d)Figure 4.12: Quantitative results on TUM dynamic dataset. Here we showtwo good scenes and two bad scenes. (a) sitting xyz, (b) walking xyz,(c) walking halfsphere, (d) walking rpy.However, this ATE could only be used an auxilliary error metric, as it only con-siders the translational error while ignoring rotational errors. The translational androtational error are simultaneously optimized.Main results and analysis The main results are presented in Table 4.5. It showsthe developed method is somewhat robust in front of dynamic obstacles, with62.5% average correct images. Fig. 4.12 (a) and (b) show that the present methodstays close to ground truth trajectory. However, there are failure cases which causethe RMSE to be very large. For instance in walking hal f sphere as shown in Fig.4.12 (c) and (d), it can be seen that several failure images caused large RMSE. Fig.4.13 show the failure image examples. There are many possible reasons for the52(a) (b)Figure 4.13: Failure cases on TUM dynamic dataset. (a) walk-ing halfsphere. Dynamic objects dominate the image and severe mo-tion blur exists. (b) walking rpy. No similar image in the trainingsequence and severe motion blur.relocalization failure, such as severe motion blur, no similar image in the trainingsequence and dominating dynamic objects.4.3.2 Outdoor camera relocalizationFig.4.14 illustrates our framework of camera relocalization for outdoor scenes. Vi-sual structure from motion is leveraged to autonomously generate training data(local features and their corresponding 3D point positions in the world coordinate).During training, regression forest is used to learn the correspondence between lo-cal features and 3D world coordinates. At test time, local features are extractedfrom query image, and then estimate their 3D correspondences from regressionforests with backtracking. Finally the camera pose is estimated using PnP solverand RANSAC.Dataset: The Cambridge Landmarks dataset [53] is used to evaluate the devel-oped method. It consists of data in a large scale outdoor urban environment. Scenesare recorded by a smart phone RGB camera at 1920×1080 resolution. The datasetalso contains the structure-from-motion (SfM) models reconstructed with all im-ages to get the ground truth camera pose. The dataset exhibits motion blur, movingpedestrians and vehicles, and occlusion, which pose great challenges for camera53VisualSfM local features !& 3D pointsSupervised Learning!(random forests)Stree TMonocular!ImagesQuery Image Stree Tlocal feature !detection & extractionRandom forests !with backtracking2D-3D!correspondences!Camera pose estimation!using PnP solver & RANSAC local !features(a)VisualSfM local features !& 3D pointsSupervised Learning!(random forests)Stre  TMonocular!ImagesQuery Image Stree Tlocal feature !detection & extractionRandom forests !with backtracking2D-3D!correspondences!Camera pose estimation!using PnP solver & RANSAC local !features(b)Figure 4.14: Overview of the present framework of SuperSIFT for cam-era relocalization (a) Visual structure from motion [115] is employedto autonomously generate training data (local features and their corre-sponding 3D point positions in the world coordinate). During training,regression forest is used to learn the correspondence between localfeatures and 3D world coordinates. (b) At test time, local features areextracted from query image, and then their 3D correspondences are es-timated from regression forests with backtracking. Finally the camerapose is estimated using PnP solver and RANSAC.relocalization.Baselines: Five state-of-the-art methods are used as our baselines: Active Search[87], PoseNet [53], Bayesian PoseNet [51], CNN+LSTM [108], and PoseNet+Geometricerror[52]. Active Search employs a prioritized matching step that first considersfeatures more likely to yield 2D-to-3D matches and then terminates the correspon-dence search once enough matches have been found. PoseNet trains a ConvNetas a pose regressor to estimate the 6-DOF camera pose from a single RGB im-age. Bayesian PoseNet improved the accuracy of PoseNet through an uncertaintyframework by averaging Monte Carlo dropout samples from the posterior Bernoulli54Table 4.6: Camera relocalization results for the outdoor CambridgeLandmarks Dataset. The median performance for the devel-oped method is shown against five state-of-the-art methods: Ac-tive Search without prioritization (w/o) and with prioritization (w/)[87], PoseNet [53], Bayesian PoseNet [51], CNN+LSTM [108], andPoseNet+Geometric loss[52].Baselines OursScene Spatial Active Search Active Search PoseNet [53] Bayesian CNN+ PoseNet+ BTBRFExtent (w/o) [87] (w/)[87] PoseNet[51] LSTM[108] Geometric loss[52]King’s 140x40m 0.59m, 0.48◦ 0.67m, 0.52◦ 1.92m, 5.40◦ 1.74m, 4.06◦ 0.99m, 3.65◦ 0.88m, 1.04◦ 0.39m, 0.36◦Hospital 50x40m 1.25m, 0.71◦ 1.29m, 0.79◦ 2.31m, 5.38◦ 2.57m, 5.14◦ 1.51m, 4.29◦ 3.20m, 3.29◦ 0.30m, 0.41◦Shop 35x25m 0.18m, 0.65◦ 0.17m, 0.53◦ 1.46m, 8.08◦ 1.25m, 7.54◦ 1.18m, 7.44◦ 0.88m, 3.78◦ 0.15m, 0.31◦St Mary 80x60m 0.26m, 0.50◦ 0.29m, 0.55◦ 2.65m, 8.48◦ 2.11m, 8.38◦ 1.52m, 6.68◦ 1.57m, 3.32◦ 0.20m, 0.40◦Average 0.57m, 0.59◦ 0.61m, 0.60◦ 2.08m, 6.83◦ 1.92m, 6.28◦ 1.30m, 5.52◦ 1.63m, 2.86◦ 0.27m, 0.39◦distribution of the Bayesian ConvNet’s weights. CNN+LSTM [108] uses a CNNto learn feature representations and LSTM units on the CNN output in spatial coor-dinates to capture the contextual information. PoseNet+Geometric error[52] whichis based on PoseNet[53] uses geometric reprojection error instead of the manuallytuned weighted loss as the regression loss to balance the rotational and positionalquantities in a single scalar loss.Error metric The median translational error and rotational error are used hereas in previous work [51–53, 108] for fair comparison.Main results and analysis Table 4.6 shows the median camera relocalizationerrors for the present method and the baseline methods. The results of PoseNetand Bayesian PoseNet are from the original papers. The results of active searchand CNN+LSTM are from [108]. The present method considerably outperforms allthese baselines for all scenes in terms of median translational and rotational errors.It is about an order of magnitude improvement in accuracy compared with PoseNet,five times as accurate as CNN+LSTM and twice as accurate as Active Search.Timings are not directly compared here as all these baselines are implemented ondifferent high-end GPUs while the current implementation is on a single CPU core.To gain insight on how backtracking helps to improve the accuracy, the camera55Figure 4.15: Qualitative results for the outdoor dataset Cambridge Land-marks, King’s College. Best viewed in color. Camera frusta over-laid on the 3D scene. The camera poses are evenly sampled every tenframes for visualization. Camera frusta with hollow green frames showground truth camera poses, while red ones with light opacity show thepresent estimated camera poses. The estimated camera poses of thedeveloped method are very close to ground truth in spite of partialocclusion, moving objects, motion blur, large illumination and posechanges.relocalization accuracy is plotted against backtracking leaf node numbers in Fig.4.16. It shows that the camera relocalization errors significantly decrease withinabout 5 backtracking leaf nodes, which indicates that the relocalization accuracycan be improved by backtracking only a small number of leaf nodes.Fig. 4.15 shows the qualitative results for the King’s College Scene. We uni-formly resampled the camera poses every ten frames to improve visualization.4.3.3 Implementation detailsThe proposed method is implemented with C++ using OpenCV [7] on an Intel3GHz i7 CPU, 16GB memory Mac system. The parameter settings for regressionforest are: tree number T = 5; 500 (indoor) and 300 (outdoor) training images pertree; 5,000 randomly sampled pixels per training image (indoor); the number ofSIFT features per image varies from around 500 to 1,500 (outdoor); the maximumdepth of tree is 25; the maximum backtracking leaves is 16. For the test, the present56Number of backtracking leaf nodes0 5 10 15 20Median translational error (meters)00.20.40.60.811.2KingsOld HospitalShop FacadeSt Marys(a)Number of backtracking leaf nodes0 5 10 15 20Median rotational error (degrees)00.20.40.60.811.21.41.61.8KingsOld HospitalShop FacadeSt Marys(b)Figure 4.16: Effect of backtracking leaf node numbers on camera relo-calization accuracy in Cambridge Landmarks Dataset. (a) mediantranslational error vs. backtracking leaf numbers (b) median rotationalerror vs. backtracking leaf numbers. Both median translational androtational errors decrease with more backtracking leaf nodes thougheventually level out.unoptimized implementation takes approximately 1 second (indoor) and 2 seconds(outdoor) on a single CPU core. Most of the time is on the backtracking and thecamera pose optimization part. It should be noted that the current implementationis not heavily optimized for speed and no GPU is used, which makes it possible tospeed up with GPU implementation. There is empirical evidence supporting thata decent GPU implementation of the forest would not exceed 1ms [91], and evenwith backtracking it still can achieve fast response.4.3.4 LimitationsThe present BTBRF method has two limitations. First, it has to store mean localpatch descriptors in leaf nodes, which doubles the trained model size. Second, thetime complexity of the backtracking is proportional to the value of Nmax, makingthe prediction process slower than the conventional regression forests method. It isimportant to find ways of adaptively setting the value of Nmax to avoid unnecessarybacktracking.574.4 ConclusionsThis chapter developed a sample-balanced objective and a backtracking scheme toimprove the accuracy for camera relocalization. The proposed methods are applica-ble to regression forests both for RGB-D and RGB camera relocalization, showingstate-of-the-art accuracy in publicly available datasets. Furthermore, we integratedlocal features in regression forests, advancing regression forests based methods touse RGB-only images for both training and test, and broadening the application tooutdoor scenes while achieving the best accuracy against several strong state-of-the-art baselines. In the future, the present work may be implemented on a parallelcomputation architecture for higher efficiency.58Algorithm 3 Tree Prediction with BacktrackingInput: A decision tree T , testing sample S, maximum number of leaf to examineNmaxOutput: Predicted label and minimum feature distance1: procedure TREEPREDICTION(T,S,Nmax)2: count← 03: PQ← empty priority queue4: R← /0 . the prediction result5: call TRAVERSETREE(S,T,PQ,R)6: while PQ not empty and count < Nmax do7: T ← top of PQ8: call TRAVERSETREE(S,T,PQ,R)9: end while10: return R11: end procedure12: procedure TRAVERSETREE(S,T,PQ,R)13: if T is a leaf node then14: dist← distance(T. f eature,S. f eature) . e.g., L2 norm15: if R is /0 or dist < R.dist then16: {R.label,R.dist}← {T.label,dist} . update17: end if18: count = count+119: else20: splitDim← T.splitDim21: splitValue← T.splitValue22: if S(splitDim)< splitValue then23: closestNode← T.le f t24: otherNode← T.right25: else26: closestNode← T.right27: otherNode← T.le f t28: end if29: add otherNode to PQ30: call TRAVERSETREE(S,closestNode,PQ,R)31: end if32: end procedure59Chapter 5Exploiting points and lines inregression forests for RGB-Dcamera relocalizationThis chapter presents a camera pose relocalization method that uses both pointsand lines. Previous chapters used Perspective-n-Point (PnP) to estimate the pose ofa calibrated camera from n 3D-to-2D or 3D-to-3D point correspondences. Thereare situations where point correspondences cannot be reliably estimated, for ex-ample in a texture-less office. In such scenarios, one can still exploit alternativegeometric entities, such as lines, yielding the Perspective-n-Line (PnL) algorithms.This chapter presents a method that uses both point and line features to relocalizea RGB-D camera.5.1 IntroductionMost existing camera relocalization approaches are designed with the assumptionthat many point features can be accurately tracked. However, it is not the case inthe real world. Textureless areas are common in human environments and motionblur also often happens when moving the camera too fast. Recent camera relo-calization methods based on deep learning [40, 51–53] are more robust when thesparse features such as SIFT [64] fail in the existence of motion blur or poorly60(a) (b)Figure 5.1: Line segment example. (a) original RGB image (b) with LSDline features. In scenes with little texture and repetitive patterns whichare typical in indoor environments, line features are more robust.textured areas. However, the overall camera relocalization accuracy of these deeplearning methods has almost an order of magnitude in error over the point-basedmethods or random forest based methods in publicly available datasets.To solve this problem, this chapter proposes to use both line and point featuresin regression forests for camera relocalization. In motion blur and textureless areaswhere the points are struggled [85], the line segments provide important geomet-ric information and are more robust features. Furthermore, 3D edges which arecomposed of many line segments are robust to viewpoint changes.5.2 Related workThis section provides a literature review of the related work on the line and/orpoint feature for camera pose estimation. For more generated related work, pleaserefer the reader to Sec. 2.1.1. Line segment detection and exploitation in manycomputer vision tasks such as camera pose estimation [81, 85], stereo analysis[15, 48] and crack detection [66] could date back to three decades [12, 50] and arestill a very active area [36, 68, 85, 107]. Robust gradient orientations of the linesegment rather than robust endpoints or gradient magnitudes play a crucial role inthe line segment literature [11, 50, 85, 107]. Besides line segment detection, thepresent work is highly related to pose estimation using line and/or point features[26, 37, 81, 85].615.3 Problem setup and method overviewThe following three assumptions of the RGB-D camera and input data are made: (i)the camera intrinsics are known; (ii) the RGB and depth frames are synchronized;(iii) the training set contains both RGB-D frames and their corresponding 6 DOFcamera pose encoding the 3D rotation and translation from camera coordinates toworld coordinates.The problem is formulated as: given only a single acquired RGB-D image{I,D}, infer the pose H of an RGB-D camera relative to a known scene.To solve the problem, we propose to exploit both line and point features in un-certainty driven regression forests. Our method consists of two major components.The first component is two regression forests trained using general points and linepoints respectively. These two forests predict general points and line points in thetesting. The second component is a camera pose optimization scheme using bothpoint-to-point constraints and point-on-line constraints.The novelty of our method lies in both the regression forests and camera poseestimation. First, point and line features are integrated in the training, modelingthe uncertainty both in the line and point predictions. Second, the uncertainty ofpoints and lines are simultaneously used to optimize the camera pose.5.4 Regression forest with point and line features5.4.1 Points sampling and scene coordinate labelsTo take advantage of the complementary properties of lines and points, the presentmethod differentiates the points on the line segments and general points.Line Segment Sampling: Directly back-projecting the two endpoints to 3D lineusing the depth information will cause large errors due to discontinuous depth onobject boundaries or lack of depth information as shown in Fig. 5.2. To avoid thisproblem, the Line Segment Detector (LSD) is employed [107] to extract a set of2D line segments L = {l1,l2, · · ·} from image I as shown in Fig. 5.1, and thenuniformly sample points from the line as shown in Fig. 5.3. Using this samplescheme, one could discard the sample points whose depths are unavailable, and62(a) (b)Figure 5.2: Depth corruption and discontinuity on line segments. (a) LSDline segments overlaid on original RGB image (b) truncated depth map.Effective depth information is not always available for 2D line segmentsin the corresponding RGB image, such as the wrong depth values shownon the desk and the glass corridor areas.only back-project the remaining points to the camera coordinate. The 3D back-projected points could contain outliers which could be removed by RANSAC [28]and fit a 3D line.Random points sampling: Besides the points sampled on the line segments, thepresent work also randomly samples general image points.The same method is used to train both point and line point models.Image features Here the proposed method uses the pixel comparison feature [93]that associates with each pixel location p:fφ (p) = I(p,c1)−I(p+ δD(p),c2) (5.1)where δ is a 2D offset and I(p,c) indicates an RGB pixel lookup in channel c.The φ contains feature response parameters {δ ,c1,c2}. The D(p) is the depth atpixel p in image D of the corresponding RGB image I. Pixels with undefined depthand those outside the image boundary are marked and not used as samples for bothtraining and test.63zyxuvimage planecamera optical centerimage pointsscene pointsscene inlier pointsscene outlier points2D line 3D lineFigure 5.3: 3D line estimation based on sampling points. Within a pinholecamera model, the 2D image points are evenly sampled on a 2D imageline and then back-projected on the scene coordinate to be 3D scenepoints. These 3D scene points contain outliers which could be removedby RANSAC to fit a 3D line in scene coordinate.Besides the random pixel comparison feature, we also use Walsh-HadamardTransform (WHT) features [43] to describe the local patch associated with thepixel p.Scene coordinate labels: For both random sampled points and points sampled onthe line segment, the 3D points x in camera coordinate of the corresponding pixelp are computed by back-projecting the depth image pixels. The scene’s worldcoordinate position m of the corresponding pixel p is computed through m = Hx.With the present sampling method, one can train both the point predictionmodel and line prediction model in the same way. An alternative way is to usetwo different models. One model predicts point-to-point correspondences and an-other model directly predicts line-to-line correspondences. The difficult part ofthis method is that there are few robust and efficient representations of lines inthe feature space. Therefore, the proposed method predicts line-point-to-line-pointcorrespondences and employs point-on-line constraint in the camera optimization64process, which greatly simplifies the model learning and prediction process.5.4.2 Regression forest trainingA regression forest is an ensemble of T independently trained decision trees. Eachtree is a binary tree structure consisting of split nodes and leaf nodes.5.4.3 Weak learner modelEach split node i represents a weak learner parameterized by θi = {φi,τi} where φiis one dimension in features and τi is a threshold. The tree grows recursively fromthe root node to the leaf node. At each split node, the parameter θi is sampled froma set of randomly sampled candidates Θi. At each split node i, for the incomingtraining set Si, samples are evaluated on split nodes to learn the split parameter θithat best splits the left child subset SLi and the right child subset SRi as follows:h(p;θi) ={0, if fφi(p)≤ τi, then go to the left subset SLi .1, if fφi(p)> τi, then go to the right subset SRi .(5.2)Here, τi is a threshold on feature fφi(p). Although here we use random pixel com-parison features as in Eq. 5.1, the weak learner model can use other general featuresto adapt application scenarios.5.4.4 Training objectiveIn the training, each split node i uses the randomly generated Θi to greedily opti-mize the parameters θ ∗i that will be used as the weak learner in the test phase bymaximizing the information gain Ii:θ ∗i = arg maxθi∈ΘiIi(Si,θi) (5.3)withIi = E(Si)− ∑j∈{L,R}|S ji (θi)||Si| E(Sji (θi)) (5.4)65where E(Si) is the entropy of the labels in Si, and subset Sji is conditioned on thesplit parameter θi. The present work employs a single full-covariances Gaussianmodel, with the entropy defined as:E(S) =12log((2pie)d |Λ(S)|) (5.5)with dimensionality d = 3 and Λ is the full covariance of the labels in S.At the end of training, all samples reach leaf nodes. In a leaf node, the presentwork uses the mean shift method to estimate a set of modes. Each mode has a meanvector µ and a covariance matrix Λ to described the clustered 3D points. Mean-while, the present method stores a mean vector of local patch descriptors for eachmode. The local patch descriptor will be used to choose the optimal predictions.5.4.5 Regression forest predictionIn testing, we use the backtracking technique in chapter 4 to find the optimal pre-diction within time budgets using a priority queue. In backtracking, the optimalmode has the minimum feature distance from the patch descriptor. Fig. 4.4 showsthe prediction process. To speed up, the maximum backtracking leaf number is setto be 8 instead of 16 as in the previous chapter. The point-on-line segment forestsand the general point forests make point-on-line and general points predictions re-spectively.5.5 Camera pose optimizationOur method optimizes the camera pose using two types of constraints. The firstconstraint is point-to-point correspondence. For each sampled camera coordinatepoint xci , the mode is found in Mi that concurrently best explains the transformedobservation Hxci :(µ∗i ,Σ∗i ) = arg max(µ,Σ)∈MiN (Hxci ;µ,Σ). (5.6)This can be optimized by minimizing the sum of Mahalanobis distances in worldcoordinates:Ep = ∑i∈Ip‖(Hxci −µ∗i )TΣ∗−1xi (Hxci −µ∗i )‖. (5.7)66The second constraint is point-on-line constraint. For each predicted edge pointxwi , the present work transforms its location to the camera coordinate H−1xwi andmeasures the Mahalanobis distance to the associated line Li. This can be optimizedby minimizing:El = ∑i∈Il‖(H−1xwi −Q)TΣ−1i (H−1xwi −Q)‖ (5.8)where Qi is the closest point on Li to the transformed point H−1xwi . The covariancematrix Σi is rotated from the world coordinate to the camera coordinate byΣc = RΣRT (5.9)where R is the rotation matrix in H−1.The proposed method jointly optimizes these two constraints by using the sumover the Mahalanobis distances as the performance index:H∗ = argminH(Ep+El) (5.10)This energy is optimized by a Levenberg-Marquardt optimizer [71] and the resultis used as the refined camera pose hypothesis in the next RANSAC iterations.5.6 ExperimentsThis section evaluates the developed method on three publicly available datasetsfor camera relocalization against several state-of-the-art baselines.5.6.1 Evaluations on Stanford 4 Scenes datasetDataset: See Secection 3.3.2 for the introduction of 4 Scenes dataset for a briefintroduction or [103] for detailed description.Baselines: Please refer Sec. 4.3.1 for details of ORB+PnP, SIFT+PnP, Ran-dom+SIFT [70], MNG [103]. The BTBRF is the present backtracking regressionforest in Section. 4, but only random point features are used.67Table 5.1: Camera relocalization results for the indoor dataset. The per-centage of correct frames (within 5cm translational and 5◦ angular error)of the developed method is shown using 4 Scenes dataset against fourstate-of-the-art methods: ORB+PnP, SIFT+PnP, Random+SIFT [70],MNG [103]. The best performance is highlighted.Spatial Baselines Our ResultsSequence Extent ORB+PnP SIFT+PnP Hybrid[70] MNG[103] BTBRF PLForestKitchen 33m3 66.39% 71.43% 70.3% 85.7% 92.7% 98.9%Living 30m3 41.99% 56.19% 60.0% 71.6% 95.1% 100%Bed 14m3 71.72% 72.95% 65.7% 66.4% 82.8% 99.0%Kitchen 21m3 63.91% 71.74% 76.7% 76.7% 86.2% 99.0%Living 42m3 45.40% 56.19% 52.2% 66.6% 99.7% 100%Luke 53m3 54.65% 70.99% 46.0% 83.3% 84.6% 98.9%Floor5a 38m3 28.97% 38.43% 49.5% 66.2% 89.9% 98.8%Floor5b 79m3 56.87% 45.78% 56.4% 71.1% 98.9% 99.0%Gates362 29m3 49.48% 67.88% 67.7% 51.8% 96.7% 100%Gates381 44m3 43.87% 62.77% 54.6% 52.3% 92.9% 98.8%Lounge 38m3 61.16% 58.72% 54.0% 64.2% 94.8% 99.1%Manolis 50m3 60.10% 72.86% 65.1% 76.0% 98.0% 100%Average — 53.7% 62.2% 59.9% 69.3% 92.7% 99.3%Error metric: The correct percentage is used, which represents the proportion oftest frames within 5cm translational and 5◦angular error as the present error metric.Main results and analysis Table 5.1 shows the main quantitative result on the4 Scenes dataset. The present method PLForest considerably outperforms all thebaselines and achieved the highest accuracy for all the sequences, with the averagecorrect frame percentage of 99.3%. Fig. 5.4 shows some qualitative results of thepresent camera pose estimation. The estimated camera poses including translationsand orientations are very similar to the ground truth camera poses. However, forsome scenes, such as Luke, there still exists still a few large error camera poseestimates. To further investigate the large error, the RGB images and their largeerror poses are shown in Fig. 5.5. From the RGB images, it is seen that onlya very little color information is available from which the present random pixelcomparison features in Eq. 5.1 needed to differentiate each other.68(a) (b)(c) (d)Figure 5.4: Qualitative results on Stan f ord 4 Scenes dataset. Best viewedin color. (a) apt1 living, (b) apt2 living, (c) office1 lounge, (d) of-fice2 5b. The evenly sampled every 20th frames in 3D reconstructedscenes are shown for visualization. The ground truth (hollow red frusta)and the present estimated camera pose (green with light opacity) arevery similar. Please note the 3D scenes are only used for visualizationpurposes and are not used in the present algorithm.5.6.2 Evaluations on Microsoft 7 Scenes datasetThe developed method is also evaluated on the widely used Microsoft 7 Scenesdataset.Dataset Please refer to Sec. 3.3.1 for the introduction of Microsoft 7 Scenesdataset for a brief introduction or [93] for detailed description.69(a) (b)Figure 5.5: Large error examples on Stan f ord 4 Scenes datasetApt2 Luke. (a) an RGB image, (b) ground truth (red) and estimatedcamera pose (blue) in the 3D scene. Dominating white color sheet andsome color information caused large camera pose error.Baselines and error metric: The developed method is compared against a sparsefeature based method reported from [93], SCRF [93], a multi-output version ofSCRF [38], an uncertainty version of SCRF [102] and an autocontext version ofSCRF [6], the present backtracking regression forests in Chapter 4 in terms ofcorrect frame percentage. The presented method is also compared with PoseNetand Bayesian PoseNet in terms of median translation error and rotation error.Main results and analysis: The main results are shown in comparison with strongstate-of-the-art baselines in terms of correct frames in Table 5.2, and in terms ofmedian performance in Table. 5.3. The present method achieves the best averageaccuracy in both criteria. Compared with Stan f ord 4 Scenes dataset, the averageaccuracy on Microso f t 7 Scenes dataset is relatively low. Several reasons mayaccount for this: (i) the training and testing sequences in Stan f ord 4 Scenes arerecorded at the same time as a whole sequence by the same person while the train-ing and test sequences of Microso f t 7 Scenes dataset are recorded by differentusers as different sequences, and split into distinct training and testing sequencesets. (ii) the depth and RGB image have better quality, and have better registrationin Stan f ord4 Scenes. (iii) the scenes in Microso f t 7 Scenes are more challeng-ing due to high ambiguity especially in Stairs scene and severe motion blur. Fig.5.6 show some qualitative results for the Heads scene of the Microso f t 7 Scenes70(a)(c)(d)Figure 5.6: Qualitative results for the Heads scene in Microso f t 7 Scenesdataset. Best viewed in color. (a) Training (blue), test ground truth(red), and test estimated camera poses (green) are evenly sampled forevery 20th images. The present estimated camera pose is similar toground truth in both translation and rotation. The large errors occur inplaces where training poses are very different from test poses. (b) the3D volume scene representation (c) large error image due to its pose isdifferent from training poses, motion blur and texture-less areas.71Table 5.2: Relocalization results for the 7 Scenes dataset. Test frames sat-isfying the error metric (within 5cm translational and 5◦ angular error)are shown for the present method on all scenes against five strong state-of-the-art methods: SCRF [93], Multi[38], BTBRF, Uncertainty [102],AutoConext [6]. The best performance is highlighted.Baselines OursScene Space SCRF[93] Multi[38] BTBRF Uncertainty[102] AutoContext [6] PLForestTraining RGB-D RGB-D RGB-D RGB-D RGB-D RGB-DTest RGB-D RGB-D RGB-D RGB-D RGB-D RGB-DChess 6m3 92.6% 96% 99.6% 99.4% 99.6% 99.5%Fire 2.5m3 82.9% 90% 95.2% 94.6% 94.0% 97.6%Heads 1m3 49.4% 56% 90.4% 95.9% 89.3% 95.5%Office 7.5m3 74.9% 92% 95.9% 97.0% 93.4% 96.2%Pumpkin 5m3 73.7% 80% 75.7% 85.1% 77.6% 81.4%Kitchen 18m3 71.8% 86% 89.4% 89.3% 91.1% 89.3%Stairs 7.5m3 27.8% 55% 60.8% 63.4% 71.7% 72.7%Average — 67.6% 79.3% 86.7% 89.5% 88.1% 90.3%Table 5.3: Relocalization results for the 7 Scenes dataset. Median perfor-mance for the present method is shown on all scenes against five state-of-the-art methods: PoseNet [53], Bayesian Bayesian PoseNet [51], ActiveSearch without prioritization [87], SCRF [93], BTBRF. The best perfor-mance is highlighted.Baselines OursScene PoseNet[53] Bayesian[51] AC[87] SCRF[93] BTBRF PLForestTraining RGB RGB RGB RGB-D RGB-D RGB-DTest RGB RGB RGB RGB-D RGB-D RGB-DChess 0.32m, 8.12◦ 0.37m, 7.24◦ 0.04m, 1.96◦ 0.03m, 0.66◦ 0.015m, 0.59◦ 0.014m, 0.57◦Fire 0.47m, 14.4◦ 0.43m, 13.7◦ 0.03m, 1.53◦ 0.05m, 1.50◦ 0.016m, 0.89◦ 0.009m, 0.48◦Heads 0.29m, 12.0◦ 0.31m, 12.0◦ 0.02m, 1.45◦ 0.06m, 5.50◦ 0.020m, 1.84◦ 0.008m, 0.68◦Office 0.48m, 7.68◦ 0.48m, 8.04◦ 0.09m, 3.61◦ 0.04m, 0.78◦ 0.018m, 0.75◦ 0.017m, 0.73◦Pumpkin 0.47m, 8.42◦ 0.61m, 7.08◦ 0.08m, 3.10◦ 0.04m, 0.68◦ 0.023m, 0.84◦ 0.019m, 0.65◦Kitchen 0.59m, 8.64◦ 0.58m, 7.54◦ 0.07m, 3.37◦ 0.04m, 0.76◦ 0.025m, 1.02◦ 0.025m, 1.02◦Stairs 0.47m, 13.8◦ 0.48m, 13.1◦ 0.03m, 2.22◦ 0.32m, 1.32◦ 0.040m, 1.22◦ 0.027m, 0.71◦Average 0.44m, 10.4◦ 0.47m, 9.81◦ 0.05m, 2.46◦ 0.08m, 1.60◦ 0.022m, 1.02◦ 0.017m, 0.70◦72(a) (b) (c)Figure 5.7: Large error examples for Of f ice scene on Microso f t 7 Scenes.(a) Fire Scene. 3D Volume with training (green) and test (red) cameratrajectory. It is seen that test sequence on the upper right part of thescene is very different from training sequence, which caused the largepose estimation errors for several frames. (b) an RGB image examplecorresponding to the upper right of the camera trajectory in (a). (c)qualitative comparison of ground truth (red) and estimated camera posein similar spatial scale (2m in x axis, 2m in y axis and 1m of y axis) as(a).dataset. The present estimated camera pose is similar to ground truth. The largeerror occur in places where the test poses are very different from training poses.Similar findings are also seen in some other scenes, such as the Fire scene case asshown in Fig. 5.7.5.6.3 Evaluations on TUM dynamic datasetTo demonstrate the efficacy and robustness against dynamic objects, the developedmethod is evaluated using TUM dynamic dataset.Dataset: Section. 4.3.1 gives a brief introduction. Also, [98] provides a detaileddescription.Baselines and error metric: Correct percentage of test frames are used within5cm and 5◦, median translational and angular error, and root mean squared error(RMSE) for Absolute Trajectory Error (ATE).73(a) (b)(c) (d)(e) (f)Figure 5.8: Quantitative results on TUM dynamic dataset. Two goodscenes, two moderate scenes and two bad scenes are shown. The cam-era trajecotry is plotted on XY plane. Ground truth camera trajectory isshown in black line, estimated trajectory is shown in blue line and thedifference is shown in red line. (a) sitting xyz, (b) walking xyz, (c) sit-ting halfsphere, (d) walking halfsphere, (e) sitting rpy, (f) walking rpy74Table 5.4: Camera relocalization results for the TUM dataset . Correctpercentage performance, median performance, RMSE of ATE are shown.BTBRF PLForestScene Correct Median RMSE Correct Median RMSEsitting static 64.6% 0.015m, 0.99◦ 0.018m 72.2% 0.012m, 0.93◦ 0.016msitting xyz 70.2% 0.029m, 0.72◦ 0.039m 74.1% 0.028m, 0.73◦ 0.047msitting halfsphere 44.4% 0.056m, 1.59◦ 0.046m 39.8% 0.061m, 1.76◦ 0.072msitting rpy 74.6% 0.029m, 0.98◦ 0.065m 77.9% 0.026m, 0.94◦ 0.069mwalking halfsphere 61.7% 0.042m, 1.03◦ 0.085m 46.6% 0.052m, 1.26◦ 0.111mwalking rpy 53.7% 0.047m, 1.14◦ 0.551m 53.4% 0.047m, 1.01◦ 0.169mwalking static 89.2% 0.019m, 0.49◦ 0.027m 97.3% 0.017m, 0.35◦ 0.021 mwalking xyz 41.7% 0.048m, 1.24◦ 0.064m 46.6 % 0.047m, 1.22◦ 0.063mAverage 62.5% 0.036m, 1.02◦ 0.119m 63.5% 0.036m, 1.02◦ 0.071m(a) (b)Figure 5.9: Failure cases on TUM dynamic dataset. (a) walk-ing halfsphere. Dynamic objects dominates the image and severe mo-tion blur exists. (b) walking rpy. Large rotation angle changes.75Main results and analysis: Table 5.4 show the main results of the present methodfor TUM Dynamic dataset. Compared with the static Stan f ord 4 Scenes andMicroso f t 7 Scenes, the performance is much lower due to high dynamics, butstill could perform in most cases. The present method is more accurate than BT-BRF in terms of average correct frames and RMSE of ATE, and have the sameaverage median performance. The present method could work satisfactorily withhighly dynamic scenes, but still struggles in some extreme cases. Fig. 5.8 showthe qualitative results on two good, two moderate and two bad sequences. Dy-namic occlusions dominate the scene. They cause too few inliers and thereforelead to failure cases. Moreover, severe motion blur is also quite apparent in dy-namic scenes due to the movement of both camera and objects such as in Fig. 5.9(a). The other important failure case is large rotation angle changes, as shown inFig. 5.9 (b). This is because the present random RGB comparison feature is notinvariant to rotation.5.7 ConclusionsThis chapter presented to exploit both line and point features within the frameworkof uncertainty driven regression forests. We simultaneously consider the point andline predictions in a unified camera pose optimization framework. We extensivelyevaluate the proposed approach in three datasets with different space scale anddynamics. Experimental results demonstrate the efficacy of the developed method,showing superior state-of-the-art or on-par performance. Furthermore, differentfailure cases are thoroughly demonstrated, throwing some light into possible futurework.76Chapter 6Mobile robot autonomousnavigation in uneven andunstructured environmentsThis chapter presents an integrated software and hardware architecture for au-tonomous mobile robot navigation in 3D uneven and unstructured indoor envi-ronment. (Video link: https://www.youtube.com/watch?v=Bh-cHpbn3zEt=10s)6.1 IntroductionAutonomous mobile robot navigation plays a vital role in self-driving cars, ware-house robots, personal assistant robots and smart wheelchairs, especially with inview of the shortage of a trained workforce and an ever-increasing aging popu-lation. Significant progress has been achieved in recent decades advancing thestate-of-the-art of mobile robot technologies. These robots are operating more andmore in unknown and unstructured environments, which requires a high degree offlexibility, perception, motion and control. Companies such as Google and Uberare developing advanced self-driving cars and expecting to move them into themarket in the next few years. Various mobile robots are roaming in factories andwarehouses to automate the production lines and inventory, saving workers fromwalking daily marathons [25]. Robot vacuums such as Roomba and Dyson36077eyes are moving around in the house to help clean the floors. Personal robots suchas PR2 [46, 67] and Care-O-bot [82] have demonstrated the ability to perform avariety of integrated tasks such as long-distance navigation and complex manipu-lation.Mobile robots navigating autonomously and safely in uneven and unstructuredenvironments still face great challenges. Fortunately, more and more environmentsare designed and built for wheelchairs, providing sloped areas for wheeled robotsto navigate through. However, little work focuses on an integrated system of au-tonomous navigation in sloped and unstructured indoor areas, especially narrowsloped areas and cluttered space in many modern buildings. The robots are re-quired to safely navigate in narrow uneven areas such as those shown in Fig. 6.1while avoiding static and dynamic obstacles such as people and pets.This chapter presents an integrated software and hardware framework for au-tonomous mobile robot navigation in uneven and unstructured indoor environmentsthat are designed for and shared with people. Fig. 6.3 shows a high-level system ar-chitecture of the present work. The robot first builds a 3D OctoMap representationfor an uneven environment with the present 3D simultaneous localization and map-ping (SLAM) using wheel odometry, 2D laser and RGB-D data. Then the presentwork projects multi-layer maps from OctoMap to generate the traversable map,which serves as the input for the present path planning and navigation. The robotemploys a variable step size RRT approach for global planning, adaptive MonteCarlo localization method to localize itself, and elastic bands method as the localplanner to close the gap between global path planning and real-time sensor-basedrobot control. The present focus is especially on efficient and robust environmentrepresentation and path planning. It is believed that reliable autonomous naviga-tion in uneven and unstructured environments is not only useful for mobile robotsbut could also provide helpful insight on smart wheelchair design in the future.6.2 Hardware and software platformThe development of a mobile robot system to work around us and assist people isthe long-term goal. The main design principle for this system is that each hardwareand software component could work as both a single module and a part of an inte-78Figure 6.1: The robot is navigating up the slope to the goal at the higherplatform. In the presence of staircases and slope, the robot builds a 3Drepresentation of the environment for the traversable map, and then therobot can navigate through the slope and avoid the staircases to reachthe goal efficiently and safely.grated system. To realize this principle, these hardware components are assembledusing screws and adjustable frames, while the software uses the Robot OperatingSystem (ROS) [78].Fig. 6.2 shows the hardware platform for the present robot. It includes a Seg-way RMP100 mobile base, an ASUS Xtion on the upper base, a Hokuyo UTM-30LX laser scanner mounted on the lower base, and a DELL laptop with Intel Corei7-4510U at 2GHz, 16GB memory (without GPU).The software system is implemented in ROS Indigo release on top of an Ubuntuversion 14.04LTS operating system. The 3D simulation experiments are performedon Gazebo [55]. Fig. 6.3 illustrates a high-level software architecture, and detailedkey components of the software architecture will be described in Sec 6.3 and 6.5.79Figure 6.2: Robot hardware platform. (a) Segway robot in a sloped area.The robot base is segway RMP100 with custom installed casters forsafety and onboard battery pack for providing power to sensors. (b)Xtion Pro Live RGB-D camera is capable of providing 30Hz RGB anddepth images, with 640x480 resolution and 58 HFV. (c) Hokuyo UTM-30LX laser scanner with range 10m to 30m, and 270◦ area scanningrange for localization.6.3 Environment representationA consistent and accurate representation of the environment is a crucial compo-nent of autonomous systems as it serves as input for motion planning to generatecollision-free and optimal motions for the robot.6.3.1 3D SLAM3D SLAM pipelines commonly consist of localization and mapping components.Localization is the process to estimate robot pose, and mapping (or fusion) involvesintegrating new sensor observations into the 3D reconstruction of the environment.For an environment which lacks of feature points (such as SIFT, ORB) the presentwork uses a 2D laser to localize and an RGB-D camera to provide point cloudsfor building 3D maps. For environment which has rich feature points, the presentwork uses RGB-D camera to both localize the robot and build the map.80Figure 6.3: High-level system architecture. The robot first builds a 3DOctoMap representation for uneven environment with the present 3DSLAM using wheel odometry, 2D laser and RGB-D data. Multi-layermaps from OctoMap are used for generating the traversable map, whichserves as the input for autonomous navigation. The robot employs avariable step size RRT approach for global planning, adaptive MonteCarlo localization method to localize itself, and elastic bands method asthe local planner to gap the global planning and real-time sensor-basedrobot control.3D SLAM using wheel odometry, 2D laser and RGB-D cameraTo overcome the challenge that vision-based SLAM is not robust when the envi-ronment lacks local features, the present work employs wheel odometry, a 2D laserand an RGB-D camera concurrently to complement each other.The present 3D SLAM framework builds on top of Karto SLAM [2], a robustmethod containing scan matching, loop detection, Sparse Pose Adjustment [57]as the solver for pose optimization and 2D occupancy grid construction. KartoSLAM takes in data from the laser range finder and wheel odometry. It is the bestperforming ROS-enabled SLAM technique in the real world, being less affected bynoise [86, 106] compared with other 2D SLAM methods, such as gmapping [35],Hector SLAM [56], Lago SLAM [13], and GraphSLAM [101].Instead of using Karto SLAM’s default 2D occupancy map, which we foundcannot represent an uneven environment reliably, the present work builds the en-vironment based on OctoMap [47]. It is a probabilistic, flexible, and compact 3Dmapping method which can represent a free, occupied and unknown environment.At each time step, the algorithm accepts point clouds of the environment from the81(a) (b)Figure 6.4: 3D environment representation of simulated world. (a) Simu-lated environment model in Gazebo (b) 3D OctoMap environment builtby our 3D SLAM using wheel odometry, a 2D laser scanner and anRGB-D sensor.RGB-D sensor and the localization information. Fig. 6.4(b) shows an OctoMaprepresentation of the simulated world generated by our 3D SLAM. Note that freespace is explicitly modeled in the experiment but is not shown in the figure forclarity.3D SLAM using RGB-D cameraFor environments that have rich feature points, the present work uses an architec-ture based on top of a variant of ORB-SLAM [72] and Octomap [47]. In orderto satisfy our need in the subsequent global localization and path planning pro-cedures, the present work has two main differences compared with the originalORB-SLAM: more keyframes and 3D probabilistic OctoMap representation.Different from ORB-SLAM which uses the survival o f the f ittest strategy forselecting keyframes, the present work keeps as many keyframes as possible in orderto obtain more camera poses for the training data in the camera relocalization in theautonomous navigation stage. As for the regression forest based global localizationstage detailed in Sec. 6.4.1, the time spent on global pose optimization in SLAMmay be sacrificed to obtain more keyframe poses for more training data and moreaccurate global localization. Therefore, the present work includes all the edgesprovided by the covisibility graph [97] rather than using the essential graph inORB-SLAM.82(a)(b)Figure 6.5: Octomap representation for f r1/room of TUM RGB-DSLAM benchmark [98] with visual SLAM. (a) the sparse map fromoriginal ORB-SLAM [72], map points (black, red), keyframes (blue).(b) OctoMap representationThe output map from ORB-SLAM is the sparse feature map which is shownin Fig. 6.5 (a). It cannot directly used for path planning and navigation. To over-come this problem, the present work employs the keyframe camera poses describedabove and point cloud as the input for OctoMap to generate 3D environment repre-sentation. The raw point cloud generated from whether from RGB and depth imageor directly from consumer RGB-D camera is very noisy due to varying point den-sities, measurement error, wrong registrations etc, posing great challenges for thesubsequent surface/volume reconstruction stage. Therefore, the present work em-ploys outlier removal based on the distribution of point to neighbors distances. Foreach raw point, the mean distance from a number of its neighbors is computed.It is assumed that the resulted distances are Gaussian distributed, and all pointswhose mean distances are larger than a threshold ( One standard deviation is usedhere) are treated as outliers and removed. Then the inlier point cloud is sent to theOctoMap. The result is shown in 6.5 (b).83(a)(b)Figure 6.6: Generation of traversable map from multilayer maps for theGazebo Caffe environment. (a) slope and staircase visualization withoccupied voxels, (b) multi-layer maps and traversable map. In thetraversable map, the staircases are occupied space, while the slope areaexcept the edge is free space, which is safe for robot to navigate. Fordetails please refer to Sec. 6.3.2.6.3.2 Multilayer maps and traversable mapAfter the environment is represented by OctoMap as mentioned above, the Oc-toMap is cut from bottom to top with several layers depending on the environmentand the required accuracy. Then these multilayers are integrated into a traversablemap, which is safe and efficient for the robot to navigate through. Both the multi-layer maps and the traversable map are represented as 2D grid occupancy maps, inwhich black represents occupied space, white as free space, and gray as unknownspace. A flag F is used to mark the map traversablility and decide whether an area84is traversable:F ={0, if αi ≥ θ , untraversable area1, if αi < θ , slope, traversable area(6.1)where θ represents the angle threshold which can be set according to the robotclimbing capability, and αi is the ith layer gradient:αi = arctandl ji= arctandrv ji(6.2)where d represents the distance between projected layers as shown in Fig. 6.6(a). It can be set to be as small as the resolution of the OctoMap for accuraterepresentation. l ji represents the length of the edge difference between differentlayers, and it can be obtained through the number of voxels v ji in that area andOctoMap resolution r. Take the environment in Fig. 6.4 (a) as an example, thefirst four layers of the 2D map from the OctoMap is shown in Fig. 6.6 (a) andthe left four maps in Fig. 6.6 (b). Then the gradient between layers are checked.For instance, in Fig. 6.6 (a), both α1 and α2 are less than θ , and that area ismarked as a slope while β1 and β2 are larger than θ , And that area is marked asan untraversable area. At the same time, the left and right edges of the slope aremarked as occupied for safety. The integration of all these multi-layer maps willgenerate the traversable map as shown in Fig. 6.6 (b).6.4 LocalizationMobile robot localization comprises the problem of determining the pose of a robotrelative to a known environment [101]. Local and global localization are two typesof localization [101]. The former aims at compensating for robot motion noisewhich is usually small, and it requires the initial location of the robot. While thelatter can localize without any information about the previous motion. In globallocalization, the initial pose and previous motion information are unknown [101].Global localization is more difficult than local localization. However, it plays acritical role as it can help to solve the kidnapped robot problem and allow the robotto recover from severe pose errors, which is essential for truly autonomous robots.856.4.1 Regression forests based global localizationThe present work uses the backtracking regression forests method in Chapter 4 asit is a faster than the method in Chapter 5, and more appropriate for real robot nav-igation which needs fast response. To further increase the speed, the backtrackingnumber 4 is used. As shown in Fig. 4.8, the accuracy increases quickly as thebacktracking leaf node number increases before 4, and slowly thereafter, eventu-ally saturating. In order to communicate with other modules, the present work usesROS subscriber to accept RGB and depth images from RGB-D camera, and ROSpublisher to send the camera pose to the path planning unit.6.4.2 Local localizationThe present robot employs adaptive Monte Carlo localization (AMCL) [29] forlocal localization. AMCL is a method that uses a particle filter to track the pose ofthe robot with a known map. It takes laser scans, the wheel odometry informationand the traversable map, and then outputs pose estimation.6.5 Planning and execution6.5.1 Global and local planningThe traversable map serves as the input for the present planning and navigationframework. The present work applies an efficient variable step size RRT planner[109] as the global planner to construct a plan in a configuration space C to reachthe goal. The global planner generates a list of way-points for the overall optimalpath from the starting position to the goal position through the map, while ignoringthe kinematic and dynamic constraints of the robot. Then the local planner takesinto account the robot kinematic and dynamic constraints, and generates a seriesof feasible local trajectories that can be executed from the current position, whileavoiding obstacles and staying close to the global plan. the present work usesElastic Bands method [79] as our local planner to close the gap between globalpath planning and real-time sensor-based robot control.866.5.2 Plan executionDuring execution, the short and smooth path from the local planner is convertedinto motion commands for the robot mobile base. The local planner computes thevelocities required to reach the next pose along the path and checks for collisionsin the updated local cost-map.6.6 ExperimentsExtensive experiments are conducted to evaluate the developed system both insimulation and real-world, demonstrating its efficacy for real-time mobile robotautonomous navigation in uneven and unstructured environments. Note that thepresent work also focuses on an integrated system besides the component meth-ods.6.6.1 Simulation experimentsEnvironment representation: The simulation is conducted in Caffe using an en-vironment of size 13m x 10m x 3m, which was built in Gazebo. A visualizationis shown in Fig. 4. The simulated robot model is equipped with wheel odometry,a 2D laser range finder and an RGB-D sensor. For this simulated environment,the present work only uses SLAM with a 2D laser and an RGB-D camera as theGazebo environment has very few visual features. The robot is tele-operated tomove around the environment. Fig. 6.4 (b) shows the OctoMap generated by thepresent 3D SLAM with 2D laser and RGB-D camera. The resolution of the Oc-toMap is set as 0.05m to strike a trade-off between speed and accuracy, and thethreshold on the occupancy probability is 0.7. Note that free space is explicitlymodeled in the experiment but is not shown in the figure for clarity. To evaluatethe present 3D SLAM with only the RGB-D camera, the present work uses thestandard TUM RGB-D SLAM benchmark [98]. Fig. 6.5 (b) shows our Octomaprepresentation for the f r1/room scene. For the Gazebo environment, the presentwork chooses four layers of projected maps as shown in the left part of Fig. 6.6 (b).It can be seen that the staircase area shows steeper changes than the sloped area.Therefore, in the generation of the present traversable map, the steeper changes are87(a) (b) (c) (d) (e)Figure 6.7: Autonomous navigation in the simulated environment. Thefirst row shows images of autonomous navigation in Gazebo, and thesecond row shows the Rviz views of the process.treated as untraversable area and the area with slower changes as the slope. Theright part of Fig. 6.6 (b) shows the generated traversable map, which will be usedfor the present robot autonomous navigation.Autonomous navigation For autonomous navigation, the present work uses thetraversable map and sensors data as the input for path planning and localization.Various autonomous navigation experiments are conducted in which the robot startsand reaches goals at different positions. Fig. 6.7 illustrates an example of thepresent robot autonomous robot navigation process in which the robot starts atlower ground and reaches the goal in the higher platform. The blue and greenlines show the exploration process of the global planner. This process ended aftera global path connecting start point and end point is found, as shown by the redline. The local path planner can smooth the global path if it is not optimal, and theoptimal path is marked as a concatenation of green bubbles. Fig.6.7 (b) – Fig.6.7(c) show the re-planning process when an obstacle is found. If the height of a slopeis higher than the laser on the robot, then the robot would expect that there is an ob-stacle ahead. The robot can replan its path when this occurs. Fig.6.7 (d) – Fig.6.7(e) show the re-planning process of the local planner when a real obstacle appears.The robot can avoid the obstacle and reach the target position.88(a) (b)Figure 6.8: Real environment. (a) a photo of the real environment. (b) 3Drepresentation of the environment with OctoMap. Only occupied voxelsare shown for visualization.(a) (b) (c) (d) (e)Figure 6.9: Multilayer maps and traversable map for real environment.(a)-(d) multiple projected layers from OctoMap, (e) the traversable map.The staircases and slope edge are occupied while the slope is free space.(a) (b) (c) (d) (e)Figure 6.10: Robot autonomous navigation example in real environment.The first row shows images of robot autonomous navigation, and thesecond row shows screenshots of Rviz views of the process.89(a) (b) (c) (b) (c)Figure 6.11: Dynamic obstacle avoidance. (a) a dynamic obstacle is ap-proaching the robot. (b) The human suddenly blocks the way in frontof the robot. (c) The robot changes direction to avoid the human.6.6.2 Real-world experimentsExtensive experiments are conducted with a Segway Robot in a real-world envi-ronment of the Industrial Automation Laboratory. Fig. 6.8 shows the present realenvironment with dimension of 11m× 5m× 3m, and the 3D OctoMap generatedfrom the present 3D SLAM with wheel odometry, a 2D laser and an RGB-D data.At this 3D SLAM stage, we teleoperate the present Segway robot to move aroundthe environment. Compared with the simulated environment, the real environmentfeatures challenges such as narrower slopes, different lighting conditions, and glasswindows. The real robot is much more noisy due to long-term wear-and-tear, un-calibrated wheel odometry, and the disturbance of casters. The sensors are alsomore noisy. Therefore, the OctoMap of the real environment and projected multi-layer maps in Fig. 6.9 tend to have some distortions and inconsistent registrationscompared with the simulated environment. The noisy environment, robot and sen-sors demand robustness from the present system. The resolution of the OctoMapis experimentally set to 0.05m to strike a trade-off between speed and accuracy.Four layers of projected maps are used to generate the traversable map in the realenvironment. The traversable map in Fig. 6.9 (e) will serve as the input for pathplanning and localization in autonomous navigation.One important parameter for robot navigating up a slope is the size of the localcostmap [65]. If the local costmap size is set too large, the far away ray sweep ofthe laser scan on the slope will intersect with the higher part of the slope, creating90an obstacle in the local costmap, which will block the robot’s path. It is especiallyobvious in the more noisy real experiment in which the robot is prone to fail in lo-calizing itself than in simulated environment. Facing localization failure, the robotwill initiate the rotation recovery mode, in which the robot rotates and laser-sweepsin a circle. If the local costmap size is too small, the obstacles may not be consid-ered in time, which may cause unsafe collision. The present work experimentallydetermined the costmap of 3 meters width and 4 meters length to work best for theuse in this use case.6.7 ConclusionsThis chapter presented an integrated software and hardware architecture for au-tonomous mobile robot navigation in 3D uneven and unstructured indoor environ-ments. This modular and reusable software framework incorporated capabilities ofperception and navigation. The present work employed an efficient path plannerwhich uses variable step size RRT in 3D environment with an octree-based repre-sentation. The present work generated a safe and feasible 2D map from multi-layermaps of 3D OctoMap for efficient planning and navigation. It is demonstrated andevaluated the developed integrated system in both simulation and real-world exper-iments. Both simulation and real-robot experiments demonstrated the efficacy andefficiency of the present methods, providing some insight for more autonomousmobile robots and wheelchairs working around us. Future work may integratemore accurate global localization and optimized path planning methods into thepresent system. Furthermore, it is desirable to explore the possibility of using 3Dsemantic scene parsing to understand the uneven areas.91Chapter 7Conclusions and future workIt is good to have an end to journey toward; but it is the journey thatmatters, in the end. — Ursula K. Leguin7.1 ConclusionsThis thesis addressed the problem of image based RGB/RGB-D camera relocal-ization, one of the core problems in computer vision and robotics. The thesis pre-sented a number of techniques based on regression forests that could improve theperformance of camera relocalization. Experiments against many strong baselinesusing various of publicly available datasets demonstrated the efficacy of the devel-oped approach, showing superior performance with respect to the state-of-the-art.This thesis also presented an integrated software and hardware architecture forautonomous mobile robot navigation in 3D uneven and unstructured indoor envi-ronments. This modular and reusable software framework incorporated capabilitiesof perception and navigation. The present work employed an efficient path plannerwhich used variable step size RRT in 3D environment with an octree-based rep-resentation. The presented work also generated a safe and feasible 2D map frommulti-layer maps of 3D OctoMap for efficient planning and navigation. The workdemonstrated and evaluated the developed integrated system in both simulation andreal-world experiments. Both simulation and real-robot experiments demonstratedthe efficacy and efficiency of the developed methods, providing some insight for92more autonomous mobile robots and wheelchairs working around us.7.2 Future directionsThis thesis is concluded by proposing several directions for possible future explo-ration.7.2.1 Speed up image based localizationThe speed of the developed method can be improved for real-time robotic appli-cations. The bottleneck is in the tree predictions, which has to backtracking manyleaf nodes. One solution is by using a GPU decision tree implementation [91],which greatly speeds up the prediction through parallel computing. An alternativemethod is to sample a few testing points but still correctly estimate the cameraposes. It requires that the samples points have more precise predictions. Fur-thermore, adaptively selecting the backtracking number will reduce unnecessarybacktracking times and will speed up the prediction process.7.2.2 Transfer learningFor a new environment, the developed method requires an offline training process,which usually costs much time (up to hours). Transfer learning can speed up thetraining process so that the method would become useful. Transfer learning aimsto extract knowledge from one or more source tasks and apply the knowledge toa target task. A major assumption in statistical machine learning is that trainingdata and test data must be in the same feature space, independently and identi-cally distributed. When the distribution changes, most statistical models need tobe rebuilt using newly collected training data. In many real-world applications, itis expensive or impossible to recollect the required training data again and rebuildthe models. Transfer learning can be very beneficial in such cases.An early survey [76] provides a comprehensive overview of transfer learn-ing for classification, regression and clustering. Recently, the transferability ofdeep convolutional neural networks has been successfully demonstrated for objectrecognition [118], place recognition [99] and camera relocalization [53] throughpre-training on large source data and fine-tuning on small target data. However,93it is not adequately studied in the context of random forests, especially for regres-sion. Lee et al.[113] build a prediction model incrementally from a partial decisiontree model. Goussies et al.[34] present a transfer learning decision forests methodto recognize gestures and characters, where random forests are used as classifiers.To the best of the present knowledge, random forests as a regressor for camerarelocalization have not been studied.It is believed that valuable knowledge can be transferred from the source tasksS = {S1,S2, ...SN} to the target task T in the camera relocalization, as it happenswith localization in humans. For instance, it is easier to learn to localize in thenew environment if a similar environment has already been localized. In otherwords, there exists latent information (or experience) that is common in similarenvironment that can be understood as common sense.In the case of transfer learning of regression forests, it is believed that thereare common tree structures and parameters among different scenes, which can beshared. This information can then be included in the process of growing each treeof forest when it encounters new information. The main idea, therefore, is to findthe parameters θ of weak learners for each tree in order to obtain a partition of thefeature space.7.2.3 Active image-based localization learning for autonomous robotCurrently all the methods based on regression forests and deep learning only con-sider the image-based localization re-localization for the passive camera sensor,which exclusively addresses the localization estimation based on an incoming streamof sensor data rather than for the camera sensor mounted on an active robot. Thesemethods assume that neither the camera motion, nor the pointing direction of thecamera sensor can be controlled. Therefore they do not exploit the opportunity tocontrol robot’s active motion during localization. Active localization needs to setthe robot’s motion direction and determine the pointing direction of the sensorsto more efficiently localize the robot [10]. It is a promising research direction formore efficient and more robust camera localization methods. Active localizationfor mobile robot navigation was first studied in [10] on top of Markov localiza-tion using sonar sensor. Active localization has been studied in the spare-feature-94based camera re-localization methods, but has never been studied in the machinelearning-based camera re-localization methods. As in sparse-feature feature basedmethods, the number of sparse features is an important signature to indicate thelocalization quality. When the number of sparse features is small, the robot willchange motion till enough sparse features appear. Here, in the machine learning-based method, it is difficult to have a criterion to indicate the localization qualitywhile the robot is navigating through the environment. It will be important tofind the criteria to decide ”where the robot should move” and ”where the camerashould look” for these machine learning-based camera re-localization methods soas to best localize the robot. In the current study, severe motion blur and repeti-tive environments such as corridors or stairs (as shown in the 7 Scenes dataset inprevious chapters) are very adverse situations for machine learning-based camerare-localization methods.95Bibliography[1] VXL C++ libraries for computer vision research and implementation.http://vxl.sourceforge.net. Accessed: 2016-08-03.[2] Karto SLAM, ROS package. accessed Nov, 2016. [online],wiki.ros.org/slam karto. 2010.[3] Y. Amit and D. Geman. Randomized inquiries about shape: An applicationto handwritten digit recognition. Technical report, CHICAGO UNIV ILDEPT OF STATISTICS, 1994.[4] R. Arandjelovic, P. Gronat, A. Torii, T. Pajdla, and J. Sivic. Netvlad: Cnnarchitecture for weakly supervised place recognition. In Proceedings of theIEEE Conference on Computer Vision and Pattern Recognition, pages5297–5307, 2016.[5] H. Bay, A. Ess, T. Tuytelaars, and L. Van Gool. Speeded-up robust features(surf). Computer Vision and Image Understanding (CVIU), 110(3):346–359, 2008.[6] E. Brachmann, F. Michel, A. Krull, M. Ying Yang, S. Gumhold, et al.Uncertainty-driven 6d pose estimation of objects and scenes from a singlergb image. In Proceedings of the IEEE Conference on Computer Visionand Pattern Recognition, pages 3364–3372, 2016.[7] G. Bradski and A. Kaehler. Learning OpenCV: Computer vision with theOpenCV library. ” O’Reilly Media, Inc.”, 2008.[8] L. Breiman. Bagging predictors. Machine learning, 24(2):123–140, 1996.[9] L. Breiman. Random forests. Machine learning, 2001.[10] W. Burgard, D. Fox, and S. Thrun. Active mobile robot localization. InIJCAI, pages 1346–1352, 1997.96[11] W. Burgard, A. B. Cremers, D. Fox, D. Ha¨hnel, G. Lakemeyer, D. Schulz,W. Steiner, and S. Thrun. The interactive museum tour-guide robot. InAAAI, 1998.[12] J. B. Burns, A. R. Hanson, and E. M. Riseman. Extracting straight lines.IEEE Transactions on Pattern Analysis and Machine Intelligence (PAMI),(4):425–455, 1986.[13] L. Carlone, R. Aragues, J. A. Castellanos, and B. Bona. A linearapproximation for graph-based simultaneous localization and mapping.Robotics: Science and Systems VII, pages 41–48, 2012.[14] T. Cavallari, S. Golodetz, N. A. Lord, J. Valentin, L. Di Stefano, and P. H.Torr. On-the-fly adaptation of regression forests for online camerarelocalisation. 2017.[15] M. Chandraker, J. Lim, and D. Kriegman. Moving in stereo: Efficientstructure and motion using lines. In Proceedings of the IEEE internationalconference on computer vision(ICCV), pages 1741–1748. IEEE, 2009.[16] K. Chatfield, V. S. Lempitsky, A. Vedaldi, and A. Zisserman. The devil isin the details: an evaluation of recent feature encoding methods. In BritishMachine Vision Conference (BMVC), 2011.[17] J. Chen and P. Carr. Mimicking human camera operators. In Applicationsof Computer Vision (WACV), 2015 IEEE Winter Conference on, pages215–222. IEEE, 2015.[18] L. Chen, W. Li, and D. Xu. Recognizing RGB images by learning fromRGB-D data. In Proceedings of the IEEE Conference on Computer Visionand Pattern Recognition (CVPR), pages 1418–1425, 2014.[19] A. Criminisi and J. Shotton. Decision forests for computer vision andmedical image analysis. Springer Science & Business Media, 2013.[20] A. Criminisi, J. Shotton, D. Robertson, and E. Konukoglu. Regressionforests for efficient anatomy detection and localization in ct studies. InInternational MICCAI Workshop on Medical Computer Vision. Springer,2010.[21] A. Criminisi, J. Shotton, and E. Konukoglu. Decision forests forclassification, regression, density estimation, manifold learning andsemi-supervised learning. Technical Report MSR-TR-2011-114, MicrosoftResearch, 2011.97[22] A. Criminisi, J. Shotton, and E. Konukoglu. Decision forests: A unifiedframework for classification, regression, density estimation, manifoldlearning and semi-supervised learning. Foundations and Trends inComputer Graphics and Vision, 7(2–3):81–227, 2012.[23] M. Cummins and P. Newman. Appearance-only SLAM at large scale withfab-map 2.0. The International Journal of Robotics Research, 30(9):1100–1123, 2011.[24] A. Dai, M. Nießner, M. Zollo¨fer, S. Izadi, and C. Theobalt. Bundlefusion:Real-time globally consistent 3D reconstruction using on-the-fly surfacere-integration. ACM Transactions on Graphics, 2017.[25] R. D’Andrea. Guest editorial: A revolution in the warehouse: Aretrospective on Kiva systems and the grand challenges ahead. IEEETransactions on Automation Science and Engineering, 2012.[26] E. Dubrofsky and R. Woodham. Combining line and pointcorrespondences for homography estimation. Advances in VisualComputing, pages 202–213, 2008.[27] J. Engel, T. Scho¨ps, and D. Cremers. Lsd-slam: Large-scale directmonocular slam. In European Conference on Computer Vision (ECCV),pages 834–849. Springer, 2014.[28] M. A. Fischler and R. C. Bolles. Random sample consensus: a paradigmfor model fitting with applications to image analysis and automatedcartography. Communications of the ACM, 1981.[29] D. Fox, W. Burgard, F. Dellaert, and S. Thrun. Monte carlo localization:Efficient position estimation for mobile robots. AAAI/IAAI, 1999(343-349):2–2, 1999.[30] D. Ga´lvez-Lo´pez and J. D. Tardos. Bags of binary words for fast placerecognition in image sequences. IEEE Trans. on Robotics, 2012.[31] X.-S. Gao, X.-R. Hou, J. Tang, and H.-F. Cheng. Complete solutionclassification for the perspective-three-point problem. IEEE Transactionson Pattern Analysis and Machine Intelligence (PAMI), 25(8):930–943,2003.[32] A. P. Gee and W. W. Mayol-Cuevas. 6D relocalisation for RGBD camerasusing synthetic view regression. In British Machine Vision Conference(BMVC), 2012.98[33] B. Glocker, J. Shotton, A. Criminisi, and S. Izadi. Real-time RGB-Dcamera relocalization via randomized ferns for keyframe encoding.Visualization and Computer Graphics, IEEE Trans. on, 2015.[34] N. A. Goussies, S. Ubalde, and M. Mejail. Transfer learning decisionforests for gesture recognition. Journal of Machine Learning Research,2014.[35] G. Grisetti, C. Stachniss, and W. Burgard. Improved techniques for gridmapping with rao-blackwellized particle filters. IEEE Transactions onRobotics, 23(1):34–46, 2007.[36] R. Grompone von Gioi, J. Jakubowicz, J.-M. Morel, and G. Randall. Onstraight line segment detection. Journal of Mathematical Imaging andVision, 32(3):313–347, 2008.[37] A. Gupta, J. J. Little, and R. J. Woodham. Using line and ellipse featuresfor rectification of broadcast hockey video. In Computer and Robot Vision(CRV), Canadian Conf. on, 2011.[38] A. Guzman-Rivera, P. Kohli, B. Glocker, J. Shotton, T. Sharp,A. Fitzgibbon, and S. Izadi. Multi-output learning for camerarelocalization. In Proceedings of the IEEE Conference on Computer Visionand Pattern Recognition (CVPR), 2014.[39] R. Hartley and A. Zisserman. Multiple view geometry in computer vision.Cambridge university press, 2003.[40] F. W. C. Hazirbas, L. L.-T. T. Sattler, S. Hilsenbeck, and D. Cremers.Image-based localization using lstms for structured feature correlation.2017.[41] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for imagerecognition. In Proceedings of the IEEE Conference on Computer Visionand Pattern Recognition (CVPR), 2016.[42] W. He, D. Goodkind, and P. Kowal. An aging world: 2015. US CensusBureau, pages 1–165, 2016.[43] Y. Hel-Or and H. Hel-Or. Real-time pattern matching using projectionkernels. IEEE Transactions on Pattern Analysis and Machine Intelligence(PAMI), 27(9):1430–1445, 2005.99[44] T. K. Ho. The random subspace method for constructing decision forests.IEEE transactions on pattern analysis and machine intelligence (PAMI), 20(8):832–844, 1998.[45] B. K. Horn. Closed-form solution of absolute orientation using unitquaternions. JOSA A, 4(4):629–642, 1987.[46] A. Hornung, M. Phillips, E. G. Jones, M. Bennewitz, M. Likhachev, andS. Chitta. Navigation in three-dimensional cluttered environments formobile manipulation. In Robotics and Automation (ICRA), Proceedings ofIEEE International Conference on, 2012.[47] A. Hornung, K. M. Wurm, M. Bennewitz, C. Stachniss, and W. Burgard.Octomap: An efficient probabilistic 3D mapping framework based onoctrees. Autonomous Robots, 2013.[48] C.-X. Ji and Z.-P. Zhang. Stereo match based on linear feature. In PatternRecognition, 1988., 9th International Conference on, pages 875–878.IEEE, 1988.[49] W. Kabsch. A solution for the best rotation to relate two sets of vectors.Acta Crystallographica Section A: Crystal Physics, Diffraction, Theoreticaland General Crystallography, 32(5):922–923, 1976.[50] P. Kahn, L. Kitchen, and E. M. Riseman. A fast line finder forvision-guided robot navigation. IEEE Transactions on Pattern Analysis andMachine Intelligence (PAMI), 12(11):1098–1102, 1990.[51] A. Kendall and R. Cipolla. Modelling uncertainty in deep learning forcamera relocalization. In Robotics and Automation (ICRA), Proceedings ofIEEE International Conference on, pages 4762–4769. IEEE, 2016.[52] A. Kendall and R. Cipolla. Geometric loss functions for camera poseregression with deep learning. arXiv preprint arXiv:1704.00390, 2017.[53] A. Kendall, M. Grimes, and R. Cipolla. Posenet: A convolutional networkfor real-time 6-dof camera relocalization. In Proceedings of the IEEEinternational conference on computer vision(ICCV), pages 2938–2946,2015.[54] G. Klein and D. Murray. Parallel tracking and mapping for small ARworkspaces. In Mixed and Augmented Reality., 2007.100[55] N. Koenig and A. Howard. Design and use paradigms for gazebo, anopen-source multi-robot simulator. In Intelligent Robots and Systems(IROS), 2004 IEEE/RSJ International Conference on, 2004.[56] S. Kohlbrecher, J. Meyer, T. Graber, K. Petersen, U. Klingauf, and O. vonStryk. Hector open source modules for autonomous mapping andnavigation with rescue robots. In Robot Soccer World Cup. Springer, 2013.[57] K. Konolige, G. Grisetti, R. Ku¨mmerle, W. Burgard, B. Limketkai, andR. Vincent. Efficient sparse pose adjustment for 2D mapping. In IntelligentRobots and Systems (IROS), 2010 IEEE/RASJ International Conference on,2010.[58] M. Labbe and F. Michaud. Appearance-based loop closure detection foronline large-scale and long-term operation. IEEE Transactions onRobotics, 29(3):734–745, 2013.[59] Y. LeCun, Y. Bengio, and G. Hinton. Deep learning. Nature, 2015.[60] V. Lepetit and P. Fua. Keypoint recognition using randomized trees. IEEETransactions on Pattern Analysis and Machine Intelligence (PAMI), 28(9):1465–1479, 2006.[61] V. Lepetit, F. Moreno-Noguer, and P. Fua. Epnp: An accurate o (n) solutionto the pnp problem. International journal of computer vision, 81(2):155,2009.[62] R. Li, Q. Liu, J. Gui, D. Gu, and H. Hu. Indoor relocalization inchallenging environments with dual-stream convolutional neural networks.IEEE Transactions on Automation Science and Engineering, 2017.[63] Y. Lin and Y. Jeon. Random forests and adaptive nearest neighbors.Journal of the American Statistical Association, 101(474):578–590, 2006.[64] D. G. Lowe. Distinctive image features from scale-invariant keypoints.International journal of computer vision, 60(2):91–110, 2004.[65] D. V. Lu, D. Hershberger, and W. D. Smart. Layered costmaps forcontext-sensitive navigation. In Intelligent Robots and Systems (IROS),2014 IEEE/RASJ International Conference on, 2014.[66] S. Mahadevan and D. P. Casasent. Detection of triple junction parametersin microscope images. In Proc. SPIE, volume 4387, pages 204–214, 2001.101[67] E. Marder-Eppstein, E. Berger, T. Foote, B. Gerkey, and K. Konolige. Theoffice Marathon. In Robotics and Automation (ICRA), Proceedings of IEEEInternational Conference on, 2010.[68] J. Matas, C. Galambos, and J. Kittler. Robust detection of lines using theprogressive probabilistic hough transform. Computer Vision and ImageUnderstanding (CVIU), 78(1):119–137, 2000.[69] K. M. Mathieson, J. J. Kronenfeld, and V. M. Keith. Maintainingfunctional independence in elderly adults the roles of health status andfinancial resources in predicting home modifications and use of mobilityequipment. The Gerontologist, 42(1):24–31, 2002.[70] L. Meng, J. Chen, F. Tung, J. J. Little, and C. W. de Silva. Exploitingrandom RGB and sparse features for camera pose estimation. In 27thBritish Machine Vision Conference (BMVC), 2016.[71] J. J. More´. The levenberg-marquardt algorithm: implementation andtheory. In Numerical analysis, pages 105–116. Springer, 1978.[72] R. Mur-Artal, J. M. M. Montiel, and J. D. Tardos. ORB-SLAM: a versatileand accurate monocular SLAM system. IEEE Transactions on Robotics(T-RO), 31(5):1147–1163, 2015.[73] R. A. Newcombe, S. Izadi, O. Hilliges, D. Molyneaux, D. Kim, A. J.Davison, P. Kohi, J. Shotton, S. Hodges, and A. Fitzgibbon. Kinectfusion:Real-time dense surface mapping and tracking. In Mixed and augmentedreality (ISMAR), IEEE international symp. on, 2011.[74] D. Niste´r. Preemptive RANSAC for live structure and motion estimation.Machine Vision and Applications, 2005.[75] D. Nister and H. Stewenius. Scalable recognition with a vocabulary tree. InProceedings of the IEEE Conference on Computer Vision and PatternRecognition (CVPR), volume 2, pages 2161–2168. IEEE, 2006.[76] S. J. Pan and Q. Yang. A survey on transfer learning. IEEE Trans. onKnowledge and Data Engineering, 2010.[77] J. Patel, S. Shah, P. Thakkar, and K. Kotecha. Predicting stock and stockprice index movement using trend deterministic data preparation andmachine learning techniques. Expert Systems with Applications, 42(1):259–268, 2015.102[78] M. Quigley, K. Conley, B. Gerkey, J. Faust, T. Foote, J. Leibs, R. Wheeler,and A. Y. Ng. ROS: an open-source robot operating system. In Roboticsand Automation (ICRA) workshop on open source software, Proceedings ofIEEE International Conference on, 2009.[79] S. Quinlan and O. Khatib. Elastic bands: Connecting path planning andcontrol. In Robotics and Automation (ICRA), Proceedings of IEEEInternational Conference on, 1993.[80] J. Redmon, S. Divvala, R. Girshick, and A. Farhadi. You only look once:Unified, real-time object detection. In Proceedings of the IEEE Conferenceon Computer Vision and Pattern Recognition (CVPR), 2016.[81] H. Rehbinder and B. K. Ghosh. Pose estimation using line-based dynamicvision and inertial sensors. IEEE Transactions on Automatic Control, 48(2):186–199, 2003.[82] U. Reiser, T. Jacobs, G. Arbeiter, C. Parlitz, and K. Dautenhahn.Care-o-bot R© 3–vision of a robot butler. In Your virtual butler. 2013.[83] A. Rubio, M. Villamizar, L. Ferraz, A. Penate-Sanchez, A. Ramisa,E. Simo-Serra, A. Sanfeliu, and F. Moreno-Noguer. Efficient monocularpose estimation for complex 3d models. In Robotics and Automation(ICRA), Proceedings of IEEE International Conference on, pages1397–1402. IEEE, 2015.[84] E. Rublee, V. Rabaud, K. Konolige, and G. Bradski. Orb: An efficientalternative to sift or surf. In Proceedings of the IEEE internationalconference on computer vision(ICCV), pages 2564–2571. IEEE, 2011.[85] Y. Salau¨n, R. Marlet, and P. Monasse. Robust and accurate line-and/orpoint-based pose estimation without manhattan assumptions. In EuropeanConference on Computer Vision (ECCV), pages 801–818. Springer, 2016.[86] J. M. Santos, D. Portugal, and R. P. Rocha. An evaluation of 2D slamtechniques available in robot operating system. In SSRR, 2013.[87] T. Sattler, B. Leibe, and L. Kobbelt. Efficient & effective prioritizedmatching for large-scale image-based localization. IEEE Transactions onPattern Analysis and Machine Intelligence (PAMI), 2016.[88] G. Schindler, M. Brown, and R. Szeliski. City-scale location recognition.In Proceedings of the IEEE Conference on Computer Vision and PatternRecognition (CVPR), 2007.103[89] S. Se, D. G. Lowe, and J. J. Little. Vision-based global localization andmapping for mobile robots. IEEE Transactions on Robotics (T-RO), 21(3):364–375, 2005.[90] V. Sharmanska, N. Quadrianto, and C. H. Lampert. Learning to rank usingprivileged information. In Proceedings of the IEEE InternationalConference on Computer Vision (ICCV), pages 825–832, 2013.[91] T. Sharp. Implementing decision trees and forests on a gpu. In Europeanconference on computer vision, pages 595–608. Springer, 2008.[92] J. Shotton, M. Johnson, and R. Cipolla. Semantic texton forests for imagecategorization and segmentation. In Proceedings of the IEEE Conferenceon Computer Vision and Pattern Recognition (CVPR). IEEE, 2008.[93] J. Shotton, B. Glocker, C. Zach, S. Izadi, A. Criminisi, and A. Fitzgibbon.Scene coordinate regression forests for camera relocalization in rgb-dimages. In Proceedings of the IEEE Conference on Computer Vision andPattern Recognition (CVPR), pages 2930–2937, 2013.[94] J. Shotton, T. Sharp, A. Kipman, A. Fitzgibbon, M. Finocchio, A. Blake,M. Cook, and R. Moore. Real-time human pose recognition in parts fromsingle depth images. Communications of the ACM, 56(1):116–124, 2013.[95] K. Simonyan and A. Zisserman. Very deep convolutional networks forlarge-scale image recognition. arXiv preprint arXiv:1409.1556, 2014.[96] J. Sivic, A. Zisserman, et al. Video google: A text retrieval approach toobject matching in videos. In Computer Vision (ICCV), 2003 IEEEInternational Conference on, volume 2, pages 1470–1477, 2003.[97] H. Strasdat, A. J. Davison, J. M. Montiel, and K. Konolige. Double windowoptimisation for constant time visual slam. In Computer Vision (ICCV),2011 IEEE International Conference on, pages 2352–2359. IEEE, 2011.[98] J. Sturm, N. Engelhard, F. Endres, W. Burgard, and D. Cremers. Abenchmark for the evaluation of rgb-d slam systems. In Intelligent Robotsand Systems (IROS), 2012 IEEE/RSJ International Conference on. IEEE,2012.[99] N. Su¨nderhauf, S. Shirazi, F. Dayoub, B. Upcroft, and M. Milford. On theperformance of convnet features for place recognition. In IntelligentRobots and Systems (IROS), 2015 IEEE/RSJ International Conference on,pages 4297–4304. IEEE, 2015.104[100] N. Sunderhauf, S. Shirazi, A. Jacobson, F. Dayoub, E. Pepperell,B. Upcroft, and M. Milford. Place recognition with convnet landmarks:Viewpoint-robust, condition-robust, training-free. Proceedings of Robotics:Science and Systems XII, 2015.[101] S. Thrun, W. Burgard, and D. Fox. Probabilistic robotics. MIT press, 2005.[102] J. Valentin, M. Nießner, J. Shotton, A. Fitzgibbon, S. Izadi, and P. H. Torr.Exploiting uncertainty in regression forests for accurate camerarelocalization. In Proceedings of the IEEE Conference on Computer Visionand Pattern Recognition (CVPR), pages 4400–4408, 2015.[103] J. Valentin, A. Dai, M. Nießner, P. Kohli, P. Torr, S. Izadi, and C. Keskin.Learning to navigate the energy landscape. In International Conf. on 3DVision (3DV), 2016.[104] V. Vapnik and A. Vashist. A new learning paradigm: Learning usingprivileged information. Neural networks, 22(5):544–557, 2009.[105] A. Vedaldi and B. Fulkerson. Vlfeat: An open and portable library ofcomputer vision algorithms. In ACM international conf. on Multimedia,2010.[106] R. Vincent, B. Limketkai, and M. Eriksen. Comparison of indoor robotlocalization techniques in the absence of GPS. In SPIE Defense, Security,and Sensing, 2010.[107] R. G. von Gioi, J. Jakubowicz, J.-M. Morel, and G. Randall. Lsd: A fastline segment detector with a false detection control. IEEE Transactions onPattern Analysis and Machine Intelligence (PAMI), 32(4):722–732, 2010.[108] F. Walch, C. Hazirbas, L. Leal-Taixe´, T. Sattler, S. Hilsenbeck, andD. Cremers. Image-based localization with spatial LSTMs. arXiv preprintarXiv:1611.07890, 2016.[109] C. Wang, L. Meng, S. She, I. M. Mitchell, T. Li, F. Tung, W. Wan, M. Q. H.Meng, and C. de Silva. Autonomous mobile robot navigation in unevenand unstructured indoor environments. In Intelligent Robots and Systems(IROS), 2017 IEEE/RSJ International Conference on, 2017.[110] T. Whelan, M. Kaess, H. Johannsson, M. Fallon, J. J. Leonard, andJ. McDonald. Real-time large-scale dense rgb-d slam with volumetricfusion. The International Journal of Robotics Research (IJRR), 34(4-5):598–626, 2015.105[111] T. Whelan, S. Leutenegger, R. F. Salas-Moreno, B. Glocker, and A. J.Davison. Elasticfusion: Dense slam without a pose graph. In Robotics:science and systems (RSS), volume 11, 2015.[112] B. Williams, G. Klein, and I. Reid. Automatic relocalization and loopclosing for real-time monocular slam. IEEE Transactions on PatternAnalysis and Machine Intelligence (PAMI), 33(9):1699–1712, 2011.[113] J. won Lee and C. Giraud-Carrier. Transfer learning in decision trees. InNeural Networks, 2007. IJCNN 2007. International Joint Conference on,pages 726–731. IEEE, 2007.[114] C. Wu. SiftGPU: A GPU implementation of scale invariant featuretransform (SIFT). 2007.[115] C. Wu et al. VisualSFM: A visual structure from motion system. 2011.[116] J. Wu, L. Ma, and X. Hu. Predicting world coordinates of pixels in rgbimages using convolutional neural network for camera relocalization. InIntelligent Control and Information Processing (ICICIP), 2016 SeventhInternational Conference on, pages 161–166. IEEE, 2016.[117] P. Xu and F. Jelinek. Random forests in language modeling. In EMNLP,pages 325–332, 2004.[118] J. Yosinski, J. Clune, Y. Bengio, and H. Lipson. How transferable arefeatures in deep neural networks? In Advances in neural informationprocessing systems (NIPS), pages 3320–3328, 2014.106

Cite

Citation Scheme:

        

Citations by CSL (citeproc-js)

Usage Statistics

Share

Embed

Customize your widget with the following options, then copy and paste the code below into the HTML of your page to embed this item in your website.
                        
                            <div id="ubcOpenCollectionsWidgetDisplay">
                            <script id="ubcOpenCollectionsWidget"
                            src="{[{embed.src}]}"
                            data-item="{[{embed.item}]}"
                            data-collection="{[{embed.collection}]}"
                            data-metadata="{[{embed.showMetadata}]}"
                            data-width="{[{embed.width}]}"
                            async >
                            </script>
                            </div>
                        
                    
IIIF logo Our image viewer uses the IIIF 2.0 standard. To load this item in other compatible viewers, use this url:
https://iiif.library.ubc.ca/presentation/dsp.24.1-0355764/manifest

Comment

Related Items