Open Collections

UBC Theses and Dissertations

UBC Theses Logo

UBC Theses and Dissertations

Towards real-time tissue surface tracking with a surface-based extended kalman filter for robotic-assisted… Wang, Weiqi 2014

Your browser doesn't seem to have a PDF viewer, please download the PDF to view this item.

Item Metadata

Download

Media
24-ubc_2014_september_wang_weiqi.pdf [ 16.92MB ]
Metadata
JSON: 24-1.0167489.json
JSON-LD: 24-1.0167489-ld.json
RDF/XML (Pretty): 24-1.0167489-rdf.xml
RDF/JSON: 24-1.0167489-rdf.json
Turtle: 24-1.0167489-turtle.txt
N-Triples: 24-1.0167489-rdf-ntriples.txt
Original Record: 24-1.0167489-source.json
Full Text
24-1.0167489-fulltext.txt
Citation
24-1.0167489.ris

Full Text

Towards Real-Time Tissue Surface Tracking witha Surface-Based Extended Kalman Filter forRobotic-Assisted Minimally Invasive SurgerybyWeiqi WangB.A.Sc, University of British Columbia, 2011A THESIS SUBMITTED IN PARTIAL FULFILLMENTOF THE REQUIREMENTS FOR THE DEGREE OFMaster of Applied ScienceinTHE FACULTY OF GRADUATE AND POSTDOCTORALSTUDIES(Electrical and Computer Engineering)The University Of British Columbia(Vancouver)June 2014c©Weiqi Wang, 2014AbstractThe use of registered intra-operative to pre-operative imaging has been proposedfor many medical interventions, with the goal of providing more informed guid-ance to the physician. The registration may be difficult to carry out in real-time.Therefore, it is often necessary to track the motion of the anatomy of interest inorder to maintain a registration.In this work, a surface based Extended Kalman Filter (EKF) framework is pro-posed to track a tissue surface based on temporal correspondences of 3D featuresextracted from the tissue surface. Specifically, an initial 3D surface feature mapis generated based on stereo matched Scale Invariant Feature Transform (SIFT)feature pairs extracted from the targeted surface. For each consecutive frame, theproposed EKF framework is used to provide 2D temporal matching guidance inboth stereo channels for each feature in the surface map. The 2D feature matchingis carried out based on the Binary Robust Independent Elementary Feature (BRIEF)descriptor. If the temporal match is successful in both stereo channels, the stereofeature pair can be used to reconstruct the feature location in 3D. The newly mea-sured 3D locations drive the EKF update to simultaneously estimate the currentcamera motion states and the feature locations of the 3D surface map. The frame-work is validated on ex vivo porcine tissue surface and in vivo prostate surfaceduring a da Vinci radical prostatectomy. The peak and mean fiducial errors are 2.5mm and 1.6 mm respectively.Compared to other methods, the surface based EKF framework can providea reliable 2D feature matching guidance for each feature in the 3D surface map.This maintains a chance to relocate a feature that was lost for a significant periodof time. Such a surface based framework provides persistent feature tracking overiitime, which is crucial to drift free surface tracking. With implementation on aGraphic Unit Processor (GPU), real time performance is achieved.iiiPrefaceIn Chapter 3, the implementation of the EKF was modified from the software de-veloped by Davison et al. [14]. The author of this thesis altered the software inorder to integrate the feature tracking scheme into the EKF framework. The origi-nal motion model and Jacobian computations [14] were also modified. In Chapter4, the GPU based 2D Gaussian blurring kernel used in the SIFT implementationwas modified from [40]. The rest of the surface tracking scheme was entirely im-plemented by the author of this thesis. The validation experiments in Chapter 6were designed by the author of this thesis. The in vivo prostate experiment wasconducted together with Omid Mohareri. The author of this thesis processed theexperiment data. In Chapter 5, the 2D calibration software used was developedin [6]. The patient study used in this work was approved by the research ethicsboard under the UBC CREB number: H11-02267. The name of the study is Intra-operative TRUS guidance for RALRP.ivTable of ContentsAbstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iiPreface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ivTable of Contents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vList of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ixList of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xGlossary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xviiiAcknowledgments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xix1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.1 Blackground and Motivation . . . . . . . . . . . . . . . . . . . . 11.2 Thesis Objectives . . . . . . . . . . . . . . . . . . . . . . . . . . 31.3 Thesis Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52.1 Surface Tracking Schemes . . . . . . . . . . . . . . . . . . . . . 52.2 Feature Tracking Schemes . . . . . . . . . . . . . . . . . . . . . 92.2.1 Continuous Tracking Schemes . . . . . . . . . . . . . . . 92.2.2 Wide Baseline Feature Matching Schemes . . . . . . . . . 112.2.3 Motion Compensated Feature Patch Matching . . . . . . . 13v3 Surface Tracking Method . . . . . . . . . . . . . . . . . . . . . . . . 153.1 EKF Formulation . . . . . . . . . . . . . . . . . . . . . . . . . . 193.1.1 EKF State Vector and Covariance Matrix . . . . . . . . . 193.1.2 Motion Model . . . . . . . . . . . . . . . . . . . . . . . 193.1.3 EKF Prediction . . . . . . . . . . . . . . . . . . . . . . . 213.1.4 EKF Update . . . . . . . . . . . . . . . . . . . . . . . . 223.1.5 Adding Features into the EKF 3D Surface Map . . . . . . 243.2 EKF Initialization . . . . . . . . . . . . . . . . . . . . . . . . . . 263.3 Feature Measurement . . . . . . . . . . . . . . . . . . . . . . . . 303.3.1 SIFT Feature Detection . . . . . . . . . . . . . . . . . . . 323.4 Outlier Rejection . . . . . . . . . . . . . . . . . . . . . . . . . . 353.5 Feature Management . . . . . . . . . . . . . . . . . . . . . . . . 363.5.1 Feature Management Scheme by Davison et al. [14] . . . 363.5.2 Feature Management with Nonpersistent Feature Tracking 373.5.3 Proposed Feature Management Scheme for a Explored Scene 384 SIFT Feature Detector and Its Implementation on GPU . . . . . . . 414.1 Building Scale Space . . . . . . . . . . . . . . . . . . . . . . . . 414.1.1 Methods . . . . . . . . . . . . . . . . . . . . . . . . . . 414.1.2 GPU Implementation . . . . . . . . . . . . . . . . . . . . 434.2 Feature Detection . . . . . . . . . . . . . . . . . . . . . . . . . . 464.2.1 Finding Local Extrema . . . . . . . . . . . . . . . . . . . 464.2.2 Sub-sample Interpolation . . . . . . . . . . . . . . . . . . 474.2.3 Edge Effect Elimination . . . . . . . . . . . . . . . . . . 494.3 Feature Results . . . . . . . . . . . . . . . . . . . . . . . . . . . 495 Stereo Calibration . . . . . . . . . . . . . . . . . . . . . . . . . . . . 555.1 Camera Model . . . . . . . . . . . . . . . . . . . . . . . . . . . 555.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . 575.3 2D Calibration Procedure . . . . . . . . . . . . . . . . . . . . . . 585.3.1 Calibration Checkerboard . . . . . . . . . . . . . . . . . 595.3.2 Video Streaming Setup . . . . . . . . . . . . . . . . . . . 595.3.3 Calibration Image Acquisition . . . . . . . . . . . . . . . 60vi5.3.4 Calibration Image Processing . . . . . . . . . . . . . . . 615.3.5 Calibration Quality Measurement . . . . . . . . . . . . . 635.4 Focus Change Recovery . . . . . . . . . . . . . . . . . . . . . . 655.5 Calibration Depth . . . . . . . . . . . . . . . . . . . . . . . . . . 686 Results and Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . 786.1 Simulation Results . . . . . . . . . . . . . . . . . . . . . . . . . 786.1.1 Synthetic Data Generation . . . . . . . . . . . . . . . . . 786.1.2 Simulation Results with Different Motion Parameters . . . 816.1.3 Simulating the Drift Effect Due to Feature Turnover . . . 816.1.4 Simulation Results with a Dynamic Motion Model . . . . 816.2 Ex Vivo Experiment Setup . . . . . . . . . . . . . . . . . . . . . 866.3 In Vivo Experiment Setup . . . . . . . . . . . . . . . . . . . . . . 876.4 Results and Discussion . . . . . . . . . . . . . . . . . . . . . . . 896.5 Timing Results . . . . . . . . . . . . . . . . . . . . . . . . . . . 907 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 977.1 Contribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 977.2 Comparisons to the State of the Art . . . . . . . . . . . . . . . . . 997.3 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1007.3.1 Affine Invariance Feature Matching Scheme . . . . . . . . 1007.3.2 Feature Management in 3D Object Tracking . . . . . . . . 1017.3.3 Surface Tracking Failure Recovery . . . . . . . . . . . . . 102Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103A EKF Jacobians . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109A.1 EKF Prediction . . . . . . . . . . . . . . . . . . . . . . . . . . . 109A.2 EKF Update . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113A.3 Adding Features . . . . . . . . . . . . . . . . . . . . . . . . . . . 115B Other Related Hardware . . . . . . . . . . . . . . . . . . . . . . . . 116B.1 The Quadro Digital Video Pipeline . . . . . . . . . . . . . . . . . 116B.2 3D TV . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117viiC Stereo Triangulation and Rectification . . . . . . . . . . . . . . . . . 118C.1 Stereo Triangulation . . . . . . . . . . . . . . . . . . . . . . . . 119C.2 Rectification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120D Shi and Tomasi Interest Point Detector . . . . . . . . . . . . . . . . 123viiiList of TablesTable 5.1 Intrinsic parameters to be determined for a single camera model. 57Table 5.2 Intrinsic calibration results for both cameras in the stereo endo-scope. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63Table 5.3 Extrinsic results describing position of the right camera wrt theleft camera. . . . . . . . . . . . . . . . . . . . . . . . . . . . 65Table 6.1 Simulation results with different δa and δα . ∆= 0. . . . . . . . 83Table 6.2 Simulation results with different ∆’s. δa = 100 and δα = 100. . 84Table 6.3 Simulation results with the constant velocity model and the dy-namic motion model. In the simulation, ∆ = 0. The simula-tions are carried out for two motion parameter configurations,δa = 25,δα = 0.5 and δa = 100,δα = 100. . . . . . . . . . . . 86ixList of FiguresFigure 3.1 The overall work flow of the EKF framework. . . . . . . . . . 16Figure 3.2 (a) The moving camera tracking configuration. In this config-uration, the motion of the camera is restricted by the insertionpoint on the patient body. Given a tracked surface, only the fea-tures on the edge of the surface (shown in red) can experiencesignificant affine transformation when the camera view movesto the other side of the surface. (b) The moving tissue track-ing configuration. In this configuration, all the features on thetissue object surface can have significant affine transformationdue to object motion. In both figures, different instances of ob-ject and camera are distinguished by solid and dash lines. Fea-tures that can have significant affine transformation are maskedin red. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18Figure 3.3 A schematic illustration of the EKF states and motion model.In the schematics, zi is the newly acquired feature measure-ment for the ith feature in the 3D surface map. hi is the trans-formed location of the ith feature. The transformation is basedon the EKF predicted motion states and yi, the 3D location ofthe ith feature in the object coordinates. y˜i’s are the innovationswhich drive the EKF update. . . . . . . . . . . . . . . . . . . 20Figure 3.4 The ROI selected by the user describing the prostate surface totrack. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27xFigure 3.5 A schematic illustration of the stereo geometry in a rectifiedcoordinate system. (a) The XZ plane view of the coordinatesystem. (b) The Y Z plane view of the coordinate system. X l2dand X r2d are the rectified image coordinates on X axis. ClX andCrX are the X components of the rectified principal axes. A3D point P is projected into the left and right camera viewsat pl and pr respectively. Ol and Or are the camera locations.f is the rectified focus length. X l3d , Yl3d , and Zl3d are the 3Dcoordinates of P in the rectified coordinate system of the leftcamera. In (b), the Y component of the principal axis, CY ,and camera location O are the same for both cameras, due torectification. . . . . . . . . . . . . . . . . . . . . . . . . . . . 28Figure 3.6 Feature information for the first frame. (a-b) Stereo matchedSIFT feature pairs detected from the stereo frames; (c) 3Dpoint locations reconstructed based on stereo matched 2D fea-ture pairs; (d) the variance estimated in mm2. In (d), the red,blue, and red curves correspond to the variances on x, y, and zaxes of the camera coordinates respectively. . . . . . . . . . 31Figure 3.7 The boxing box for scale space construction is shown. Thepredicted location for each 3D surface feature and its searchregion is shown as well. . . . . . . . . . . . . . . . . . . . . 33Figure 3.8 The mask for each scale space octave. Such masks allow us toterminate feature extraction in early stage. . . . . . . . . . . . 33Figure 3.9 The saturation map the image shown in Figure 3.7. . . . . . . 34Figure 3.10 The saturation mask generated based on the saturation mapshown in Figure 3.9. The threshold selected is 0.2. . . . . . . 34Figure 3.11 A figure showing the features extracted from a partial nephrec-tomy video frame. In this case, the objective is to track thekidney surface. As a result, the features extracted from thebackground (shown in red) should not be added into the sur-face map. The green features are relevant and should be added. 39xiFigure 4.1 This figure shows the overall workflow of building a octave.The pre-blurred version of the original image or a scale levelfrom previous octave can be fed into the first level of the oc-tave. In this case, we extract features from 3 scale level peroctave. Therefore, 6 scale levels and 5 DoG levels are con-structed. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44Figure 4.2 The scale levels for all 6 octaves. The top 2 rows show the scalelevels for the first octave. The images for the second octave areshown in the third row with the rest the octave images shownin the last row. . . . . . . . . . . . . . . . . . . . . . . . . . 51Figure 4.3 The DoG functions for all 6 octaves. The top 2 rows showthe images for the first octave. The images for the second oc-tave are shown in the third row with the rest the octave imagesshown in the last row. . . . . . . . . . . . . . . . . . . . . . 52Figure 4.4 The features generated by our implementation (blue cross) arecompared to the features produces by original SIFT implemen-tation by Lowe [26] (green dots). . . . . . . . . . . . . . . . . 53Figure 4.5 The vlSIFT features (blue cross) are compared to the originalSIFT features (green dots). They are significantly different. . . 54Figure 5.1 A schematics of the pin hole camera model. . . . . . . . . . . 56Figure 5.2 An instance of the robot arm grabbing the calibration checker-board. The checkerboard shall be properly placed so that theview of the corners is not affected by the specular reflectionshown at the bottom of the figure. . . . . . . . . . . . . . . . 59Figure 5.3 A typical set of calibration images. . . . . . . . . . . . . . . . 62Figure 5.4 The checkerboard can be carried with a third hand and movedto different poses with respect to the endoscope. . . . . . . . . 62Figure 5.5 Reprojected pixel error distribution for a typical calibration set. 64xiiFigure 5.6 Reprojected pixel error distribution for a case where the saddlepoint extraction quality is very poor for one of the calibrationimage. In the figure, the green points that are far away from therest of the points are extracted from the same calibration im-age. These points should be excluded from current calibrationiteration and refined later. . . . . . . . . . . . . . . . . . . . . 64Figure 5.7 Rectification error for a good set of calibration. The recti-fication errors are distributed evenly around 0. The featurematches with big rectification errors can be considered as out-liers. A threshold should be determined based on specific data. 66Figure 5.8 For calibration images taken from our capture device, the recti-fication error seem to have offsets along with greater scatteringrange. The rectification errors do not seem to be reproduciblefrom case to case. . . . . . . . . . . . . . . . . . . . . . . . 66Figure 5.9 The mean and standard deviation of reconstructed distancesbetween adjacent saddle points. (a) mean (b) standard devi-ation. The green curve shows the reconstructed close pointclouds using calibration results at far depth. The red curveshows the reconstructed far point clouds using calibration re-sults at far depth. The black curve shows the reconstructedfar point clouds using calibration results at close depth. Theblue curve shows the reconstructed close point clouds usingcalibration results at close depth. . . . . . . . . . . . . . . . . 69Figure 5.10 The calibration results for left eye focal length. . . . . . . . . 70Figure 5.11 The calibration results for right eye focal length. . . . . . . . 70Figure 5.12 The calibration results for left eye optical center. . . . . . . . 71Figure 5.13 The calibration results for right eye optical center. . . . . . . 71Figure 5.14 The calibration results for left eye distortion parameters. . . . 72Figure 5.15 The calibration results for right eye distortion parameters. . . 72Figure 5.16 The results for translation vector between stereo cameras. . . 73Figure 5.17 The calibration results for axis angles between stereo cameras. 73Figure 5.18 Left eye focal lengths at different depths. . . . . . . . . . . . 74Figure 5.19 Right eye focal lengths at different depths. . . . . . . . . . . 74xiiiFigure 5.20 Left eye optical centers at different depths. . . . . . . . . . . 75Figure 5.21 Right eye optical centers at different depths. . . . . . . . . . 75Figure 5.22 Left eye distortion parameters at different depths. . . . . . . 76Figure 5.23 Right eye distortion parameters at different depths. . . . . . . 76Figure 5.24 Translation vector between stereo cameras at different depths. 77Figure 5.25 Axis angles between stereo cameras at different depths. . . . 77Figure 6.1 The simulation results for linear components of the motionstates. In the figure, the red curve shows the ground truth.The green curve shows a case where δa = 0.5mm/s2, δα =0.05rad/s2. The blue curve shows a case where δa = 100mm/s2,δα = 100rad/s2. . . . . . . . . . . . . . . . . . . . . . . . . 79Figure 6.2 The simulation results for angular components of the motionstates. In the figure, the red curve shows the ground truth.The green curve shows a case where δa = 0.5mm/s2, δα =0.05rad/s2. The blue curve shows a case where δa = 100mm/s2,δα = 100rad/s2. . . . . . . . . . . . . . . . . . . . . . . . . 79Figure 6.3 The simulation results for linear velocity components of themotion states. In the figure, the red curve shows the groundtruth. The green curve shows a case where δa = 0.5mm/s2,δα = 0.05rad/s2. The blue curve shows a case where δa =100mm/s2, δα = 100rad/s2. . . . . . . . . . . . . . . . . . . 80Figure 6.4 The simulation results for angular velocity components of themotion states. In the figure, the red curve shows the groundtruth. The green curve shows a case where δa = 0.5mm/s2,δα = 0.05rad/s2. The blue curve shows a case where δa =100mm/s2, δα = 100rad/s2. . . . . . . . . . . . . . . . . . . 80xivFigure 6.5 The simulation results show the effect of drifting due to fea-ture turnover in the surface map. In the Figure, the green plotsshow the tracking errors with feature turnovers. The blue plotsshow the tracking errors without feature turnover. Figure (a)shows translation errors in mm while Figure (b) shows the an-gle errors in radians. The turnover rate is 5 features every 40frames. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82Figure 6.6 The velocity simulation results for the dynamic motion model.(a) The linear velocity results. (b) The angular velocity results.In the figure, the red curve indicates the ground truth. Thegreen curve is the dynamic motion model results. The bluecurve is the constant motion model results. The simulation iscarried out using the following parameter configuration, δa =25mm/s2, δα = 0.5rad/s2,∆= 0. . . . . . . . . . . . . . . . 85Figure 6.7 (a) The validation fiducial and tracking error at each frame forthe ex vivo experiment. (b) The clipper corner used for valida-tion and tracking errors at specific time instances for the in vivohuman prostate data. The missing data point of the red curvein plot (d) is caused by instrument occlusion. At that time in-stance, the view of the fiducial is blocked by the instrument, sothat it is not detectable. . . . . . . . . . . . . . . . . . . . . . 88Figure 6.8 Experiment results. The ex vivo results are shown in the leftcolumn and the in vivo results are shown in the right column.Graphs (a-b) show the number of features matched in everyframe. (c-d) show the camera position estimation in mm. (e-f) show the camera axis angle estimation in degree. The exvivo results are shown in the left column and the in vivo resultsare shown in the right column. As shown in the figures, fewerfeatures are tracked when the surface travels to the edge ofthe camera view. In that case, part of the surface region isinvisible while the illumination condition in the region alsovaries considerably. . . . . . . . . . . . . . . . . . . . . . . . 91xvFigure 6.9 The 3D feature tracking map. In the figure, if the correspond-ing surface feature is matched, the 3D feature locations aremarked in blue. In case the feature matching fails, the 3D lo-cations are marked in red. The camera location is at (0,0,0)T ,and the optical axis is shown as the green line. . . . . . . . . . 92Figure 6.10 A figure showing the average residual error among the matchedsurface features at each time instance. . . . . . . . . . . . . . 92Figure 6.11 A figure showing the feature matching performance when thesurface travels to the edge of the camera view. The top figureshows the reference surface with initial features. The blue fea-tures are not matched while the green ones are matched. Thebottom figure shows the feature tracking performance whenthe surface travels to edge of the camera view. . . . . . . . . 93Figure 6.12 A figure shows the feature tracking performance for all trackedfeatures. In the figure, every mark indicates the feature indexedby the y value is matched at the time instance suggested by thex value. As shown in the figure, feature tracking can fail forup to 10 seconds. Yet, the framework is able to relocate thefeatures that have been lost for a while. . . . . . . . . . . . . 94Figure 6.13 A figure showing how feature matching reacts to instrumentocclusion. As shown in the figure, the system is sufficient ro-bust to reject potential mismatches due to instrument occlusion. 95Figure 6.14 A histogram of the matching rates for all tracked features. . . 95Figure 6.15 A break down of the execution time consumed by each com-ponent. As shown in the figure, most of the execution time isspent on building octave and EKF update. Currently, the EKFprediction and update are carried on CPU. The salient featureextraction and matching are carried on GPU. The executiontime is measured in ms. In the figure, the timing for the ex vivoand in vivo data sets are shown in red and blue respectively.The ROI for in vivo data is greater, resulting to more featuresin the surface map. That leads to longer execution time. . . . . 96xviFigure C.1 A schematics of the triangulation geometry. . . . . . . . . . . 119xviiGlossaryEKF Extended Kalman FilterSIFT Scale Invariant Feature TransformBRIEF Binary Robust Independent Elementary FeatureGPU Graphic Unit ProcessorAR augmented realityMIS minimally invasive surgeryCT Computed TomographyMRI Magnetic Resonance ImagingSLAM Simultaneous Localization and MappingDOG Difference of GaussianHSV Hue Saturation ValueRMS Root Mean SquareSSD Sum of Squared DifferencesxviiiAcknowledgmentsFirstly, I would like to thank my supervisor, Tim Salcudean, for his guidance, pa-tience, and support. I would also like to thank Dr. David Lowe for his insight intothe problem, without which this work would not have been possible.I am very grateful to the colleagues at the Robotics and Control lab. Specialthanks to Omid Mohareri, Julio Lobo, and Caitlin Schneider for spending their timehelping me with my experiments at VGH. Also, I would like to thank Omid Mo-hareri, Hussam Al-deen Ashab, Julio Lobo, Jeff Abeysekera, Siavash Khallaghi,Saman Nouranian, Ramin S.Sahebjavaher, and Philip Edgcumbe for moral supportin the lab.Most importantly, I would like to thank my parents who have always been onmy side and supporting me through the tough days.xixChapter 1Introduction1.1 Blackground and MotivationCompared to traditionally open procedures, minimally invasive surgery (MIS) re-quires smaller incisions on the patient body. That leads to less blood loss, quickerpatient recovery, and shorter hospital stay. The daVinci surgical robot has beenintroduced to facilitate minimally invasive surgery. In order to provide surgeonswith a superior 3D view of the surgical scene, a stereo endoscope is inserted intothe patient body. Such camera configuration enables 3D reconstruction of the tis-sue surface, which offers an excellent platform to enable augmented reality (AR)during minimally invasive surgery.During minimally invasive surgery, augmented reality involves overlaying med-ical images onto the surgeon’s view of the tissue [12] [47] [22]. These medicalimages can be Computed Tomography (CT), Magnetic Resonance Imaging (MRI),or ultrasound images that are acquired preoperatively or intraoperatively. The im-age data provides detail information on the anatomy structure around the surgicalregion. In the case of radical prostatectomy, augmented reality allows the sur-geon to locate cancer, anatomy boundaries, and critical nerves before dissecting theprostate from its surrounding anatomies. Such anatomy localization helps preser-vation of the neurovascular bundle and sphincter, which ultimately improves postsurgery sexual function and urinary control. With improved awareness of the sur-rounding anatomies, potential intraoperative injury such as rectal injury is also1reduced [47]. In the case of partial nephrectomy, AR has 2 potential applications[22]. Firstly, AR can be used to locate major vessels and renal vasculature near thetumor. This helps the surgeon reduce blood loss during resection. Secondly, sur-geons can use AR to identify the tumor boundaries, which helps complete tumorremoval and maximize nephron preservation.To enable augmented reality in MIS, a medical image volume of the targetanatomy has to be initially registered onto the tissue surface. Su et al. [45] ren-ders the CT volume of the kidney on the screen and visually finds the best fitbetween the CT volume and tissue surface. Pratt et al. [41] locates a 3D pointin both the daVinci camera coordinates and the 3D CT volume coordinates. Thetranslation registration between the 2 coordinates are simply done by aligning the2 points. The next stage of the registration requires human interaction to alignthe coordinates in 3 rotation angles. To facilitate the process, an intuitive rollingball interface is provided to the user. Adebar et al. [1] proposes an ultrasound tocamera registration technique using a surgical instrument which can be used insidepatient body. Placed on the tissue surface, fiducials on one side of the instrumentcan be identified in the ultrasound image while other fiducials on the other side ofthe instrument can be detected by the stereo camera. Knowing the geometry ofthe instrument, registration can be established between the ultrasound coordinatesand the camera coordinates. Based on the registration, the volume data in the ul-trasound coordinates can be mapped into camera coordinates. Based on the initialregistration, the medical image volume can be rendered and overlaid onto the tissuesurface [41].The focus of this work lies on maintaining the initial registration. The goal is tokeep the medical image volume tagged on the tissue surface until resection starts.The medical image to tissue surface registration can be maintained by temporallytracking the tissue surface motion. Such surface tracking can be carried out basedon temporal correspondence of the salient features extracted from the surface [38],[52], [19].Temporal feature correspondences can be established by matching the featuresin consecutive frames based on the 2D patches centered at the feature locations.The feature tracking condition in MIS has been discussed in detail by Giannarouet al. [17]. We list the challenges as following,2• As the endoscope camera moves with respect to the tissue surface, the visualappearance of the patch varies due to changing view angles. As a result,the patch tagged to each feature is subject to translation, rotation, affine, andscaling transformations.• The only illumination source in the environment is attached to the endo-scope. Under this setup, relative motion between the camera and the tissuesurface can vary the illumination conditions on the feature patches.• Tracked features can be invisible in the camera view at times. Specifically,specular reflectance and surgical instruments can occlude the tracked fea-tures. In some cases, the tracked features can move out of the camera viewor to the other side of a 3D object. Since the feature is not visible, it is notpossible to track the feature.• The free form deformation can also change the appearance of the featurepatch.Due to the challenging tracking condition in MIS, it is difficult to always tracksurface features with persistency. Impersistent feature tracking is likely to causesurface tracking to drift [17]. It is necessary to attempt failure recovery.1.2 Thesis ObjectivesThe objective of this work is to develop a robust surface tracking scheme whichcan be used for augmented reality during da Vinci radical prostatectomy and par-tial nephrectomy. The framework can be potentially used in other da Vinci basedsurgeries as well. To enable such a framework, the following requirements have tobe met:• To enable augmented reality, the registration between the tissue surface andthe medical image volume has to be updated in real time. As a result, a realtime surface tracking scheme is expected.• In order to maintain the surface registration for long periods of time, thesurface tracking scheme is expected to be drift-free.3• During MIS, challenging visual tracking condition can significantly affectsurface tracking quality. Therefore, the surface tracking scheme is expectedto be reasonably robust to these negative effects. The surface tracking schemeis expected to be properly validated.1.3 Thesis OutlineChapter 2 provides a review of existing work on tissue surface tracking duringMIS. Since the overall performance of the surface tracking scheme, timing andaccuracy, highly depends on the feature tracking scheme used, a review of relatedfeature tracking scheme is presented as well.In Chapter 3, each component of the surface tracking scheme is discussed indetail. Specifically, a surface based Extended Kalman Filter (EKF) frameworkis used to estimate the surface motion based on temporal correspondence of 3Dfeature cloud extracted from the surface. In addition, we also discuss how thesurface based EKF framework can provide consistent guidance for temporal featurematching, so as to boost overall tracking accuracy.The validation of the surface tracking scheme is presented in Chapter 6. Thescheme is validated on both ex vivo porcine data and in vivo human prostate datataken during a da Vinci radical prostatectomy.To enable real time performance, a large portion of the surface tracking frame-work is implemented on a modern GPU. In Chapter 4, the SIFT feature detectorand its implementation are discussed in detail.To enable 3D surface tracking, it is necessary to calibrate the stereo endoscope.In Chapter 5, a brief review of the existing calibration techniques is presented alongwith a description of the calibration framework used in the project.4Chapter 2Related WorkThis chapter reviews the existing work on tissue surface tracking during MIS.Since the overall performance of the surface tracking scheme, timing and accu-racy, highly depends on the feature tracking scheme used. It is necessary to reviewthe work on feature tracking as well.2.1 Surface Tracking SchemesSurgical Navigation Work by Mountney et al. [38]Following the Simultaneous Localization and Mapping (SLAM) work by Davisonet al. [14], Mountney et al. [38] has performed surgical navigation using a stereoendoscope. Based on a rigid surface model, an Extended Kalman Filter (EKF) wasused to simultaneously update the motion states of the camera and the feature lo-cations of a 3D surface feature map. The update was driven by repetitive measure-ments of the 3D surface map features in the consecutive frames. Meanwhile, theEKF framework also kept tracking of the camera motion state uncertainties, indi-vidual feature uncertainties, cross-covariances between motion state and features,and cross-covariances between individual features.The 3D feature map was the initial 3D feature cloud extracted from the firstvideo frame. Specifically, in the left image, the Shi and Tomasi [44] operator wasused to detect features in the scale space constructed by the Difference of Gaus-5sian (DOG) function [26]. Based on stereo geometry, the stereo feature correspon-dence in the right image was found by searching the epipolar line. The featurematching function used was the normalized sum-of-squared difference. The fea-ture locations detected in the left and right images were used to triangulate the 3Dfeature locations.To temporally track the features in the 3D feature map, an active search featuretracking scheme was used to temporally match 2D features in the consecutive videosequences. Such 2D temporal feature tracking was carried out in both left and rightvideo frames. If a match could be found in both stereo channels, the stereo matchedfeature pairs were used to triangulate the 3D feature location. In each stereo eye,feature matching was carried out by active search scheme which found the matchby searching a small region around a predicted 2D location. Specifically, based onthe camera motion state predicted by EKF, the 3D location and uncertainty storedin the 3D feature map can be transformed into the current time instance. Based oncamera geometry, the 3D information was projected into both left and right images,so as to define a predicted search region around a predicted feature location. Thesearch region in the left image was found first which was later used to constrain thethe search region determination in the right image through stereo geometry. Foreach pixel in the predicted search region, matching was done based on a 25×25image patch centered at the pixel. Specifically, the normalized sum-of-squareddifference was used to measure patch similarity with respect to the reference patchtagged to each feature in the 3D feature map.The 3D temporal feature correspondences were used to update the camera mo-tion state and 3D feature map, along with the associated covariance matrix. Specif-ically, for each successful temporal match, the distance between the projected 3Dlocation and the newly measured 3D location drove the EKF update. The projectedwas based on predicted motion state and feature locations stored in the 3D featuremap.The major drawback of the work by Mountney et al. [38] is that the frameworklacks a reliable outlier rejection mechanism, which reduces the duration of con-tinuous tracking. In [10], Civera et al. provides a detail discussion of the outlierrejection problem in EKF based SLAM system. In [17], Giannnarou et al. alsodiscuss the necessity of having a outlier rejection mechanism.6Tissure Surface Tracking Work by Yip et al. [52]Another full surface tracking work was carried out by Yip et al. [52]. In thisframework, 2D salient features were tracked in time by repeatedly detecting thesame features in consecutive frames. This was different from the active searchfeature tracking scheme used by Mountney et al. [38], which exhaustively searchesa image region for matching pixels.Same as in the work by Mounteny et al. [38], Yip et al. [52] initialized a3D object feature list based on salient features extracted from the stereo images.Specifically, the STAR feature detector [2] was used to extract features from bothleft and right images. Stereo feature matching was carried out based on featuredescriptors built based on the patches centered at the feature locations. In thiscase, the Binary Robust Independent Elementary Feature (BRIEF) descriptor [8]was used. Based on the 2D feature locations of the stereo matched feature pairs,the initial 3D object feature locations were reconstructed by stereo triangulation.For each feature in the object list, the corresponding 2D feature location in the leftimage and its BRIEF descriptor vector were saved for temporal matching purpose.The 3D location was also saved for 3D registration purpose.In the work by Yip et al. [52], 3D feature tracking in time was done by 2D tem-poral feature tracking in the left video channel only. For each incoming new leftvideo frame, the STAR features were extracted from the image while the BRIEFdescriptor was constructed for each extracted feature. Based on the BRIEF descrip-tor, the newly extracted features were matched to the object feature list initializedas discussed above. The temporal feature match was constrained by the last known2D feature location stored in the object feature list. Specifically, the distances be-tween the newly measured 2D locations and last known 2D feature locations wererestricted within a threshold. The feature match was discarded if the feature trav-eled a suspiciously large distance with respect to the last known 2D location ina short period of time. Temporally matched features in the left frame were thenmatched to the features extracted from the right frame, so as to be reconstructedin 3D. If a feature in the object feature list had a temporal match in the left frameand could be 3D reconstructed, a 3D feature match was found. In this case, thelast known 2D feature location and the BRIEF descriptor were updated. Based on7temporal correspondence of 3D features, the Single Value Decomposition (SVD)was used to find the rigid surface motion.The major drawback of the work by Yip et al. [52] is that the feature trackingscheme used is not robust to momentary loss of tracking, which reduces the tempo-ral persistency of the feature tracking scheme. Momentary loss of tracking can becaused by failure of feature descriptor matching, as a result of image noise, suddenimage appearance change, or occlusion. Due to loss of tracking for a few of frames,the feature can travel large distance with respect to its last known 2D location. Asa result, feature matching is more likely to fail due to not fulfilling the temporal 2Dmotion constraint. Loosening such temporal 2D motion constraint can cause thesystem to be more vulnerable to outlier and increase computation burden. In casethe lost feature never gets close to last known 2D location, it is permanently lost. Inthe presence of a feature list management scheme that will be discussed later, thelost feature may be permanently deleted. Davison et al. [14] and Mounteny et al.[38] solve this problem by predicting search regions based on the camera motionstates and feature locations stored in the 3D feature map. In that case, as long asthere are sufficient number of other features matched at each frame to keep trackof the camera motion states, a lost feature can be found even after a long period oftime, given the lost feature has not been deleted by the feature management schemeyet.In [52], Yip et al. adopts the feature list management scheme initially proposedby Davision et al. [14], in order to maintain the number of features stored in objectfeature list. In this case, momentary loss of tracking can reduce the percentageof frames in which a feature match is found. That can lead to permanent removalof feature from the object feature list. In [53], Yip acknowledges gradual loss offeatures from the object list. To ensure that a sufficient number of features to main-tain registration, new 3D features are extracted from the consecutive stereo videoframes and added to the object list. To add a new 3D feature, the camera motionstate at frame n, Tn, is used to project the 3D location of the new feature backto frame 0 where the object feature list is initialized. Specifically, X0 = T−1n ·Xn,where X0 and Xn are the 3D feature locations in frame 0 and frame n respectively.Such projection allows the error in the motion states to leak into X0. As trackingcontinues, existing features are gradually deleted from the object list and new fea-8tures are added into the list. Eventually, all the 3D features saved in the objectlist are corrupted with motion state error to some extent. In that case, the entiretracking gradually drifts away from the ground truth.2.2 Feature Tracking SchemesThis section discusses other 2D feature tracking schemes developed for MIS. Theseschemes can be classified into 2 categories: 1) continuous tracking - the schemesin this category rely on the assumption that the patch appearance does not changedramatically in a short period of time; 2) wide baseline feature matching - theschemes in this category can handle significant change in patch appearance, whichcan be used to find lost features.2.2.1 Continuous Tracking SchemesSimilar to the feature tracking scheme used by Yip et al. [52], continuous featuretracking schemes are not robust to momentary tracking failure. In case the trackingfails for a few frames, the scheme can not provide reliable 2D spatial guidance forfeature matching. As a result, the tracked feature is permanently discarded. Astracking continues, the initial features in the object list are likely to be graduallylost. New features can be added into the list in order to continue surface tracking.Yet, new features have to be projected into frame 0 using the motion state. As aresult, the projected 3D locations are corrupted with motion state errors. Usingthese projected locations for surface tracking introduces drifting in the system.Probabilistic Tracking of Affine Regions [17]Inspired by the affine-invariant feature detector work by Mikolajczvk et al. [33],Giannarou et al. [17] used a single feature EKF to track affine-invariant features inMIS video sequences. In this work, Giannarou et al. detected the affine-invariantfeature using the scale adapted Harris detector [33]. In the neighborhood of eachdetected feature, the second moment matrix was computed describing the localgradient distribution. An elliptical affine region was defined based on the secondmoment matrix with the two eigenvalues of the matrix being the major and minoraxes of the ellipse, indicating the two principal axes of pixel intensity change. An9EKF was designed to track each feature in consecutive frames. For each feature,2D location, ellipse tip, angle between major and horizontal axes, angular velocity,and ratio between major and minor axes were set as the states of the EKF. To trackthe feature in a new frame, a new affine-invariant feature was identified aroundthe location predicted by the EKF and a new elliptical affine region was also com-puted. Feature matching was done by measuring the overlap area between the EKFpredicted and newly measured affine regions.In order to boost the temporal persistency of the continuous tracking scheme,Giannarou et al. [17] incorporated the spatial context information to improve theEKF prediction quality and reject measurement outlier. Moreover, attempts weremade to recover from momentary tracking failure by using the spatial context in-formation. If the lost feature could not be recovered for a number of consecutiveframes, it was permanently discarded. As a result, the method is not capable ofrecovering a long term lost feature.Online Learning Scheme [36] [35]Mountney and Yang [36] proposed an on-line learning framework to classify tis-sue surface patches into 2 categories, matched and unmatched. In this work, the2D initial features were generated using DOG [26] and Shi and Thomasi opera-tor [44]. For a couple of frames right after the initial frame, the initial referencefeatures were tracked using Lucas Kanade (LK) tracker [27]. The tracking resultswere used to train a decision tree that were used for classification. Specifically,the matched patches were treated as one category while some unmatched patchesnearby were selected as training set for the other category. In order to propagatethrough the decision tree, a sequence of patch tests was constructed. Each test wasa comparison between a selected pixel pair within the patch centered at the pixelunder classification. The learning objective was to find the optimal tree and testsequence that minimize the entropy of the system, so that the test sequence pro-vided most information. After the tree and test sequence were established, pixelsin the search region in the consecutive frames were classified using the decisiontree based on test sequence results. As an extension of this work, Mountney [35]added synthetic patches into the learning framework. As a result, the system be-10came invariant to various rigid motion transformations.In [17], Giannarou et al. showed that the 2 methods, [17] and [35], have sim-ilar temporal persistency. In the case of a long lost feature, the appearance of thefeature patch can be significantly different from that of any patch used for training.As a result, the tracking can still fail. The framework can serve as a good enhance-ment to other feature tracking scheme. Specifically, the tracking results from thealternative tracker can provide good tracked patches to train the system. This canpotentially improve overall tracking performance. In case of tracking failure forthe learning system, the alternative tracker can always be swapped back.Image Warping Based TechniquesFor cardiac surgery, local patch deformation is extensive, which poses greaterchallenge to surface tracking. To temporally track surface features, Richa et al.[42] uses the thin-plane spline based image warping to recover such deformation.Specifically, a The execution time of such method depends on the number of it-erations required for the image warping error to converge. For large inter-framefeature motion, more iterations are required, leading to a slow system. In our ap-plication, the patch deformation during prostatectomy and partial nephrectomy issignificant less compared to the case of cardiac surgery.2.2.2 Wide Baseline Feature Matching SchemesWide baseline feature matching schemes are designed by the computer vision com-munity to normalize the significant patch appearance change due to camera anglechanges. Such schemes are suitable for relocating a feature that has been lost for awhile. Specifically, due to tracking failure, there can be significant appearance dif-ference between the feature patch at the current time instance and its correspondedlast known instance. It is essential to normalize the potential transformations be-fore patch matching.SIFT [26] and ASIFT [34]Yip et al. [52] has compared the Scale Invariant Feature Transform (SIFT) featuretracking scheme [26] with other schemes, including SURF [3], BRIEF [8], and11STAR [2]. It has been shown that SIFT provides more robust tracking results. TheSIFT detector extracts features in high gradient locations in image scale space.Feature locations are distinctive regions that can be tracked over time. The imagescale space is built using the DOG function. The SIFT descriptor is computedbased on the distribution of pixel gradient orientations within the patch centeredat each feature. The gradient orientation of the entire patch is computed and thegradient orientation of each pixel is rotated with respect to the orientation of thepatch. In this sense, the descriptor stays invariant to in-plane rotation. Since thedescriptor is also constructed based on scale-space, SIFT is also invariant to scalechange, due to translation in depth. While it is scale and in-plane rotation invariant,an affine transformation can still cause the SIFT descriptor to fail. Morel and Yu[34] have developed an ASIFT algorithm to cope with affine transformation. TheASIFT algorithm generates synthetic images at a finite set of out-plane rotationangles. Standard SIFT feature matching scheme can be applied to all synthetic andoriginal images.Affine Invariant FeaturesIn a scale invariant feature detector such as SIFT, scale change is assumed to beisotropic. The assumption holds if the feature patch moves in the depth direction,resulting in a feature size that changes uniformly across the patch. An out-of-planerotation is likely resulting in a scale change that is anisotropic across the patch. Tocope with this problem, a variety of affine region detectors have been developed.Mikolajczvk et al. [33] has compared 6 such affine invariant detectors. Amongthose detectors, Maximally Stable Extremal Regions (MSER) [28] are shown toperform better than other detectors for most test cases. It is also shown that theHarris affine detector [32] and Hessian affine detector [33] can detector more re-gions compared to other detectors.Wide baseline matching between affine invariant feature is based on a ellipticalaffine region defined for each extracted feature. Specifically, the major and minoraxes of the ellipse are set to the two eigenvalues of the second moment matrix,computed based on the image gradient of the patch centered at the feature location.Between matching frames, two affine regions tagged to the same feature can have12different orientations and sizes, due to the inter-frame motion of the camera. Tomatch the affine regions, an orthogonal transformation is found to transform thetwo affine regions into one normalized set of coordinates, where the two ellipsesare projected to two circles. In the normalized coordinates, the features can bematched by taking the SIFT approach to normalize in-plane rotation. Finding theorthogonal transformation is an iterative process, which can be potentially timeconsuming, in case many candidates need to be checked. It is not clear whethersuch matching scheme can be realized in real time.2.2.3 Motion Compensated Feature Patch MatchingAlong with the SLAM framework, Davison et al. [14] also proposes a featuretracking scheme which incorporates the camera motion state to compensate thepatch appearance change due to camera view angle change. In this work, loca-tion and pose of a moving camera is determined based on a video sequence of arigid environment. For each feature patch, a 3D orientation vector normal to thepatch plane is constructed. As a result, with respect to the camera location andpose, the original image patch tagged to the tracked feature is projected to thecurrent view. The projected template is then used to perform a normalized cross-correlation search within the search region, so as to find the feature match in thecurrent frame.The drawback of such scheme is it relies on predicting the feature locationswith respect to the camera. To compensate that, a greater search area can alwaysbe implemented to improve tolerance to prediction error. The method certainlyworks better when a rigid object is tracked. In the object to be tracked is notrigid, features can still be projected onto the current time instance based on thedeformation estimation and camera motion state.In summary, a feature matching scheme finds feature matches based on appear-ance similarity of the patch tagged to each feature. A spatial guidance mechanismshould be in place to predict the 2D location of the match, leading to a small searchrange. With a smaller search range, real-time matching speed can be achievedwhile the scheme is less vulnerable to matching outlier.Due to camera view angle changes, illumination condition changes, and sur-13gical instrument occlusions, feature tracking can fail momentarily. In a continu-ous feature matching scheme, both of the patching matching and spatial guidancemechanisms are not suitable for failure recovery. Specifically, in case a featurejust moves out a surgical instrument occlusion, the patch appearance can be signif-icantly different from its last matched instance while the feature can travel a large2D distance with respect to its last known location. The difference in patch appear-ance can fail a similarity matching scheme which is used by Yip et al. [52] andGiannarou et al. [17], because both of which can not normalize image transforma-tions in large scale. In terms of spatial guidance, the last known 2D location usedin [52] and single feature based EKF used in [17] can not estimate potential 2Dmatching location in case the feature has not been successfully matched for a fewframes.To overcome the drawbacks, we use the surface based EKF proposed by Davi-son [14]. In this framework, as long as the surface registration is maintained, reli-able estimation of the 2D feature location can be acquired for each feature in the3D surface map, even if the feature has not be successfully matched for a while.In our feature matching scheme, we do not update our descriptor over time. Sucha scheme is not suitable for tracking recovery for a feature that has been lost fora long time. Our current tracking configuration has restricted relative motion be-tween the camera and tissue to be tracked. As a result, even a simple descriptorsuch as the BRIEF descriptor can have reasonable feature matching rate.14Chapter 3Surface Tracking MethodThis section describes the surface tracking method used and its implementation.To track a prostate surface, the DOG function is used to build the scale spacewithin a region of interest, from which the SIFT feature detector is used to extractsalient locations. The SIFT features extracted from left and right images are stereomatched and reconstructed in 3D via triangulation. This allows us to have a 3Dfeature map tagged to the tracked surface. An EKF framework similar to the oneused by Mountney et al. [38] and Davison et al. [14] is used to predict the featurelocations in the current frames, so as to guide temporal feature matching. To makethe EKF framework more suitable for MIS application, the feature managementscheme used is modified, so that no feature is deleted from the surface map. Thisincreases the chance of finding a long lost feature. If 2D temporal matching issuccessful in both stereo channels, the matched feature pairs can be used to recon-struct the feature locations in 3D, which drive the EKF to update the 3D featurelocations of the surface feature map and the camera motion state. A overall systemwork flow is presented in Figure 3.1.We use the BRIEF descriptor [8] to match newly extracted SIFT features to thefeatures stored in the 3D surface map. The BRIEF descriptor describes the patchby a series of intensity comparisons between different pixels within the patch. Thepixels to be compared are selected beforehand following an isotropic Gaussiandistribution centered at the detected feature location. Specifically, the pixels closerto the feature location are more likely to be selected as pixels to be compared. The15select ROIextract initial 2D SIFT features extract initial 2D SIFT features initialize motion state and 3D feature maptransform 3D feature locations to the current frameproject 2D feature locations project 2D feature locations match feature using the BRIEF descriptorreconstruct newly matched 2D stereo featuresEKF updatematch feature using the BRIEF descriptorextract SIFT features within search regionextract SIFT features within search regionpredict EKF motion statesFigure 3.1: The overall work flow of the EKF framework.16descriptor matching can be done by counting how many comparison results arethe same between the descriptor pairs. In case the comparison results are saved inbits, BRIEF matching can be done by computing the Hamming distance betweenthe two descriptors instead of the L2 norm or normalized cross-correlation. TheHamming distance measures the similarity of 2 bit vectors by counting the numberof corresponding bits that are different. Specifically, the distance can be computedby applying a bit-wise XOR operator to the vector pair and counting the numberof 1 bit in the XOR result vector. For the counting part, a modern processor islikely to provide a POPCNT instruction to count the number of 1 bits in a giveninteger. Using such a machine level optimized instruction can dramatically reducematching execution time. The major drawback of the BRIEF descriptor is its lackof a mechanism to handle any rotation or scale change.In the current tracking configuration, we move the camera and leave the tissuesurface static. The same tracking configuration and camera trajectory are used inthe state of the art work by Grasa et al.[19]. In [19], the same surface based EKFframework is used along with the affine invariance feature matching scheme pro-posed by Davison [14]. In this moving camera tracking configuration, the featurepatch experiences less affine transformation, compared to the moving tissue case.Specifically, in the moving camera configuration shown in Figure 3.2a, the cameramotion is restricted by the insertion point of the camera on the patient body. Asa result, less affine transformation can be generated on the feature patches. Eventhe camera view moves to one side of tracked surface, only the features on theedge of the tracked surface experience significant affine transformation while restof the features on the surface can be tracked so as to maintain surface registra-tion. In the case of moving tissue configuration (Figure 3.2b), the entire surfacecan experience significant affine transformation due to object motion. In this case,the BRIEF descriptor may not provide sufficient matched features to maintain sur-face registration. A more sophisticated feature matching scheme shall be in placeto boost the 2D feature matching performance in order to achieve robust surfacetracking.17insertion pointfeaturestissue surface(a)cameraobject to be trackedfeatures(b)Figure 3.2: (a) The moving camera tracking configuration. In this configura-tion, the motion of the camera is restricted by the insertion point on thepatient body. Given a tracked surface, only the features on the edge ofthe surface (shown in red) can experience significant affine transforma-tion when the camera view moves to the other side of the surface. (b)The moving tissue tracking configuration. In this configuration, all thefeatures on the tissue object surface can have significant affine trans-formation due to object motion. In both figures, different instances ofobject and camera are distinguished by solid and dash lines. Featuresthat can have significant affine transformation are masked in red.183.1 EKF Formulation3.1.1 EKF State Vector and Covariance MatrixAn EKF is used to simultaneously update the camera motion state and 3D locationsof a surface feature map extracted from the tracked surface. The state vector of theEKF are defined as:x =xvy1y2..., xv =rqvω. (3.1)The motion state vector (xv) consist of translation (r), its velocity (v), rotation ex-pressed in quaternion (q), and its angular velocity (ω). The motion states are de-fined to transform points from the camera coordinates to the object coordinates. yiis the 3D location of the ith feature in the object frame. The system is illustratedin Figure 3.3. The EKF framework proposed estimates the rigid transformationbetween two coordinate systems. It disregards which coordinate system is movingin the real world coordinate system.The covariance matrix of the EKF is define as:P =Pxx Pxy1 Pxy2 · · ·Py1x Py1y1 Py1y2 · · ·Py2x Py2y1 Py2y2 · · ·.......... . .. (3.2)Pxx is the covariance matrix of the motion states while Pyiyi is the covariance matrixof the ith feature in the 3D surface map. Pxyi is the cross-covariance matrix betweenthe motion states and the ith feature in the 3D surface map. Pyiy j is the cross-covariance matrix between the ith and jth features.3.1.2 Motion ModelThe motion model used in this work is the constant velocity model proposed byDavison [14]. At each time instance, unknown linear and angular accelerations are19𝑟𝑞𝑣𝜔 𝑦𝑖ℎ𝑖𝑧𝑖 𝑦𝑖camera 𝑥𝑦 𝑧object 𝑥𝑜𝑦𝑜 𝑧𝑜Figure 3.3: A schematic illustration of the EKF states and motion model. Inthe schematics, zi is the newly acquired feature measurement for the ithfeature in the 3D surface map. hi is the transformed location of the ithfeature. The transformation is based on the EKF predicted motion statesand yi, the 3D location of the ith feature in the object coordinates. y˜i’sare the innovations which drive the EKF update.applied on the velocities. The accelerations are modeled to have zero mean andGaussian distribution with experimentally determined uncertainties. Specifically,the acceleration vector n can be expressed as:n =[VΩ]=[a∆tα∆t]. (3.3)where ∆t is the time interval between adjacent video frames, a is the linear accel-eration and α is the angular acceleration. Here, the undetermined acceleration, n,can also be treated as velocity noise. After all, n is modeled as the process noise inthe EKF formulation. The EKF formulation can be written as:xv(t) = fv(xv(t−1),n), zi = hi(xv(t−1),yi,vi). (3.4)In equation 3.4, yi is the 3D location of ith feature in the object coordinates. Givena rigid object, yi is not affected by the motion model while it is constantly refinedin the EKF update step. zi is the newly measured 3D feature location in the camera20coordinates at time instance t. hi is the transformed 3D feature location in thecamera coordinates. The transformation is based on the motion states and yi. vi isthe observation noise.3.1.3 EKF PredictionBased on the constant velocity model, the motion state vector can be predicted as:xv(t|t−1) = fv(xv(t−1|t−1),n) =r+(v+V )∆tq×q((ω+Ω)∆t)v+Vω+Ω. (3.5)Throughout this thesis, the subscript t − 1|t − 1 stands for the final estimation attime instance t−1. t|t−1 implies the EKF prediction estimation at time instance tand subscript t indicates the final EKF estimation at time instance t. In equation 3.5,q((ω+Ω)∆t) stands for the quaternion converted from the axis angle (ω+Ω)∆t.The quaternion multiplication is carried out by multiplying the equivalent 4× 4matrices.The feature locations in the state vector, yi’s, do not change in the EKF predic-tion step. In this case, given N features stored in the 3D surface map, the processfunction can be written as:xt|t−1 = F(xt−1|t−1,n) =fv(xv(t−1|t−1),n)y0(t−1|t−1)...y(N−1)(t−1|t−1). (3.6)The covariance estimation in the prediction step can be expressed as:Pt|t−1 =∂F∂x Pt−1|t−1∂F∂xT+Q. (3.7)21Q is modeled as the covariance matrix of the process error as:Q =∂F∂n Pn∂F∂nT, Pn =[(δa∆t)2 00 (δα∆t)2]. (3.8)The process function Jacobians can be computed as:∂F∂x =[∂ fv∂xv 00 I], and∂F∂n =[∂ fv∂n 00 0]. (3.9)In order to reduce computation complexity, equation 3.7 can be directly computedas:Pt|t−1 =∂ fv∂xv Pxx(t−1|t−1)∂ fv∂xvT+Q ∂ fv∂xv Pxy1(t−1|t−1)∂ fv∂xv Pxy2(t−1|t−1) · · ·Py1x(t−1|t−1)∂ fv∂xvTPy1y1(t−1|t−1) Py1y2(t−1|t−1) · · ·Py2x(t−1|t−1)∂ fv∂xvTPy2y1(t−1|t−1) Py2y2(t−1|t−1) · · ·.......... . ..(3.10)The derivations of the Jacobians, ∂ fv∂xv and∂ fv∂n , are shown in the Appendix A.1.In equation 3.8, δ 2a and δ 2α are experimentally determined based on the tradeoffbetween motion smoothness and sensitivity to feature measurements. Specifically,a greater Pn allows rapid acceleration driven by the feature measurements whilesmaller Pn enforces a smooth motion model with limited velocity variation betweenadjacent time instances. The selection of δa and δα is discussed in Section 6.1.2.3.1.4 EKF UpdateWith newly acquired feature measurements at time instance t, the innovation, y˜i,for the ith feature in the 3D surface map can be computed as:y˜i = zi−hi. (3.11)For the ith feature in the 3D surface map, hi can be computed as:hi = R¯(yi− r). (3.12)22R¯ is the rotation matrix converted from the inverse of the quaternion,q, stored inthe EKF predicted motion state vector, xv(t|t−1). r can be directly acquired fromxv(t|t−1). Since r and q are defined to transform points from the camera coordinatesto the object coordinates, the transformation is inverted in equation 3.12.Based on the entire innovation vector at time instance t as y˜t , the final EKFupdate can be carried out as:xt|t = xt|t−1 +Wt y˜t , (3.13)Pt|t = Pt|t−1−WtStWTt . (3.14)The Kalman gain Wt can be computed as:Wt = Pt|t−1∂h∂xT(x(t|t−1))S−1t . (3.15)St is the innovation covariance matrix,St =∂h∂x (x(t|t−1))Pt|t−1∂h∂x (x(t|t−1))T +Rt , (3.16)where Rt is the measurement covariance matrix, the diagonal of which are thevariances of the newly acquired features. The variance of the feature measurementcan be carried out based on equation (3.33). St is computed using only featureswith successful measurement at current frame where ∂h∂x can be computed as∂h∂x =∂h∂xv (y0)∂h∂yi (y0) · · · 0 0 · · · 0∂h∂xv (y1) 0 · · ·∂h∂yi (y1) 0 · · · 0.....................∂h∂xv (yk) 0 · · · · · · · · · · · ·∂h∂yi (yk−1), (3.17)where k is the number of successful measurements in the frame. In this case, ∂h∂x isa 3k× (13+3N) matrix. The details of computing ∂h∂xv (yi) and∂h∂yi (yi) are shown inAppendix A.2.After the EKF update, we re-normalize the quaternion in the motion state to23correct the potential numerical error induced during the EKF propagation.The nor-malization can be done by:qnorm =q‖q‖. (3.18)The covariance matrix has to be adjusted according to the normalization. Since thenormalization only affects the motion state, only the motion state related terms inthe covariance matrix are modified as:Pxx =∂xv(norm)∂xv·Pxx(old) · (∂xv(norm)∂xv)T , (3.19)andPxyi =∂xv(norm)∂xv·Pxyi(old). (3.20)∂xv(norm)∂xv can be computed as :∂xv(norm)∂xv=I∂qnorm∂qI (3.21)The detail information on computing ∂qnorm∂q is presented in Appendix A.2. Finally,the covariance matrix is symmetrized as:Pf inal =12(Pnorm +PTnorm). (3.22)3.1.5 Adding Features into the EKF 3D Surface MapFor 3D object tracking, it is necessary to explore new surface of the object and addnew features in the 3D surface feature map. Consider adding L features into theEKF frame containing N features, for jth feature to add, the back-transformed 3Dfeature location y j can be computed as:y j =R · z j + r, (3.23)24where z j is the current measurement of the jth feature. r and R can be acquiredfrom the camera location (r) and pose quaternion (q) respectively. Based on thefinal EKF estimation of the state vector and covariance matrix for existing features,the new covariance matrix can be generated as:Padd = M ·B ·MT . (3.24)M contains the related Jacobians as:M =[I JxyJyx Jyy], (3.25)I is a N×N identity matrix. Jxy contains the Jacobians of the back-transformedfeature locations to add with respect to the camera motion state,Jxy =[∂y0∂xv · · ·∂yL−1∂xv]. (3.26)Jyy contains the Jacobians of the back-transformed feature locations to add withrespect to the measured locations in the current frame,Jyy =∂y0∂ z0 · · · 0... ∂y1∂ z1.... . .0 · · · ∂yL−1∂ zL−1. (3.27)B can be defined as:B =[P 00 Rt]. (3.28)25P is the current final estimated covariance matrix. Rt contains the uncertainties ofthe new features to add:Rt =Rt(0) · · · 0... Rt(1).... . .0 · · · Rt(L−1), (3.29)where Rt( j) is a diagonal matrix defined based on the uncertainties of jth feature toadd. Rt( j) can be estimated based on equation (3.33). The computation of Jacobians∂y j∂xv and∂y j∂ z j are shown in Appendix A.3.3.2 EKF InitializationWe start the tracking by building the 3D probabilistic feature map tagged to thesurface. The feature map contains the 3D features extracted from the tissue surfacealong with their 3D uncertainties.Firstly, the user is prompted to select a region of interest (ROI) by clickinga series of points defining a 2D contour. A typical ROI is shown in Figure 3.4.SIFT features are then detected from both stereo images within the ROI. At eachdetected feature location, both of the BRIEF and SIFT descriptors are constructed.The BRIEF descriptor is used for temporal feature matching. Being a more robustdescriptor, the SIFT descriptor is used for initial stereo matching, which is per-formed only once, at the very first frame. Given a matched stereo feature pair, weapply two layers of outlier rejection mechanisms to ensure the correctness of stereomatching:• As suggested by Lowe [26], for a good feature match, the ratio between thesmallest Euclidean descriptor distance and the second smallest shall be lessthan a threshold. In [26], Lowe suggests a threshold of 0.8. Since stereomatching quality has a big impact on the overall tracking performance, thethreshold is set to 0.6 to ensure outlier rejection while potentially sacrificingmore inliers.• We reconstruct the stereo matched feature pairs in 3D using the method in-26Figure 3.4: The ROI selected by the user describing the prostate surface totrack.cluded in [6]. A description of the method is given in AppendixC. We as-sume the depth estimations shall be similar within the neighborhood. As aresult, for each 3D reconstructed feature in the camera coordinates, we findthe median depth for 3D features within a 5 mm radius in xy direction (no-tation defined in Figure 3.3). In case the difference between the median ofdepth and the feature depth is greater than 5 mm, we reject the feature. Theparameters are suitable at a depth of 10 cm.The stereo matched 2D feature pairs are shown in Figure 3.6a and Figure 3.6b.Given the 3D locations of the feature cloud reconstructed from the stereo matchedfeature pairs (Figure 3.6c), we also estimate the uncertainty of the 3D measure-ment, as required to build the probabilistic feature map. The 3D reconstructionerror can be estimated based on 2D locations of the stereo matched feature pairin rectified coordinates (Figure 3.5). We rectify each extracted feature using themethod included in [6]. A description of the method is included in Appendix C.The 3D reconstruction error is mainly on the z axis of the camera coordinatesystem, which is a result of error in the disparity measurement on x axis [7]. In arectified coordinate system, the epipolar lines of the stereo views are aligned in x27௥ܱଶܺௗ௥- ܥ௑௥௟ܱଶܺௗ௟- ܥ௑௟ܫ𝑧 𝑧ܺ ܻܺ ܻimage planeܲ݌௟݌௥݂ଷܼௗ௟ଷܺௗ௟(a)𝑧ܺimage planeܲଷܼௗ௟ଷܻௗ௟െܥ௒ܱ ܻ݌݂(b)Figure 3.5: A schematic illustration of the stereo geometry in a rectified co-ordinate system. (a) The XZ plane view of the coordinate system. (b)The Y Z plane view of the coordinate system. X l2d and Xr2d are the recti-fied image coordinates on X axis. ClX and CrX are the X components ofthe rectified principal axes. A 3D point P is projected into the left andright camera views at pl and pr respectively. Ol and Or are the cameralocations. f is the rectified focus length. X l3d , Yl3d , and Zl3d are the 3Dcoordinates of P in the rectified coordinate system of the left camera. In(b), the Y component of the principal axis, CY , and camera location Oare the same for both cameras, due to rectification.28direction in the camera coordinate system. As a result, a simpler 3D reconstructionformulation can be derived while error relationship can be revealed. In Figure 3.5,a top view is given to emphasize the geometry in the XZ plane. The rectificationmethod used normalizes the focal length differences between the two cameras andthe distortion effects on each of the cameras. As shown in Figure 3.5, the 3D featurelocation P can be estimated based on 2D feature locations, pl and pr, detected inthe left and right camera views respectively. Specifically, based on the similartriangles, the depth of P, Zl3d , can be written as:Zl3d =f Id, where d = (X l2d−ClX)+(CrX −Xr2d). (3.30)d is generally referred as disparity. X l2d and Xr2d are the rectified image coordinateson the X axis. Clx and Crx are X components of the principal axes for the left andright cameras respectively. I is the stereo base between the cameras. f is the focallength in the rectified coordinates. Following the equation (3.30), the X componentof the 3D location of P is:X l3d =(X l2d−ClX)Zl3df=(X l2d−ClX)Id. (3.31)As shown in Figure 3.5b, the Y component of the 3D location of P is:Y l3d =(Y l2d−CY )Zl3df=(X l2d−ClX)Id. (3.32)In equation (3.31), (3.32), and (3.30), the variables are X l2d , Yl2d , and d. Based onuncertainty propagation, the uncertainty of the 3D measurement in rectified coor-dinates can be estimated knowing the uncertainties of the variables. Specifically,the 3D measurement uncertainty can be written as:δ 2X =I2δ 2X2dd2+I2(X l2d−ClX)2δ 2dd4,δ 2Y =I2δ 2Y2dd2+I2(Y l2d−CY )2δ 2dd4,δ 2Z =f 2I2δ 2dd4.(3.33)29Since we do not expect measurement bias in X and Y directions, δX2d and δY2d arethe same while δd is 2δX2d . We set δ 2X2d to 0.5, respecting the 2D measurementuncertainty for the SIFT detector. This 2D uncertainty corresponds to a 3D mea-surement error of roughly 1 mm at 8 cm, shown in Figure 3.6d. In the future, adetail measurement of 3D reconstruction error of SIFT feature shall be carried outto determine δ 2X2d more precisely. The rest of the terms in equation 3.33 can beacquired from camera calibration and image rectification. The uncertainties are es-timated in the left rectified coordinates. An extra rotation is required to transformthe uncertainties to the left camera coordinates, where 3D probabilistic surface mapis generated.In equation 3.1, the initial value of yi is set to the 3D location of the ith feature.The Pyiyi terms in equation 3.2 are initially set based on the 3D feature uncertainties.Specifically, the diagonal terms of the Pyiyi’s are set to the x, y, and z componentsof the 3D uncertainty of the ith feature. The rest of the terms in Pyiyi’s are set tozero.Initially, the camera location (r) and its uncertainty are set to (0,0,0)T and(0,0,0)T respectively. The initial quaternion (q) and its uncertainty are set to(1,0,0,0)T and (0,0,0)T . The linear velocity (v) and its uncertainty are set to(10−14,10−14,10−14)T and (0,0,0)T . The angular velocity (ω) and its uncertaintyare set to (10−14,10−14,10−14)T and (0,0,0)T . The initial velocity terms are setto non-zero to avoid zero denominators while computing the initial system Jaco-bians. The detail derivation of Jacobian is included in Appendix A.1. As shown inequation (3.10), a Q term is added onto the covariance matrix of the motion statein the EKF prediction step. As shown in the Section 6.1.2, the initial motion statecovariance has slim effect on the tracking quality.3.3 Feature MeasurementFor each stereo channel, within a search region around the predicted 2D featurelocations, we use a GPU based SIFT detector to extract features. The GPU imple-mentation details of the SIFT detector are discussed in Chapter 4. The detectedSIFT features are temporally matched to the features in the surface map. For a fea-ture in the 3D surface map, if temporal matches are found in both stereo channels,30(a) (b)−20 0 20 6080 100−1001020z (mm)x (mm)y (mm)(c)0 20 40 6000.511.522.5feature indexvariance (mm2 )  (d)Figure 3.6: Feature information for the first frame. (a-b) Stereo matchedSIFT feature pairs detected from the stereo frames; (c) 3D point lo-cations reconstructed based on stereo matched 2D feature pairs; (d) thevariance estimated in mm2. In (d), the red, blue, and red curves cor-respond to the variances on x, y, and z axes of the camera coordinatesrespectively.31the 2D locations of the stereo feature pairs are used to reconstructed the featurelocation in 3D. Meanwhile, the 2D locations of the stereo feature pairs are alsorectified to compute the 3D measurement uncertainties. To achieve real time per-formance, the feature measurement scheme is implemented on a modern GPU.3.3.1 SIFT Feature DetectionSelect Bounding Box for Scale Space ConstructionBuilding the scale space for SIFT feature extraction is the most computation con-suming part of the framework. To reduce the computation, we define a region ofinterest based on predicted locations of the features in the 3D surface map. Specif-ically, a bounding box is defined to contain all the transformed features with 100pixel margins on each boundary side. The transformation is based on the EKFpredicted motion state and the stored 3D feature locations in the EKF state vector.In case the EKF motion prediction is sufficient close to the actual estimation, thismargin guarantees that the search region is contained in the bounding box for eachpredicted 3D surface map feature. In addition, the margin is big enough to ensurethat the boundary effect of the Gaussian blur does not impact the feature extractionin high scale octaves. The bounding box for a typical frame is shown in Figure 3.7.Region MaskingSearch Region Masking. In each stereo channel, SIFT features are only extractedwithin search regions around the predicted 2D feature locations, as shown in Figure3.7. To terminate the irrelevant features in an early stage, we mask the searchregions in the DOG functions for each octave. The masking is shown in Figure3.8.Saturation Masking. Image saturation due to poor illumination condition canpose feature extraction problem, since the saturation regions generally do not con-tain any information. Even within the regions near the saturation spots, the fea-ture extracted are not reliable, because the tagged patches are likely to experience32Figure 3.7: The boxing box for scale space construction is shown. The pre-dicted location for each 3D surface feature and its search region isshown as well.Figure 3.8: The mask for each scale space octave. Such masks allow us toterminate feature extraction in early stage.33Figure 3.9: The saturation map the image shown in Figure 3.7.Figure 3.10: The saturation mask generated based on the saturation mapshown in Figure 3.9. The threshold selected is 0.2.34significant appearance changes. In this case, saturation regions are masked outbefore feature extraction. As suggested in [49], saturation regions can be elimi-nated by applying a threshold on the saturation components of the Hue SaturationValue (HSV) color map. For the image shown in Figure 3.7, the saturation map isshown in Figure 3.9 while the masked image is shown in figure 3.10.3.4 Outlier RejectionAn outlier in 2D feature matching can have considerable impact on the overalltracking. Generally, the 2D distance between the outlier and the correct match isnot great, due to the search region we enforce. However, due to the narrow stereobase of the endoscope, a small 2D error can be magnified to cause a large error in3D triangulation. As a result, an outlier rejection should be in place to enhance thetracking robustness.To suppress feature measurement outliers, a threshold is set on the absolutevalue of the EKF innovation, shown in equation 3.11. The outlier rejection mecha-nism is based on the assumption that the measured feature locations are consistentwith the EKF predicted locations. In case of an outlier, the measurement zi falls faraway from the predicted location hi. As a result, setting a threshold on the abso-lute value of the EKF innovation can boost the tracking robustness. Since the EKFtracking is driven by the innovation, limiting the terms in the innovation furtherenforces the smoothness of the estimation. The threshold is currently set to 1.5mm, assuming the 3D reconstruction error at 10cm is 1 mm.In [11], the 1-point RANSAC scheme is proposed to rejection outlier in theEKF filtering used in SLAM framework. The framework is applied to MIS appli-cation in [18]. The outlier rejection issue is generally solved by RANSAC. In theRANSAC framework, small randomly selected subsets of measurements are usedto generate hypotheses which are validated based on agreements of other measure-ments. In case a sufficient number of measurements agree with a hypothesis, thehypothesis is chosen as the final result with the measurement set as inlier. The1-point RANSAC scheme uses single measurement EKF update for hypothesis se-lection. Specifically, for each hypothesis selection trail, a single measurement israndomly selected to update the EKF state vector, so as to provide a hypothesis.35Such hypothesis selection terminates until a reasonable number of measurementsagree with the selected hypothesis. The selected measurements are treated as initialinliers, zin1. The rest of the measurements are treated as initial outliers, zout , whichwill be further tested. The initial inliers are used to drive the first iteration of theEKF update. The result is then used to test the measurements in zout . Specifically,if a measurement agree with the first iteration result, the measurement is treatedas an inlier and added to secondary inlier list, zin2. The final EKF update result iscarried out based on zin2 and the first iteration EKF update result.In our current surface tracking scheme, in order to reject outliers and ensuresmooth tracking, only the low innovation measurements are used for the EKFupdate with the high innovation measurements discarded. In [11], Civera et al.proposes a method which allows high innovation measurements to potentially con-tribute to the EKF update as well. Yet, it is not clear whether the 1-point RANSACscheme outperforms our method or not. With the capture frequency of the videosequence, there might be only few high innovation measurements. In the future, weshould compare our outlier rejection scheme with the 1-point RANSAC scheme.3.5 Feature Management3.5.1 Feature Management Scheme by Davison et al. [14]To allow the tracker to explore new scenes, Davison et al. [14] has a feature man-agement scheme which dynamically adds to the feature map. Specifically, theframework detects a new feature using the Shi and Tomasi [44] interest point op-erator. The operator is applied on a 80× 60 box placed randomly in the image.A description of the operator is given in AppendixD. The constraints are: 1) thebox does not overlap with the predicted search ranges of the projected features; 2)the newly detected feature is likely to stay in the camera view for the next a fewframes.In the work by Davison et al. [14], a feature is deleted from the feature map ifno feature match can be found in the predicted search region for a large proportionof the previous attempts. In case the feature is not visible in the current cameraview, it does not count as a measurement attempt. The feature deletion allows36the framework to eliminate unstable features, which adds computation complexitywithout contributing to system accuracy.In [14], Davison et al. dynamically adds and drops features based on match-ing rate of each feature. The purpose of such a “natural selection” scheme is toensure that the number of matched features is above a threshold, so as to maintainrobust surface tracking for each time instance. In [14], the threshold is set to 12in order to balance the tradeoff between tracking robustness and execution speed.The outcome of the feature management scheme is a static set of stable features,which does not change in time. The stable features in the feature map ensures thatminimum matched feature threshold is met for each time instance. As a result, nofeature adding is attempted. On the other hand, stable features will not be deleted.3.5.2 Feature Management with Nonpersistent Feature TrackingLong term stable features can be rare due to bad tracking conditions. The badtracking conditions are caused by constant camera view angle changes, leading torotation and affine transformation on the feature patches. Also, the illuminationconditions in the surgical scene leads to constant change of illumination profile onthe feature images. Moreover, specular reflectance and instruction occlusion canentirely block feature matching for hundreds of frames. With the challenging track-ing condition, the most successful recent work has shown to be able to match 50%of its initial features tracked at frame 250 [17]. This was done on a single channelwhile worse result can be expected for 3D tracking. Because feature matches havebe found in both stereo channels for 3D reconstruction.With insufficient stable features to track, the feature management scheme byDavison et al. [14] leads to constant feature turnovers in the feature map. Specif-ically, in case features can not be consistently matched, the features in the surfacemap are gradually deleted by the feature management scheme. To keep sufficientnumber of features in the map, new features are extracted from the same scene tocompensate the feature loss due to nonpersistent feature tracking. Yet, the newlyadded features eventually get deleted again by the feature management scheme. Inmost cases, the feature management scheme is constantly adding and deleting theexactly same set of features. Such constant feature turnover in the feature map not37only increases computation complexity but also causes tracking to drift.In the context of the feature management scheme by Davison [14], nonper-sistent feature tracking can also increase execution time. Specifically, more fea-tures have to be stored in the feature map to ensure the desired amount of featuresthat are matched for each time instance, which leads to greater EKF update time.Moreover, the feature turnover mentioned above also cause considerable overheadassociated with feature adding. As shown in section 3.1.5, adding feature involvesmatrix generation and multiplication, leading to considerable overhead. The situa-tion deteriorates as the number of features in the surface map increases.It is also not desirable to have a system subject to drift while we are aimingfor a tracking system that can last for thousands of frames. The drift is resultedfrom corrupted back-transformed feature locations in frame 0. Specifically, foreach newly extracted feature in frame n, the measured 3D location and uncertaintyare projected back to frame 0, so the feature can be added into the surface featuremap. The projection is done based on the camera motion states at frame n, whichalways have tracking error. Such projection allows the motion state error to leakinto the 3D feature locations stored in the surface map. With the initial featuresreplaced with newly extracted features, eventually, all the feature locations storedin the map are corrupted with motion error. As the feature turnover goes on, theoverall tracking error accumulates. A simulation is conducted in Section 6.1.3.3.5.3 Proposed Feature Management Scheme for a Explored SceneAs discussed in Section 6.4, our feature tracking is sufficiently persistent to benefitfrom a dynamic feature management scheme as proposed by Davison [14]. Yet,to enable dynamic feature adding, an object ROI has to be defined for each videoframe. Specifically, as shown in Figure 3.11, the feature extracted from the back-ground should not be added to the surface map. In future, we should develop a wayto dynamically identify the object ROI. Currently, we suggest that the best way toachieve stable long term tracking is by neither adding or deleting features. Such ascheme allows the framework to find long term lost features under the guidance ofthe EKF prediction, which is the key to long term drift-free tracking. Specifically,given the poor tracking condition, feature tracking is likely to be momentarily lost.38Figure 3.11: A figure showing the features extracted from a partial nephrec-tomy video frame. In this case, the objective is to track the kidney sur-face. As a result, the features extracted from the background (shownin red) should not be added into the surface map. The green featuresare relevant and should be added.In order to recover from such momentary loss of tracking, it is not desirable topermanently delete momentarily lost features. The surface registration allows theEKF framework to predict the 2D location for each feature in the surface map.Based on the prediction, there is a reasonable chance to match a lost feature againusing a guided wide baseline feature match scheme.In the work by Giannarou [17], spatial context information is used to guidethe recovery from momentary loss of tracking. Specifically, the historical motioncorrelations between the lost feature and its nearby tracked features can be used topredict the current feature states for the lost feature. Based on the predicted featurestates, the framework attempts to find the lost feature. In our rigid surface basedEFK scheme, the entire tracked surface can be treated as a spatial context. In adeformable environment, as long as the surface registration is done based on 3Dpoint correspondences, the 3D features in the surface map can still be transformedinto current frame based on current deformation registration. In that case, current392D location of a lost feature can still be predicted. As a result, recovery is possible.For an explored scene, adding features to the 3D surface map is not necessary.In the work by Davison [14], 4 initial features are defined to start the tracker. Astracking continues, more features are added into the map until the features in themap stabilize, as discussed in 3.5.1. In our case, feature extraction is done on theentire visible surface at once, which generally offers sufficient features for robustsurface tracking.40Chapter 4SIFT Feature Detector and ItsImplementation on GPUDetecting the SIFT features is the most computation consuming part of the frame-work. As a result, we place a lot of optimization efforts on this section of thesoftware. There are existing open source SIFT implementations such as SiftGPU[51] and cudaSIFT [5]. However, SiftGPU is essentially a GPU implementationof vlSIFT [48], which produces problematic results, as discussed in this chapter.cudaSIFT does not provide sub-sample interpolated feature locations and its fea-ture location result has not been thoroughly validated. Both implementations arestructured in a closed manner, which does not allow easy access to intermediateresults. Therefore, these implementations do not allow the aggressive optimizationmentioned in chapter 3. The SIFT implementation discussed in this chapter is verysimilar to the SIFT implementation by Hess[21], but executed more efficiently ona GPU.4.1 Building Scale Space4.1.1 MethodsThe scale space is a family of Gaussian blurred images generated from the originalimage. Each image in the scale space, which is also called scale level, can be41generated by convolving the original image with a Gaussian kernel of scale δ ,G(x,y,δ ) = 12piδ 2 e−(x2+y2)2δ2 . (4.1)Lowe [26] proposes to extract salient features at extrema of the DoG function com-puted by subtracting 2 adjacent scale levels in the scale space. The DoG function isan approximation of the scale-normalized Laplacian of Gaussian, which providestrue scale invariance [25] and stable [31] feature detection. This approximationrequires a constant ratio between the scales of adjacent scale levels. Assuming thescale for the previous level to be δ , the scale of the current level can be kδ .The scale space is built using a cascade filtering approach, which dramaticallyreduces the computation complexity. Specifically, each level of the scale space isincrementally blurred from the previous scale level, instead of from the originalimage. The benefit of such approach is to avoid a large Gaussian kernel radius dueto large Gaussian function variance required to blur the image. The variance of theGaussian function used for each blurring is evaluated based on the scale values ofthe source level and destination level. Assuming that the scale value for the sourceand destination images are δs and δd respectively, the variance of the Gaussianfunction δg can be computed as:σg =√σ2d −σ2s . (4.2)Since the Gaussian function is approximated in a discrete manner, the kernel radiushas to be increased as the scale value increases. In practice, the kernel radius isset to be greater than 4σg, which is sufficient to approximate the actual Gaussiankernel. By building the scale space incrementally, the required variance for eachscale level is reduced. Therefore, a smaller kernel radius can be used.As suggested in [26], the scale space is divided into octaves. Assuming theoctave index and scale level index to be i and s respectively, the scale value foreach scale level in the scale space can be expressed as:δ = δ0 ·2i+s/S, (4.3)where S is the total number of sampled scale levels in one octave. As shown in42equation 4.3, the (S + 1)th scale level can be passed onto the next octave as thefirst scale level, assuming both o and s start from zero. In addition, from the firstscale level of the octave to the (S+1)th scale level, the scale value is increased bya factor of 2. Since the image sharpness has been reduced by a factor of 2, we canre-sample the scale level by taking every other pixels. This decreases the imagesize of the entire octave by a half, which dramatically reduces the computation.Lowe [26] also suggests a range of suitable parameters based on experimentalresults. As already suggested in equation 4.3, the factor between the scales ofadjacent scale levels is set to 1/S. The number of sampled scale levels in eachoctave, S, is suggested to be 3. Therefore, k = 21/3. As shown in figure 4.1, toextract features on S scale levels, S+ 2 DoG levels have to be constructed, whichrequires S+ 3 scale levels. In case we extract features on 3 scale levels, 6 scalelevels are required for each octave. For each image, we will build a scale spacewith a total of 6 octaves. As suggested in [26], in order to get more stable features,the image size of the first octave is doubled with respect to the size of the originalimage. We expand the image by linearly interpolating extra pixels between existingpixels. Assuming the original image has a scale of 0.5, the scale of the new enlargedimage is 1.0. Since Lowe [26] suggest a initial scale, δ0, to be 1.6, pre-blurring hasbe carried on the enlarged image before the incremental scale space construction.4.1.2 GPU ImplementationConstructing scale space is one of the major computation bottlenecks of the wholepipeline, the building block of which is the Gaussian blurry kernel. Although the2-D convolution kernel function is a massively parallelizable algorithm, sophisti-cate memory management and low level optimization are involved to achieve bestperformance[40]. Fortunately, NVIDIA has provided an existing implementationas part of the CUDA SDK example. Nevertheless, the CUDA kernel function onlyworks for 3072×3072 images with a the convolution kernel size up to 17×17. Tobe integrated into our framework, the blurring function has to allow the user tochange the Gaussian kernel size with respect to the blurring variance required. Inaddition, the image size has to be flexible as well.43original image in grey scale up sample kernel Gaussian blur kernel 1st DOG level 2nd DOG level 3rd DOG level 4th DOG level 5th DOG level subtract image kernel 4th scale level down sample kernel 1st scale level for next octave 1st scale level (s = 0, ࣌ൌ૚Ǥ૟)  2nd scale level (s = 1, ࣌ൌ૛Ǥ૙૚૟) 3rd scale level (s = 2, ࣌ൌ૛Ǥ૞૝) 4th scale level (s = 3, ࣌ൌ૜Ǥ૛) 5th scale level (s = 4, ࣌ൌ૝Ǥ૙૜૛) 6th scale level (s = 5, ࣌ൌ૞Ǥ૙ૡ) Gaussian blur kernel Gaussian blur kernel Gaussian blur kernel Gaussian blur kernel Gaussian blur kernel subtract image kernel subtract image kernel subtract image kernel subtract image kernel scale level passed from last octave Figure 4.1: This figure shows the overall workflow of building a octave. The pre-blurred version of the original imageor a scale level from previous octave can be fed into the first level of the octave. In this case, we extract featuresfrom 3 scale level per octave. Therefore, 6 scale levels and 5 DoG levels are constructed.44It is easy to modify the image size by re-arranging the threads and block man-agement. However, altering the Gaussian kernel size involves some optimizationefforts, so as to maintain the good performance. Specifically, the compiler level op-timization technique, loop unrolling, requires the Gaussian kernel size to be knownat the compile time. In that case, the kernel radius is passed as a template, whichallows kernels with different sizes to be used in a Gaussian blur function on theGPU.Loop unrolling is used to optimize the instruction throughput. Considering thebasic loop involved in convolution function:for(int k = -KERNEL_RADIUS;k <= KERNEL_RADIUS;k ++)sum += data[smemPos + k]* d_Kernel[KERNEL_RADIUS - k];.Listing 4.1: This list shows the basic loop involved in the convolution func-tion.In each iteration of the loop, the accumulation of the element multiplication takesa few instructions while loop branching instructions are also involved to keep theloop executing properly. Compared to the amount of the computation involved ineach iteration, the overhead resulting from control instructions significantly capsthe overall performance. As a result, loop unrolling is performed at compile timeto unwind the loop so as to reduce run time. Specifically, control instructionsare eliminated while computation instructions are re-arranged in proper order bythe compiler. The loop unrolling technique is enabled by placing #pragma unrollbefore the loop affected. With unrolling of the innermost convolution loop (Listing4.1), the overall performance is more than doubled.Since loop unrolling is a complier level technique, the length of the loop de-fined by the Gaussian kernel radius has to be known at compile time as well. TheKERNEL RADIUS symbol in Listing4.1 has to be a pre-defined constants. If thekernel radius is passed as an function parameter, the overall performance deterio-rates by a factor of 3 compared to the implementation with the kernel radius fixed.The issue is resolved using a C++ template. Instead of passing the kernel radiusdirectly to the GPU function, a switch statement is implemented in a wrapper,which launches the GPU kernel function with the kernel radii defined as templates.45In this configuration, the compile time is increased, since the same GPU functionhas to be compiled multiple times. Nevertheless, the execution time is the same asin the case in which the blurring function has a fixed kernel radius.In figure 4.2 and 4.3, the scale levels and DoG functions for all 6 octaves areshown. The original image was taken during an invivo prostatectomy surgery.4.2 Feature Detection4.2.1 Finding Local ExtremaOnce the scale space is constructed, DoG images are computed by using subtractimage kernel. The features are extracted by searching the DoG pyramid for maximaand minima in 3D scale space. Feature extraction takes place on the 2nd, 3rd, and4th DoG levels. Extracting features on each DoG level requires adjacent upper andlower DoG levels. In the worst case, a pixel is compared to all its 26 neighbors in3D DoG functions. In most cases, a pixel is eliminated after a few comparisons.The middle DoG images are divided into 16× 16 blocks with a GPU threadassigned to each pixel. The thread checks if the pixel is larger or smaller than all its8 neighbors on the middle DoG level. Since the pixels in the middle DoG imagesare repeatedly accessed by threads assigned to adjacent pixels, the middle DoGimages are fetched into shared memory for repeated low latency access. In case apixel passes the test on the middle DoG image, the assigned thread is allowed tomove on to upper and lower DoG images checking if the pixel is a local extrema in3D neighborhood. Accessing upper and lower images involves scattered memoryaccess, which is not an efficient access pattern. However, since the majority of thepixels fail the test on the middle DoG images, the scattered access will not hurtoverall performance too much.For each feature pixel passing the extrema test, the assigned thread has to writethe feature location and scale into a feature list. The feature pixels are scatter-ing over a 2D image while concurrent and time-slicing threads are allocated tothe pixels. In this configuration, it is challenging to organize the feature infor-mation into a list, due to interaction and racing between threads. In some earlyimplementation[4, 39, 50], the feature information is stored back to a 2D image46which is fetched back to CPU. The 2D feature information is arranged into a fea-ture list and fetched back to GPU for further processing. Nevertheless, data com-muting between CPU and GPU involves low bandwidth and high latency buses.As a result, one should avoid fetching data back and forth between 2 processors.A GPU based SURF implementation[13] pre-processes the 2D feature informa-tion, which reduces the data size fetched back to CPU. In this implementation, theatomic feature on modern GPU is used to generate the feature list natively on theGPU.A global counter residing on the GPU memory keeps track of the number offeatures in the list. In case one thread detects a feature, the thread atomically ac-cesses the global counter and increases the counter by 1. The atomic incrementfunction built in CUDA also returns the old value of the counter. The old countervalue can index the memory location for the feature extracted by the current ac-cessing thread. Since atomic increment has been done on the counter, the nextthread has another memory location indexed by the counter. In this configuration,the interaction and racing between different threads are eliminated.4.2.2 Sub-sample InterpolationFor each local extrema in 3D DoG function, sub-sample interpolation can be achievedbased Taylor expansion around the extrema pixel [26],D(x) = D+∂DT∂x +12xT∂ 2D∂ 2x , (4.4)where D is the DoG function and x = (x,y,σ)T is a small offset in scale space. Thesub-sample interpolation location is computed by taking the derivative of the DoGfunction with respect to x and setting it to zero, resulting inxˆ =−∂ 2D∂ 2x−1 ∂D∂x , (4.5)47the second order partial derivative is the 3×3 Hessian matrix as:∂ 2D∂ 2x =∂ 2D∂x2∂ 2D∂x∂y∂ 2D∂x∂ s∂ 2D∂y∂x∂ 2D∂y2∂ 2D∂y∂ s∂ 2D∂ s∂x∂ 2D∂ s∂y∂ 2D∂ s2 , (4.6)while∂D∂x =(∂D∂x∂D∂y∂D∂ s). (4.7)In equation 4.5, the inversion of the Hessian matrix can be carried out in close-formfrom Cramer’s rule as:A−1 =1det(A)DyyDss−DysDys −(DxyDss−DysDxs) DxxDys−DyyDxs−(DxyDss−DysDxs DxxDss−DxsDxs −(DxxDys−DxyDxs)DxxDys−DyyDxs −(DxxDys−DxyDxs) DxxDyy−DxyDxy .(4.8)The DoG value at sub-sample location xˆ can then be expressed as:D(xˆ) = D+12∂D∂xTxˆ. (4.9)A threshold can be placed on the magnitude of the final interpolated DoG value toensure feature quality. Another approach is to sort all the features extracted fromthe image base on the magnitudes of their interpolated DoG values. A number offeatures with greater magnitudes can be selected while the rest can be discarded.The stopping condition for the interpolation is met when all components of xˆare less than 0.5. If any component of xˆ is greater than 0.5, the DoG extrema iscloser to one of the neighbors. As a result, the feature location is updated to theneighbor suggested by xˆ and then we can make another interpolation attempt. If theupdated location goes of out the image boundary, the interpolation stops. For eachDoG extrema, we make 5 attempts interpolating the sub-sample location based onequation 4.5.484.2.3 Edge Effect EliminationIt has been pointed out that the DoG function also reacts to edges. Features de-tected on edges are not stable, because they can easily drift along the edges be-tween frames. Lowe[26] suggests to detect edge features by investigating the ratiobetween eigenvalues of the following 2D Hessian matrix:H =(∂ 2D∂x∂x∂ 2D∂x∂y∂ 2D∂x∂y∂ 2D∂y∂x). (4.10)The trace and determinant of H are:Tr(H) =∂ 2D∂x∂x +∂ 2D∂y∂y = α+β , (4.11)Det(H) =∂ 2D∂x∂x∂ 2D∂y∂y − (∂ 2D∂x∂y)2 = αβ , (4.12)where α is the eigenvalue with greater magnitude. If r is the ratio between thegreater eigenvalue and smaller eigenvalues, thenTr(H)2Det(H)=(α+β )2αβ =(rβ +β )2rβ 2 =(r+1)2r. (4.13)(r+1)2r increases with increasing ratio between eigenvalues. Therefore, one can usethis quantity to control the ratio by applying a threshold. In practice, the thresholdis set to be 10, as suggested by Lowe [26].4.3 Feature ResultsIn this section, the 2D feature locations generated for a test image were tested.Our sift implementation generates very similar result to the result by Hess[21].The differences are caused by different methods used to invert Hessian matrix inequation 4.5. As shown in figure 4.4, our SIFT features are reasonably close tothe features generated by Lowe [26]. The vlSIFT [48] features are compared toLowe’s features in figure 4.5. There are significant differences between the featuresgenerated from vlSIFT and original SIFT implementation. The parameters used for49our implementation and vlSIFT are the same. The pre-compiled demo program ofLowe’s original SIFT implementation does not allow parameter tuning. Since wecan not set the DoG threshold for feature extraction, we can not control the numberof features extracted from a given image.50Figure 4.2: The scale levels for all 6 octaves. The top 2 rows show the scale levels for the first octave. The images forthe second octave are shown in the third row with the rest the octave images shown in the last row.51Figure 4.3: The DoG functions for all 6 octaves. The top 2 rows show the images for the first octave. The images forthe second octave are shown in the third row with the rest the octave images shown in the last row.52Figure 4.4: The features generated by our implementation (blue cross) arecompared to the features produces by original SIFT implementation byLowe [26] (green dots).53Figure 4.5: The vlSIFT features (blue cross) are compared to the originalSIFT features (green dots). They are significantly different.54Chapter 5Stereo Calibration5.1 Camera ModelEach camera of the stereo endoscope is calibrated according to the pinhole cameramodel,suv1= PXYZ1, where P = A[R|t],A =fx 0 cx0 fy cy0 0 1 .(5.1)In equation 5.1, P is the camera projection matrix. s is a scale factor while [R|t] isjoint rotation-translation matrix of the extrinsic parameters, describing the transfor-mation between the camera coordinates and the world coordinates. In the intrinsicmatrix A, fx and fy are the focal lengths and (cx,cy) is the optical center. (X ,Y,Z)are the coordinates of a 3D point in world coordinate space while (u,v) are thecoordinates of the projection point on the image plane. A schematics of the pinhole camera model is provided in Figure 5.1. Radial and tangential distortions are55ሺܿ௫ǡܿ௬ሻܻܼܺሺݑǡ𝑣ሻcamera i mage planeoptical axisFigure 5.1: A schematics of the pin hole camera model.modeled as:x′ = x(1+ k1r2 + k2r4)+ [2p1xy+ p2(r2 +2x2)], (5.2)andy′ = x(1+ k1r2 + k2r4)+ [p1(r2 +2y2)+2p2xy], (5.3)where x, y, and r can be defined as:x =XZ, y =YZ, and r = x2 + y2. (5.4)k1 and k2 are the tangential distortion coefficients, p1 and p2 are the radial distortioncoefficients. Finally, the 2D image coordinates can be written as:u = fxx′+ cx,y = fyy′+ cy.(5.5)In summary , the unknown parameters to be determined by the calibration processfor a single camera are listed in table 5.156Table 5.1: Intrinsic parameters to be determined for a single camera model.fx focal length in x axisfy focal length in y axiscx optical center in x axiscy optical center in y axisk1 first order radial distortion coefficientk2 second order radial distortion coefficientp1 tangential distortion coefficient 1p2 tangential distortion coefficient 25.2 Related WorkIn chapter 2 of [29], an overview of the existing camera calibration techniques ispresented. As shown in equation 5.1, the goal of calibration is to solve intrinsicparameters (A) and extrinsic parameters ([R|t]) based on 3D points in space andtheir corresponding 2D projections in the image plane. Specifically, given the 3Dpoints and their corresponding 2D image projections, [X ,Y,Z]T and [u,v]T can bedefined respectively. An analytical method can be used to roughly estimate A and[R|t]. The estimation can be refined using non-linear optimization.Based on the reference object used in the calibration, Zhang [29] classifies theexisting camera calibration techniques into 4 categories as 3D, 2D, 1D, and 0Dcalibration.• The 3D calibration requires a 3D reference object. A good example of sucha reference object is presented by Faugeras [15]. In this work, the referenceobject consists of two planes which are perpendicular to each other. On thesurface of each plane, easily detectable corners are printed as calibrationpatterns. Alternatively, in the work by Tsai [46], two calibration images arecaptured on a calibration plane at 2 different known positions in space. Sincethe capture locations are known, the corners detected at 2 capture instancescan be treated as corners extracted from an actual 3D object.• The 2D calibration requires a planar reference object printed with easily de-tectable features. In the work by Zhang [54], the camera calibration is doneby viewing a planar checkerboard from different angles.57• The 1D calibration requires 3 or more collinear easily detectable points withknown distance between each other [55]. Calibration can be done by movingthe reference object around either one of the points.• The 0D calibration refers to calibration without reference object. This typeof calibration can be done based on point correspondences observed from atleast 3 different views [16] [20].As the dimension of the calibration reference object decreases, there are less con-straints placed on the intrinsic and extrinsic parameters, which increases the prob-lem difficulty and reduces calibration accuracy. For different calibration tech-niques, different analytical and optimization methods are used to solve for thecamera parameters.Although 3D calibration provides the best accuracy, it is relatively hard to makethe reference object. In the case of the 3D reference object used in [15], the twoplanes have to be strictly perpendicular to each other in order to provide the bestresult. Otherwise, the 3D points fed into [X ,Y,Z]T are not accurate. Compared tothe 3D reference object, the 2D checkerboard is much easier to make. Therefore,the method described in [54] has become a widely used technique.The camera has to be re-calibrated in case the camera intrinsic parameters havechanged. In our application, this could happen when the surgeon changes the fo-cus of the camera. To re-calibrate the camera without pulling the camera out ofthe patient body, it is preferred that no reference object is needed. However, 0Dcalibration generally do not provide good calibration accuracy. As shown in [30],the calibration error with respect to 3D calibration results can reach 27%. Theprojection 2D error is reported to be 0.68 pixels.5.3 2D Calibration ProcedureWe use the camera calibration toolbox for Matlab [6] to calibrate the camera. Thetoolbox is an open source implementation of Zhang’s method [54].58Figure 5.2: An instance of the robot arm grabbing the calibration checker-board. The checkerboard shall be properly placed so that the view ofthe corners is not affected by the specular reflection shown at the bot-tom of the figure.5.3.1 Calibration CheckerboardThe reference object is a checkerboard as shown in figure5.2. The calibration pat-tern is printed using a regular printer and pasted onto a 3 by 3 inch piece of glass.The piece can be acquired at Crown Glass on Broadway. On the checkerboard, thecorners formed by 2 white and 2 black squares can be detected by a corner detectorin the image. During the calibration process, the corners are assumed to be copla-nar. As a result, bending or denting the surface of the calibration board introduceserror in the calibration results. Glass, having a flat surface, is an ideal material fora calibration checkerboard.5.3.2 Video Streaming SetupThe stereo videos can be streamed to a PC and saved onto the hard drive for laterprocessing. Here, proper configurations are required on both the robot and thecomputer sides.59On the daVinci Si robot, the video output option can be selected as: VideoSettings→ Video output→ DVI-I(1280×1024). In addition, to avoid artifacts inthe calibration images, it is recommended that the edge enhancement feature isdisabled. The feature can be find as: Video Settings→ Edge Enhancement.A PC is linked to the Si robot to capture stereo video sequences via two framegrabbers. The capture software, IdealDemo, is available on the desktop of the PC.A document describing the software and the manufacturer information has beenuploaded to the unfuddle repository. In the document, only usage of the daVinci Srobot is discussed. For each frame grabber, a .chp file is required to provide inputvideo information. The proper files are with the Si robot in the VGH basement andare named:• LowerSlot 075787 LeftEye FirstBoard si.chp• UpperSlot 075792 RightEye SecondBoard si.chpSince the PC is also used to capture videos from the da Vinci S model in the OR,the software could use wrong .chp files. In that case, only a portion of the video se-quence is saved, which has a significant impact on the calibration quality. A directconsequence is an increase in uncertainty values reported by the calibration soft-ware. A typical set of uncertainty values are shown in Table 5.2 and 5.3. Similarlythe .chp file used for the S robot are:• LowerSlot 075787 LeftEye FirstBoard original.chp• UpperSlot 075792 RightEye SecondBoard original.chpThere is a live display option window showing the images captured by one ofthe frame grabbers in real time. In fact, this option can cause delays between the2 frame grabbers. As a result, the stereo images are not synchronized anymore.Generally, we capture test videos to ensure the setup is correct. After the videocapture is done, we review the videos using video player to ensure capture quality.5.3.3 Calibration Image AcquisitionTo achieve the best calibration accuracy, we suggest that the following guidelinesbe followed:60• The number of images taken should be greater than 16. As suggested byZhang [54], the error tends not to reduce much after 16 images. 20 imagesare usually taken in a regular calibration routine, in case some of the imagesare of poor quality.• As suggested by Zhang [54], if 2 images are taken while the checkerboardundergoes a pure translation, the second image does not contribute to overallaccuracy. As a result, all images should be taken at different view angles.• In a regular calibration routine, it is recommended to cover a wide rangeof angles around all 3 axes. One can started with placing the checkerboardparallel to the camera image plane, followed by extreme angles around eachaxis. The extreme angles can be -60 and 60 degree. Rest of the images canbe taken at angles within those bounds.• Illumination can cause part of the image to be saturated. One should avoidsaturation to occur in the effective checkerboard region.• The checkerboard in a depth that allows the checkerboard to occupy a largeportion of the image.A typical placing of the checkerboard is shown in figure 5.3. The checkerboardcan be carried with a da Vinci robot tool or a third hand as shown in figure 5.4.5.3.4 Calibration Image ProcessingA detail tutorial is available online at http://www.vision.caltech.edu/bouguetj/calibdoc/htmls/example.html. In summary, the user is prompted to define four cornersof the effective checkerboard region. The saddles points of the checkerboard areextracted for each checkerboard pose. The 2D locations are fed into a closed formsolver followed by a non-linear optimizer. The goal is to optimize the intrinsicparameters listed in Table 5.2 by minimizing the projected 2D pixel error for eachextracted saddle point. The calibrated parameters from the first calibration iterationcan be used to project the saddle point locations in the images, which can be used torefine the saddle point location extraction. The new refined saddle point locationsare then fed into the next calibration iteration. This iterative process ends until61Figure 5.3: A typical set of calibration images.Figure 5.4: The checkerboard can be carried with a third hand and moved todifferent poses with respect to the endoscope.62Table 5.2: Intrinsic calibration results for both cameras in the stereo endo-scope.Left Rightfx 1.57×103±1.85 fx 1.57×103±1.84fy 1.57×103±1.92 fy 1.57×103±1.92cx 6.68×102±3.65 cx 8.07×102±3.90cy 5.30×102±3.31 cy 5.28×102±3.27k1 −3.17×10−1±1.19×10−2 k1 −3.89×10−1±9.38×10−3k2 5.19×10−1±8.73×10−2 k2 6.11×10−1±5.79×10−2p1 4.01×10−3±3.90×10−4 p1 −1.89×10−3±3.50×10−4p2 3.60×10−4±3.50×10−4 p2 1.17×103±3.4×10−4the calibration results converge. For stereo calibration, the final refined saddlepoint locations and intrinsic parameters of each individual camera are fed into anon-linear optimizer, so as to globally optimize the intrinsic parameters for bothcameras and the transformation between them.5.3.5 Calibration Quality MeasurementMeasurement UncertaintyA typical measurement of intrinsic parameters for both cameras on the stereo en-doscope along with uncertainties are presented in Table 5.2. The position of theright camera with respect to the left camera is listed in Table 5.3. The results areintended to give the reader an idea of the ratio between the uncertainties and themeasurements.Reprojected ErrorBased on the camera model and calibration results, the saddle point locations canbe reprojected back to the calibration images. A typical reprojected pixel errordistribution for a single camera calibration is shown in Figure 5.5. If reprojectederrors are relative big for certain images (Figure 5.6), these images should be ex-cluded from the current calibration iteration. The calibration should be carried outthe rest of the calibration images.63−0.5 0 0.5−0.5−0.4−0.3−0.2−0.100.10.20.3Reprojection error (in pixel) − To exit: right buttonxyFigure 5.5: Reprojected pixel error distribution for a typical calibration set.−4 −3 −2 −1 0 1−1.5−1−0.500.511.522.53Reprojection error (in pixel) − To exit: right buttonxyFigure 5.6: Reprojected pixel error distribution for a case where the saddlepoint extraction quality is very poor for one of the calibration image. Inthe figure, the green points that are far away from the rest of the pointsare extracted from the same calibration image. These points should beexcluded from current calibration iteration and refined later.64Table 5.3: Extrinsic results describing position of the right camera wrt the leftcamera.Rotation Vector[6.70×10−3 1.26×10−2 −5.88×10−3]Rotation Vector Uncertainty[2.74×10−3 3.17×10−3 1.60×10−4]Translation Vector[−5.44 2.10×10−1 1.01×10−1]Translation Vector Uncertainty[2.16×10−2 2.13×10−2 1.13×10−1]Rectification ErrorBased on the calibration results, the stereo images can be rectified, such that theepipolar lines are aligned to the horizontal axis. After rectification, the featuresextracted from the stereo images shall have similar y values. For each matchedfeature pair, the rectification error can be defined as the difference in y values. Fora good calibration set, the rectification error is shown in Figure 5.7.For the images taken from the Dell computer with the AccuStream 75 framegrabbers, the rectification errors for two different calibration sets are shown in 5.8.The images are taken on April 4th, 2014. The rectification errors seem to haveoffsets along with greater scattering ranges. The rectification errors do not seem tobe reproducible from case to case. The cause of this issue can be a hardware issuein the capture device, since such issue occurs with both da Vinci robots at VGH.The problem is unlikely to be caused by image processing procedures. Because thesame procedure has been applied to demo calibration images given in [6], showingcorrect rectification errors. Despite the offset in rectification error, the 3D recon-struction error is shown to be around 0.35 mm for the extracted saddle points at 10cm, as shown in Figure 5.9b.5.4 Focus Change RecoveryFor a typical surgery, the stereo endoscope is calibrated before insertion into the pa-tient body. Before starting the surgery, the surgeon adjusts the focus of the camerato have the best view possible of the surgical field. Such focus change can alter thephysical properties of the camera, which leads to variations in the camera parame-ters. In this section, we investigate the feasibility of recovering the focus at whichthe calibration takes place. If the focus recovery is successful, the preoperative650 50 100 150 200−10−50510feature indexrecitification error in pixelFigure 5.7: Rectification error for a good set of calibration. The rectificationerrors are distributed evenly around 0. The feature matches with bigrectification errors can be considered as outliers. A threshold should bedetermined based on specific data.0 20 40 60−10−50510feature indexrecitification error in pixelFigure 5.8: For calibration images taken from our capture device, the recti-fication error seem to have offsets along with greater scattering range.The rectification errors do not seem to be reproducible from case to case.66calibration results can provide better 3D reconstruction results, which ultimatelyleads to better surface tracking results.The following experiment procedures are designed to investigate the feasibil-ity:• The camera is originally focused at an object placed at 10 cm.• A set of 10 individual calibrations is performed.• A series of de-focus and focus procedures are performed to simulate theeffect of focus adjusting by the surgeon.• The camera is re-focused at 10 cm.• Another set of 10 individual calibrations is performed.20 images are used in each individual calibrations. To speed up data acquisition,for each calibration set, 10 images are taken at around one pose while 20 differentposes are chosen to provide 20 images for each individual calibration. Two sets ofcalibration results are compared against each other to show the effect of the focuschange. Since there are 10 measurements in each set, the effect of the focus changecan be separated from the measurement uncertainty of the calibration technique.Both sets of the calibration results are shown in Figures 5.10−5.17. The twosets of calibration results are shown in blue and red respectively. The horizontallines indicate the average measurement of each calibration set.As shown in the figure, the variations between 2 sets of calibration results arereasonably small compared to the measurement uncertainty of the calibration tech-nique. For the measurements for focal length ( f c) and optical center (cc), thevariation between the 2 calibration sets are less than 0.5 %, which is statisticallyinsignificant. In terms of radial distortion, a slightly greater variation is observedat the second order coefficient kc2 for the left image. The variation is less than 2%.The x component of the translation vector between the stereo camera (T 1) varies1% between 2 calibration set. The rest of the terms in the calibration results areknown to be close to zero, which are not significant.In a real surgical scenario, the preoperative calibration can always provide arough 3D location of an object by reconstructing the 2D features extracted from67the object surface. This estimation can help place the object at the focus depth atwhich preoperative calibration takes place. By re-focusing the camera at the objectdepth, the focus can be recovered within 2% of the original values.5.5 Calibration DepthIn this section, an experiment is designed to investigate the effect of calibrationdepth. In this experiment, 10 sets of individual calibration are performed at a depthof 10 cm. To investigate the depth effect, another 10 calibration sets are performedat a depth 15 cm. The comparison of the calibration results are presented in figure5.18−5.25. As shown in the figures, the variation between the calibration resultsat different depths are considerable. Compared to the results taken at closer depth,the calibration results at far depth are inconsistent.To further investigate the calibration accuracy, we use one of the calibration setsas test set. The saddle points extracted in the test set are reconstructed in 3D basedon the 9 calibration results measured at each depth. In this case, for each depth,we have one test set and 9 sets of calibration results. Given the geometry of thecheckerboard used, the distances between adjacent saddle points are exactly 5 mm.In the experiment, the accuracy of each calibration is measured based on how wellthe distance is recovered. Specifically, the accuracy is characterized by the meanand standard deviation of the recovered distances between adjacent saddle points.As shown in figure 5.9, the calibration results acquired at 10 cm can reconstructtest points at both 10 and 15 cm. The standard deviation is expected to increasewith increasing of depth, due to stereo geometry. The calibration results acquiredat 15 cm generate bad reconstruction results for the test points acquired at 10 cm.In conclusion, calibration should be performed at a depth relatively close to thecamera. Such configuration can produce a set of calibration results that work forall depths. Calibrations performed at a greater depth can produce inconsistent andunreliable results. One of the possible reasons for poor calibration results at greaterdepth is that the 2D pixel error is not sensitive to parameter variations at greaterdepth. Specifically, the calibration solver minimizes the pixel error to optimizethe camera parameters. At a greater depth, variation in the parameters inducesrelatively smaller pixel error. As a result, the solver is not effective.682 4 6 84.9855.025.045.065.085.15.12calibration set IDRecontructed Seperation (mm2 )(a)2 4 6 80.260.280.30.320.340.360.38calibration set IDStandard Deviation (mm2 )(b)Figure 5.9: The mean and standard deviation of reconstructed distances be-tween adjacent saddle points. (a) mean (b) standard deviation. Thegreen curve shows the reconstructed close point clouds using calibra-tion results at far depth. The red curve shows the reconstructed far pointclouds using calibration results at far depth. The black curve shows thereconstructed far point clouds using calibration results at close depth.The blue curve shows the reconstructed close point clouds using cali-bration results at close depth.690 2 4 6 8 101558156015621564fc left 10 2 4 6 8 101562156415661568fc left 2Figure 5.10: The calibration results for left eye focal length.0 2 4 6 8 101554155615581560fc right 10 2 4 6 8 101556155815601562fc right 2Figure 5.11: The calibration results for right eye focal length.700 2 4 6 8 10524526528530532534cc left 10 2 4 6 8 10566568570572574576cc left 2Figure 5.12: The calibration results for left eye optical center.0 2 4 6 8 10678680682684686688cc right 10 2 4 6 8 10565570575cc right 2Figure 5.13: The calibration results for right eye optical center.710 5 10−0.37−0.365−0.36−0.355−0.35kc left 10 5 100.50.520.540.560.580.6kc left 20 5 100510x 10−4 kc left 30 5 10−4.8−4.6−4.4−4.2−4−3.8x 10−3 kc left 4Figure 5.14: The calibration results for left eye distortion parameters.0 5 10−0.42−0.41−0.4−0.39−0.38kc right 10 5 100.450.50.550.60.650.7kc right 20 5 100.00950.010.0105kc right 30 5 104.64.855.25.45.6 x 10−3 kc right 4Figure 5.15: The calibration results for right eye distortion parameters.720 2 4 6 8 10−5.45−5.4−5.35T10 2 4 6 8 1000.050.1T20 2 4 6 8 10−0.200.2T3Figure 5.16: The results for translation vector between stereo cameras.0 2 4 6 8 10−505x 10−3 om10 2 4 6 8 10−14−12−10−8−6−4x 10−3 om20 2 4 6 8 1088.5x 10−3 om3Figure 5.17: The calibration results for axis angles between stereo cameras.730 2 4 6 8 101550155515601565fc left 10 2 4 6 8 101555156015651570fc left 2Figure 5.18: Left eye focal lengths at different depths.0 2 4 6 8 101545155015551560fc right 10 2 4 6 8 101545155015551560fc right 2Figure 5.19: Right eye focal lengths at different depths.740 2 4 6 8 10490500510520530540cc left 10 2 4 6 8 10550560570580590cc left 2Figure 5.20: Left eye optical centers at different depths.0 2 4 6 8 10680690700710720cc right 10 2 4 6 8 10530540550560570580cc right 2Figure 5.21: Right eye optical centers at different depths.750 5 10−0.45−0.4−0.35kc left 10 5 100.511.5kc left 20 5 10−2−10123x 10−3 kc left 30 5 10−6−5−4−3−2−1x 10−3 kc left 4Figure 5.22: Left eye distortion parameters at different depths.0 5 10−0.45−0.4−0.35kc right 10 5 10−0.500.511.5kc right 20 5 10789101112x 10−3 kc right 30 5 10012345x 10−3 kc right 4Figure 5.23: Right eye distortion parameters at different depths.760 2 4 6 8 10−5.4−5.3−5.2T10 2 4 6 8 10−0.200.2T20 2 4 6 8 10−1−0.500.5T3Figure 5.24: Translation vector between stereo cameras at different depths.0 2 4 6 8 10−0.04−0.0200.02om10 2 4 6 8 10−0.04−0.020om20 2 4 6 8 107.588.599.5 x 10−3 om3Figure 5.25: Axis angles between stereo cameras at different depths.77Chapter 6Results and Discussion6.1 Simulation ResultsThe basic functions of the EKF are validated with synthetic motion states and 3Dpoint clouds at different time instances. The 2D feature measurement in each stereochannel and stereo triangulation are not included in the simulation.6.1.1 Synthetic Data GenerationIn the simulation, an initial point cloud is generated 100 mm away from the stereoendoscope. This is the typical depth from the endoscopic camera to the surgicalsite. The 3D points in the initial point cloud scatter around (0,0,100)T with a stan-dard deviation of 30 in all 3 axes. For each consecutive time instance, the initialpoint cloud is transformed into the current coordinates based on the simulated mo-tion states. Synthetic errors are generated and added on top of the initial point cloudalong with all point clouds generated for consecutive time instances. The syntheticerrors are random variables with variances of 0.3, 0.3, and 1.5 for x, y, and z axisrespectively. The variances are determined based uncertainty measurement shownin Figure 3.6d. In this simulation, measurement is assumed to be successful for all3D points in the initial point cloud. For motion states, we constantly vary the linearand angular velocities. The simulated motion states are shown in Figure 6.1 and6.2. The variation of the linear and angular velocities over time is shown in Figure6.3 and 6.4. Each component of the linear velocity linearly changes between 6 and780510x (mm)0510y (mm)5 10 15 20 25 30 35 40 45 50 55 600510z (mm)Figure 6.1: The simulation results for linear components of the motion states.In the figure, the red curve shows the ground truth. The green curveshows a case where δa = 0.5mm/s2, δα = 0.05rad/s2. The blue curveshows a case where δa = 100mm/s2, δα = 100rad/s2.02468anglex (deg)0510angley (deg)5 10 15 20 25 30 35 40 45 50 55 6002468time (sec)angle z (deg)Figure 6.2: The simulation results for angular components of the motionstates. In the figure, the red curve shows the ground truth. The greencurve shows a case where δa = 0.5mm/s2, δα = 0.05rad/s2. The bluecurve shows a case where δa = 100mm/s2, δα = 100rad/s2.−6 mm/s while each component of the angular velocity linearly varies between 6and −6 deg/s. The simulation is carried out for 3600 frames, equivalent to oneminute of video sequence captured at a rate of 60 Hz.79−10010vec x (mm/s)−10010vec y (mm/s)5 10 15 20 25 30 35 40 45 50 55 60−10−505time (sec)vec z (mm/s)Figure 6.3: The simulation results for linear velocity components of the mo-tion states. In the figure, the red curve shows the ground truth. Thegreen curve shows a case where δa = 0.5mm/s2, δα = 0.05rad/s2. Theblue curve shows a case where δa = 100mm/s2, δα = 100rad/s2.−20020omega x (deg/s)−20020omega y (deg/s)5 10 15 20 25 30 35 40 45 50 55 60−50050time (sec)omega z (deg/s)Figure 6.4: The simulation results for angular velocity components of themotion states. In the figure, the red curve shows the ground truth. Thegreen curve shows a case where δa = 0.5mm/s2, δα = 0.05rad/s2. Theblue curve shows a case where δa = 100mm/s2, δα = 100rad/s2.806.1.2 Simulation Results with Different Motion ParametersWe conduct simulations with different motion parameters to study the effect of δaand δα , which control the motion smoothness. Meanwhile, we also investigateimpact of the initial covariance settings.As shown in Table 6.1, if δa and δα are set too small, the system does not havesufficient acceleration to keep up with the motion (Figure 6.1 anf 6.2). Greater δaand δα values cause the velocity estimation to oscillate around the ground truth. Incase the velocity estimation is poor, the system can still track the object with a highRoot Mean Square (RMS) error, due to estimation oscillation. In this simulation,the terms in the initial motion covariance matrix are set to 0 while the matrix isregulated by the Q term in the EKF predication step. For tracking tissue, these twoparameters shall be tuned based on a tracking results on large sets of data.To study the impact of the initial covariance settings, we set the diagonal termsof the motion covariance matrix to be ∆. The simulation results are shown in Table6.2. The simulation is performed while δa and δα are set to 100. The results indi-cate that the initial covariance setting has slim impact on the tracking performance.6.1.3 Simulating the Drift Effect Due to Feature TurnoverA simulation is conducted to show the effect of the feature turnover on the overalltracking results. The simulation results are shown in Figure 6.5a and 6.5b. Inthis simulation, the same configuration is used as described in Section 6.1.1. Tosimulate the effect of feature turnover, 20 initial features are generate to start theEKF. For every 40 frames, 5 original features are replaced with new features. Thesimulation lasts for 3600 frames, corresponding to a minute.6.1.4 Simulation Results with a Dynamic Motion ModelIn this section, a different motion model is implemented and tested on simulateddata. In this motion model, we damp the velocities by a factor of η . Specifically,810 10 20 30 40 50 60024 linear error (mm)x0 10 20 30 40 50 600510y0 10 20 30 40 50 60024ztime (sec)(a)0 10 20 30 40 50 60024 angular error (deg)x0 10 20 30 40 50 60012y0 10 20 30 40 50 60024ztime (sec)(b)Figure 6.5: The simulation results show the effect of drifting due to featureturnover in the surface map. In the Figure, the green plots show thetracking errors with feature turnovers. The blue plots show the trackingerrors without feature turnover. Figure (a) shows translation errors inmm while Figure (b) shows the angle errors in radians. The turnoverrate is 5 features every 40 frames.82Table 6.1: Simulation results with different δa and δα . ∆= 0.δa = 0.5,δα = 0.05δa = 25,δα = 0.5δa = 100,δα = 100δa = 106,δα = 106RMS(rx)(mm) 2.85 3.27×10−1 4.35×10−1 1.29RMS(ry)(mm) 2.97 3.80×10−1 4.20×10−1 1.13RMS(rz)(mm) 1.73 1.35×10−1 1.76×10−1 3.98×10−1RMS(θx)(deg) 1.58 1.97×10−1 2.39×10−1 6.02×10−1RMS(θy)(deg) 1.49 1.65×10−1 2.49×10−1 6.79×10−1RMS(θz)(deg) 4.97×10−1 1.54×10−1 4.63×10−1 4.92×10−1RMS(vx)(mm/s) 4.12 9.70×10−1 2.34 7.40×102RMS(vx)(mm/ f rame) 6.87×10−2 1.62×10−2 3.90×10−2 1.23RMS(vy)(mm/s) 4.30 1.37 2.34 7.43×102RMS(vy)(mm/ f rame) 7.17×10−2 2.29×10−2 3.90×10−2 1.24RMS(vz)(mm/s) 3.56 7.93×10−1 1.75 2.37×102RMS(vz)(mm/ f rame) 5.93×10−2 1.32×10−2 2.92×10−2 3.95×10−1RMS(ωx)(deg/s) 2.63 9.93×10−1 7.05 4.00×102RMS(ωx)(deg/ f rame) 4.39×10−2 1.66×10−2 1.18×10−1 6.67×10−1RMS(ωy)(deg/s) 1.84 5.96×10−1 5.92 4.00×102RMS(ωy)(deg/ f rame) 3.07×10−2 9.90×10−3 9.87×10−2 6.67×10−2RMS(ωz)(deg/s) 1.85 9.27×10−1 2.51×102 2.95×102RMS(ωz)(deg/ f rame) 3.09×10−2 1.55×10−2 4.18×10−1 4.91×10−1the motion states are updated as:xv(t|t−1) = fv(xv(t−1|t−1),n) =r(t−1|t−1)+(v(t−1|t−1)+V )∆tq(t−1|t−1)×q((ω(t−1|t−1)+Ω)∆t)(1−η)v(t−1|t−1)+V(1−η)ω(t−1|t−1)+Ω. (6.1)The Jacobian ∂ fv∂xv can be computed as:∂ fv∂xv=I 0 (1−η)I∆t 00∂q(t|t−1)∂q 0∂q(t|t−1)∂ω0 0 (1−η)I 00 0 0 (1−η)I. (6.2)83Table 6.2: Simulation results with different ∆’s. δa = 100 and δα = 100.∆= 0 ∆= 1000 ∆= 106RMS(rx)(mm) 4.31×10−1 6.18×10−1 8.85×10−1RMS(ry)(mm) 4.71×10−1 5.95×10−1 7.17×10−1RMS(rz)(mm) 2.01×10−1 2.05×10−1 3.06×10−1RMS(θx)(deg) 2.48×10−1 3.33×10−1 4.10×10−1RMS(θy)(deg) 2.40×10−1 4.51×10−1 4.65×10−1RMS(θz)(deg) 4.43×10−1 4.88×10−1 5.27×10−1RMS(vx)(mm/s) 2.36 2.43 2.56RMS(vx)(mm/ f rame) 3.94×10−2 4.05×10−2 4.26×10−2RMS(vy)(mm/s) 2.33 2.43 2.62RMS(vy)(mm/ f rame) 3.88×10−2 4.02×10−2 4.37×10−2RMS(vz)(mm/s) 1.75 1.80 1.81RMS(vz)(mm/ f rame) 2.92×10−2 3.00×10−2 3.01×10−2RMS(ωx)(deg/s) 6.79 6.94 6.73RMS(ωx)(deg/ f rame) 1.13×10−1 1.16×10−1 1.12×10−1RMS(ωy)(deg/s) 5.89 5.91 5.94RMS(ωy)(deg/ f rame) 9.81×10−2 9.85×10−2 9.89×10−2RMS(ωz)(deg/s) 2.28×102 2.51×102 2.40×102RMS(ωz)(deg/ f rame) 3.80×10−1 4.18×10−1 4.00×10−1In the simulation, we use η = 0.1. ∆ is set to 0. Two motion parameter con-figurations, δa = 25,δα = 0.5 and δa = 100,δα = 100, are simulated. The sim-ulation results are compared to the results generated using the constant velocitymodel. The simulation results are shown in Table 6.3. The velocity results for theδa = 100,δα = 100 configuration are shown in Figure 6.6. As shown in the Table6.3 and Figure 6.6, the dynamic motion model does not seem to improve trackingquality.84−505vecx (mm/s)−505vecy (mm/s)5 10 15 20 25 30 35 40 45 50 55 60−505time (sec)vec z (mm/s)(a)−505omega x (deg/s)−505omega y (deg/s)5 10 15 20 25 30 35 40 45 50 55 60−505time (sec)omega z (deg/s)(b)Figure 6.6: The velocity simulation results for the dynamic motion model.(a) The linear velocity results. (b) The angular velocity results. In thefigure, the red curve indicates the ground truth. The green curve is thedynamic motion model results. The blue curve is the constant motionmodel results. The simulation is carried out using the following param-eter configuration, δa = 25mm/s2, δα = 0.5rad/s2,∆= 0.85Table 6.3: Simulation results with the constant velocity model and the dy-namic motion model. In the simulation, ∆ = 0. The simulations arecarried out for two motion parameter configurations, δa = 25,δα = 0.5and δa = 100,δα = 100.δa = 25,δα = 0.5constantδa = 25,δα = 0.5dynamicδa = 100,δα = 100constantδa = 100,δα = 100dynamicRMS(rx)(mm) 3.27×10−1 4.39×10−1 4.35×10−1 9.05×10−1RMS(ry)(mm) 3.80×10−1 8.73×10−1 4.20×10−1 1.03×10−1RMS(rz)(mm) 1.35×10−1 2.60×10−1 1.76×10−1 2.64×10−1RMS(θx)(deg) 1.97×10−1 4.98×10−1 2.39×10−1 5.54×10−1RMS(θy)(deg) 1.65×10−1 1.96×10−1 2.49×10−1 4.60×10−1RMS(θz)(deg) 1.54×10−1 2.50×10−1 4.63×10−1 4.08×10−1RMS(vx)(mm/s) 9.70×10−1 1.81 2.34 1.96RMS(vx)(mm/ f rame) 1.62×10−2 3.01×10−2 3.90×10−2 3.26×10−2RMS(vy)(mm/s) 1.37 2.94 2.34 1.97RMS(vy)(mm/ f rame) 2.29×10−2 4.90×10−2 3.90×10−2 3.28×10−2RMS(vz)(mm/s) 7.93×10−1 1.86 1.75 1.70RMS(vz)(mm/ f rame) 1.32×10−2 3.11×10−2 2.92×10−2 2.84×10−2RMS(ωx)(deg/s) 9.93×10−1 2.20 7.05 7.02RMS(ωx)(deg/ f rame) 1.66×10−2 3.67×10−2 1.18×10−1 1.17×10−1RMS(ωy)(deg/s) 5.96×10−1 6.15×10−1 5.92 5.97RMS(ωy)(deg/ f rame) 9.9×10−3 1.03×10−2 9.87×10−2 9.95×10−2RMS(ωz)(deg/s) 9.27×10−1 1.96 2.51×102 2.51×102RMS(ωz)(deg/ f rame) 1.55×10−2 3.27×10−2 4.18×10−1 4.19×10−16.2 Ex Vivo Experiment SetupThe EKF based surface tracking framework is firstly validated on an 1800 frame exvivo porcine stereo video sequence captured at 60 Hz. During the video acquisition,a piece of pig intestine is hand held and moved in front the stereo endoscope of theda Vinci robot. The visual appearance of the pig intestine surface is similar to thatof the human tissue surface.To provide reference for validation, a saddle point is placed on the tissue sur-face, as shown in figure 6.7a. The Harris corner detector [33] can be used to locatethe saddle point with subpixel accuracy. The corner detector requires a reason-86able initial guess. For the first frame, we manually locate the saddle points, whichserves as an initial guidance for the corner detector. For consecutive frames, weuse normalized cross correlation to guide the corner detector. The saddle pointsare located in both stereo channels and the 3D location of the saddle point can bereconstructed based on 2D locations. The initial 3D location of the saddle point isprojected into each frame based on the estimated motion states. For each frame,the transformed location is compared with the 3D saddle point measurement, so asto validate the estimation accuracy. The fiducial is masked out in during the initialfeature extraction. As a result, no feature related to the fiducial is used for surfacetracking.6.3 In Vivo Experiment SetupOur in vivo data was acquired from a robot-assisted radical prostatectomy proce-dure performed at our institution. The patient study was approved by the researchethics board under the UBC CREB number: H11-02267. The name of the studyis Intra-operative TRUS guidance for RALRP. Once the anterior surface of theprostate was visible, we asked the surgeon to move the camera around the surgi-cal scene. A clipper was placed on the prostate surface. For validation, at timeinstances 100 frames apart, we manually identify a sharp corner of the clipper(fig6.7c) in both stereo images. This allows us to reconstruct the point at eachtime instance. To verify our tracking results, we project the initial point to eachtime instance and compare the projected location with the manual extracted one.The manual extracted results are not ground truth, due to human error and recon-struction error. Nevertheless, it still gives us an accurate estimation of the trackingerror.When the surgery starts, the surgeons tend to re-focus the camera to get the bestvision of the surgical scene. Changing focus affects the camera parameters, whichresults in errors in the 3D reconstruction. For the in vivo study, we calibrated thestereo endoscope after the surgery was finished. Camera focus was not changed inthe interval between the video acquisition and the camera calibration after the pro-cedure. Calibrating the camera after the surgery can also be less stressful, becausewe do not risk contaminating the camera by inadvertent touching of a sterilized87part. A further study of the focus changing effect is presented in Section 5.4.(a)0 5 10 15 20 25 300123time (sec)error (mm)(b)(c)0 10 20 30 40 50 6000.511.522.5time (sec)error (mm)(d)Figure 6.7: (a) The validation fiducial and tracking error at each frame forthe ex vivo experiment. (b) The clipper corner used for validation andtracking errors at specific time instances for the in vivo human prostatedata. The missing data point of the red curve in plot (d) is caused byinstrument occlusion. At that time instance, the view of the fiducial isblocked by the instrument, so that it is not detectable.886.4 Results and DiscussionThe fiducial errors shown in figure 6.7 are constantly lower than 2.5 mm for bothin vivo and ex vivo cases. The RMS error is 1.6 mm. As shown in Figure 6.8e, theaxis angles reach 15◦, which shows the tracker is capable of handling reasonablerotations.In figure 6.8, we also report the number of 3D features matched for each framealong with the motion state of the camera, including location and pose. In ourexperiments, we observe reduction in the number of matched features when thetracked region moves to the edge of the camera view. One of such instance isshown in Figure 6.11. When the surface travels to the edge of the camera view,some of the features can not be matched, since they are outside the camera view.For some other features, the matching rate drops significantly due to changes incamera view angle and illumination conditions.We also record detail feature tracking performance for each tracked feature.The results are shown in Figure 6.12, 6.14, and 6.9. As shown in Figure 6.12,feature tracking can fail up to 10 seconds for some features, due to bad trackingconditions. Yet, the framework is capable of relocating the features that have beenlost for a long period of time. For instance, the tracking for feature 20 fails at timestamp 18 and the framework relocates the feature at time stamp 28.35 after thefeature tracking has been lost for 10.35 seconds. The overall matching rates forall tracked features are shown in Figure 6.14. In Figure 6.9, if the correspondingsurface feature is matched, the 3D feature locations are marked in blue. In case thefeature matching fails, the 3D locations are marked in red. As shown in the figure,majority of the matching failures take place when the surface features travel to theedge of the camera view. In Figure 6.9, the camera location is (0,0,0)T .To suppress feature measurement outliers, we set a threshold on the innova-tions. Specifically, in case the innovation magnitude associated with a featuremeasurement is greater than 1.0 mm, we discard the feature measurement. Theoutlier rejection mechanism also improves system robustness to instrument occlu-sions. Instrument occlusion can not only prevent feature matching but also causefeature mismatch. Specifically, a feature in the surface map can be matched to afeature detected on the instrument. With the proposed outlier rejection method,89such outlier due to instrument occlusion can be handled well. In case a feature onthe instrument is accidentally matched, there will be a sudden depth change, sincethe matched feature moves from the tissue surface on to the instrument. Such asudden change leads to a big feature innovation, since the prediction hi is based onthe surface registration. As a result, the mismatch is unlikely to affect the tracking.As shown in Figure 6.13, none of the features on the instrument is matched.A way to verify the tracking quality is to compute the average residual erroramong the matched features. For each surface feature, the residual error can becomputed as the distance between the transformed 3D feature location and theacquired feature measurement. The transformed feature location can be obtainedbased on the 3D surface feature location in the object coordinates and the motionstates. The feature locations and the motion states can be extracted from the newlyupdated EKF state vector. The average residual error for an in vivo case at eachtime instance is shown in Figure 6.10.6.5 Timing ResultsThe surface tracker was tested on a standard PC with Intel core i7-3770 CPU at3.4 GHz. The GPU used is the Nvidia GeForce GTX 670. For the ex vivo data,we initialize 55 features within the ROI. The feature selection is based on a DOGthreshold of 0.03. In the future, we shall improve feature selection based on thefeature matching rate. Specifically, the features with low matching rates can beeliminated to reduce computation cost. The execution time is 32.6 ms per frame,corresponding to roughly 30 Hz. In the case of in vivo study, we track 77 initialfeatures with a frame rate of 23 Hz (42.5 ms per frame). A detailed break downof the execution time consumed by each component is shown in figure 6.15. Asshown in the figure, most of the execution time is spent on building the octaves andthe EKF update. Currently, the EKF prediction and update are carried on CPU.The salient feature extraction and matching are carried on GPU. The EKF updatecan be further improved by porting the matrix operations onto GPU. The matrixoperations can be found in CULA library [23].900 5 10 15 20 25 3001020304050time (sec)matched number of features0 10 20 30 40 50 60102030405060time (sec)matched number of features(a) (b)(c) (d)(e) (f )ex vivo in vivoFigure 6.8: Experiment results. The ex vivo results are shown in the left col-umn and the in vivo results are shown in the right column. Graphs (a-b)show the number of features matched in every frame. (c-d) show thecamera position estimation in mm. (e-f) show the camera axis angleestimation in degree. The ex vivo results are shown in the left columnand the in vivo results are shown in the right column. As shown in thefigures, fewer features are tracked when the surface travels to the edgeof the camera view. In that case, part of the surface region is invisiblewhile the illumination condition in the region also varies considerably.910 20 4060 80−20020−1001020z (mm)x (mm)y (mm)Figure 6.9: The 3D feature tracking map. In the figure, if the correspondingsurface feature is matched, the 3D feature locations are marked in blue.In case the feature matching fails, the 3D locations are marked in red.The camera location is at (0,0,0)T , and the optical axis is shown as thegreen line.1000 2000 300000.20.40.6time (sec)Average Residual Error (mm)Figure 6.10: A figure showing the average residual error among the matchedsurface features at each time instance.92Figure 6.11: A figure showing the feature matching performance when thesurface travels to the edge of the camera view. The top figure showsthe reference surface with initial features. The blue features are notmatched while the green ones are matched. The bottom figure showsthe feature tracking performance when the surface travels to edge ofthe camera view.930 10 20 30 40 50 60010203040506070time (sec)feature indexFigure 6.12: A figure shows the feature tracking performance for all tracked features. In the figure, every mark indicatesthe feature indexed by the y value is matched at the time instance suggested by the x value. As shown in thefigure, feature tracking can fail for up to 10 seconds. Yet, the framework is able to relocate the features that havebeen lost for a while.94Figure 6.13: A figure showing how feature matching reacts to instrument oc-clusion. As shown in the figure, the system is sufficient robust to rejectpotential mismatches due to instrument occlusion.0 0.2 0.4 0.6 0.8 1012345matching ratenumber of featuresFigure 6.14: A histogram of the matching rates for all tracked features.950 2 4 6 8 1 0 1 2kalman predictmake mask leftbuild octave leftextract key leftcompute BRIEF descriptor leftmatch leftrectification leftmake mask rightbuild octave rightextract key rightcompute BRIEF descriptor rightmatch rightrectification right3D triangluation and uncertainty measurementKalman update(ms)Figure 6.15: A break down of the execution time consumed by each component. As shown in the figure, most of theexecution time is spent on building octave and EKF update. Currently, the EKF prediction and update are carriedon CPU. The salient feature extraction and matching are carried on GPU. The execution time is measured in ms.In the figure, the timing for the ex vivo and in vivo data sets are shown in red and blue respectively. The ROI forin vivo data is greater, resulting to more features in the surface map. That leads to longer execution time.96Chapter 7Conclusion7.1 ContributionTo enable augmented reality during minimally invasive radical prostatectomy, anEKF based tissue surface tracking framework is developed to maintain the reg-istration between a medical image volume and the tissue surface over time. Inthis framework, an initial 3D surface feature map is constructed based on stereomatched SIFT feature pairs extracted from the tissue surface being tracked. Specif-ically, SIFT features are extracted from both stereo channels and matched based onthe SIFT descriptor. The stereo matched feature pairs are reconstructed in 3D togenerate the initial 3D surface feature map. For each consecutive frame, the sur-face based EKF predicts the current 3D location for each feature in the 3D surfacemap. The 3D predicted location can be projected into both stereo channels basedon stereo geometry, so as to guide 2D temporal matching for each correspondingfeature in the 3D surface map. Specifically, within a search range around the 2Dpredicted location, SIFT features are extracted and matched to initial 2D featuresbased on the BRIEF descriptor. For an initial feature in the 3D surface map, if 2Dtemporal matches are found in both stereo channels, the stereo feature pair can beused to reconstruct the feature location in 3D, which can be treated as a new 3Dmeasurement of the corresponding feature in the 3D surface map. The differencesbetween the newly measured 3D locations and EKF predicted locations drive theEKF update to simultaneously estimate the current camera motion states and the97feature locations of the 3D surface map. In order to reject potential outliers in thetemporal matching, a threshold is applied on the magnitude of the EKF innova-tion for each matched feature. The surface tracking framework is validated on anex vivo porcine video sequence and in vivo prostate sequences acquired during daVinci radical prostatectomy. For all cases in our test sequences, the fiducial errorsare shown to be less than 2.5 mm. The following three major contributions havebeen made in the course of this work:• The surface based EKF framework proposed is suitable for MIS applicationdue to its robustness against momentary feature loss. In MIS, due to con-stantly changing illumination condition, instrument occlusion, and smoke,2D feature tracking schemes can not consistently track features in time. Inthis framework, the feature management scheme does not discard any featureeven in the case where the feature has not been matched for a very long time.For each frame, the framework attempts to locate each feature in the 3D sur-face map based on the feature prediction provided by the surface based EKF.The framework does not rely on continuous tracking of each feature in the3D surface. As long as there are a sufficient number of features tracked inthe current frame, the surface registration can be maintained and the EKFpredictions are sufficient accurate for locating long term lost features.• The feature detection and matching schemes are implemented on modernGPUs. Our SIFT feature detector allows sub-pixel interpolation and is com-pared to the original SIFT implementation by Lowe [26]. With the SIFTfeature detection, the BRIEF descriptor, and feature matching implementedon GPUs, the frame rate can reach 30Hz. In addition, with the intensivecomputation power provided, a greater search range can be explored aroundthe 2D predicted feature locations. That allows the framework to be morerobust to potential temporary estimation error in the EKF.• We validate our work on both ex vivo and in vivo video sequences. The track-ing accuracy is measured based on the surface fiducial error. We also recordthe 3D feature tracking persistency of our framework. The validation resultsshow that our framework can persistently track a set of surface features and98the surface tracking does not drift.7.2 Comparisons to the State of the ArtCompared to the work by Giannarou et al.[17], the surface based EKF can providea better spatial guidance for relocating lost features. In [17], the spatial contextinformation around the tracked feature is used for loss of tracking recovery. Sucha scheme can work well in the case of short term loss of feature tracking, whichcan be caused by changes in the camera view angles and illumination conditions.However, in case no neighbor feature is matched, the scheme can not provide spa-tial guidance for recovery. This could be the case if the feature travels outside ofcamera view or is occluded by a surgical instrument. In the surface based EKFframework, the entire tracked surface is used to provide spatial guidance for recov-ery. This enable us to recover features that have been lost for a long period.In the work by Grase et al. [19], the same surface based EKF framework isused to perform monoSLAM during laparoscopic surgery. The major differencesbetween the work by Grase et al. [19] and our work are:• The feature tracking quality reported in [18] appear to be inferior to the re-sults reported in this work, given similar camera motion. Yet, a detail com-parison should be performed on the same data sets to be certain.• The outlier rejection mechanisms used are different between this work andthe work by Grase et al. [19]. In [19], the 1-point RANSAC outlier rejec-tion method [11] is used to allow high innovation feature measurements tocontribute to the EKF update while we discard the high innovation measure-ments. Given a video capturing rate of 60 Hz, it is not clear whether the1-point RANSAC method can offer a better performance compared to theoutlier rejection method proposed in this work.• In the work by Grase et al. [19], the 3D feature locations are estimated basedon the Inverse Depth Parametrization method [9]. Specifically, the featuredepth is estimated by observing the feature at different camera view angles.Based on the rigid environment assumption, the feature uncertainty in depthreduces as the EKF framework propagates in time. The rigid environment99assumption does not hold in case the tracked tissue surface experiences freeform deformation. Using a stereo endoscope can have a clear advantage in adeformable tracking scenario.In the work carried out by Mountney et al. [37, 38], a surface-based EKFframework has been proposed to dynamically expand surgical view, in order tofacilitate surgical navigation. The major differences between our work and thework by Mountney et al. [37, 38] are:• The work by Mountney et al. [37, 38] has not been quantitatively validatedin the in vivo context. In our work, the surface tracking framework is vali-dated on both ex vivo and in vivo video sequences. We have shown both thetracking accuracy and the 3D feature tracking persistency.• To eliminate feature mismatch, it is necessary to implement outlier rejectionmechanism, such as the 1-point RANSAC method [11]. In this work, we dis-card the feature measurements with high innovation. The work by Mountneyet al. [37, 38] lacks of an outlier rejection method, which potentially reducesthe system robustness.7.3 Future Work7.3.1 Affine Invariance Feature Matching SchemeIn the proposed surface based EKF framework, the BRIEF descriptor only worksin the case that the target surface does not experience too much rotation or affinetransformation. For a generalized surface tracking scheme, a affine invariant fea-ture tracking scheme should be put in place. The potential approaches can beASIFT [34] or compensating the patch motion based on camera motion states [14].The feature matching scheme required is a wide baseline matching schemewhich allows long term feature loss recovery. A wide baseline matching schemecan potential offer higher matching rate in case significant appearance differenceoccurs to the feature patches. With higher feature matching rate, it is more likelyto relocate a long term lost feature. In order to recover long term lost features,100schemes such as constantly updating descriptors or keeping historical descriptorsfor matched features are not optimal.7.3.2 Feature Management in 3D Object TrackingWhile tracking a 3D object, it is necessary to distinguish the features tagged onthe object surface from those on the surrounding anatomies. Therefore, 2D regionsof interest (ROI) have to be defined for each frame while only features extractedwithin the ROIs are considered as object features. This can be done by mapping the3D object into the camera coordinates based on an initial registration between themedical image frame and the camera frame. Combined with surface registrationand mapped 3D object in the camera coordinates, 2D ROIs can be defined forconsecutive frames by projecting the 3D object boundary into both stereo images.The EKF can be initialized as discussed in section 3.2, expecting that the featuresare extracted from projected ROIs in both images instead of user selected ROIs inthe left image. As discussed before, no feature is deleted from the surface map,since the tracker is expected to be able to recover long term lost features.In order to allow the 3D object to move freely in the scene, the frameworkhas to be able to detect the presence of unexplored surfaces. Such condition can beidentified as the explored ratio drops in the 2D object ROI. Specifically, the existingfeature bounding box defined in section 3.3.1 can be treated as an explored area.In case only a small portion of the object ROI is explored, new feature extractionhas to be carried out in the unexplored scene and newly extracted features shouldbe added into the surface map as described in 3.1.5. The unexplored area can bedetermined by subtracting the explored area from the object ROI.In some cases, tracking has to rely entirely on the features newly extracted froman object surface that is just explored. In that case, the tracking can potentiallydrift due to back-transformation error. Nevertheless, as soon as the camera viewmoves back to the very original object surface, the initial features can be matchedto eliminate drifting error. The tracking can still fail if the system drift is so bigthat initial feature locations can not be accurately predicted. Yet, this happens veryrare.At the current stage, we still have to further investigate the performance of101BRIEF descriptor under various image transformations. Meanwhile, the initialregistration between the medical image coordinate and the camera coordinate isstill missing, which is required for projection the 3D medical image model into thecamera coordinates. As a result, a full 3D object tracking is expected in the futurework.7.3.3 Surface Tracking Failure RecoveryComplete surface tracking failure is rare. Yet, in case it happens, the failure re-covery can be carried out by matching features extracted from the current frameto features in the 3D surface map. Such feature matching requires wide baselinematching which in turn requires affine invariance. Potential approaches can beASIFT or the learning approach by Grasa [18]. In [18], to recover surface trackingfailures, features from accelerated segment test (FAST) features [43] are detectedfrom the current frame and fed into a randomized tree [24] for classification. Theclassification aims to assign each newly extracted feature to the features in the 3Dsurface map. If classification is successful, a feature match is found. The random-ized tree is trained based on 400 warped versions of the initial feature patches andhistorically matched feature patches.102Bibliography[1] T. K. Adebar, M. C. Yip, S. E. Salcudean, R. N. Rohling, C. Y. Nguan, andS. L. Goldenberg. Registration of 3d ultrasound through an air tissueboundary. Medical Imaging, IEEE Transactions on, 31(11):2133–2142,2012. ISSN 0278-0062. → pages 2[2] M. Agrawal, K. Konolige, and M. Blas. Censure: Center surround extremasfor realtime feature detection and matching. In D. Forsyth, P. Torr, andA. Zisserman, editors, Computer Vision ECCV 2008, volume 5305 ofLecture Notes in Computer Science, pages 102–115. Springer BerlinHeidelberg, 2008. ISBN 978-3-540-88692-1. → pages 7, 12[3] H. Bay, T. Tuytelaars, and L. V. Gool. Surf: Speeded up robust features. InIn ECCV, pages 404–417, 2006. → pages 11[4] M. Bjorkman. A cuda implementation of sift, 2007. URLhttp://www.csc.kth.se/∼celle/. → pages 46[5] M. Bjrkman, N. Bergstrm, and D. Kragic. Detecting, segmenting andtracking unknown objects using multi-label {MRF} inference. ComputerVision and Image Understanding, 118(0):111 – 127, 2014. ISSN 1077-3142.→ pages 41[6] J. Y. Bouguet. Camera calibration toolbox for Matlab, 2008. URLhttp://www.vision.caltech.edu/bouguetj/calib doc/. → pages iv, 27, 58, 65,120[7] G. Bradski and A. Kaehler. Learning OpenCV: Computer Vision with theOpenCV Library. O’Reilly, Cambridge, MA, 2008. → pages 27[8] M. Calonder, V. Lepetit, C. Strecha, and P. Fua. Brief: Binary robustindependent elementary features. In Proceedings of the 11th EuropeanConference on Computer Vision: Part IV, ECCV’10, pages 778–792, Berlin,103Heidelberg, 2010. Springer-Verlag. ISBN 3-642-15560-X,978-3-642-15560-4. → pages 7, 11, 15[9] J. Civera, A. Davison, and J. Montiel. Inverse depth parametrization formonocular slam. Robotics, IEEE Transactions on, 24(5):932–945, Oct 2008.ISSN 1552-3098. → pages 99[10] J. Civera, O. G. Grasa, A. J. Davison, and J. M. M. Montiel. 1pointt ransacfor extended kalman filtering: Application to real-time structure frommotion and visual odometry. J. Field Robot., 27(5):609–631, Sept. 2010.ISSN 1556-4959. → pages 6[11] J. Civera, O. G. Grasa, A. J. Davison, and J. M. M. Montiel. 1-point ransacfor extended kalman filtering: Application to real-time structure frommotion and visual odometry. Journal of Field Robotics, 27(5):609–631,2010. ISSN 1556-4967. → pages 35, 36, 99, 100[12] D. Cohen, E. Mayer, D. Chen, A. Anstee, J. Vale, G.-Z. Yang, A. Darzi, andP. Edwards. Augmented reality image guidance in minimally invasiveprostatectomy. In A. Madabhushi, J. Dowling, P. Yan, A. Fenster,P. Abolmaesumi, and N. Hata, editors, Prostate Cancer Imaging.Computer-Aided Diagnosis, Prognosis, and Intervention, volume 6367 ofLecture Notes in Computer Science, pages 101–110. Springer BerlinHeidelberg, 2010. ISBN 978-3-642-15988-6. → pages 1[13] N. Cornelis and L. Van Gool. Fast scale invariant feature detection andmatching on programmable graphics hardware. In Computer Vision andPattern Recognition Workshops, 2008. CVPRW ’08. IEEE Computer SocietyConference on, pages 1 –8, june 2008. → pages 47[14] A. Davison, I. Reid, N. Molton, and O. Stasse. Monoslam: Real-time singlecamera slam. Pattern Analysis and Machine Intelligence, IEEE Transactionson, 29(6):1052–1067, June 2007. ISSN 0162-8828. → pages iv, vi, 5, 8, 13,14, 15, 17, 19, 36, 37, 38, 40, 100[15] O. Faugeras. Three-dimensional Computer Vision: A Geometric Viewpoint.MIT Press, Cambridge, MA, USA, 1993. ISBN 0-262-06158-9. → pages57, 58[16] O. Faugeras, Q.-T. Luong, and T. Papadopoulou. The Geometry of MultipleImages: The Laws That Govern The Formation of Images of A Scene andSome of Their Applications. MIT Press, Cambridge, MA, USA, 2001. ISBN0262062208. → pages 58104[17] S. Giannarou, M. Visentini-Scarzanella, and G.-Z. Yang. Probabilistictracking of affine-invariant anisotropic regions. Pattern Analysis andMachine Intelligence, IEEE Transactions on, 35(1):130–143, Jan 2013.ISSN 0162-8828. → pages 2, 3, 6, 9, 10, 11, 14, 37, 39, 99[18] O. Grasa, J. Civera, and J. M. M. Montiel. Ekf monocular slam withrelocalization for laparoscopic sequences. In Robotics and Automation(ICRA), 2011 IEEE International Conference on, pages 4816–4821, May2011. → pages 35, 99, 102[19] O. Grasa, E. Bernal, S. Casado, I. Gil, and J. Montiel. Visual slam forhandheld monocular endoscope. Medical Imaging, IEEE Transactions on,33(1):135–146, Jan 2014. ISSN 0278-0062. → pages 2, 17, 99[20] R. Hartley and A. Zisserman. Multiple View Geometry in Computer Vision.Cambridge University Press, New York, NY, USA, 2 edition, 2003. ISBN0521540518. → pages 58[21] R. Hess. An open-source siftlibrary. In Proceedings of the InternationalConference on Multimedia, MM ’10, pages 1493–1496, New York, NY,USA, 2010. ACM. ISBN 978-1-60558-933-6. → pages 41, 49[22] A. Hughes-Hallett, E. K. Mayer, H. J. Marcus, T. P. Cundy, P. J. Pratt, A. W.Darzi, and J. A. Vale. Augmented reality partial nephrectomy: Examiningthe current status and future perspectives. Urology, 83(2):266 – 273, 2014.ISSN 0090-4295. → pages 1, 2[23] J. R. Humphrey, D. K. Price, K. E. Spagnoli, A. L. Paolini, and E. J.Kelmelis. CULA: hybrid GPU accelerated linear algebra routines. pages770502–770502–7, 2010. URL http://dx.doi.org/10.1117/12.850538. →pages 90[24] V. Lepetit and P. Fua. Keypoint recognition using randomized trees. IEEETrans. Pattern Anal. Mach. Intell., 28(9):1465–1479, Sept. 2006. ISSN0162-8828. → pages 102[25] T. Lindeberg. Scale-space theory: A basic tool for analysing structures atdifferent scales. Journal of Applied Statistics, 21:224–270, 1994. → pages42[26] D. G. Lowe. Distinctive Image Features from Scale-Invariant Keypoints. Int.J. Comput. Vision, 60(2):91–110, Nov. 2004. ISSN 0920-5691. → pages xii,6, 10, 11, 26, 42, 43, 47, 49, 53, 98105[27] B. D. Lucas and T. Kanade. An iterative image registration technique withan application to stereo vision. In Proceedings of the 7th International JointConference on Artificial Intelligence - Volume 2, IJCAI’81, pages 674–679,San Francisco, CA, USA, 1981. Morgan Kaufmann Publishers Inc. → pages10[28] J. Matas, O. Chum, M. Urban, and T. Pajdla. Robust wide-baseline stereofrom maximally stable extremal regions. Image and Vision Computing, 22(10):761–767, Sept. 2004. ISSN 02628856. → pages 12[29] G. Medioni and S. B. Kang. Emerging Topics in Computer Vision. PrenticeHall PTR, Upper Saddle River, NJ, USA, 2004. ISBN 0131013661. →pages 57[30] P. R. S. Mendona and R. Cipolla. A simple technique for self-calibration,1999. → pages 58[31] K. Mikolajczyk. Detection of local features invariant to affinetransformations. PhD thesis, Institut National Polytechnique de Grenoble,France, 2002. → pages 42[32] K. Mikolajczyk and C. Schmid. Scale and affine invariant interest pointdetectors. International Journal of Computer Vision, 60(1):63–86, 2004. →pages 12[33] K. Mikolajczyk, T. Tuytelaars, C. Schmid, A. Zisserman, J. Matas,F. Schaffalitzky, T. Kadir, and L. V. Gool. A comparison of affine regiondetectors. Int. J. Comput. Vision, 65(1-2):43–72, Nov. 2005. ISSN0920-5691. → pages 9, 12, 86[34] J. M. Morel and G. Yu. ASIFT: A new framework for fully affine invariantimage comparison. SIAM Journal on Imaging Sciences, 2(2):438–469, 2009.→ pages 11, 12, 100[35] P. Mountney. Simultaneous Localisation and and Mapping For MinimallyInvasive Surgery. PhD thesis, Imperial College London, 2010. → pages 10,11[36] P. Mountney and G.-Z. Yang. Soft tissue tracking for minimally invasivesurgery: Learning local deformation online. In Proceedings of the 11thInternational Conference on Medical Image Computing andComputer-Assisted Intervention, Part II, MICCAI ’08, pages 364–372,Berlin, Heidelberg, 2008. Springer-Verlag. ISBN 978-3-540-85989-5. →pages 10106[37] P. Mountney and G.-Z. Yang. Dynamic view expansion for minimallyinvasive surgery using simultaneous localization and mapping. InEngineering in Medicine and Biology Society, 2009. EMBC 2009. AnnualInternational Conference of the IEEE, pages 1184–1187. IEEE, 2009. →pages 100[38] P. Mountney, D. Stoyanov, A. Davison, and G.-Z. Yang. Simultaneousstereoscope localization and soft-tissue mapping for minimal invasivesurgery. In Medical Image Computing and Computer-AssistedIntervention–MICCAI 2006, pages 347–354. Springer Berlin Heidelberg,2006. → pages 2, 5, 6, 7, 8, 15, 100[39] D. S. Myers, A. I. Gonzales, F. H. Rothganger, and K. W. Larson.Implementing wide baseline matching algorithms on a graphics processingunit, 2007. → pages 46[40] V. Podlozhnyuk. Image convolution with cuda. Technical report, 2007. →pages iv, 43[41] P. Pratt, E. Mayer, J. Vale, D. Cohen, E. Edwards, A. Darzi, and G.-Z. Yang.An effective visualisation and registration system for image-guided roboticpartial nephrectomy. Journal of Robotic Surgery, pages 1–9, 2012. → pages2[42] R. Richa, P. Poignet, and C. Liu. Three-dimensional motion tracking forbeating heart surgery using a thin-plate spline deformable model. Int. J. Rob.Res., 29(2-3):218–230, Feb. 2010. ISSN 0278-3649. → pages 11[43] E. Rosten and T. Drummond. Fusing points and lines for high performancetracking. In Computer Vision, 2005. ICCV 2005. Tenth IEEE InternationalConference on, volume 2, pages 1508–1515 Vol. 2, Oct 2005. → pages 102[44] J. Shi and C. Tomasi. Good features to track. In Computer Vision andPattern Recognition, 1994. Proceedings CVPR ’94., 1994 IEEE ComputerSociety Conference on, pages 593–600, 1994. → pages 5, 10, 36, 123[45] L.-M. Su, B. P. Vagvolgyi, R. Agarwal, C. E. Reiley, R. H. Taylor, and G. D.Hager. Augmented reality during robot-assisted laparoscopic partialnephrectomy: Toward real-time 3d-ct to stereoscopic video registration.Urology, 73(4):896 – 900, 2009. ISSN 0090-4295. → pages 2[46] R. Tsai. A versatile camera calibration technique for high-accuracy 3dmachine vision metrology using off-the-shelf tv cameras and lenses.107Robotics and Automation, IEEE Journal of, 3(4):323–344, August 1987.ISSN 0882-4967. → pages 57[47] O. Ukimura and I. S. Gill. Image-fusion, augmented reality, and predictivesurgical navigation. Urologic Clinics of North America, 36(2):115 – 123,2009. ISSN 0094-0143. New Technology in Urologic Surgery. → pages 1, 2[48] A. Vedaldi and B. Fulkerson. VLFeat: An open and portable library ofcomputer vision algorithms, 2008. → pages 41, 49[49] C. Wengert, L. Bossard, A. Hberling, C. Baur, G. Szkely, and P. Cattin.Endoscopic navigation for minimally invasive suturing. In N. Ayache,S. Ourselin, and A. Maeder, editors, Medical Image Computing andComputer-Assisted Intervention MICCAI 2007, volume 4792 of LectureNotes in Computer Science, pages 620–627. Springer Berlin Heidelberg,2007. ISBN 978-3-540-75758-0. → pages 35[50] C. Wu. SiftGPU: A GPU implementation of scale invariant featuretransform (SIFT), 2007. URL http://cs.unc.edu/∼ccwu/siftgpu. → pages 46[51] C. Wu. SiftGPU: A GPU implementation of scale invariant featuretransform (SIFT). http://cs.unc.edu/∼ccwu/siftgpu, 2007. → pages 41[52] M. C. Yip, D. G. Lowe, S. E. Salcudean, R. N. Rohling, and C. Y. Nguan.Tissue tracking and registration for image-guided surgery. Medical Imaging,IEEE Transactions on, 31(11):2169–2182, Nov 2012. ISSN 0278-0062. →pages 2, 7, 8, 9, 11, 14[53] M. C. L. Yip. Ultrasound registration and tracking for robot-assistedlaproscopic surgery. Master’s thesis, University of British Columbia, 2011.→ pages 8[54] Z. Zhang. A flexible new technique for camera calibration. Pattern Analysisand Machine Intelligence, IEEE Transactions on, 22(11):1330–1334, Nov2000. ISSN 0162-8828. → pages 57, 58, 61[55] Z. Zhang. Camera calibration with one-dimensional objects. PatternAnalysis and Machine Intelligence, IEEE Transactions on, 26(7):892–899,July 2004. ISSN 0162-8828. → pages 58108Appendix AEKF JacobiansA.1 EKF PredictionIn the EKF prediction stage, the covariance matrix evolves as:P(t|t−1) =∂ fv∂xv Pxx(t−1|t−1)∂ fv∂xvT+Q ∂ fv∂xv Pxy1(t−1|t−1)∂ fv∂xv Pxy2(t−1|t−1) · · ·Py1x(t−1|t−1)∂ fv∂xvTPy1y1(t−1|t−1) Py1y2(t−1|t−1) · · ·Py2x(t−1|t−1)∂ fv∂xvTPy2y1(t−1|t−1) Py2y2(t−1|t−1) · · ·.......... . ..(A.1)The Q is defined as:Q =∂ fv∂n Pn∂ fv∂nT, (A.2)wherePn =[(δa∆t)2 00 (δα∆t)2]. (A.3)The Jacobians to compute here is ∂ fv∂xv and∂ fv∂n . Firstly, the prediction function canbe written as:fv =rnewqnewvnewωnew=r+(v+V )∆tq×q((ω+Ω)∆t)v+Vω+Ω. (A.4)109As a result, ∂ fv∂xv can be written as:∂ fv∂xv=I 0 I∆t 00 ∂qnew∂q 0∂qnew∂ω0 0 I 00 0 0 I. (A.5)The problem is to compute ∂qnew∂q and∂qnew∂ω . As shown in A.4, the term qnew is com-puted by multiplying 2 quaternion. The quaternion multiplication can be done byconverting the quaternions into matrix and multiplying the corresponding matrices.The quaternion matrix conversion can be done as:M =w z −y x−z w x yy −x w z−x −y −z w, i f q =wxyz. (A.6)The quaternion product q3 = q2×q1 can be derived as:q3 =w1w2− x1x2− y1y2− z1z2w1x2 + x1w2− y1z2 + z1y2w1y2 + x1z2 + y1w2− z1x2w1z2− x1y2 + y1x2 + z1w2. (A.7)The Jacodians can be derived as:∂q3∂q1=w2 −x2 −y2 −z2x2 w2 −z2 y2y2 z2 w2 −x2z2 −y2 x2 w2, (A.8)110and∂q3∂q2=w1 −x1 −y1 −z1x1 w1 z1 −y1y1 −z1 w1 x1z1 y1 −x1 w1. (A.9)At this point, ∂qnew∂q can be computed using equationA.9. Regarding to∂qnew∂ω , theJacobian can be computed using chain rule as:∂qnew∂ω =∂qnew∂q(ω∆t) ×∂q(ω∆t)∂ω . (A.10)∂qnew∂q(ω∆t) can be computed using equationA.8. q(ω∆t) is the quaternion representa-tion of the axis angle ω∆t. Converting the axis angle to a quaternion, q(ω∆t) canbe written as:q(ω∆t) =[cos(12‖ω∆t‖)ω∆t‖ω∆t‖sin(12‖ω∆t‖)]. (A.11)The detail computation of ∂q(ω∆t)∂ω is shown in equation A.13.∂ fv∂n can be computedas:∂ fv∂n =I∆t 00 ∂qnew∂ΩI 00 I. (A.12)Here, ∂qnew∂Ω can be computed in the same way as∂qnew∂ω shown in equationA.10.111q(ωR∆t) =−ωRx ∆tsin( 12 ‖ωR‖∆t)2‖ωR‖−ωRy ∆tsin( 12 ‖ωR‖∆t)2‖ωR‖−ωRz ∆tsin( 12 ‖ωR‖∆t)2‖ωR‖ωRx2∆t2‖ωR‖2 cos(12‖ωR‖∆t)+ 1‖ωR‖ (1−ωRx2‖ωR‖2 )sin(12‖ωR‖∆t)ωRx ωRy‖ωR‖2 (∆t2 cos(12‖ωR‖∆t)− 1‖ωR‖ sin(12‖ωR‖∆t)) ωRx ωRz‖ωR‖2 (∆t2 cos(12 ‖ωR‖∆t)− 1‖ωR‖ sin(12‖ωR‖∆t))ωRy ωRx‖ωR‖2 (∆t2 cos(12‖ωR‖∆t)− 1‖ωR‖ sin(12 ‖ωR‖∆t))ωRy2∆t2‖ωR‖2 cos(12‖ωR‖∆t)+ 1‖ωR‖ (1−ωRy2‖ωR‖2 )sin(12 ‖ωR‖∆t)ωRy ωRz‖ωR‖2 (∆t2 cos(12 ‖ωR‖∆t)− 1‖ωR‖ sin(12‖ωR‖∆t))ωRz ωRx‖ωR‖2 (∆t2 cos(12‖ωR‖∆t)− 1‖ωR‖ sin(12 ‖ωR‖∆t))ωRz ωRy‖ωR‖2 (∆t2 cos(12‖ωR‖∆t)− 1‖ωR‖ sin(12‖ωR‖∆t)) ωRz2∆t2‖ωR‖2 cos(12‖ωR‖∆t)+ 1‖ωR‖ (1−ωRz2‖ωR‖2 )sin(12 ‖ωR‖∆t)(A.13)112A.2 EKF UpdateIn the EKF update, the Kalman gain is computed as:Wt = Pt|t−1∂h∂xT(x(t|t−1))S−1t . (A.14)St is the innovation covariance matrix,St =∂h∂x (x(t|t−1))Pt|t−1∂h∂x (x(t|t−1))T +Rt , (A.15)where ∂h∂x can be computed as∂h∂x =∂h∂xv (y0)∂h∂yi (y0) · · · 0 0 · · · 0∂h∂xv (y1) 0 · · ·∂h∂yi (y1) 0 · · · 0.....................∂h∂xv (yk) 0 · · · · · · · · · · · ·∂h∂yi (yk−1), (A.16)where k is the number of successful measurements in the frame. In this case, ∂h∂xis a 3k× (13+ 3n) matrix. n is the total number of features in the surface featuremap. In this case, hi is the 3D feature location predicted based on feature locationin the 3D surface map and camera motion states as:hi = R¯ · (yi− r). (A.17)R is the rotation matrix computed based on the quaternion describing the poseof the camera and r is the camera location stored in the EKF state vector. yi isthe feature location stored in the 3D feature map. In equation A.16, ∂h∂yi can becomputed as:∂h∂yi= R¯. (A.18)Regarding to ∂h∂xv ,∂h∂xv=[∂h∂ r∂h∂q], (A.19)113where∂h∂ r =−R¯, (A.20)and∂h∂q =∂h∂ q¯∂ q¯∂q . (A.21)∂ q¯∂q can be computed as∂q∂q =1 0 0 00 −1 0 00 0 −1 00 0 0 −1. (A.22)∂h∂ q¯ can be computed in the following manner. For a quaternion q and its matrixrepresentation R, the Jacobian ∂ (Ry)∂q can be computed in stacking manner as:∂ (Ry)∂q =[v1 v2 v3 v4],v1 =∂R∂qωy =2qω −2qz 2qy2qz 2qω −2qx−2qy 2qx 2qωy,v2 =∂R∂qxy =2qx 2qy 2qz2qy −2qx −2qω2qz 2qω −2qxy,v3 =∂R∂qyy =−2qy 2qx 2qω2qx 2qy 2qz−2qω 2qz 2qyy,v4 =∂R∂qzy =−2qz −2qω 2qx2qω −2qz 2qy2qx 2qy 2qzy.(A.23)114In this case, y is (yi− r) and R is R¯. The Jacobian is derived from the conversionbetween quaternion and rotation matrix as:R =q2ω +q2x +q2y +q2z −qzqω +qyqx−qωqz +qxqy qyqω +qzqx +qxqz +qωqyqxqy +qωqz +qzqω +qyqx q2y −q2z +q2ω −q2x qzqy +qyqz−qxqω −qωqxqxqz−qωqy +qzqx−qyqω qyqz +qzqy +qωqx +qxqω q2z −q2y −q2x +q2ω .(A.24)A.3 Adding FeaturesFor ith feature to add, the back-projected 3D feature location yi can be computedas:yi = R · zi + t, (A.25)where zi is the current measurement of the ith feature. t and R can be acquiredfrom the camera location (r) and pose quaternion (q) respectively.The 2 Jacobians to be computed here are ∂yi∂xv and∂yi∂ zi . Regarding to∂yi∂xv ,∂yi∂xv=[I ∂ (Ry)∂q]. (A.26)∂ (Ry)∂q can be computed as shown in equation A.23.∂yi∂ zi can be computed as:∂yi∂ zi= R. (A.27)115Appendix BOther Related HardwareB.1 The Quadro Digital Video PipelineThe NVIDIA Quadro digital video pipeline can be the hardware used for aug-mented reality. Specifically, the stereo video sequences can be streamed into theGPU memory through the digital video pipeline. Within GPU, the video sequencescan be processed to solve for the surface registration and medical image volume canbe overlaid onto the visible tissue surface. The combined image can be streamedback the da Vinci console via the digital video pipeline.The digital video pipeline consists of 3 pieces of hardware unit, a Quadro GPU,a SDI input card, and a SDI output card. The GPU and the SDI input card requireindividual 16 line PCI Express slots on the motherboard while the output cardrequires a 8 line PCI Express slots. To ensure the digital video pipeline to work,the PC motherboard has to have two 16 line PCI Express slots. Moreover, the 2slots have to be connected to the same CPU IO hub, so that peer to peer transactionbetween the PCI Express slots is enabled. Peer to Peer transaction between the PCIExpress slots allow the video data to be transferred directly from SDI capture cardto GPU memory without the transaction delay between CPU and GPU.Currently, real time video streaming can be achieved with no drop frames usingthe demo application. The full functionality of the digital pipeline requires softwareimplementation following the SDKs provided by NVIDIA at https://registration.nvidia.com/SDI.aspx.116B.2 3D TVWith the NVIDIA 3D Play software, a GeForce GPU can also be used to enable3D vision on a 3D TV. We have purchased a 3D TV for this purpose. There aremany GTX 580 and GTX 690 GPUs in the lab.117Appendix CStereo Triangulation andRectificationThe stereo triangulation and rectification are based on a pin hole camera model.The model can be written as:suv1= AXYZ , where A =fx 0 cx0 fy cy0 0 1 . (C.1)s is a scale factor. In the intrinsic matrix A, fx and fy are the focal lengths and(cx,cy) is the optical center. (X ,Y,Z) are the coordinates of a 3D point in worldcoordinate space while (u,v) are the coordinates of the projection point on theimage plane. Radial and tangential distortions are modeled as:x′ = x(1+ k1r2 + k2r4)+ [2p1xy+ p2(r2 +2x2)], (C.2)andy′ = x(1+ k1r2 + k2r4)+ [p1(r2 +2y2)+2p2xy], (C.3)where x, y, and r can be defined as:x =XZ, y =YZ, and r = x2 + y2. (C.4)118k1 and k2 are the tangential distortion coefficients, p1 and p2 are the radial distortioncoefficients. (x,y,1)T is the normalized 3D position, ~n. It is very convenient towork with~n if we rewrite equation (C.1) as:uv1= Axy1 . (C.5)Finally, the 2D image coordinates can be written as:u = fxx′+ cx,v = fyy′+ cy.(C.6)௟ܱ ௥ܱ݌௥ ݌௟݉݅݀݌݋݅݊ݐǡܲ ൌ ܻܼܺ௟ܲ௥ܲ݅݉ܽ݃݁݌݈ܽ݊݁Ԧݐ𝑥௟ 𝑥௥Figure C.1: A schematics of the triangulation geometry.C.1 Stereo TriangulationThe geometry of triangulation is illustrated in Figure C.1. In the figure, a 3D pointP is projected into the left and right camera views at pl and pr respectively. Ol andOr are the camera locations. The goal is to find both OlPl and OrPr, since midpoint119of PlPr is the 3D reconstructed point. We set:−−→OlPl = α~xl, and−−→OrPr = β~xr. (C.7)Given that PlPr is very small, following geometry relationship can be established:α~xl ·~xl = ~xl ·~t +β~xl ·~xr,α~xl ·~xr = ~xr ·~t +β~xr ·~xr.(C.8)α and β can be solved based on the equation set as:α = ((~xr ·~xr)(~xl ·~t)− (~xl ·~xr)(~xr ·~t))D,β = ((~xl ·~xr)(~xl ·~t)− (~xl ·~xl)(~xl ·~t))D,D = (~xl ·~xl)(~xr ·~xr)− (~xl ·~xr)2.(C.9)~xl and ~xr can be defined as:~xl = R~nl and ~xr = ~nr. (C.10)~nl and ~nr can be obtained from equation (C.4). R is the rotation matrix between thestereo cameras. Equivalently, we can also transform ~nr into the left coordinates tosolve for α and β . The final reconstruction result can be acquired as:P =12(α~nl +βR−1(~nr−~t)). (C.11)C.2 RectificationThe objective of rectification is to align the stereo images in row direction, sothat corresponded object projections have very similar v values. The rectificationmethod used [6] involves two rotation stages.Firstly, the cameras are rotated towards each other to align the optical axes.Specifically, given the axis angle between the stereo camera, ~θ , the rotations ap-120plied to the cameras can be written as:Rl1 = rodrigues(~θ2),Rr1 = rodrigues(−~θ2).(C.12)Secondly, a rotation is applied to both of the cameras to achieve row alignment.This can be achieved by rotate the unit vector~i = (1,0,0)T to the direction of~t,which is the translation vector between the stereo cameras. Such rotation can beachieved by firstly defining a unit vector ~zn which are perpendicular to both~i and~t,z =~i×~t, and zn =z‖z‖. (C.13)The axis angle required for the second rotation~γ can be derived as:~γ = acos(~i ·~t‖~i‖‖~t‖)zn. (C.14)As a result, the equivalent rotation matrix is:R2 = rodrigues(~γ). (C.15)During the rectification process, we also compensate for distortion by adjust-ing the focal length. Assuming the distortion is dominated by first order radialdistortion, the compensated focal length is:f′ = f (1+ k1w2 +h24 f 2). (C.16)For simplicity, all four focal lengths for the stereo cameras are set to the smallercompensated fy. The differences between the original focal lengths of the stereocameras are with 2%. As a result, such approximation is good enough for ourapplication, since we only use this for a rough estimation of the reconstructionerror. In case of the optical center, to allow maximum image view, the four cornersof the image boundary in the original coordinates are transformed into the rectifiedcoordinates and projected into the image plane. The middle point of the projected121boundary points are set to be the new optical center for each of the cameras. Withthe new focal length and optical center, a new projection matrix A′ can be writtenas:A′ =f ′ 0 c′x0 f ′ c′y0 0 1 . (C.17)For each camera, the rectification can be done as:~x2d′ = A′R2R1A−1 ~x2d (C.18)122Appendix DShi and Tomasi Interest PointDetectorThe Shi and Tomasi interest point detector [44] detects interest point based onthe structure tensor computed within a image window. Consider a image windowcentered at (u,v), the Sum of Squared Differences (SSD) for a shift (x,y) can bewritten as:SSD(x,y) =∑u∑vw(u,v)(I(u+ x,v+ y)− I(u,v))2. (D.1)I is the image pixel in the image window and w is a pixel weighting function.I(u+ x,v+ y) can be written based on first order Taylor expansion as:I(u+ x,v+ y)≈ I(u,v)+ Ix(u,v)x+ Iy(u,v)y, (D.2)so thatSSD(x,y)≈∑u∑vw(u,v)(Ix(u,v)x+ Iy(u,v)y)2. (D.3)123In the matrix form,SSD(x,y)≈[x y]A[xy],where A =∑u∑vw(u,v)[I2x IxIyIxIy I2y]=[〈I2x 〉 〈IxIy〉〈IxIy〉 〈I2y 〉].(D.4)A is named the Harris matrix. The two eigenvalues of A indicate the rate of changein the image. We denote eigenvalues of A to be λ1 and λ2. We set λ1 ≥ λ2. Inthe Shi and Tomasi detector, λ2 has to be over a threshold for the pixel (u,v) to beconsidered as an interest point. A small λ2 indicates an edge or a featureless imagewindow.124

Cite

Citation Scheme:

        

Citations by CSL (citeproc-js)

Usage Statistics

Share

Embed

Customize your widget with the following options, then copy and paste the code below into the HTML of your page to embed this item in your website.
                        
                            <div id="ubcOpenCollectionsWidgetDisplay">
                            <script id="ubcOpenCollectionsWidget"
                            src="{[{embed.src}]}"
                            data-item="{[{embed.item}]}"
                            data-collection="{[{embed.collection}]}"
                            data-metadata="{[{embed.showMetadata}]}"
                            data-width="{[{embed.width}]}"
                            async >
                            </script>
                            </div>
                        
                    
IIIF logo Our image viewer uses the IIIF 2.0 standard. To load this item in other compatible viewers, use this url:
http://iiif.library.ubc.ca/presentation/dsp.24.1-0167489/manifest

Comment

Related Items