Open Collections

UBC Theses and Dissertations

UBC Theses Logo

UBC Theses and Dissertations

Improving object detection using 3D spatial relationships Southey, Tristram 2013

Your browser doesn't seem to have a PDF viewer, please download the PDF to view this item.

Item Metadata


24-ubc_2013_fall_southey_tristram.pdf [ 8.38MB ]
JSON: 24-1.0052189.json
JSON-LD: 24-1.0052189-ld.json
RDF/XML (Pretty): 24-1.0052189-rdf.xml
RDF/JSON: 24-1.0052189-rdf.json
Turtle: 24-1.0052189-turtle.txt
N-Triples: 24-1.0052189-rdf-ntriples.txt
Original Record: 24-1.0052189-source.json
Full Text

Full Text

Improving Object Detection using 3D Spatial RelationshipsbyTristram SoutheyMSc., University of British Columbia, 2006a thesis submitted in partial fulfillmentof the requirements for the degree ofDoctor of Philosophyinthe faculty of graduate and postdoctoral studies(Computer Science)The University Of British Columbia(Vancouver)August 2013c? Tristram Southey, 2013AbstractReliable object detection is one of the most significant hurdles that must be overcometo develop useful household robots. Overall, the goal of this work is to demonstrate howeffective 3D qualitative spatial relationships can be for improving object detection. We showthat robots can utilize 3D qualitative spatial relationships to improve object detection bydifferentiating between true and false positive detections.The main body of the thesis focuses on an approach for improving object detection ratesthat identifies the most likely subset of 3D detections using seven types of 3D relationshipsand adjusts detection confidence scores to improve the average precision. These seven 3Dqualitative spatial relationships are adapted from 2D qualitative spatial reasoning tech-niques. We learn a model for identifying the most likely subset using a structured supportvector machine [Tsochantaridis et al., 2004] from examples of 3D layouts of objects in of-fices and kitchens. We produce 3D detections from 2D detections using a fiducial markerand images of a scene and show our model is successful at significantly improving overalldetection rates on real world scenes of both offices and kitchens.After the real world results, we test our method on synthetic detections where the propertiesof the 3D detections are controlled. Our approach improves on the model it was basedupon, that of [Desai et al., 2009], by utilizing a branch and bound tree search to improveboth training and inference. Our model relies on sufficient true positive detections in thetraining data or good localization of the true positive detections. Finally, we analyze thecumulative benefits of the spatial relationships and determine that the most effective spatialrelationships depend on both the scene type and localization accuracy. We demonstrate thatthere is no one relationship that is sufficient on its own or always outperforms others andthat a mixture of relationships is always useful.iiPrefaceThis dissertation is an original intellectual product of the author Tristram Southey. All ofthe work presented henceforth was conducted in the Laboratory for Computational Intelli-gence.A version of Chapter 4 appeared in [Viswanathan et al., 2011]. All stages of this work wereperformed jointly with Pooja Viswanathan and it is included with her permission. J. J.Little and A. Mackwork were involved in project developments in a supervisory position.A version of Chapters 7 and 8 appeared in [Southey and Little, 2012]. I was lead researchfor all all material in this paper. J. J. Little was involved in project developments in asupervisory position. David Meger provided some software for assistance with the scenestructure labelling component.iiiTable of ContentsAbstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iiPreface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iiiTable of Contents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ivList of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viiiList of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ix1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.1 Spatial Knowledge in Action . . . . . . . . . . . . . . . . . . . . . . . . . . . 21.2 Why Qualitative 3D Spatial Relationships? . . . . . . . . . . . . . . . . . . . . 21.3 Improving Object Detection Using 3D Qualitative Spatial . . . . . . . . . . . . 31.3.1 Scene classification Using Object Detections . . . . . . . . . . . . . . . 41.3.2 Single Object Classification Through Spatial Context . . . . . . . . . . 41.3.3 Improved Multi-object Detection Using 3D Spatial Relationships . . . . 41.4 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51.5 Collaborations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61.6 Chapter Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92.1 Boosted Alternating Decision Tree . . . . . . . . . . . . . . . . . . . . . . . . 92.1.1 Adaboost and Boosted Decision Trees . . . . . . . . . . . . . . . . . . 102.1.2 Alternating Decision Trees . . . . . . . . . . . . . . . . . . . . . . . . 112.2 Support Vector Machines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142.2.1 Linear Support Vector Machines . . . . . . . . . . . . . . . . . . . . . 152.2.2 Non-linear Support Vector Machines . . . . . . . . . . . . . . . . . . . 172.2.3 Structured Support Vector Machines . . . . . . . . . . . . . . . . . . . 173 Context and Object Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . 223.1 Image-based Object Detection . . . . . . . . . . . . . . . . . . . . . . . . . . 223.1.1 Object Detectors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 233.1.2 Discriminatively Trained Deformable Part Model . . . . . . . . . . . . . 24iv3.2 Context in Object Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . 253.2.1 Contextual Sources . . . . . . . . . . . . . . . . . . . . . . . . . . . . 283.3 Object Detection Using 3D Context . . . . . . . . . . . . . . . . . . . . . . . 293.3.1 Context and the Spatial Semantic Hierarchy . . . . . . . . . . . . . . . 303.3.2 Monocular 3D Context . . . . . . . . . . . . . . . . . . . . . . . . . . 303.4 Multi-object Simultaneous Object Detection . . . . . . . . . . . . . . . . . . . 313.4.1 Graphical Models for Contextual Object Detection . . . . . . . . . . . 323.4.2 Automatic Place Modeling Using Objects . . . . . . . . . . . . . . . . 333.4.3 Bayesian Compositional Hierarchy . . . . . . . . . . . . . . . . . . . . 343.4.4 Discriminative Models for Multi-class Object Layout . . . . . . . . . . . 374 Scene Classification using Object Detections . . . . . . . . . . . . . . . . . . . 404.1 Scene Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 414.2 Automated Scene Labeling . . . . . . . . . . . . . . . . . . . . . . . . . . . . 424.2.1 LabelMe Data Collection . . . . . . . . . . . . . . . . . . . . . . . . . 424.2.2 Count Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 434.2.3 Useful Objects . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 444.2.4 Detector Training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 444.2.5 Scene Labeling Using Boosted Decision Trees . . . . . . . . . . . . . . 454.3 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 454.3.1 Count Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 464.3.2 Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 464.3.3 Scene Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . 474.4 Influence on Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 495 Qualitative Spatial Relationships . . . . . . . . . . . . . . . . . . . . . . . . . . 515.1 Qualitative Spatial Reasoning . . . . . . . . . . . . . . . . . . . . . . . . . . . 515.2 Qualitative vs. Quantitative Relationships . . . . . . . . . . . . . . . . . . . . 535.2.1 Advantages of Qualitative Spatial Techniques . . . . . . . . . . . . . . 535.2.2 Disadvantages of Qualitative Spatial Techniques . . . . . . . . . . . . . 545.3 Spaces . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 555.3.1 Sizes of Space . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 565.3.2 Spatial Object Structure . . . . . . . . . . . . . . . . . . . . . . . . . 575.3.3 3D Spatial Representations as 2D . . . . . . . . . . . . . . . . . . . . 585.4 Qualitative Spatial Relationships . . . . . . . . . . . . . . . . . . . . . . . . . 595.4.1 Orientation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 595.4.2 Distance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 635.4.3 Topology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64v5.4.4 Shape and Scale . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 675.5 Proposed Complex Relationships . . . . . . . . . . . . . . . . . . . . . . . . . 686 Spatial Object Classification in Virtual Environments . . . . . . . . . . . . . . 736.1 Problem Formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 736.2 Synthetic Data from Elder Scroll Oblivion . . . . . . . . . . . . . . . . . . . . 746.2.1 Advantages of Using the Oblivion Data Set . . . . . . . . . . . . . . . 746.2.2 Issues With Using the Oblivion Data Set . . . . . . . . . . . . . . . . . 776.3 Qualitative Spatial Relationships . . . . . . . . . . . . . . . . . . . . . . . . . 776.3.1 Distance Relationships . . . . . . . . . . . . . . . . . . . . . . . . . . 776.3.2 Direction/Containment Spatial Relationships . . . . . . . . . . . . . . 786.4 Relationship Sets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 796.5 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 806.5.1 Training and Test Data . . . . . . . . . . . . . . . . . . . . . . . . . . 806.5.2 Results and Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . 816.6 Influence on Later Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 827 Improving Object Detection using 3D Spatial Relationships . . . . . . . . . . . 847.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 847.2 Hypothesis Boxes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 857.3 Scene Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 867.4 Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 867.5 Inference . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 887.5.1 Greedy Forward Search . . . . . . . . . . . . . . . . . . . . . . . . . . 887.5.2 Branch and Bound Search . . . . . . . . . . . . . . . . . . . . . . . . 897.6 Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 907.6.1 Problem Formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . 907.6.2 Structured SVM Weight Training . . . . . . . . . . . . . . . . . . . . . 917.7 3D Spatial Relationships . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 927.7.1 Relationship Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . 927.7.2 Absolute Distance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 948 Object Detection using 3D Spatial Relationships Results . . . . . . . . . . . . 1038.1 3D Data Collection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1038.1.1 Fiducial Marker 3D Data Collection . . . . . . . . . . . . . . . . . . . 1048.1.2 Locations & Scenes . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1068.2 Experimental Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1078.3 Image-based Hypothesis Experiments . . . . . . . . . . . . . . . . . . . . . . . 107vi8.3.1 3D Hypothesis Construction . . . . . . . . . . . . . . . . . . . . . . . 1078.3.2 Instanced Hypothesis Boxes of Image-based Experiments . . . . . . . . 1098.4 Generated Hypothesis Experiments . . . . . . . . . . . . . . . . . . . . . . . . 1098.5 Simulated Hypothesis Experiments . . . . . . . . . . . . . . . . . . . . . . . . 1168.5.1 Simulated True Positive Detections . . . . . . . . . . . . . . . . . . . . 1168.5.2 Simulated False Positive Detections . . . . . . . . . . . . . . . . . . . 1178.5.3 Simulated Hypothesis Experimental Procedures . . . . . . . . . . . . . 1188.5.4 Detection and Localization Error Experiment . . . . . . . . . . . . . . 1188.5.5 Score Separation Experiment . . . . . . . . . . . . . . . . . . . . . . . 1208.5.6 Branch and Bound Tree Vs Greedy Search . . . . . . . . . . . . . . . . 1218.5.7 Spatial Relationship Comparison . . . . . . . . . . . . . . . . . . . . . 1239 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1259.1 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1259.2 Future Directions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129viiList of TablesTable 4.1 Results for scene classification with perfect object labels. . . . . . . . . . . 48Table 4.2 Classification results for scene classification using object detections, Gist andboth combined on images acquired by humans. . . . . . . . . . . . . . . . . 49Table 6.1 Relationship set accuracy comparison . . . . . . . . . . . . . . . . . . . . . 81Table 6.2 Confusion matrix for aggregate distance & direction/containment classifier . 81Table 8.1 The improvement in average precision (?AP ) for objects and overall on 3Ddetections produced from scene images. . . . . . . . . . . . . . . . . . . . . 108viiiList of FiguresFigure 2.1 An example of a conventional decision tree and the equivalent alternatingdecision tree. The circular nodes represent decisions. The square nodes inthe decision tree represent classification leaf nodes and in the alternatingdecision trees they represent predicate nodes. . . . . . . . . . . . . . . . . 12Figure 2.2 A visualization of a 2D SVM classifier. The hyperplane is computed betweenthe solid and dotted line data points. . . . . . . . . . . . . . . . . . . . . . 15Figure 2.3 An example of an input x, output y and feature map ?(x, y) for a structuredSVM designed to determine a Probabilistic Context Free Grammar. Figurereproduced from [Tsochantaridis et al., 2004] . . . . . . . . . . . . . . . . 18Figure 3.1 Visualizations of the Felzenszwalb et al. object classifier. The images on theleft show an expected intensity of gradient in a grid pattern for the entireobject. On the right, they show the gradients in the parts model. . . . . . 26Figure 3.2 The graphical model for place recognition using object detections in 3D from[Ranganathan and Dellaert, 2007]. The place label L generates a set of Nobject detections O. Each object detection has a 3D position T , a shape Sand an appearance A. The position and shape produce a set of 3D points?3D. These points are computed from an image containing n features witheach feature having a depth d, a pixel location u and an appearance v. Figurereproduced from [Ranganathan and Dellaert, 2007] . . . . . . . . . . . . . 34Figure 3.3 The compositional hierarchy of a facade. Triangle indicate aggregate struc-ture. Figure reproduced from [Terzic? and Neumann, 2010] . . . . . . . . . 35Figure 3.4 The structure of a Bayesian Compositional Hierarchy. The triangles areaggregates defined by bounding box A, parts B1 . . . BK and spatial layoutof the parts C. Figure reproduced from [Terzic? and Neumann, 2010] . . . . 36Figure 3.5 The qualitative spatial relationships used by [Desai et al., 2009]. The re-lationships they used were distance-based (near,far), an orientation-based(above,below, next to) and a topological (on top). Figure reproduced from[Desai et al., 2009]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38Figure 4.1 A kitchen scene from the LabelMe database. The polygons used to segmentobjects in the scene are shown as colored lines. . . . . . . . . . . . . . . . 43Figure 4.2 Counts of the types of objects found in kitchen and office scenes. . . . . . 46ixFigure 4.3 The precision/recall rates of object detectors. Top rows shows 2 of the mostsuccessful classifiers and the bottom row shows 2 of the least successfulclassifiers. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47Figure 5.1 Three varieties of point based qualitative orientation systems. . . . . . . . 62Figure 5.2 A region based orientation system used to describe the projection of axis-aligned bounding boxes for the objects on to each axis of the frame of refer-ence as shown in 5.2a. Intervals are compared according to Allen?s intervalalgebra shown in 5.2b. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62Figure 5.3 A example of a 2D relative distance relationship CanConnect(X,Y, Y ).The reference box Y is show in blue. The region surrounding Y shows thenear/far partition. If X (not shown) overlaps with region surrounding Y ,then X and Y are ?near?, otherwise they are ?far?. This first shows thepartition region when there is no rotation of Y allowed to connect X and Y .The second shows the partition region when any rotation of Y is allowed toconnect X and Y . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65Figure 5.4 The RCC-8 topological calculus. . . . . . . . . . . . . . . . . . . . . . . . 66Figure 5.5 This figure demonstrates how containment is determined using convex hulls.Figure 5.5a shows 4 objects defined by 2D regions. In Figure 5.5b, eachshape has been overlaid by its convex hull to demonstrate containment de-tection. Object B is contained by object A, or more specifically, the RCC-8containment relationship between A and B is NTPPi (non-tangential properpart inverse). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69Figure 5.6 This figure demonstrates how betweenness is determined using convex hulls.Figure 5.6a shows 4 objects defined by 2D regions. In Figure 5.6b, a convexhull has been overlaid on the combined points of objects A and D to detectwhat objects share betweenness relationships with them. Object B and Care both between A and D, with object B having an NTTPi relationship andobject C having a PO (partial overlap) relationship. . . . . . . . . . . . . . 70Figure 6.1 A tavern from Oblivion. The wine rack in the rear of the bar holds aboutthirty bottles of wine in five different varieties. These were collapsed intoa single object type ?wineBottle? for classification purposes. The cat-likefigure behind the bar is the owner. All game characters were removed fromthe training and test data. . . . . . . . . . . . . . . . . . . . . . . . . . . 75xFigure 6.2 A large and ornate dining room from Oblivion. In the center of the imageis a dining table containing food and wine set for ten people. Surroundingthe table are chairs, though these are partially obscured by shapes that showhow character models would transition from standing to sitting on the chair. 76Figure 6.3 A library from the Oblivion data set. The shelves in the back contain books,ornaments and tools. In the foreground there is a table set for two with foodand wine. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76Figure 7.1 This figure illustrates the box distance relationship. Sp(R) and Sp(T ) (notshown) are the longest internal spanning vector of the reference box R,shown in blue), and target boxes T , shown in red. Since Sp(R) > Sp(T )then Sp(R) is used to define the maximum separation between objects thatare ?close?. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96Figure 7.2 This figure illustrates the vertical orientation relationship. The reference boxR is shown in blue and different target boxes are shown in red with theresulting relationship beside them. . . . . . . . . . . . . . . . . . . . . . . 97Figure 7.3 This figure illustrates the Coplanarity relationship. The reference box R isshown in blue and different target boxes are shown in red with the resultingrelationship beside them. . . . . . . . . . . . . . . . . . . . . . . . . . . . 99Figure 7.4 This figure illustrates the vertical alignment relationship. The reference boxR is shown in blue and different target boxes are shown in red with theresulting relationship beside them. . . . . . . . . . . . . . . . . . . . . . . 100Figure 7.5 This figure illustrates the vertical alignment relationship. The reference boxR is shown in blue and different target boxes are shown in red with theresulting relationship beside them. . . . . . . . . . . . . . . . . . . . . . . 101Figure 8.1 A sample image of a kitchen taken from IKEA. The object in the foregroundis a fiducial marker which allows us to recompute the camera position andintegrate multiple images into a 3D model of the object layout. . . . . . . 105Figure 8.2 An illustration of good instanced hypothesis boxes from image-based detec-tions for a kitchen scene. In this scene most of the objects had well localizedboxes that comprised a likely layout so most objects overlap an instancedhypothesis box. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110Figure 8.3 An illustration of good instanced hypothesis boxes from image-based detec-tions for a kitchen scene. Again, many of the objects in the scene overlapan instanced hypothesis box. There are multiple overlapping boxes includedfor the faucet because both boxes are well localized enough to be consideredcorrect boxes and our loss function does not penalize additional good boxes. 110xiFigure 8.4 An illustration of good instanced hypothesis boxes from image-based detec-tions for a kitchen scene. This scene has fewer instanced hypothesis boxesbut contained an example of stacked ovens being correctly detected and in-stanced. The localization of the top oven box is poor but it is still instancedbecause it shares spatial relationships with the bottom oven boxes observedin other stacked ovens in the data set. . . . . . . . . . . . . . . . . . . . . 111Figure 8.5 An illustration of bad instanced hypothesis boxes from image-based detec-tions for a kitchen scene. The model can select multiple poorly localizedhypothesis boxes because they constitute a likely layout. Here, the oven,dishwasher and coffee maker boxes were likely selected because the oven anddishwasher top and coffee maker bottom are coplanar, a layout common inmany scenes. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111Figure 8.6 An illustration of bad instanced hypothesis boxes from image-based detec-tions for a kitchen scene. This scene contains two instanced oven hypothesisboxes, side by side, with one poorly localized and one well localized. Theextra oven box might have been included because multiple oven boxes neareach other are common in stacked ovens and the extra detection is close toa small appliance detection. . . . . . . . . . . . . . . . . . . . . . . . . . . 112Figure 8.7 An illustration of bad instanced hypothesis boxes from image-based detec-tions for a kitchen scene. In this scene, both the oven detection box andpot detection box are significantly offset from their ground truths. Pots areoften observed in the data set on top of ovens so the model selected the potand oven boxes that shared that relationship. . . . . . . . . . . . . . . . . 112Figure 8.8 An illustration of good instanced hypothesis boxes from image-based detec-tions for an office scene. In this scene several objects had well localizedboxes that comprised a likely layout so many objects overlap an instancedhypothesis box. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113Figure 8.9 An illustration of good instanced hypothesis boxes from image-based detec-tions for an office scene. This scene is very complicated with a large numberof objects detected and localized accurately. Not every object is well local-ized, one of the monitors is offset significantly. Also, the two close chairscan lead to extra poorly localized boxes when 2D detections from differentchairs are combined together to create a 3D hypothesis box. . . . . . . . 113Figure 8.10 An illustration of good and bad instanced hypothesis boxes from image-baseddetections for an office scene. Two parts of the scene contain well localizeddetections on either side and an extra chair and telephone false positivedetections appear in the background. . . . . . . . . . . . . . . . . . . . . . 114xiiFigure 8.11 An illustration of good and bad instanced hypothesis boxes from image-baseddetections for an office scene. Both chairs and one monitor hypotheses werewell localized but the two false positive monitors and one chair were added,though they do comprise a reasonable scene layout. The monitor on the leftwas likely added because there is a mouse hypothesis next to it. . . . . . . 114Figure 8.12 An illustration of bad instanced hypothesis boxes from image-based detec-tions for an office scene. The problem in this scene was too many falsepositive and badly localized boxes. The scene had a horseshoe arrangementof desks with the fiducial marker placed in the middle and many objects closetogether. This lead to many incorrectly localized and false positive boxes be-cause every camera frustum overlapped, meaning many more intersectionsin the 3D hypothesis box creation. . . . . . . . . . . . . . . . . . . . . . 115Figure 8.13 An illustration of bad instanced hypothesis boxes from image-based detec-tions for an office scene. In this scene there simply were not enough imagestaken and there are few 2D detections. This resulted in few hypothesis boxesand so only a small number of poorly localized boxes were instanced. . . . 115Figure 8.14 Simulation results which vary the number of true positive detections, con-trolled by Pdet, the probability that each ground truth object has an associ-ated positive detection. Results are shown at different levels of localizationerror (controlled by ?box). Kitchen results are in red and offices in blue.Solid lines show average precision before applying our method and dottedlines after. Our model provides a significant improvement with either goodlocalization or many positive detections. . . . . . . . . . . . . . . . . . . . 119Figure 8.15 Simulation results which vary T , the average difference between the true andfalse positive scores, shown at different levels of localization error (controlledby ?box). Kitchen results are shown in red and offices in blue. Solid lines showthe average precision before applying our method and dotted lines after. Thelargest improvement in average precision occurs when T is small, simulatinga detector with a poor ability to differentiate true and false positive detections. 120Figure 8.16 Simulation results that compare the effectiveness of our branch and boundsearch against the [Desai et al., 2009] greedy search. We varied the local-ization error controlled by ?box. Branch and bound results are in red andgreedy search results in blue. Solid lines show average precision before ap-plying our method and dotted lines after. From these it is clear that thebranch and bound approach improvements vary between scene types but didprovide consistently better results. . . . . . . . . . . . . . . . . . . . . . . 122xiiiFigure 8.17 Simulation results that show the cumulative benefit of each of the spatialrelationships. The relationships were chosen by greedily selecting the onethat led to the greatest improvement at each step. We varied the localizationerror, controlled by ?box and ran the experiment on both offices and kitchens.The order in which the features were applied is shown in the table embeddedin each graph. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124xivChapter 1IntroductionA better understanding of the structure of human environments and the ability to makepredictions about object type based on that structure could be useful for many applications.With the increasing availability of crowd sourced object labeling services like MechanicalTurk and 3D shape measuring devices like the Kinect stereoscopic camera, it is likely thatwe will soon have access to large quantities of accurate quantitative data about the spatialstructure and makeup of human environments. At the same time, there is an increasingdemand for home-based robots that can assist people with basic, everyday problems likefetching or putting away objects, cleaning houses and reminding people where they lefttheir keys. All of these tasks require some level of knowledge about the structure of humanenvironments.Overall, the main goal of our work is to demonstrate that it is possible to significantlyimprove the accuracy of object detection using 3D qualitative spatial relationships. Ourapproach is based on the fact that the spatial relationships between objects in organizedhuman environments exhibit structure. This predictable structure can be used to improveobject detection rates by identifying a set of detections that have a layout in 3D thatcorresponds to a model of object relationships in that type of scene. For example, in akitchen you are more likely to find a frying pan on a stove than on a refrigerator. Therefore,objects detected on stoves are more likely to be frying pans than objects detected on top ofrefrigerators.Manually identifying all the rules that work together to determine object layout in housesis impractical. The scale of the problem is daunting, given the variety of objects andthe multiple locations they can reasonably appear. Humans would have great difficultytrying to identify all the rules that govern object layout, let alone the probabilities behindthese rules that allow us to classify an object based on its context. Learning providesus with an approach that can identify these rules with minimal human intervention andwhich can be easily updated for new objects and environments. To simplify the learning,we propose using a supervised learning approach that trains from labeled training data.1An unsupervised approach that could work with unlabeled data would require less datapreprocessing would be preferable but it is not unreasonable to require object labels giventhe difficulty of the problem and the increasing prevalence of labeled data sources such asLabelMe [Russell et al., 2008].What is needed is a mechanism that would allow a robot to take examples of object layoutsin houses and develop a model of the commonalities between these layouts. This modelwould then allow the robot to make predictions about the type and layout of objects basedon input from an object detector. Significantly, the locations of objects in houses are basedonly partially on the shape of the house. Instead, object locations are predicated on eachother. A room is a bedroom primarily because of its contents and the location of an objectin that room will be dependent on the location of other objects.1.1 Spatial Knowledge in ActionTo better understand our goals, let us examine an anecdotal example of how a human mightapply spatial knowledge to a classification task. A person is sitting on a chair at a tableand in front of them are three objects in a row. The object on one side is a knife and theobject on the other is a fork. However, the object in the middle is covered and unknown.Given the scene, what type of object is in the middle? Most humans in this situation wouldbe able to easily determine that it is most likely a plate.This may seem like a trivial example but this classification is quite impressive since a platewould be a difficult object for many visual object classifiers to identify. Plates can varyin color and size, have few distinguishable features in their appearance and look similarto many other circular objects found in human environments. However, a person canidentify the unknown object as a plate using only its qualitative position relative to theother surrounding objects because they have learned about the spatial relationships thatexist between objects in human environments and how to apply those relationships to theproblem of object classification.1.2 Why Qualitative 3D Spatial Relationships?The main difference between our work on using context for improving object detection andother approaches is the use of qualitative 3D spatial relationships to model the spatial lay-out of objects. Qualitative spatial relationships, such as near, far and above, are terms2humans use to conceptualize and communicate about their environment. The assumptionbehind the use of qualitative spatial relationships in human environments is that by bas-ing relationships on human terms used to describe and discuss their surroundings, thenthe relationships contain the relevant information used by the humans when creating thatlayout. Qualitative spatial relationships are applied to decompose and partition the quali-tative spatial relationships between objects that are produced by a robot?s sensors, breakingthe relationships down into multiple simplified representations based on some property ofspace such as topology, orientation or distance. Qualitative spatial relationships are usedon problems with a spatial element because they simplify learning by translating complexspatial interactions into discrete values, reducing the complexity of the problem.Most contextual object detection models that use spatial relationships are limited to rela-tionships between detections in a 2D image space. Since 2D relationships are viewpoint-dependent, relations between the same objects can be inconsistent from image to image.Furthermore, 2D spatial relationships are limited in their expressiveness, unable to capturesome spatial relationships that are clear in 3D. For example, it is very difficult to reliablydetermine if objects are on the same surface or directly above each other from a singleimage. 3D spatial relationships are viewpoint-independent and can provide richer spatialdescriptions on which to base a model. 3D spatial relationships between objects are moreconsistent across different scenes than 2D spatial relationships as they are not view pointdependent. Furthermore, they are more descriptive because they can capture a broaderrange of object arrangements. Therefore, we anticipate that 3D spatial relationships, ifaccurately localized, can provide a better basis for performing margin reweighing than 2Drelationships.Historically, 3D relationships have not been used due to a lack of data on the 3D layoutof indoor scenes. Computing the 3D layout of objects in scenes is time consuming anddifficult. In our work we have explored using both synthetic data sets from video gamesand produced our own data set of 3D scenes because there was no existing data set of 3Dreal world scenes to learn from.1.3 Improving Object Detection Using 3D Qualitative SpatialThe following is a set of tasks that could be performed using a model relative object positionsin houses:31.3.1 Scene classification Using Object DetectionsThis is the problem of using object detections to determine the type of a scene. Scenes areeither rooms or portions of a room devoted to a task which have a commonly used semanticlabel (e.g., office, kitchen, bedroom, bathroom, etc). Since scenes are task related, they relyon the presence of commonly used objects and many scenes are primarily just a collectionof objects in an area. The difference between a bedroom and a office is mostly the contents,not the properties of the room itself. Determining the type of a scene based on raw objectdetection was important to our work as it allowed us to use scene dependent spatial models.We demonstrate that visual object detections could be used to recognize several commontypes of scenes with sufficient accuracy that we could use scene dependent spatial models.1.3.2 Single Object Classification Through Spatial ContextThis is the problem of classifying an individual object based on perfect information aboutthe type and location of all surrounding objects. This early work was intended to test thefeasibility of using qualitative spatial relationships to classify objects. We used syntheticdata from a video game which contained many realistically structured houses to provide thetraining and test data. Our results demonstrate that it was possible to infer the type of manyobjects using only information about the qualitative relationships with their surroundings.1.3.3 Improved Multi-object Detection Using 3D Spatial RelationshipsThe most significant area of this thesis is devoted to the problem of taking the results of avisual object detector, applied to multiple views of a scene and using a model of the spatialrelationship between objects to reject false positive detections.Our method identifies the most likely subset of 3D detections using seven types of 3D spatialrelationships and adjusts detection confidence scores to improve the average precision. Amodel is learned using a structured support vector machine (SVM) [Tsochantaridis et al.,2004] from examples of 3D layouts of objects in offices and kitchens. This approach of usinga structured SVM to improve object detection accuracy was applied to images in [Desaiet al., 2009].We describe a technique for generating 3D detections from 2D image-based object detectionsand demonstrate how our method improves the average precision of these 3D detections.We show that our approach could improve on the detection rates in real world scenes.After that we tested our method on synthetic detections to determine how factors such as4localization accuracy, number of detections and detection scores change the effectiveness of3D spatial relationships for improving object detection rates. We end with an analysis ofthe relative performance of our seven types of spatial relationships.1.4 ContributionsThe following are the main contributions of this thesis in the order they are presented:Scene Classification using Object Detections: We present a novel technique for per-forming scene classification using the results of an image-based object detector. Wediscuss the problems associated with determining the correct types of object to usefor classifying scenes and our approach which uses alternating boosted decision trees.Qualitative Spatial Relationships in 3D: We present a detailed analysis of qualitativespatial relationships and the the motivations behind their use. We describe the ma-jor types of qualitative spatial relationships commonly used and the techniques usedto perform the measurements and partitioning necessary to decompose quantitativespatial relationships into qualitative ones. We focus on the problems of translatingqualitative spatial techniques from 2D spaces where they are normally used into 3D.We then describe in detail seven qualitative spatial relationships that we believe pro-vide a broad coverage of qualitative spatial techniques that are appropriate to ourproblem.3D Qualitative Spatial Relationships and Single Object Classification: We presentan analysis of the potential for 3D qualitative spatial relationships to be used to per-form classification. This approach uses Alternating Boosted Decision trees as thelearning mechanism and relies on synthetic training and test data from the videogame Elder Scrolls 3. We demonstrate that classification with moderate accuracy isfeasible using only information about surrounding objects. This success shows thatspatial relationships might be very useful for improving object detection rates by re-moving false positive detections. This section also includes an analysis of the use ofsynthetic data for spatial problems such as these, with a description of the benefits,problems and essential elements required when using a video game for this kind ofproblem.3D Qualitative Spatial Relationships for Improving Multi-object Detection: Wepresent an approach for taking multiple images of a scenes, producing 3D object de-tections, identifying a subset of detections that comprise a likely scene according to a5learned model and then adjusting the detection scores of all 3D detections to removefalse positive detections. We demonstrate our approach on real world scenes fromIKEA and our university department.The technique we used is based on that described in [Desai et al., 2009] for improvingdetection rates in 2D but differs and improves on their work in several ways. Firstly,we produce 3D detections, use 3D spatial relationships and use 3D training data.Secondly, we use a superior inference technique using a branch and bound tree searchto identify better true positive subsets, this technique improves on the learning stageas well. Finally, we perform a much broader analysis of the properties of the objectdetections which lead to significant improvements from the use of this approach. Weused synthetically generated scenes based on real world data to examine the roleof detection localization accuracy, number of detections per scene, scene type andthe average difference in score between true and false positive detections. Finally,we analyze the seven 3D qualitative spatial relationships to see how they performrelative to each other and discuss which relationships are effective under differentcircumstances.1.5 CollaborationsDuring work on this thesis I collaborated on research with two individuals. The workdescribed in Chapter 4 on Scene Classification using Object Detection was performed col-laboratively with Pooja Viswanathan. We both contributed to the research, coding, ex-perimentation and paper writing. In Chapter 8 on Object Detection using 3D SpatialRelationships I extended a code base created by David Meger for labeling objects in scenesin 3D.1.6 Chapter OverviewThe following is a list of the remaining of the chapters in this thesis and a brief descriptionof their role:Chapter 2: Related Work This chapter provides an overview of two learning techniquesused in later chapters: Alternating Boosted Decision trees and Support Vector Ma-chines.Chapter 3: Context and Object Detection This chapter begins with an overview of6the core concepts behind object detection and a Deformable Parts model [Felzenszwalbet al., 2008], the main visual object detection technique used in this thesis. We discussthe many ways that context is used for learning problems, the ways it can influencea scene and the major sources of contextual information. The chapter ends with abroad examination of the approaches for applying context to multi-object and 3Dobject detection problems.Chapter 4: Scene Classification using Object Detections: This chapter covers ourapproach to performing scene classification using the results of a visual object detector.Chapter 5: Qualitative Spatial Relationships This chapter provides an overview ofqualitative spatial relationships, the motivation behind their use and the core conceptsrelated to them. We cover the four major types of qualitative spatial relationships,major approaches to measurement and partitioning and how to translate them into3D.Chapter 6: Spatial Object Classification in Virtual Environments: This chapter cov-ers our early work on performing classification of a single object given perfect infor-mation about its surroundings. This chapter also covers the advantages and problemsinvolved in using synthetic data from video games for learning object spatial relation-ships.Chapter 7: Improving Object Detection using 3D Spatial Relationships: This chap-ter describes our approach to improving object detection accuracy by removing falsepositive detections using a learned model of expected 3D spatial relationships be-tween objects in indoor scenes. This chapter also covers the seven qualitative spatialrelationships we use with this approach.Chapter 8: Object Detection using 3D Spatial Relationships Results: This chap-ter describes the experimental results of the model from the previous chapters appliedto real world scenes. We describe our approach for producing 3D detections from2D detections in multiple images and our data collection and annotation techniquesfor producing our 3D training and test data. We give results of our model appliedto object detections computed on these real world scenes. We then describe our ap-proach for producing synthetic detections with controllable statistical properties fromthese real world scenes. We describe experiments which examine how the propertiesof the hypothesis boxes used for training and test affect the final improvement fromour techniques. We end with an analysis of the individual benefits from each of theseven qualitative relationships used.7Chapter 9: Conclusion and Future Direction: This chapter gives an overview of theconclusions we reached from this thesis. It ends by describing future directions forthis research.8Chapter 2Related WorkThis chapter contains related research relevant to this thesis with a focus on the learningtechniques we will use. In the next chapter we will examine the role of context in objectdetection and research in that area.The goal of the first section of this chapter is to acquaint the reader with the principles andtechniques involved in applying Boosted Alternating Decision Trees which we used in ourearly work in Chapter 6. Next we examine Support Vector Machines (SVMs) and provide acloser an examination of Structured Support Vector Machines [Tsochantaridis et al., 2004],sufficient to explain how they are used with our model for improving 3D detection rates inChapter 7.2.1 Boosted Alternating Decision TreeDecision trees are a hierarchical model for supervised learning that identifies local regionsin the input space using a sequence of recursive rule-based divisions of the data [Quinlan,1986]. The general structure of decision trees has internal decision nodes that implementtest functions with discrete results and terminal leaves that identify the classifier output.Classification generally works by starting at the root and recursively applying decision nodetest functions until a leaf node is reached. Most decision tree learning approaches are greedy,finding an optimal split according to an impurity measure. A split is ?more pure? if therule on which the data is split divides the data into two sets containing samples that belongto the same class, with splits containing a similar number of samples in each set preferredover unbalanced splits.92.1.1 Adaboost and Boosted Decision TreesBoosted decision trees [Dietterich, 2000] are a class of decision tree learning algorithms thatapply Adaboost learning techniques to the iterative construction of decision trees. Boostingis a learning technique that determines how to combine and weight many weak classifiers toproduce a single strong classifier. Adaboost (adaptive boosting) by [Freund and Schapire,1999] was an improvement on earlier boosting techniques that built classifiers iteratively andadapted them based on the performance of previous iterations. The Adaboost algorithmfocuses on so called ?hard to solve? samples from the training distribution by increasingweights on samples incorrectly labeled by previous iterations and decreasing weights on thosethat were successfully classified. As a learning approach it is distinct from the underlyingmodel it is constructing and so can be applied to many different type of models. In ourwork, we concentrate on boosted decision trees. Adaboost has seen application in a varietyof research topics such as face recognition [Viola and Jones, 2004] and text classification[Schapire and Singer, 2000].Boosted decision trees have several useful properties that make them appropriate for ourwork:? They provide a classification score and therefore allow for a reject option.? They require no knowledge about the properties of the weak classifiers used and canbe combined with any weak classifier that is more accurate than a random guess,allowing us to test a wide variety of spatial relationships.? They are not as prone to over-fitting as other learning techniques such as maximumentropy models.? They have no parameters to tune except for the number of rounds they run for andcan usually just be run until the test accuracy plateaus [Freund and Schapire, 1999].? They come with statistical guarantees about the rate of convergence given sufficientdata and a set of moderately accurate weak hypotheses [Freund and Schapire, 1999].Adaboost can be a poor choice as a learning technique if the training data is small, noisy orif there exists a set of data points that no weak hypotheses can split with a even low purity.Outliers that cannot be classified successfully end up with a large weight and the model canbecome biased towards classifying these excessively hard points. There are variations onAdaboost that compensate for this such as Logitboost [Friedman et al., 2000] which limitsthe maximum weight of outliers and Brownboost [Freund, 2001] which aggressively down10weights points that are determined to be too hard to classify. Adaboost approaches such asthese might be necessary for our work on qualitative spatial classification as some objectsin an environment are found outside their normal locations and are very difficult to classify.2.1.2 Alternating Decision TreesAlternating decision trees (ADTrees) [Freund and Mason, 1999] are a variation of binaryclassification decision tree that have certain useful properties. They allow for an equallyexpressive but exponentially more compact representation of any decision tree than theconventional representation shown in Figure 2.1. ADTrees are also considered easier tointerpret by a human observer than full decision trees, making it simpler to determine thereasoning behind a decision. The resulting classifier has similar accuracy to other decisiontree learning techniques such as C4.5 [Quinlan, 1993] but allows for simpler, more compacttree representation. ADTrees also provide a classification score, a measure of confidence inthe resulting classification. Classification scores are not as useful as a probabilistic outputbecause they can not be compared between different types of models but can be used asthe basis of a reject option.StructureAn ADTree is a binary classifier that uses a cumulative scoring value to determine bothclassification and to provide a classification score. The tree is made up of two types of nodes:decision nodes and predicate nodes. Decision nodes contain a binary decision as a predicatecondition (e.g., X above Table). Predicate nodes contains a classification value (possiblynegative) that either raises or lowers the likelihood of the classification. Decision nodes areconnected to two predicate nodes, for the positive and negative cases. Predicate nodes canthen connect to any number of decision nodes or none. Figure 2.1 shows a conventionaldecision tree and an equivalent alternating decision tree.ClassificationLike other decision trees, classification in ADTrees begins at the root node and follows thetree structure as dictated by the decision nodes. Unlike regular decision trees where theoutput is a binary value at the leaf node, ADTrees produce a classification score based onthe entire path used to traverse the tree. A running sum over all encountered classificationnodes is maintained and the cumulative score is kept along the path from root to leaf.11(a) Decision Tree (b) Alternating Decision TreeFigure 2.1: An example of a conventional decision tree and the equivalent alternatingdecision tree. The circular nodes represent decisions. The square nodes in the decisiontree represent classification leaf nodes and in the alternating decision trees they representpredicate nodes.The sign of the cumulative score is the classification and the magnitude is a measure ofclassification confidence.TrainingMajor improvements to the accuracy of decision tree classifiers can be achieved by train-ing using boosting and Adaboost has been shown to be an effective basis for improvingalternating decision tree classifier construction [Freund and Mason, 1999]. The followingis a brief overview of an approach for boosted learning of alternating decision trees and isincluded to provide an impression of both how ADTrees are trained and how AdaBoostworks. This algorithm is for creating a binary classifier but a multiclass classifier can beproduced by training N trees for N classes using a 1 vs. all training approach and selectingthe classification from the tree with the greatest score.Let S = {(x1, y1), . . . , (xn, yn)} be a set of training data where each xi is a feature vector andyi ? {?1, 1} is a set of labels. C is a set of conditions, binary comparisons to the presenceof a single feature in an instance x. A precondition is a set of conditions needed to reacha specific predicate node. A rule is denoted as r and is a combination of a preconditionnecessary to reach the decision node and the condition at that node. The decision treeis a set of decision node rules R. P is a set of preconditions associated with the possible12locations where a new rule can be added.Initially, P1 = T, where T is a constant true predicate representing the root node. Eachtraining point has an associated positive weight wi,t which is the weight for training point ion training round t. Initially, wi,0 = 1 for all training points i. Pt and Rt correspond to thetwo sets at boosting iteration t. W+(c) and W?(c) represent the sum of all the weightedtraining values that satisfy condition c and with values of y = ?1,+1 respectively.Initialize R1 with a rule with a precondition and condition equal to T and withr1(x) =12 lnW+(T)W?(T)(2.1)For t = 1, 2, ...., N1. For each precondition in c1 ? Pt and each condition c2 ? C find the impurity measureZt(c1, c2) = 2(?W+(c1 ? c2)W?(c1 ? c2) +?W+(c1 ? ?c2)W?(c1 ? ?c2) +W+(?c2) +W?(?c2))(2.2)2. Choose c1 and c2 that minimize Zt(c1?c2). Construct a new rule rt with preconditionc1 and condition c2 and prediction valuesrt(x) =???12 lnW+(c1?c2)W?(c1?c2)if c212 lnW+(c1??c2)W?(c1??c2)if ?c2(2.3)3. Set Pt+1 = Pt + c1 ? c2 + c1 ? ?c24. Update the training same weights wi,t+1 = wi,t+1ert(xi)yiThe classification and score for a sample x is found by comparing x against all Rt+1 ruleswhich is equivalent to traversing the tree.class(x) = sign(score(x)) (2.4)score(x) =T?t=1rt(x) (2.5)Determining an appropriate number of training rounds T before training is not generally13possible but since over-training is unlikely, training can continue until test error asymptotes.In our final experiments, we used a structured SVM learning algorithm rather than a boosteddecision tree because our goal was to learn a set of weight associated spatial relationshipsbetween true positive detections. This was not a classification problem in the context ofdecision trees so they were not an appropriate choice.2.2 Support Vector MachinesSupport vector machines (SVMs) are a class of machine learning algorithms introduced by[Cortes and Vapnik, 1995] and used for classification and regression analysis using patternrecognition. Given a set of training data S = {(x1, y1), . . . , (xn, yn)} where each xi is a vectorin a D dimensional linear vector space and yi ? {?1, 1}, SVMs identify a D dimensionalhyperplane that can separate this input data into two categories.SVMs have several useful properties that make them widely used in machine learning.? SVMs can be applied to a wide variety of problem types with differing input andoutput spaces.? The optimization problem that SVMs present is typically convex so finding a uniqueglobal optimum can be assured.? SVMs are efficient to train because the optimization problem required is sparse withrespect to the training data.? Both linear and non-linear solutions can be found using the ?kernel trick?, a techniqueto compute the inner product between high-dimensional features corresponding to twopoints without explicitly computing the high-dimensional features.? SVMs can handle training data with misclassified data points or other outliers usinga soft-s formulation.Often input data needs to be mapped to a more expressive hypothesis space than its rawrepresentation, so variables or features are computed from the training data inputs and ahyperplane is computed which divides the categories in the feature space. A set of featuresfrom an example are called a feature vector and the vectors that lie close to the hyperplaneare called support vectors.14Figure 2.2: A visualization of a 2D SVM classifier. The hyperplane is computed betweenthe solid and dotted line data points.2.2.1 Linear Support Vector MachinesA hyperplane can be defined by the equation ?w, x? + b = 0 where (w, b) ? RM ? R. In aSVM, the hyperplane is optimized to maximize the distance from the plane to the supportvectors from both categories. This distance is called the margin and, generally, the largerthe margin between the support vectors and hyperplane, the lower the error in classification.The following overview of SVM classification is based on [Smola and Scho?lkopf, 2004].A margin is defined as:?i = yi(?wi, xi?) + b) (2.6)and ? > 0 indicates a positive classification. See Figure 2.2 for a visualization of the SVMhyperplane and margin.The geometric margin is a normalized version of the margin used to measure the Euclidean15distance of the margin,?gi =yi(?wi, xi?) + b)?w? (2.7)We can pose the computation of the SVM as an optimization problem where the objectiveis to maximize the geometric margin.maxw,b ?gs.t. yi(wx?i+b)?w? ? ?g, i = 1, . . . , l(2.8)The geometric margin is equal to 1?w?2 (proof omitted), so the optimization problem can bereformulated asmaxw,b 12?w,w?s.t. yi(wx?i + b) ? 1, i = 1, . . . , l(2.9)This optimization problem assumes that the results are linearly separable and there is nomisclassified training data but that limitation will be dealt with later. In order to optimizethis equation and employ the kernel trick later, it is necessary to give the problem a dualrepresentation. In many optimization problems, it is often useful to convert the primal(original) representation of a problem into a Lagrangian dual problem because the solutionto the dual problem provides a lower bound to the solution of the primal problem. To findthe dual representation, we first need the Lagrangian primal representation.For an optimization problem of the formminx f(X)s.t. gi(X) ? 0, i = 1, . . . , jhi(X) = 0, i = 1, . . . , k(2.10)the Lagrangian primal form is,L(X,?, ?) = f(X) + ?g(X) + ?h(X) (2.11)where ?i and ?i are called Lagrangian multipliers.Using this definition, the Lagrangian primal form of Equation 2.9 isL(w, b, ?) = 12?w,w? ?l?i=1?i[yi(?wi, xi?+ b)? 1] (2.12)Using the Kuhn-Tucker conditions [Kuhn and Tucker, 1951], the dual Lagrangian problemcan be determined for the primal. The dual problem takes the form of the quadratic16optimizationmax? f(?) =?li=1 ?i ? 12?li,j=1 yiyj?i?j?xi, , xj?s.t.?li=1 yi?i = 0?i ? 0, i = 1, . . . , l(2.13)Once max? has been computed, the max-margin hyperplane determined by (w?, b?) can becomputed byw? =?li=1 yi?ixib? = 12 [maxyi=?1(?w?, xi?)?minyi=1(?w?, xi?)(2.14)2.2.2 Non-linear Support Vector MachinesOften training examples are not linearly separable in the initial problem space, so SVMsmap the input data points into a higher dimensional space within which they are linearlyseparable. However, in a higher dimensional feature space, computing the SVM is morecomputationally expensive. SVMs use a technique kernel trick [Aizerman et al., 1964] toavoid this problem which allows the algorithm to fit a maximum margin hyperplane withoutneeding to compute the higher dimensional features explicitly. For a full examination ofnon-linear support vector machines and the application of the kernel trick see [Burges, 1998].2.2.3 Structured Support Vector MachinesOur model in Chapter 7 uses a technique called Structured Support Vector Machines[Tsochantaridis et al., 2004] to learn a set of weights associated with inter-object spa-tial relationships. Structured output prediction is an area of machine learning concernedwith the prediction of complex, structured outputs. Structured SVMs learn a discrimi-native function from X to discrete outputs Y , where Y is a complex output that couldbe sequences, strings, labeled trees, graphs, etc. In regular multi-class SVMs, the outputspace is a set of labels Y = {1, . . . ,K}. Naively, a regular multi-class SVM can be used toproduce complex outputs by treating each possible state as a label. However, this approachwould likely result in a very large and complex output space and computing a SVM in thatspace is infeasible to solve. Structured SVMs solve this problem by taking advantage ofdependencies within Y and a max margin algorithm to learn the weights w.17Figure 2.3: An example of an input x, output y and feature map ?(x, y) for a structuredSVM designed to determine a Probabilistic Context Free Grammar. Figure reproduced from[Tsochantaridis et al., 2004]Problem FormulationStructured SVMs learn a discriminant function F : X ? Y ? R over input/output pairs.Prediction is performed by maximizing F over the response (output) variable for an inputx. The hypothesis f is thenf(x;w) = arg maxy?YF (x, y;w) (2.15)where w is a parameter weight vector. F is assumed to be linear in ?(x, y) which is acombined feature representation of input and outputs,F (x, y;w) = ?w,?(x, y)? (2.16)where ?(x, y) will have different forms depending on the problem space. The structure of?(x, y) has significant impact on the complexity of the problem. In Figure 2.3 we providean example from [Tsochantaridis et al., 2004] of the X, Y and ?(x, y) for the problemof learning a Probabilistic Context Free Grammar from natural language processing usinga structured SVM. In this example, the problem is determining a parse tree y for eachsentence x using a set of grammar rules g. ?(x, y) for this problem is a histogram count ofhow often each grammar rule occurs in tree y for sentence x.Using structured SVMs requires ?(y, y?), a loss function that quantifies the difference be-tween prediction y? and the true output y. Training any SVM requires a loss function but18using a simple zero-one loss function for comparing structured outputs typically does notwork well. A loss function is required that can capture whether an output is close to corrector completely wrong.Given a problem space and loss function where ?(y, y?) > 0 s.t. y 6= y? and ?(y, y) = 0, wecan formulate the problem with zero training error as a set of non-linear constraints?i : maxy?Y \yi{?w,?(xi, y)? < ?w,?(xi, yi)?} (2.17)This in turn can be rewritten at a set of linear constraints?i,?y ? Y yi : ?(w, ??i(y)? > 0 (2.18)where ??i(y) = ?(xi, yi)? ?(xi, y).Given that there can be many possible solutions w? for the constraints in Equation 2.18,Tsochantaridis et al. propose selecting w where ?w? ? 1 and the score of the correct labelyi is most different from the next closest. This step incorporates the max-margin principlefrom general SVMs into the problem of producing a structured output. The resultingformulation is the following:SVM : minw 12?w?2?i,?y ? Y \ yi : ?(w, ??i(y)? ? 1(2.19)This formulation does not allow for training error, so slack variables are introduced tooptimize for a soft-margin criteria. Also, Equation 2.19 is for the zero-one classificationcase and does not incorporate the loss function ?(y, y?). There are two approaches toincorporate slack variables and ?(y, y?). This first is slack rescaling which follows themodel proposed by [Crammer and Singer, 2002] where a slack variable is introduced forevery non-linear constraint and then rescaled according to the loss from each constraint,SVM : minw 12?w?2 + Cn?n,?i=1 ?i s.t. ?i?i ? 0?i,?y ? Y \ yi : ?w, ??i(y)? ? 1? ?i?(yi,y)(2.20)The second approach is margin rescaling by [Taskar et al., 2003] which adjusts the marginconstraints. The margin rescaling formulation of the structured SVM problem isSVM : minw 12?w?2 + Cn?n,?i=1 ?i s.t. ?i?i ? 0?i,?y ? Y \ yi : ?w, ??i(y)? ? ?(yi, y)? ?i(2.21)19LearningThe potentially very large number of constraints associated with solving a structured SVMpresents a challenge. The algorithm presented in [Tsochantaridis et al., 2004] to solve thisproblem is called called the Cutting Plane Method which iteratively finds a small set ofconstraints Si for each training example. Constraints are added by repeatedly identify-ing the most violated constraints for each training example for intermediate solutions andadding them to S. The intermediate solution weights w are determined by ?iy, a Lagrangemultiplier enforcing the margin constrains from S using standard quadratic programmingtechniques.The following is the pseudocode for the cutting plane method used in both [Tsochantaridiset al., 2004] and the StructSVM toolkit [Vedaldi, 2011], for computing a structured SVM:Algorithm 1 Algorithm for computing a Structured SVM for both slack and margin rescal-ingPrecondition: (((x1, y1), . . . , (xn, yn)), C)1: Si ? ? for all i = 1, . . . , n2: repeat3: for i = 1, . . . , n do4: w ??j?y??Sj?jy???j(y?)5: if using slack rescaling then6: SVM?s : H(y) = (1? (??i(y), w))?(yi, y)7: else8: SVM?m : H(y) = (?(yi, y)? (??i(y), w))9: end if10: Compute y? = arg maxy?Y H(y)11: Compute ?i = max(0,maxy?Si H(y)12: if H(y?) > ?i then13: Si ? Si ? {yi}14: ?S ? optimize dual over S, S = ?iSi15: end if16: end for17: until no Si has changed over entire iterationThe algorithm for both slack and margin rescaling are the same except at step 5 wherethere functions are selected. In step 10, the algorithm determines the most violated con-straint for an output y?. Efficiently determining the most violated constraint is potentiallyvery computationally intensive and, since it is based on the cost function H(y), differentapproaches are necessary for the slack and margin rescaling solutions.20ImplementationTsochantaridis et al. provide code for the learning structured SVMs in a toolkit calledStructSVM [Vedaldi, 2011]. This toolkit allows for the implementation of a structuredSVM given a problem formulation defined by three functions. First is a loss function?(y, y?) which should encode loss in a more sophisticated manner than simply zero-oneloss. The second is a feature mapping function ?(x, y) which maps the combined input andoutput space onto a joint feature space. Lastly is the maximization step performed in step10 of the pseudocode algorithm which finds the most violated constraints.21Chapter 3Context and Object DetectionThe overall goal of this chapter is to ground our work in the existing literature of objectdetection using context. This chapter describes a number of approaches to performing objectdetection with the aid of context and contrasts them with our work. We have divided thechapter into two sections. First we describe a number of approaches that explicitly use 3Dcontextual information either about the structure of the scene or the layout of the objectsto perform object detection or similar tasks. The second section describes a number of 2Dobject detectors that perform joint detection of objects in the scene using context.3.1 Image-based Object DetectionAn object is a physical ?thing? in an environment with a semantic label (i.e., a word whichplaces it in a category). Objects can be separable from the environment or permanentlyattached elements. Semantic labels provide a way of clustering different objects into acategory or class (things with the same label) and a way of decomposing a scene intoobjects (things with different class labels). Each object has an associated region, the pixelsin an image that correspond to that object.A class is an abstract grouping of individuals, other classes or both, also referred to ascategory, sort, or type. The generic term ?Object? is often used as the root of a taxonomy,a hierarchy where each level describes a logical grouping of entities based on shared physicalproperties, common uses, semantic usage, or other similarity metrics. Belonging to theobject class entails that objects exhibit certain properties, such as occupying a region ofspace. Taxonomic hierarchies are based on an ??is a?? relationship (e.g. X is a either equalto or a child node of Y if X is a Y ). These groupings serve to both define the properties ofan object and differentiate it from other objects.Object recognition is the general problem in computer vision of detecting objects or objectclasses in an image. It is a term used to describe a number of different types of problems.22Object recognition problems focus on the task of identifying one or more target objectclasses. Object Classification is the binary classification problem of determining whetheran image or part of an image contains a target object. If there are multiple target objecttypes, then the problem is called Object Categorization. Object Localization is the processof determining regions within an image for each occurrence of the target object. ObjectDetection is the problem of both classifying/categorizing and localizing a target objecttype(s) in an image.3.1.1 Object DetectorsMost detectors produce two outputs for a positive object detection: a region and a confi-dence score. Object regions are the results of object localization, generally represented asrectangular detection boxes or pixel-based image masks. The confidence score associatedwith a detection is a measure of how closely the object?s appearance in the image matchesthe model used for detection. Depending on the model, this score may be derived from amargin of classification for support vector machines (SVMs) or a probability for probabilis-tic models. Other models simply have the property that there is a monotonic relationshipbetween score and the likelihood that a detection is correct. Object detectors designed fora broad class of objects with significant variance in appearance (e.g., humans, chairs, cars,etc.) are called generic object detectors.Most object detectors use image features as the basis for the model. Image features representa portion of an image (corresponding to an object) in a manner which is invariant tochanges in illumination, orientation and viewpoint. This invariance is important if anobject detector model is to generalize from its training data to test data. Examples ofsuccessful image features include the Scale Invariant Feature Transform (SIFT)[Lowe, 2004]and the histogram of gradients [Dalal and Triggs, 2005]. These features are based on imagegradients but image features have also been based on the silhouette or contours of objects[Shotton et al., 2008], [Helmer and Lowe, 2010]. For a broad survey and comparison ofmany image features and a comparison of different object detection approaches see [Zhanget al., 2007].In this work, we use object detectors designed to recognize man-made objects in indoorscenes, in contrast to detectors designed to identify natural objects (e.g., trees, animals, etc)or background scene elements (e.g., roads, sky, grass, etc). Most object detectors representthe objects as a collection of image features, with either a sparse or dense representation.Part-and-shape object detectors break down the object detection problem into multipledetection problems for constituent object parts and they model the spatial distribution of23those parts. These models are technically more challenging but have shown great success atobject detection [Fergus et al., 2005], [Felzenszwalb and Huttenlocher, 2005], [Felzenszwalbet al., 2008].Sparse representation object detectors are the most widely used approach to object detec-tion [Zhang et al., 2007] and work by identifying a small set of consistent visual patternsin the target object and representing objects as a collection of image features. Sparse ob-ject detectors are computationally efficient because they function in a sparse feature spacerather than trying to model the dense pixel information. Many early approaches to sparseobject detection used orderless bag-of-features models [Grauman and Darrell, 2007; Zhanget al., 2007], which have origins in text recognition. In this model, image features are firstextracted from object regions. From these, a ?codebook? of highly discriminatory featuresis learned and histograms or other counts of these features are used for object detection.This technique can be improved using spatial pyramids where feature are computed overimage cells defined by a recursive multi-level image decomposition [Lazebnik et al., 2006].Orderless bag of feature models are limited in their ability to perform object detection bythe lack of structural information about the layout of features and have generally been moreuseful at image classification.Dense representation object detectors densely model the appearance of objects. Detectioncan be performed at the pixel-level (e.g., template matching) or through densely computedimage features [Dalal and Triggs, 2005]. For most generic object detection, using a denserigid template to model the appearance of an object is in effective because appearances varytoo much within the class. However, combining rigidly computed feature based matchingwith part-and-shape based detection has lead to very effective generic object detectors[Felzenszwalb et al., 2008], [Bourdev et al., 2010].3.1.2 Discriminatively Trained Deformable Part ModelIn our work we use an image-based object detector to produce our 3D object detections.The discriminatively trained deformable part model (DPM) [Felzenszwalb et al., 2008] isa highly effective object classifier which has shown significant success in the Pascal VisualObject Classes Challenge [Everingham et al., 2010]. It is a part-and-shape based model thatuses a coarse global template that matches the entire object and multiple higher resolutionpart templates.The underlying feature used for the templates is the histograms of oriented gradient features(HOG) [Dalal and Triggs, 2005]. HOG decomposes an image region into a grid of cells,24computes a histogram of image gradients within those cells, produces normalized histogramsof those gradients and returns a feature for each cell. The HOG features are computed overan image pyramid to capture both coarse and fine level object features. The HOG featuresare matched against the object model using filters which specify weights for subwindowsof the HOG pyramid. The score of a filter is the dot product of the weight vectors andthe features in the subwindow of the HOG pyramid. An object model then consists of aroot filter F0 which captures the entire object and n part filters (P1, . . . , Pn) where Pi =(Fi, vi, si, ai, bi). Here Fi is a part filter, vi and si are the center and size of a box of possiblepart positions relative to the root, and ai and bi specify the coefficients of a quadraticfunction which measures the score of the placement of Pi.The weights used for scoring are learned from a training set of images annotated withthe type and bounding box for each object type being detected. The task is a binaryclassification given an input set of labeled examples D = ((x1, y1), . . . , (xn, yn)). xi is aHOG pyramid H(xi) and a set of valid locations for a root filter Z(xi) and yi ? ?1, 1 is aset of labels. Positive and negative training sets are constructed from this data with thenegative set containing only images where there are no instanced of the classification object.The underlying model is called latent SVM, which is a reformulation of multi-instance SVMin terms of [Andrews et al., 2002] latent variables. The approach alternates between fixinglatent values for positive examples and optimizing the SVM object function.The approach commonly used with the DPM is to learn multiple object classifiers, each ofwhich is designed to capture the object type either from a different viewpoint or capturesignificant variations in structure within the class. This is done by sorting and then splittingthe training data set according to the ratio of the height and width of the object boxes.Figure 3.1 shows a visualization of the two models learned for a bowl classifier.Detection is performed using a scanning windows approach. The score for a detection isderived from the score of the overall template matched against the window and the sumof all the part templates minus a deformation cost. The scores are computed by the dotproduct of the histogram of gradients and a learned set of weights.3.2 Context in Object DetectionWhile it is agreed upon in both the psychological and computer vision communities thatcontext is valuable for object recognition, defining context is a challenge. Often it is leftundefined and implicitly refers to whatever effects the researcher is examining that influencesthe objects in a scene or image. Context was defined by [Strat, 1993] as ?any and all25(a) First component of bowlmodel (corresponding to a side-ways view)(b) First component of partsmodel of bowl(c) Second component of bowlmodel (corresponding to a topdown view)(d) Second component of partsmodel of bowlFigure 3.1: Visualizations of the Felzenszwalb et al. object classifier. The images on theleft show an expected intensity of gradient in a grid pattern for the entire object. On theright, they show the gradients in the parts model.information that may influence the way a scene and the objects within it are perceived?.This seems an overly broad definition as it would include object appearance in context sowe define context as ?all information that affects the appearance or layout of a scene or theobjects it contains?.[Divvala et al., 2009] published a broad study of the effects of context on object detectionin images and created a taxonomy of contextual effects and types of contextual informationwhich we use as the basis for discussing context. We have expanded this to consider contextin 3D environments as well as images. The following is a list of the effects that context hason objects which influence this thesis.? Object Presence: Context affects whether an object is present within a given area. Inorder to determine object presence the area of consideration needs to be defined. Forimages the area is defined by the boundaries of the image. In 3D the area needs to bedefined and could, for example, be a room, a building, or a semantically defined arealike a kitchen or office. Information about the type of scene provides a strong prior26on the types of objects likely to be found. The presence or absence of other objects ina scene or image also has a contextual effect on object presence (e.g., toilet presenceis correlated with bathtubs but not with microwaves).? Object Position: Context affects where objects are located. Object positions areinfluenced by many factors such as the type of scene, the structure of the immovablescene elements (e.g. walls, floors, work surfaces, etc), the position of other objects andthe areas where activities are typically performed. In this thesis, we are particularlyconcerned with the way object positions are influenced by other object positions.As a source of contextual information, object positions provide insight into both theimmovable elements of the scene and the areas where activities are performed.? Object Appearance: Context affects the way an object?s appearance changes in dif-ferent scenes. Appearance is influenced by factors such as illumination, weather andthe viewpoints from which information about the scene are acquired. Context canalso influence the physical structure (and therefore appearance) of a class of objectsin different scene types. For example, a chair in an office environment is physicallyvery different than one in a kitchen or on an airplane.Our work is primarily concerned with the effects of context on object position but objectpresence and appearance are both factors that we must consider. In object recognitionchallenges like the PASCAL Challenge [Everingham et al., 2010], object presence can havea significant effect because the objects in the challenge come from many different typesof scenes [Desai et al., 2009] and for each scene there are few objects likely to be found.However, in the model we present in Chapter 7 we already know the scene and are onlysearching for objects commonly found in that scene type. Not being able to use objectpresence as a factor in improving object detection accuracy makes our work more challengingbut our goal is to demonstrate the effects of object position, not object presence.We primarily considered the effect of object appearance in our selection of training data.For example, when training our chair classifier, we used the query ?office chair? rather thansimply ?chair? in order to get images more like the objects in our training and test data.Other contextual appearance effects, like illumination, also affect the objects in our scenes.There are significant differences in lighting, and therefore appearance, between objects inreal world scenes and objects in commercial product shots. To accommodate this, whenselecting images for training our object detectors, we ensured that a variety of differentsources with different illumination conditions were used. This leads to an object detectorable to accommodate different lighting conditions.273.2.1 Contextual Sources[Divvala et al., 2009] also created a taxonomy of contextual information sources. We foundtheir taxonomy very broad and complete but there is no clear motivation for the division oftheir categories. Some categories were based on the cause of the context information (e.g.,weather, culture, 3D scene structure, etc) while others were based on how that contextualinformation is manifested in the image (e.g., local pixel effects, 2D global statistics, etc).The following list is a modified version of their taxonomy we use in this thesis, where allthe categories are based on the cause of the context.? Scene Context is contextual information based on analysis of the scene as a whole.The type of scene is a strong contextual information source. Context can be providedby low-level global image statistics of the scene, often referred to in computer visionliterature as gist, [Oliva and Torralba, 2001], [Russell et al., 2007] which learns con-textual effects without explicitly classifying the scene. Other approaches use contextderived from explicitly determined scene type or the use of keywords associated withthe scene or image [Li and Li, 2010], [Oliva and Torralba, 2001], [Carolina Galleguil-los and Lanckriet, 2010]. Scene context also includes context provided by activitiespresent in the scene.? 3D Geometric Context is contextual information based on the 3D scene structure.This includes detection of supporting surfaces, contact points, visual occlusions andalignment with walls or other scene structures [Hoiem et al., 2006].? Object Context is contextual information based on object-object interactions. Thisinclude object presence and object position effects in 2D [Galleguillos and Belongie,2010], [Carolina Galleguillos and Lanckriet, 2010], [Galleguillos et al., 2008],[Rimeyand Brown, 1994], [Neumann and Mo?ller, 2007] or 3D [Hoiem et al., 2007], [Anguelovet al., 2005]. Object context in 3D is the main source of context we examine in thisthesis and will be discussed in greater detail in Chapter 5.? Photographic Context is contextual information based on an image?s intrinsic proper-ties and how it is acquired. This includes the camera?s height, focal length, orientation,and other internal properties. It also includes cultural effects which influence how theoperator targets the camera such as the common position and pose of people in im-ages, the effects of framing [Simon and Seitz, 2008], and the selection of culturallyappropriate subject matter. Photographic context also includes effects resulting fromhow a robot acquires images [Meger et al., 2008]. Some of these photographic con-28textual effects also influence 3D scenes that is acquired from images or from scanningdevices that are targeted like a camera.? Environmental Context is contextual information based on the outdoor environmentof the scene. This includes the effects of lighting (both natural and artificial)[Lalondeet al., 2008], weather [Narasimhan and Nayar, 2002], and terrain types [Hays andEfros, 2008]. It is similar to scene context but the effects come from sources a longway away from the scene (sun, clouds, distant terrain, etc).? Temporal Context is contextual information based on temporal information about thescene. Temporal information about the events before and after the image can bedetermined from a video or a set of sequential images [Liu et al., 2008].In order to focus the taxonomy on only contextual sources and incorporate 3D scenes aswell as images, we made the following changes from the original taxonomy in [Divvala et al.,2009]. We removed ?Local Pixel Context? which combined effects from many sources thatmanifested locally in the image. We combined ?2D Scene Gist Context? and ?SemanticContext? into the more category of ?Scene Context? since both are broad analysis of thescene as a whole. We combined ?Illumination? and ?Weather? and ?Geographic? contextinto the more broad category of ?Environmental? context because of potential overlaps incontextual sources (e.g., is smog weather or geography and is the effect of city street lightsillumination or geography). We also combined ?Cultural? context with ?Photogrammetric?context into the ?Photographic? category because it is unclear whether effects like framingor intentionally blurred backgrounds in photos had photogrammetric or cultural origins.Generally, all of these contextual sources have some influence on all the contextual effects:object presence, position and appearance. They are complementary sources of information,though sometimes one dominates another. For example, from the objects in a scene, it isoften possible to to determine scene type, as we will discuss in Chapter 4. Our work focuseson object context but object context can provide 3D geometric context because from therelative positions of objects in a scene it is possible to infer scene structure.3.3 Object Detection Using 3D ContextThe most significant element of our work that differentiates it from previous work on contextin object detection is the use of 3D data and spatial relationships. We are not the first toexplore how 3D data can effectively be used to improve object classification. This section29covers several different examples of research that have employed context in 3D environmentsfor object detection and how they are applicable or affect the work presented in this thesis.3.3.1 Context and the Spatial Semantic HierarchyThe work of Kuipers et al. has long focused on the problem of a robot embedded in arich 3D environment, attempting to bootstrap knowledge about places, objects and ac-tions. [Kuipers et al., 2000] introduced the concepts of the Spatial Semantic Hierarchy, amodel with multiple interacting representations of spatial knowledge for robots. The modelcombined qualitative and quantitative representations of space. Of particular interest tous are the upper-level, abstracted layers that decomposed space along high-level conceptsof paths, places, and regions. In [Kuipers et al., 2000], objects had a limited role in thespatial semantic hierarchy model but they expanded on this in their work on autonomousobject discovery [Modayil and Kuipers, 2004]. Their approach separated objects from theirsurroundings, using 3D geometric contextual information both to discover new objects andproduce object detection models from 3D range data. They combined concepts from theirspatial semantic hierarchy and object discovery into the Object Semantic Hierarchy [Xuand Kuipers, 2010]. This multilayer approach discovers objects by modeling the contextualsurrounding, identifying new 2D views of objects and using them to produce 3D models ofobjects.The work of Kuipers et al. demonstrates well how embedded robots can reason about theirsurrounding, learn object models in both 2D and 3D and use those models to automaticallydiscover structure and objects. Their work on object discovery was influential in how wethink about leveraging context for both object discover and to identify new informationabout known objects. They use context to improve object detection by performing betterobject-background segmentation and using this information to create better models. Ourapproach differs significantly from theirs in our greater emphasis on spatial relationships, ouruse of web-acquired training data for object recognition and less emphasis on the specificsof robotic implementation.3.3.2 Monocular 3D ContextThe authors of the study on context we used as the basis of Section 3.2 [Divvala et al.,2009] have demonstrated an approach to object detection that creates a coarse 3D modelof the scene computed from a single image from an uncalibrated camera and uses the 3Dcontextual information to improve object detection [Hoiem et al., 2006]. Their approach30computes object spatial properties (height and scene depth) using camera height and thehorizon line. Using an image-based object detector, they detected object candidates in theimage. They then used the estimated heights to recompute the candidate?s likelihood usinga probabilistic framework. We adapt this approach of computing detections using standardobject recognition techniques, augmenting them with 3D information and adjusting themto remove false positives. Their work is restricted by using only a single image which meantthe 3D information limited and they were only able to detect objects on the ground plane.They also did not use any object-object relationships, likely because they were only applyingtheir detector to a small number of object classes (cars and pedestrians).Subsequent work by them, [Hedau et al., 2009], has significantly improved on their abilityto determine 3D structure of the scene in terms of object positions. They demonstratedan impressive level of accuracy in computing the structure of fixed elements of a scene andapproximating the 3D shape of the scene objects. This work has progressed to allow themto compute the support relationships between objects [Silberman et al., 2012]. Others havehad similar success at determining the 3D layout of scenes from single images. [Lee et al.,2010] and [Tsai et al., 2011] have demonstrated that a robot can, with a single image, extractinformation about the 3D structure of the fixed structural elements of the scene. [Fouheyet al., 2012] demonstrates how human pose estimation can be used to further improve theaccuracy of single image 3D geometric reconstruction by relating poses to underlying 3Dscene structures (e.g., a seated human is likely on a surface with a predictable height).Single image 3D reconstruction has bearing on our research as our techniques for using3D spatial relationships could be combined with this approach to perform improved objectdetection in 3D using a single image. However, it is unclear, given the prevalence anddiminishing cost of 3D sensors, how necessary determining 3D structure from single imageswill be for robots.3.4 Multi-object Simultaneous Object DetectionThis section covers techniques that employ context in the simultaneous detection of multipleobjects simultaneously. These techniques generally work by combining both object detec-tion results with contextual information between detection to improve the overall objectdetection results. [Hedau et al., 2009]313.4.1 Graphical Models for Contextual Object DetectionIn our work, we needed to decide on an underlying model for improving object detectionusing spatial relationships. Probabilistic graphical models are a widely used AI techniqueand are an efficient and straightforward way of specifying complex probabilistic relationshipsbetween variables and can be used to find the probability of an event given varying typesof evidence [Torralba et al., 2010]. This section examines some examples of how graphicalmodels are used in 3D contextual object detection.Some of the earliest work that employed graphical models to apply context to object classifi-cation was [Rimey and Brown, 1994] which used selective perception for scene classification.They used a Bayesian network model of a tabletop to determine where a camera shouldfocus to classify the scene efficiently. Their scene was very simple, a table set for a tea partywith a small range of object types, and the 2D spatial relationships between the objectswere provided, not learned.More commonly used graphical models for incorporating context into object detection in-clude Markov random fields (MRFs) or conditional random fields (CRFs), both types ofdiscriminative undirected probabilistic graphical models. For further details on MRFs,CRFs and their use in object detection see [Li, 2009]. Both MRFs and CRFs are often usedin vision tasks where the nodes represent values with a spatial arrangement such as pixels,image regions, or mesh points and allow for the simultaneous detection of different objectclasses. In low level approaches, the nodes are typically only locally connected to theirimmediate neighbors, so it is difficult for contextual information to effectively propagate.This limitation can be overcome by adding edges between more distant nodes or includingadditional nodes to represent higher level concepts like entire objects [Shotton et al., 2009;Torralba et al., 2004]. [Winn and Shotton, 2006] use a shape-and-part based detectionapproach, with one layer of nodes representing object parts and another connected layerfor detecting entire objects. Their approach also incorporates some 3D information withincorporating edges that model the occlusion and overlap of objects in the scene.One of the 3D object detection problems where graphical models are most often employed ismesh segmentation, the problem of detecting objects or scene structures in a triangle meshor 3D point cloud. These meshes or point clouds can be generated by laser range findersor computed from stereo techniques and can the 3D points can include intensity or colorinformation. For mesh segmentation, MRF nodes often correspond to mesh points withnode potentials computed from a local 3D shape descriptor at that point (e.g., spin images[Johnson and Hebert, 1999] and shape context [Belongie et al., 2002]). The edge weightsrepresent interactions between local mesh points and might include distance or difference32in surface normal. The use of 3D contextual information can include height from groundand global scene shape statistics [Kalogerakis et al., 2010].3.4.2 Automatic Place Modeling Using Objects[Ranganathan and Dellaert, 2007] which demonstrated an approach for performing auto-matic place modeling with a robot, has been influential on this thesis. Their problem is,given an image and an associated depth map of a scene, how can a robot infer a label for thescene and label the types of observed objects and their 3D locations. They used a generativegraphical model for this problem learned from supervised training data. This resembles theproblem we solve in Chapter 7, the detection of multiple objects based on their appearanceand spatial layout, but their emphasis is on the easier problem of place recognition ratherthan object detection. In our work we are trying to identify the types of objects basedon our estimates of their 3D locations, assuming the scene label is already known. Figure3.2 shows their graphical model which connects pixel level information on appearance anddepth to place categorization. [Ranganathan and Dellaert, 2007] used a Markov randomfield over color/depth pixels with edges connecting to their closest neighbors in order tospatially clusters features and learn a correspondence from pixel features to object types.One of the most novel elements of this work is the use of 3D object position for both objectdetection and place categorization. To describe the spatial relationships of objects in thescene, they used a variant of the constellation model, a commonly used model in part-and-shape object detectors that encodes the distribution of parts (or in this case objects) interms of their positions relative to a base location. Using the constellation model requiredthe assumption that all object positions were conditionally independent of each other. Thisdiffers significantly from our work where we are trying to demonstrate that object-objectrelationships are very effective for improving object detection. Their reason for assumingconditional independence of object position is that a graphical model which consideredobject-object relationships in 3D would would be too complex to produce a viable model.Their concern is using object-object spatial relationships would require a fully connectedgraph of objects (i.e., the type and position of all objects would influence all others) whichdoes not scale. Though they stated that this was preliminary work and that they hopedto refine it further, the lack of subsequent work on this model suggests that they mayhave encountered problems extending it. Instead, in their later work they moved away fromclassifying objects and instead towards global image features [Ranganathan and Lim, 2011].Though clearly ambitious and impressive work, the limitations they discussed are a con-tributing factor to why we decided against using a graphical model for this thesis and instead33Figure 3.2: The graphical model for place recognition using object detections in 3D from[Ranganathan and Dellaert, 2007]. The place label L generates a set of N object detectionsO. Each object detection has a 3D position T , a shape S and an appearance A. Theposition and shape produce a set of 3D points ?3D. These points are computed from animage containing n features with each feature having a depth d, a pixel location u and anappearance v. Figure reproduced from [Ranganathan and Dellaert, 2007]opted for the structured SVM approach for improve object detection that we will discuss inthe next section. Their representation of object spatial relationships is simple and they feltthat a more complex model would have been infeasible. In our work, we wanted to be ableto test a large number of spatial relationships to examine their comparative effectiveness,without needing to be overly concerned about model complexity. For this reason, we de-cided to avoid the use of graphical models and instead focused on the work of [Desai et al.,2009] and their use of structured SVMs to improve object detection accuracy.3.4.3 Bayesian Compositional HierarchyIn many contextual detection problems , the distribution of objects has a hierarchical struc-ture. Consider the example of tables in a restaurant. At the high level the layout of thetables can have a predictable structure, at a lower level the layout of chairs and place34Figure 3.3: The compositional hierarchy of a facade. Triangle indicate aggregate structure.Figure reproduced from [Terzic? and Neumann, 2010]settings relative to each table has structure, and at the lowest level the arrangement ofutensils in the place settings has structure. While it would be technically possible to use amodel which compared the layout of every utensil to every other in the scene, a more logicalapproach is to break the scene up into components in a hierarchy and learn relationshipsat each level of the hierarchy. This is the approach used in the Bayesian CompositionalHierarchy (BCH) [Neumann, 2008] which is a probabilistic scene model with the structurefrom an object-centric representation of a compositional hierarchy, similar to those used informal logic ontologies (see Figure 3.3).The hierarchical structure of the BCH allows strong mutual dependencies between objects tobe confined to specific hierarchical groups of scene objects (or aggregates). Each aggregateis modeled by a joint probability distribution P (AB1 . . . BKC) where A is a description ofthe aggregate, B1 . . . BK are parts of the aggregate and C is the spatial layout of the parts.Each part B is itself modeled by an aggregate of lower level elements, unless it is the lowestlevel element in which case it is the result from an object detector (see Figure 3.4). Forexample, for a Balcony aggregate, A would be the bounding box for the whole balcony,B1, B2, B3 would be parts with bounding boxes for Door, Window and Railing aggregates,and C would be the distance between the parts. The advantage of the BCH formulation isthat joint probability distribution can be computed from each aggregate, represented as amultivariate Gaussian SVM and belief updates can be performed by propagation throughthe tree structure with a closed form solution.In [Kreutzmann et al., 2009], the authors propose an incremental approach to performingdetection of building facades using the BCH. When performing inference using this model,it is valuable to be able to consider multiple classifications of aggregates since a single lowlevel misclassified element can propagate up and result in many misclassifications. Infer-35Figure 3.4: The structure of a Bayesian Compositional Hierarchy. The triangles areaggregates defined by bounding box A, parts B1 . . . BK and spatial layout of the parts C.Figure reproduced from [Terzic? and Neumann, 2010]ence is performed using a technique called beam search which maintains multiple partiallabellings of scene parts (which they call the ?beam?). This approach maintains a set ofpartial solutions, ranked by their overall probability and adds labels with high probability,discarding models without any additional high probability labels.Building facades are an interesting testbed for contextual object detection because of theirvisual and spatial properties. The components of the facade are very hard to classify usingtraditional sparse object detections. Structures such as windows, doors, balconies, vents,rails, etc, are all basically rectangular, homogeneously colored and have little texture. Theydo, however, have very specific spatial structure, dictated by factors such as the numberof floors, of the building, the presence of the ground, the presence of a street, etc. Theyare perfect for testing an approach like the compositional hierarchy because there are manyobjects to detect and these objects are grouped together into larger structures.We considered a probabilistic approach like the BCH for our model of object relationshipsbut decided against it for a few reasons. While the scenes we are working with have somehierarchical structure (e.g., tables in offices and counters in kitchens) it is very limited anddetecting enough objects to build up this structure is beyond the scope of our work. Wewere concerned that, without this hierarchical structure and the strong object-object influ-ences within aggregates, their model would not be effective on the types of indoor sceneswe are interested in. Also, [Kreutzmann et al., 2009] only uses very simple spatial relation-ships, not the broad range we wanted to try and we were concerned that the multivariateGaussian BCH model would not fully exploit the large number of weakly interacting spatial36relationships in our work. If in our future work we handle larger numbers of objects perscene or across multiple rooms, then a hierarchical approach would be more useful.3.4.4 Discriminative Models for Multi-class Object Layout[Desai et al., 2009] provides the basis for our work in Chapters 7 and 8. Their goal is touse context, as defined by object-object co-occurrence and spatial relationships, to improveobject detection rates. Their goal is to learn the difference between the types of spatial re-lationships found between true detection pairs and the types of relationships between trueand false or false and false pairs. Rather than incorporate context into the object detectionproblem, they apply their model after object detection has been performed to remove falsepositive detections. A key difference between their approach and many others is that it isdoes not focused simply on the spatial relationships between the ground truth objects. In-stead, they learn the informative spatial relationships for identifying the difference betweentrue positive and false positive detections.In this section we give only a brief overview of their model since it is described in moredetail in Chapter 7 when we describe how we apply their model is ?to our work. The inputto their model is X = {xi : i = 1 . . . N}, a set of N object detections with an associatedtype and score computed from an image. Let Y = {yi : i = 0 . . . N} be a label vector whichindicates if a detection is a true or false positive.In [Desai et al., 2009] the stated goal is to determine the correct labeling of Y for a givenX but they do not actually test this in their experiments. Instead they adjust the scoresassociated with all detections in X according to their 2D spatial relationships with all truepositive detections in Y . This has the effect of decreasing false positive detection scores andincreasing true positive scores. The likelihood of Y given an image X is evaluated using ascoring functionS(X,Y ) =?i,jwTyi,yjdij +?iwTyixi (3.1)where wyi,yj is a pairwise weight vector that adjusts the score based on the likelihood thattwo objects of type yi and yj sharing the spatial relationship dij are true positives. wyix1represents a local detector score associated with detection xi for object type yi. The spatialrelationships dij that they use are fairly simple 2D qualitative spatial relationships andare illustrated in Figure 3.5. Inference is performed by computing arg maxY S(X,Y ) todetermine the best labeling of detections Y . They use a simple greedy search approach tocompute arg maxY S(X,Y ) and then adjust all detections in X using the weights wyi,yj andwyi according to their spatial relationships relative to the true positive detections determined37Figure 3.5: The qualitative spatial relationships used by [Desai et al., 2009]. The relation-ships they used were distance-based (near,far), an orientation-based (above,below, next to)and a topological (on top). Figure reproduced from [Desai et al., 2009].by Y .The model is trained with a set of images with true and false positive detections. The weightsare learned such Y for each image closely matches the correct labeling of all detections astrue or false. A more complete description of this method can be found in Section 7.6. Astructured SVM is used to compute the weights. Desai et al. use the PASCAL challengedata set of 20,000 images, covering 20 categories of objects, for training and test. Theyshow moderately successful results, with an overall improvement of 3% in average precisionof the detections after applying their model, with larger improvements in individual classes.Some classes decrease in average precision but this likely just means that their model hasnot learned any effective relationships for improving detection accuracy.Our work extends that of Desai et al. by moving to a 3D data set and 3D spatial rela-tionships. We are provide a more careful and thorough examination and selection of the3D spatial relationships used, drawing upon the field of qualitative spatial reasoning toidentify good features for use on 3D scenes, as we discuss in Chapter 5. We also modifiedthe inference and learning techniques used in [Desai et al., 2009] to use a branch and boundtree search which lead to significant improvements in average precision. Finally, in Chapter8 we perform a number of experiments that examine how the properties of the detectionsused in training and test affect the overall results.38We believe our work actually better demonstrates the value of spatial relationships forimproving object detection than [Desai et al., 2009] because of their choice of training data.The PASCAL challenge 2007, which provided the training and test data used in [Desai et al.,2009] encompasses 20 classes of object, covering a very wide variety of object types. Thisbroad range of objects come from a correspondingly large number of types of scenes. As aresult, co-occurrence of object types becomes a very effective way of improving detectionrates. For example, if a car is detected in an image, it is unlikely that a chair will alsooccur and therefore all chair detections can have correspondingly decreased scores. Fromtheir work, it is unclear how much benefit they got from the spatial relationships. In ourwork, we restrict ourselves to training and performing detection in only the same type ofscene so co-occurrence provides little information and instead we rely much more on spatialrelationships.39Chapter 4Scene Classification using Object DetectionsScene classification interests us because it allows a robot to select a spatial model to aid inobject detection that is customized to a specific type of scene. Rooms are physical structuresdefined by geometric properties of the environment, usually walls and doors, while scenesare a semantic labeling of an area. A room may contain multiple scenes. For example, aopen plan condo might contain a kitchen, dining room and living room in a single largeopen area. Scenes are defined by the objects they contain and the set of related tasksthat occur within them [Southey and Little, 2006]. Scene classification should therefore bepossible by detecting objects in the scene and using them to infer the scene type. The othermajor approach to scene classification has been to recognize scenes as a whole, identifyingconsistent visual elements across entire scene images, an approach that works well whenscenes are visually distinctive. Techniques such as Gist [Oliva and Torralba, 2001; Torralbaet al., 2003] capture global properties of an image and use them to classifying scenes. Ourwork predates many other more objects focused approaches to scene recognition, as wediscuss further in Section 4.4.In this chapter, we demonstrate that object detections, unaided by our qualitative spatialmodel, can be used to perform scene classification in images. We also show that combiningdetected objects with global properties of the image can further enhance performance. Wepresent a novel method of scene classification that uses object detections to perform scenelabeling. Object-based scene classification we believe is more effective for indoor scenes andgeneralizable to a large number of previously unseen indoor environments.In our work, object occurrence information from the LabelMe database is used to informclasses of useful objects for detector training. We train these detectors automatically usingthe DPM object detection, discussed in Section 3.1.2. We then train an Alternating BoostedDecision Tree that uses detection scores to predict the scene type. We compare this methodto using Gist alone on an indoor dataset. We then present another method that combinesGist as well as object detection scores, and show that the two types of cues, when combined,lead to better performance than when used alone. The work presented in this chapter was40jointly performed with Pooja Viswanathan [Viswanathan et al., 2011] and is included withher permission.4.1 Scene ClassificationScene classification using semantic labels such as ?kitchen?, ?bathroom?, etc. and vision orshape-based methods has been an active area of research [Ersi and Tsotsos, 2012; Pronobiset al., 2010; Quattoni and Torralba, 2009; Wu et al., 2009]. However, most of these methodsrely on the global properties of images. Some compute local image properties using feature-based methods but do not explicitly attempt object recognition. There have been significantimprovements in object detection for robots [Meger et al., 2008] and the success of theembodied object recognition scenario presents the opportunity to leverage object detectionfor higher-level environment understanding.We conducted early work that demonstrated that recognized objects and their locationscan be used to automatically label scenes in the environment through the use of annotateddatabases, as demonstrated by the spatial-semantic modeling system [Viswanathan et al.,2009]. In this work we were able to label 2D regions of the map with the correct scene typeusing object positions in the map. This system, however, assumed the ability to recognizeobjects perfectly and did not address the recognition problem. Furthermore, the underlyingclassification mechanism was a simple probabilistic model which did not fully take advantageof available information provided by the object detections.Labeling areas of a 2D map, such as that captured with Simultaneous Location and Map-ping, with descriptive tags has most commonly been performed in topological mapping. Inwork by [Ranganathan and Dellaert, 2007] graph-like maps are constructed where nodes areclassified using visual object recognition. Kro?se et al. have developed a series of practicalsystems [Kro?se et al., 2007; Spexard et al., 2006] in which the visual similarity betweenimages is used to cluster regions in the environment. Scene labels for the clusters, however,are provided by a human through speech.For classifying images of scenes, [Oliva and Torralba, 2001; Torralba et al., 2003] use globalproperties of a scene (Gist). [Pronobis et al., 2006, 2008, 2010] combine Composed Re-ceptive Field Histograms, SIFT and laser data to perform scene classification in indoorenvironments, under different illumination conditions. Local regions are used to infer anintermediate ?theme? of an image in [Li and Perona, 2005] to aid in scene classification.Several other context- and region-based approaches have been implemented, and can befound in [Bosch et al., 2007].41The authors of [Quattoni and Torralba, 2009] find that most methods that achieve state-of-the-art performance in classification of outdoor scenes perform significantly worse on indoorscenes. They observe that some indoor scenes are better described by the objects in themand thus combine global and local properties (Gist and spatial pyramid of visual words) toachieve increased performance. However, the reported multi-class average precision ratesfor the indoor dataset are still found to be low.Object-based methods have also been used for scene classification, as in [Vasudevan andSiegwart, 2008], where functional regions of the environment are labeled based on objectoccurrences. However, the main drawback of this method was that it was trained on specificinstances of objects and tested on the same objects under different viewpoints and lightingconditions. It remains a challenge to determine which objects are strong cues for sceneclassification. In addition, generic object class recognition has been a challenging task incomputer vision research.4.2 Automated Scene LabelingWe developed a system to categorize scenes based on object detectors learned from LabelMeimages. Our system is composed of four components. Firstly, we perform automated datacollection from LabelMe, thus facilitating the collection of training images used to recognizea large number of object categories. We compute a Count Model that represents the numberof times an object is observed in each scene type in the LabelMe data based on user-providedtext labels. We then use images from LabelMe to train windowed object detectors forthe most frequently occurring objects. Finally, we use an Alternating Boosted DecisionTree (ADTree) to predict the most likely scene type given the detected objects in a scene.Furthermore, we show that enhanced performance can be achieved by incorporating globalcues (such as Gist) into our framework.4.2.1 LabelMe Data CollectionLabelMe [Russell et al., 2008] is an online database of user annotated images. In LabelMe,the user can annotate an object in an image by outlining a region of the image usinga polygon and giving that region a label. The entire scene can also have a descriptioncontained in the filename. Figure 4.1 shows a kitchen scene from LabelMe with severallabeled objects. We use LabelMe in two ways. The technique we use for training of objectdetectors requires tight bounding boxes that we can acquire using the LabelMe polygons.42Figure 4.1: A kitchen scene from the LabelMe database. The polygons used to segmentobjects in the scene are shown as colored lines.Also, our Count Model is computed using the correspondence between labels of objectsin an image and the scene name descriptor found in the image filename. In creating thismodel, we do not directly analyze the images in the dataset, and instead focus on the textualannotations in each image.4.2.2 Count ModelIn order to perform scene classification based on objects, we first need to learn a modelof the types of objects and number of occurrences in each scene type. We obtain thisinformation from the LabelMe database by querying for scenes and recording the numberof annotated occurrences of each object in the scene, as in [Vasudevan and Siegwart, 2008]and [Viswanathan et al., 2009].The counts table ctp(o) contains the number of times object o occurs in images of scenetype p. If the number of images of scene type p is np, the likelihood of observing object oin scene p is computed asP (o|p) = ctp(o)np(4.1)43We refer to this likelihood as the Count Model, which is used to inform detector trainingand learning of the Scene Model described below. We used only objects that appeared atleast a 10 times in the scenes.4.2.3 Useful ObjectsThe most useful objects for the scene labeling task are ones that provide the most amountof information gain. Information gain IG for an object o can be computed as in [Yang andPedersen, 1997]:IG(o) = ??iP (p)logP (p) + P (o)?iP (o|p)logP (o|p) + P (?o)?iP (?o|p)logP (?o|p)(4.2)Upon analyzing the relationship between information gain and the frequency of occurrenceof objects, we noticed a positive correlation between information gain and frequency as in[Yang and Pedersen, 1997]. The most informative objects are often the most frequentlyoccurring, in the domain of scene labeling. This is because there is relatively little crossoverbetween types of objects in scenes. In addition, since most LabelMe scenes contain objects inrealistic home settings, objects that have high counts in the learned model are more likely tobe present in the intended test environment (homes). Thus, it is sufficient to determine themost frequently occurring objects in each scene type to train object detectors. A histogramof these objects for some scene types can be found in the Section Detector TrainingFor object detection, we used the system created by [Felzenszwalb et al., 2008] discussedin Section 3.1.2. Their approach must be trained on images of the target objects withaccurate bounding boxes, making many conventional data sources unusable. A database oflabeled images is needed to train the object detector and in this earlier work we used theLabelMe database [Russell et al., 2008]: a free online data source which provides a largeamount of human-labeled images, which contains indoor scenes suitable for scene labelingand object recognition. We trained object detectors for a subset of the most frequentlyoccurring objects based on the object counts. A total number of 61 objects were used(corresponding to approximately 15 objects in each scene type). The precision-recall ratesfor a few categories, as well as visualizations of a detector model can be found in the44Experiments section.In our later work on improving object detections using 3D spatial relationships, discussed inChapter 8, we stopped using LabelMe for providing training images for the object detector.We found use of user designated labels lead to poorly labeled images and multiple namesfor the same type of object. Also, there were insufficient examples of some types of objectsand bounding polygons were inconsistently drawn. At the time we were working on thematerial presented here, ImageNet [Deng et al., 2009], which we use for our object detectionin Chapter 8, did not provide bounding boxes for objects in images and was therefore notusable with the DPM object detector.4.2.5 Scene Labeling Using Boosted Decision TreesA decision tree is a hierarchical model for supervised learning that identifies local regionsin the input space using a sequence of recursive splits which we described in Section 2.1.1.Significantly, boosted decision trees require no knowledge about the properties of the weakclassifiers used and can be combined with any weak classifier that is more accurate than arandom guess, allowing us to test a wide variety of features. Also, they are not prone toover-fitting and have no parameters to tune except for the number of rounds they trainedand can usually just be trained until the test accuracy plateaus. To achieve multiclassclassification, we trained a decision tree for each scene type using a 1 vs. all approach. Wedescribe the inputs to the boosted decision trees in Section 4.3 since they vary betweenexperiments.4.3 ExperimentsIn this work, we attempt to classify kitchens, offices, bathrooms and bedrooms. However,due to the automated nature of our system, we could learn models for other types of scenesincluding living rooms and dining rooms by simply querying LabelMe for more scenes. Wedid not do so because there was an insufficient number of examples of these types of scenesin LabelMe to determine good Count Models and it is difficult to train object detectorsfor objects in these scenes from the relatively small number of examples in the LabelMedataset.45(a) Kitchens (b) OfficesFigure 4.2: Counts of the types of objects found in kitchen and office scenes.4.3.1 Count ModelFigure 4.2 shows the object counts learned for kitchens and offices. We display the 15most frequently occurring objects in each scene type. As seen in the figures, some of theobjects have unusable labels due to the ambiguous user entries. We thus select a limitednumber of the objects, and show later on that these are in fact sufficient for the task ofscene classification.4.3.2 DetectionWe trained object detectors using at most 200 positive examples and 1000 negative exam-ples for each class. We set the number of components of the mixture model, n, based onthe size of the training data for each class (classes with few training examples were trainedon 1 component, while classes with more training data were trained on up to 3 compo-nents). Thus, training examples are split into n components based on the aspect ratio ofthe bounding boxes they contain. DPMs are trained on each component individually andmerged together to form the final model.In order to produce precision-recall and average precision rates for each category, we val-idated the models on images of LabelMe objects that were not used in training. We usedloosely cropped versions of these images to prevent unannotated true positive examples inan image from being detected as false positives. Figure 4.3 shows some of the most and leastsuccessful detection results. As seen, objects that are usually partially obscured by otherobjects (furniture such as desks and tables) tend to perform the worst. Training images for46these classes mostly contain views of cluttered table/desk tops. Alternate views of furniturecan be gathered by using other Internet sources, which would provide additional structure(e.g. table legs) for use in training detectors.(a) Mouse (b) Bowl(c) Table (d) WhiteboardFigure 4.3: The precision/recall rates of object detectors. Top rows shows 2 of the mostsuccessful classifiers and the bottom row shows 2 of the least successful classifiers.4.3.3 Scene ClassificationWe designed experiments to test scene classification in three different scenarios. In thefirst experiment, we classify scenes based on perfect labels of all annotated objects. In thesecond, we classify full images that depict a scene containing different types and numbersof objects, using real detection results.47Table 4.1: Results for scene classification with perfect object labels.Scene Prec. RecallBathroom 0.96 0.88Bedroom 0.92 0.92Kitchen 0.95 0.93Office 0.98 1.0Average 0.95 0.93Scene Classification with Perfect LabelsIn this experiment, we want to determine the performance of the scene classification systemusing ADTrees if all objects can be recognized perfectly. We run 10-fold-cross validationon all examples of bedrooms (37), bathrooms (52), offices (647) and kitchens (190). Thenumber of examples for each scene type is indicated in parentheses. Only objects that occurin at least 10 images are used as features in the ADTrees. In this experiment, inputs arebinary, indicating the presence or absence of an object. As seen in Table 4.1, our sceneclassification algorithm produces assignments that closely match the ground-truth scenelabels for all scene types. Bathrooms produce the lowest recall rate due to the limitednumber of example images currently in LabelMe and the sparsity of objects in these scenes.This demonstrates that objects present in a scene are very useful in classifying it.Scene Classification with Object DetectionsOur second experiment determines scene labels using real objects detections in the image.These images only contain a portion of the scene, and can contain few to many objects.The highest SVM detector scores, produced by running each learned object detector onthe image, are fed as input to the BDT, which then infers the scene label. We found thatmultiple detections per object class resulted in negligible improvement, and thus used onlythe top detections.We expanded the indoor dataset in [Quattoni and Torralba, 2009] with 50 more images foreach scene type acquired from an online image photo collection to minimize overlap betweenthe Torralba dataset (which uses LabelMe as a data source) and the images used in detectortraining. We compare our object-centric method with the technique in [Oliva and Torralba,2001], which uses Gist for scene classification using an SVM. We perform 10-fold cross-validation on the entire dataset for both methods. Results of scene classification on thecombined indoor dataset are shown in Table 4.2, in columns 1 and 2. Given the difficultyof the task, our model performs extremely well at distinguishing between the various scene48Table 4.2: Classification results for scene classification using object detections, Gist andboth combined on images acquired by humans.Scene Object Gist Object+GistPrec. Recall Prec. Recall Prec. RecallBathroom 0.56 0.58 0.63 0.56 0.68 0.63Bedroom 0.57 0.53 0.53 0.53 0.57 0.64Kitchen 0.61 0.62 0.53 0.57 0.64 0.67Office 0.58 0.60 0.53 0.52 0.66 0.60Average 0.58 0.58 0.56 0.55 0.63 0.64types based on a relatively small set of trained object detectors. This demonstrates thatobject detections using DPM on our automated training method are reliable for the task ofscene classification. We also notice that our object-centric method is comparable to Gist,outperforming it in most cases.Finally, we show results of combining object detections and Gist in the third column. Theinput features in the BDT in this experiment are the highest SVM object detector scores aswell as SVM output from Gist. We use the average precision and recall rates. We see thatdue to the complementary strengths of the methods, combining them results in enhancedperformance for all scene types in the indoor dataset.4.4 Influence on Future WorkThis chapter demonstrated a system that can perform scene classification using objectdetections on an indoor dataset. Our object-centric method outperforms Gist on the indoordataset. The combination of object detections and Gist leads to enhanced performance. Wehave shown that, with state-of-the-art object detectors trained with large, freely availabledata sources like LabelMe, we can effectively both detect and classify a wide variety ofobjects in realistic indoor images.Further work on classifying scenes using object detections in both indoor and outdoor scenehas been pursued by [Li et al., 2010]. Like us they combined object detections with globalscene descriptors but included spatial information and many more types of object detections.They also included 2D spatial information between detections and achieved very impressiveclassification rates on 9 types of LabelMe scenes (? 70%). Their work showed that thisobject-centric approach is applicable to a broad variety of scene types.From the success in scene recognition from a single image, we can infer that it is possible49to determine scene type with good accuracy given multiple images from different angles.Given classification rates similar to our own or that demonstrated in [Li et al., 2010], asimple voting process over all images of the scene would give very high confidence givenmany scene images. Therefore, we assume in Chapter 7 that scene classification is possibleand only apply a spatial model trained for the correct scene type.50Chapter 5Qualitative Spatial RelationshipsThis chapter will discuss the use of qualitative spatial relationships for describing the rela-tive spatial properties of objects in human environments. This chapter covers several keyconcepts relating to qualitative spatial relationships and their origins in qualitative spatialreasoning. It ends with an overview of the four major types of qualitative spatial relation-ships that are used at the basis for our experiments in Chapters 6 and 8.5.1 Qualitative Spatial ReasoningA qualitative representation is one which, as opposed to quantitative representation, ?makesonly as many distinctions as necessary to identify objects, events, situations, etc. in agiven context? [Hernndez and Zimmermann, 1994]. Qualitative representations embracethe principle of parsimony, that a model should use the simplest representation necessaryfor a given task. Ideally, such representation would be based on human behavior, cognitivemodels and communication, since these provide insight into how humans simplify theirmental representations of their environment. A representation should be both simplifyingand function well in a wide variety of situations.Much of the research presented here is drawn from qualitative spatial reasoning (QSR), abranch of reasoning that deals with finding qualitative descriptions of structures in spaceand then reasoning about the physical properties relative to each other. Qualitative repre-sentations are based on the application of a calculus of predicates, boolean valued functionsthat partition quantitative values. We are primarily interested in the qualitative spatialrelationships used in qualitative spatial reasoning rather than the logical elements of QSR.Qualitative spatial relationships express continuous quantitative spatial reasoning using dis-crete symbols that simplify reasoning about spatial concepts [Cohn and Hazarika, 2001].Coming from the field of logic, qualitative spatial relationships provide us with a completeand consistent approach to representing space.51The key problem in QSR lies in finding meaningful, relevant ways of quantifying contin-uous relationships so that they capture so called ?common sense? information about theenvironments humans inhabit and are appropriate to a problem or situation. We believethat provided with a spatial model based on qualitative spatial relationships, a computeror a robot can interpret and conceptualize the physical world in a manner more closelyresembling a human?s and make predictions and actions better suited to its task.Qualitative reasoning allows for inference in the absence of complete knowledge, not bytreating the information in a probabilistic manner and modeling the uncertainty, but bypurposefully grouping together like values that are conceptually similar. QSR allows arobot to consider its surrounding in terms of human concepts such as ?near?, ?on topof?, ?inside?, etc. While this information is contained in the quantitative spatial data,identifying and extracting it is part of the role of QSR. Allowing for uncertainty aboutspatial information simplifies many spatial problems since less accurate data is required.The shift from a continuous space to a discrete space decreases the complexity of spatialreasoning and learning by reducing the dimensionality of the problem, as well as changingthe type of reasoning required.Much of the body of QSR literature focuses only on 2D spatial representations. Whendealing with indoor environments, 3D representations capture significant relationships be-tween objects. The vertical placement of objects clearly has great significance indoors (e.g.,consider the different treatment of food on the floor vs. food on a table). Advanced sensordevices like stereoscopic cameras and laser range finders can provide complex 3D informa-tion about scenes. Since 3D spatial relationships are not the norm in QSR, we have had totranslate many traditional qualitative spatial concepts from 2D to 3D.An important concept when discussing qualitative systems is the concept of granularity, thedegree to which the components of a system are subdivided. A coarsely granular systemis one comprised of fewer or larger components than a finely granular system. Granularityis often confused with scale, since a map that has a scale of 1:1000 has a coarser granular-ity than one that provides 1:100 scale. However, scale is only one dimension on which toconsider granularity. Most real-world spaces can be considered at a broad range of granu-larities. An ideal qualitative spatial relationship should be as coarsely granular as possiblewhile still containing sufficient information for all tasks within that environment.525.2 Qualitative vs. Quantitative RelationshipsQualitative spatial relationships rely on decomposing and partitioning spatial relationships.Decomposing a quantitative representation of space breaks it down into multiple simplifiedrepresentations based on some mathematical property of space such as topology, orientationor distance. These decomposed representations are then partitioned into discrete subsetsthat capture relevant divisions within that representation. Quantitative relationships canalso decompose space to varying degrees but do not usually discretize it using partitions.5.2.1 Advantages of Qualitative Spatial TechniquesThe following is a list of advantages of using qualitative spatial techniques as opposed toquantitative ones:Complexity reduction: An excess of information can often make learning tasks harderbecause of the time and difficulty involved in identifying relevant data. By translat-ing spatial data from a continuous, combined multi-dimensional space to a discrete,decomposed space, the use of qualitative spatial relationships reduces the overall in-formation needed to express the relevant properties of an environment [Escrig andToledo, 1998]. Also, through decomposition, information of different types can beconsidered separately or in conjunction as necessary [Hernndez and Zimmermann,1994].Compensate for data inaccuracy: Getting exact quantitative data about the world isoften difficult or expensive. The degree to which we are uncertain about the resultinginformation is often unknown. By using a representation that does not require ex-act information, we can potentially ignore or mitigate problems resulting from theseinaccuracies.Handling partial and uncertain information: Qualitative spatial approaches are ableto handle vague or uncertain information about their environments by intentionallynot differentiating between similar situations by using a more coarsely granular ap-proach to identifying qualitative relations [Hernndez and Zimmermann, 1994].Human inadequacy simulation: Our senses are not designed to precisely measure prop-erties of space against abstract, non-present conceptual properties like ?a centimeter?or ?a gram? [Hernndez and Zimmermann, 1994]. Therefore, it seems reasonable thatan approach to spatial understanding that aims to function in an environment created53by humans should base itself on a model similar to that used by a human. The sug-gestion that an approach is based on ?how humans do it? is potentially risky withoutadequately exploring the actual mechanisms in the brain. However, humans and otheranimals are clearly good at reasoning in space without perfect information about spa-tial properties, which indicates that precise quantitative information is unnecessary[Renz, 2002].Generalization: Reducing our reliance on exact values and combining like cases togethermakes seemingly different situations become similar. Approaches that learn rules fromsimplified examples are more likely to generalize to novel situations [Hernndez andZimmermann, 1994].5.2.2 Disadvantages of Qualitative Spatial TechniquesQualitative spatial techniques have some potential drawbacks:No universal approach: There are many qualitative spatial relationships that have beensuggested in the QSR literature that rely on different spatial properties and are ap-plicable to different tasks. Experimentation is often required to determine whichapproaches are best suited to a task and communication between different qualitativeapproaches first requires a definition of a agreed upon ?spatial language? [Escrig andToledo, 1998].Varying granularity: Before attempting a task involving qualitative spatial reasoning, itis necessary to identify what granularity of information about the world is required[Escrig and Toledo, 1998]. If the level of granularity is not obvious, as is likely thecase, multiple levels should be examined to approach an optimal solution.Translational costs: Since most sensors acquire only quantitative information about theworld, mechanisms must be in place to transform the data to a qualitative system.The costs for the conversion could be time, computation, or money (if, for example,a human expert is required to perform the translation).Lack of Bi-Directionality: Moving from a quantitative representation to a qualitativeone is typically a well defined and exact process since the shift reduces the overallinformation conveyed. However, moving back from a qualitative representation to aquantitative one is more problematic since previously exact values are now undefined[Hernndez and Zimmermann, 1994]. For example, a robot might know through QSRthat a knife belongs on the table but to actually place it there it will need exact54coordinates in space that a qualitative model cannot provide. In our work, this is nota problem since we do not need to translate any qualitative relationships back intoquantitative values.5.3 SpacesBefore it is possible to discuss the spatial relationships between objects, we must considerthe nature of the spaces involved. [Hernndez and Zimmermann, 1994] suggests that acognitive model of space used for qualitative spatial relations should draw on the propertiesof four types of spaces:Mathematical spaces that are comprised of mathematical elements interacting accordingto a set of axioms and which provide the tools that allow spatial concepts to beabstracted, formalized and applied. They provide the mechanisms for defining amapping from the quantitative representation to a qualitative one. Euclidean spacesare an example of an often appropriate model and have the advantage of being anaxiomatization of physical spaces[Hernndez and Zimmermann, 1994].Physical Spaces where objects follow ?real world? physical constraints [Renz, 2002] (oftenassociated with Newtonian physics) such as:? Physical objects are homogeneous, continuous and finite, which is clearly a usefulsimplification since object material is not always homogeneous.? Objects have only positive extension (e.g., objects cannot have a negative widthor height).? Different objects cannot occupy the same space at the same time.? A specific object exists only once in space.? To move from one point to another, an object must pass through some space inbetween.These constraints help restrict spatial models to situations that have a physical analog.Psychological Spaces that define a perceived model of actual space. The properties ofpsychological spaces are based on physical spaces as filtered through a variety ofhuman senses. Mental models of space do not seem to correspond to simple Euclideanspaces [Roberts and Suppes, 1967].55Metaphorical Spaces that apply when spatial concepts are applied to a non-spatial do-main but where spatial concepts allow metaphorical insights. For example, the organi-zation of a company might be described spatially, where influence and common goalsare mapped onto position and overlap. The application of spatial rules outside thespatial domain allows for an analysis of the inherent properties of those rules[Hernndezand Zimmermann, 1994].In Chapter 7, we discuss our approach for differentiating between true and false positive de-tections using spatial relationships. In this work, we are determining spatial relationshipsbetween detections which are an interesting combination of both a physical and psycho-logical construct. Detections are based on physical objects and a true positive detectioncorresponds to an actual object in a scene and that object should obey all requirementswe gave above for a physical object. However, a detection can also be considered as apsychological construct since it is based on a entity?s perception of the environment. In ourcase the entity is a computer or a robot rather than a human. Psychological constructs donot need to obey the physical rules of an actual object since detections can overlap, havea non=finite extent, etc. So in considering the spatial relationships between detections,we have to mediate between these two representations and identify relationships that areappropriate to both spaces. For example, true positive detections should not significantlyoverlap each other, unless we allow for a part-based detection where object parts shouldoverlap with the whole object.5.3.1 Sizes of Space[Montello, 1993] proposed that as the size of spaces increases, the reasoning techniquesemployed by humans operating in them become based on more coarsely granular represen-tations. As humans are able to perceive less of a building, their ability to make precisejudgments of quantitative values become less accurate and they shift to a more coarselygranular human mental model as environments becomes larger. To aid in discussion ofqualitative relationships relative to environment size, Montello defined the following fourclasses of spaces of increasing size:Figural spaces that are similar in size to the human body and that can be perceivedentirely without moving (e.g., a table or a desk).Vista spaces that are larger than the human body but which can be examined withoutmovement (e.g., an office or a kitchen).56Environmental spaces that requires movement to be explored (e.g., a house, a mall, anoffice building).Geological spaces massive spaces that are likely never completely explored (e.g. a townor a city).Obviously, there are no hard boundaries between these types of spaces but if one examinesthe spatial reasoning techniques used, a shift in the granularity of personal representationhappens as humans move from small to large spaces [Renz, 2002]. At the larger scales, travelmechanisms change the way people experience space and cognitive models of distance aremore influenced by our perceptions. For example, a path in the park which you always walkthrough might seem much longer than the road that you drive, even though it is actuallymuch shorter. Distance metrics need to be adjusted to encompass different scales of space.In environmental and geographic spaces it is often possible to simplify 3D space into 2Dby ignoring height since it is orders of magnitude smaller than the other dimensions. QSRresearch is grounded largely in 2D representations because much of the origins of QSR is inspatial analysis at the map level where height can be safely ignored.We are primarily interested in figural and vista [Swadzba and Wachsmuth, 2011] spacessince they describe the majority of human buildings. In vista and figural spaces, humansmake relatively accurate estimations of the quantitative spatial properties of objects aroundthem and a substantial amount of the environment can be observed without moving. Thehigher accuracy of human reasoning in small environments allows us to use qualitativespatial relationships that can rely on accurate judgments of space and knowledge of theobjects in the environment. At these scales, many decisions about object placement arebased on physical human properties such are arms? reach, human height, walking distances,etc., so physical properties of humans become a defining characteristic when determiningthe granularity of spatial relationships.5.3.2 Spatial Object StructureTypically, in mathematical spaces, the basic unit of space is the point and all other spatialconstructs (such as lines and regions of space) are defined as sets of points. In QSR, themost commonly used basic unit of space is the spatial region [Vieu, 1997]. Regions define thespace occupied by both physical objects and psychological spatial constructs like detectionsand it is between regions we determine spatial relationships. Regions are used rather thanpoints because qualitative spatial relationships are predominately used to reason aboutphysical objects and physical objects occupy volumes of space, not points. As [Simons,571987] said ?No one has ever perceived a point, or will ever do so, whereas people haveperceived individuals of finite extent?. Lower dimensional units like lines and points do nothave physical manifestations and, if necessary, for describing mathematical properties, canbe defined in terms of degenerate regions.Given the almost infinite complexity of the region occupied by a real world object, mostQSR techniques rely on simplifying transformations to produce a more coarsely granulardescription of an object?s region using its boundaries. There are many quantitative systemsfor encoding the region such as triangle meshes, splines, and voxels. These are rarely used inQSR because of the complexity involved in determining relationships between representationregions. Commonly used approximations of object region include bounding boxes or convexhulls [Clementini and Felice, 1997].When discussing the spatial properties of objects, it is important to differentiate betweenobject shape and the region that defines a object in a space. An object?s shape is indepen-dent of its position and orientation, while the region defined by an object is based on itsposition, orientation, shape and scale. It is common in QSR [Hernndez and Zimmermann,1994] to distinguish between shape and scale. We examine the difference in greater detailin Section 3D Spatial Representations as 2DOften in qualitative spatial work 2D representations of 3D domains are used which sim-plifies finding qualitative spatial relationships. For tasks such as map analysis and imageunderstanding, 2D representations are clearly appropriate. However, both maps and im-ages are 2D projections of 3D environments and involve a loss of information. With citymaps, if height is ignored there is usually minimal impact on the usefulness of qualitativerelationships. For computer vision, the loss of information from using images can lead tomore significant issues but is necessary due to the nature of cameras and the prevalence ofdata sets with only images.An option we considered for our work but rejected was to use a 2D surface-based qualitativemodel, a type of augmented 2D representation suggested by Hernandez [Hernndez andZimmermann, 1994] that projects 3D spatial data onto parallel 2D layers. The intuitionfor surface-based models is that most human environments are comprised of flat surfacesand that gravity holds objects on these surfaces. The actual height of an object, therefore,is determined by object size and surface height. Since object relationships are heavilyinfluenced by objects on the same surface [Hernndez and Zimmermann, 1994], the model58treats objects on the same surface differently than objects on different surfaces. With thesurface-based approach, the space inside a building is treated as a collection of 2D layerscorresponding to surfaces. Qualitative spatial relationships between objects are based onthe object?s 2D relationships on surfaces with an additional factor that a takes into accountif the objects are on different surfaces.In the end, we decided to focus our work on full 3D models of scenes because, with theincreasing prevalence of 3D sensors, there will likely be more robots in the future usingfull 3D representations of their environments. If object detection can be performed a full3D representation, then mapping to 2D surface-based one at the loss of some potentiallyvaluable relationships is unnecessary.5.4 Qualitative Spatial RelationshipsMost qualitative spatial relationships are based on one of four of simplifying properties ofquantitative spatial relationships between objects: distance, orientation, topology, shape[Cohn and Hazarika, 2001; Freeman, 1975]. The shape category here includes both shapeand the related but separate property of scale. The four properties can be considered sepa-rately or in combination to develop systems for describing qualitative spatial relationships.Determining qualitative spatial relationships involves partitioning these properties into aset of exhaustive, pair-wise disjoint relationships. Since the type of qualitative relationshipsthat are appropriate is dependent on the associated task, there exist many different waysof performing this partitioning.Our experiments use primarily orientation and distance relationships but an examination ofall four areas is necessary for later discussion of our work and is valuable to understand whywe made our decisions. For each property used in our qualitative spatial relationships wedescribe the approaches for considering and quantifying these properties and the rationaleand techniques used for creating the qualitative partitions .5.4.1 OrientationIn Euclidean spaces, orientation is expressed using angles which define the direction of avector relative to a universal axis. However, more generally, orientation is a triadic relationbased on a target object T , a reference object R and a frame of reference FoR. A keydifficulty in using orientation as a spatial relationship is identifying a frame of reference thatis appropriate to the desired goal and that can be consistently identified in the environment.59Finding the right frame of reference is important because any orientation between R and Tis possible given the choice of FoR (unless R and T are the same point, when orientationbecomes meaningless).There are three types of frames of reference used when determining orientation, each ofwhich has significantly different interpretations:Extrinsic frames of reference rely on a single external, invariant set of points as the frameof reference for all comparisons. Examples of extrinsic frames of reference includethe traditional axis system used in Euclidean space, a fixed landmark such as thenorth pole or a vector like gravity. The difficulty with extrinsic frames of referenceis identifying one that is meaningful for the space and task involved [Hernndez andZimmermann, 1994].For our work, the vector defined by gravitational direction is our main extrinsic frameof reference since it is easy to identify and gravity has a strong influence of scenestructure. We also use a frame of reference defined by the walls of the room todetermine x and y axes.Intrinsic frames of reference rely on a physical property of R to define a frame of referencerelative to T . To find an intrinsic frame of reference, object R must have recogniz-able features from its shape or appearance that make it possible to reliably identifyenough points on R to create a frame of reference. Since it may be impossible to iden-tify enough points on R to define a sufficiently high dimensional frame of reference,extrinsic orientation approaches should be able to gracefully change their granularitydepending on the properties of R.Since our early work described in Chapter 6 used synthetic data from a game, wehad access to intrinsic frames of reference for each object, defined by its model in thegame. Our early work used these intrinsic frames of reference but they were discardedonce we realized that in our later real-world work we would not be estimating objectpose.Deictic orientation uses an embodied agent?s point of view in the environment as the basisfor defining a frame of reference [Cohn and Hazarika, 2001]. In many human inter-actions, a person is used as the reference point for comparative orientation, thoughusing deictic frames of reference can be problematic unless both parties agree on whichperson is the reference (hence the age old question ?Your left or mine??).In our work, we have not found deictic frames of reference useful since there are noembodied agents other than the robot. Even though it is possible to use a robot?s60sensors to define a deictic frame of reference, it is of little use in our work unless therobot had been involved in the creation of the environment and its position was knownat the time. Deictic frames of reference for a robot are primarily useful when it needsto either understand or describe a spatial concept that is relative to itself (e.g., ?Pickup the cup to your left?).An approach we believe would be useful in future work would be for a robot to?hallucinate? an embodied human in a scene where a human is most likely to be,then performing object detection using the resulting relationships. For example, theposition of a chair in an office indicates that a human will frequently interact withthe environment from that position. Spatial decisions have been made from thatviewpoint, making it a useful basis for orientation judgments.Orientation Between PointsMost approaches to qualitative orientation are based on comparisons between points (typ-ically the centroids of objects T and R) relative to a frame of reference and are designedto work in 2D spaces [Hernndez and Zimmermann, 1994]. By judging orientation betweencentroids, these approaches are invariant to the shape of the objects and, if based on anframe of reference that is extrinsic to the objects, object orientation.Possibly the simplest approach for describing orientation between points is that of [Schlieder,1993] which, in 2D, expresses the relative orientations T and R given FoR as +, 0,?for clockwise, collinear and anti-clockwise. Often coarsely granular approaches describeorientation using the cardinal directions. Instead of cardinal directions, ?left?, ?right?,?front?, ?back? descriptors are sometimes substituted when using intrinsic frames of ref-erence. [Frank, 1991] identifies two basic methods for subdividing space for identifyingcardinal orientations, shown in Figures (a) and (b) that are widely used and produce either4 or 8 relations depending on granularity. Another widely used approach was developedby [Freksa, 1992] and is called the ?double cross? calculus. It derives an intrinsic frameof reference from R. In 2D, there are 3 axes, one defined between R and T and two moreorthogonal to that axis at R and T , see Figure (c), which divides space into 15 regions.Orientation Between Regions in Qualitative Spatial ReasoningDetermining orientation relations based on object regions is a more complex problem thanwhen using points because of the variety of possible object shapes[Hernndez and Zimmer-mann, 1994]. Region-based orientation approaches are not invariant to object position,61(a) Conic Cardinal Dir. (b) Projection Cardinal Dir. (c) Double CrossFigure 5.1: Three varieties of point based qualitative orientation systems.(a) Axis Intervals (b) Allen?s Interval AlgebraFigure 5.2: A region based orientation system used to describe the projection of axis-aligned bounding boxes for the objects on to each axis of the frame of reference as shown in5.2a. Intervals are compared according to Allen?s interval algebra shown in 5.2b.orientation or shape but can capture more complex spatial relationships.Most region-based qualitative orientation approaches rely on first reducing the shape tosome consistent structure such as a convex hull [Barber et al., 1996] or an axis-alignedbounding box. Given an extrinsic frame of reference and a simplified structure, one commonapproach is to project the bounding boxes of R and T independently onto axes defined by aframe of reference (see Figure 5.2). The relationship between bounding box projections onan axis can then be defined using Allen?s interval algebra [Allen, 1983], with one relation peraxis of the FoR giving a total of 13n relations for an n dimensional space. The number of62states becomes very large in more than a one dimension so often it is necessary to reduce thenumber of states in the interval calculus. Deciding which states to merge requires carefulconsideration of the problem and experimentation.5.4.2 DistanceDistance is a scalar property based on the position and possibly shape and orientation oftwo objects. When dealing with distance measures, the first question is between what pointsshould distance be measured? One solution is to determine distance between the region cen-troids of R and T , an approach that is invariant to object orientation and shape. However,using centroids results in an inaccurate distance if the distances involved are similar to thesize of the objects and using centroids makes it impossible to identify some semanticallymeaningful (e.g., ?touching?, ?overlapping?, etc.). Otherwise, the shortest vector distancebetween the surfaces of the two regions can be used, or a reasonable approximation of thatdistance if the object shapes are highly complex.When determining distance, there is also the question of whether to use the shortest pathbetween the target and reference objects R and T regardless of the interposition of otherobjects or whether to find a path that passes through empty space. Determining theshortest free space path is a path planning problem and there are many known solutionsthat allow for additional variables such as minimum aperture sizes [LaValle, 2006]. Webelieve using the free space path instead of the shortest distance is only necessary when thetwo distances are significantly different and this mostly happens when comparing objectsin multiple rooms or in very complex scenes. Another problem with free space distances isthere needs to be a mechanism for handling situations where there is no path between theobjects, such as when an the target or reference object is enclosed by another object.Distance in Qualitative Spatial ReasoningThere are two general approaches to judging distance:Absolute distance approaches use a consistent set of distance segments for all compar-isons. Most absolute distance approaches are based on dividing the distance intosectors and assigning them a qualitative description such as ?near? and ?far?. An is-sue with absolute distance approaches is that they do not scale across different sizes ofspace. For example, the meaning of a ?near? object is different in vista spaces versusgeographic space. When using multiple segments, a valuable technique for determining63their length is the order of magnitude calculus [Mavrovouniotis and Stephanopoulos,1990] where comparison values are selected such that each sequential value is manytimes larger than the previous one. Using the order of magnitude calculus for describ-ing distances has the desired effect that summing together multiple distances of thesame qualitative length will usually result in the same overall distance (e.g., if A andB are near and B and C are near then A and C will likely be near). In our early workwe used absolute distances based on the order of magnitude calculus.Later, we shifted to absolute distances based on anthropometric measures, distancemeasures based on average human physical capabilities. For example, the averagehuman reach from a seated position is approximately 50 cm while the average lengthof one step is approximately 1 m [Tilley and Wilcox, 2002]. Objects are placed inenvironments so that they are comfortably available for the average person performingactivities so distance measures based on simple movements can identify objects usedfor the same task or related tasks. Anthropometric partitions we can be selected toresemble an order of magnitude calculus and have most of the same advantages.Relative distance approaches rely on a comparison to the size of other existing regions.For example, a simple relative distance metric used in QSR is the CanConnect(X,Y, Z)metric [de Laguna, 1922] that is true if region X can join regions or points Y andZ. Relative distance approaches are often used in image based approaches to objectrecognition, thought they are not typically described as such. For example, [Desaiet al., 2009] use a near/far qualitative spatial relationship between detection windowsT and R where T is near R if CanConnect(T,R, T ) without changing the orientationof T as shows in Figure 5.3.We became interested in relative distance approaches that use either the referenceor target object as the basis for comparison, since we believed that they could scalebetween the figural and vista spaces in houses better and could even work in largerspaces. Relative distance relationships also do not require the determination of parti-tions in advance. They also can have the interesting effect of providing an asymmet-rical distance relationship between objects such that smaller objects need to be closerother objects to be considered near to them.5.4.3 TopologyTopology is a branch of mathematics belonging to the field of geometry that provides acoarsely granular approach to encoding object region structure through the examination of64(a) No change in orientatoin (b) Any orientationFigure 5.3: A example of a 2D relative distance relationship CanConnect(X,Y, Y ). Thereference box Y is show in blue. The region surrounding Y shows the near/far partition. IfX (not shown) overlaps with region surrounding Y , then X and Y are ?near?, otherwisethey are ?far?. This first shows the partition region when there is no rotation of Y allowedto connect X and Y . The second shows the partition region when any rotation of Y isallowed to connect X and Y .connectedness. Connectedness can be considered as simply the property of two regions beingin contact with each other. While QSR does borrow elements from mathematical topology,both point-set and algebraic topology are too abstract to be of practical use in QSR as theyare focused primarily on representation and do not relate to the ?common-sense? elementsof human spatial reasoning [Cohn and Hazarika, 2001].Topology in QSR is used as the basis for describing the overlap between pairs of regionsX and Y . Cognitive studies indicate that topological relationships are highly important tohuman spatial cognition and have been shown to be the primary basis on which tasks suchas grouping are performed [Renz, 2002]. Topology is also interesting since it is one of thesimplest representation of space but still captures significant spatial distinctions.The RCC-8 Topological CalculusOne commonly accepted approach to describing relative topology in QSR is the RCC-8system (region connection calculus) [Randell et al., 1992] for defining the relative topologiesof two bodies. RCC-8 describes an exhaustive, continuous pairwise disjoint set of topologicalrelationships that can exist between two regions X and Y . Figure 5.4 shows a diagram ofthe following eight states of the calculus and the possible transformations between them:65Figure 5.4: The RCC-8 topological calculus.? Disconnected (DC):? Externally Connected (EC):? Partially Overlapping (PO):? Equal (EQ):? Tangential Proper Part (TPP):? Tangential Proper Part inverse (TPPi):? Non-tangential Proper Part (NTPP):? Non-tangential Proper Part inverse (NTPPi):All of the RCC-8 relationships can be defined in terms of a single primitive relation, theconnected relation C(a, b) [Renz, 2002]. A interpretation of C(a, b) is that a and b areconnected if and only if their topological closures share a common point. The closure ofa region R consists of all points in R plus all the limit points of R. It can be shown thatRCC-8 relationships can be applied to Euclidean spaces of any dimensionality [Renz, 2002].In our work, we found topology a less useful that other factors such as distance for de-termining spatial relationships because in our work it is impossible for there to be overlapbetween physical objects, meaning that the only topological relationships between objectsare contact relationships (e.g., ?touching? or ?separate?). However, we determine rela-tionships between object detections and object detections can be considered as more of apsychological construct and can overlap each other. The localization error common to mostobject detections means that regions are almost never aligned such that they only overlapat their boundaries. This make many of the states in the RCC-8 topological relationships66(e.g., TPP, TPPi, EC and EQ) are extremely unlikely to occur.5.4.4 Shape and ScaleThe shape of an object can be defined as all the geometric information about a structure thatis invariant to rotation, scale and position. Object shape provides a more finely granularapproach than topology to describe the structure of individual objects. While topology cancapture some aspects of structure such as the presence of holes, multiple parts or hollowinteriors, it is insufficiently fine grained and expressive for some QSR applications.Describing the shape of an object qualitatively has proven to be a difficult problem and itis not one we are actively pursuing since we are interested in exploring relative qualitativerelationships between objects, not their individual qualitative attributes. Furthermore, thetechniques we use for performing initial object detection in 2D in Chapters 7 and 8 preventus from determining the actual shape of objects.A separate but related concept is scale or size, which is a measure of the extent of anobject in space and can be measured in a number of ways (e.g., axis interval, volume,area, internal span, etc.). A spatial relationship we were planning on using in our finalexperiments but eventually could not is size comparison. Comparisons between object sizesbased on shape are invariant to object position and orientation. The size of an object canbe measured in terms of volume, surface area, and internal spanning vectors, all of whichare potentially meaningful. Being a scalar magnitude, the relative size of two objects Tand R can be expressed based on binary magnitude comparisons <,>,= [Hernndez andZimmermann, 1994]. However, like many quantitative spatial properties, precise judgmentsof object size are difficult to make for humans. Therefore, when comparing relative sizes, itis useful to employ an order of magnitude comparison [Mavrovouniotis and Stephanopoulos,1990] which equates object sizes if they are within an order of magnitude [Hernndez andZimmermann, 1994; Mavrovouniotis and Stephanopoulos, 1989].In our final experimental work in Chapters 7 and 8 we did not use any size comparisonrelationships because our technique for producing 3D detections gave all object detectionsof the same type the same size. Our triangulation techniques determined only the centroidof the objects, not the extent in 3D and all detections of the same type use a boundingbox based on the average size of all objects in that class. Since all size comparisons wouldthen equate to just comparisons of object type, the size relationships contained no usefulinformation. However, we believe that size could be a useful relationship in future work ifwe had a technique for determining bounding box size from detections.675.5 Proposed Complex RelationshipsThere were two additional types of relationships we included initially but which we decidednot to use in our final work for several reasons. In this section we provide an overviewof them for use in future work, explain why they were not included and how we replacedthem with other relationships. We call the relationships containment and betweenness.These relationships relied on a mixture of representations and used a convex hull aroundthe object regions to capture a perceptual extended region surrounding the objects usedto make spatial judgments. For example, we would say an apple is ?in? a bowl, eventhrough there is no overlap between the actual bowl and apple. We consider a region tobe ?inside? the bowl, a region that can be approximated with a convex hull of the bowl.Both containment and betweenness relationships can be determined through examining theoverlap of the convex hulls of regions defined by the target and reference object.ContainmentAs mentioned in Section 5.4.3, we did not find the relative topology of object regionsto be an effective property for finding containment relationships between physical objectsbecause real world objects cannot overlap with each other. Instead, we were interested ina definition of containment able to recognize that a shelf could contain books or a bowlcontain fruit, even though none of these objects physically overlap. In these examples, theobject?s containment relationships are based on properties of the regions of the involvedobjects. In the case of the shelf, a book is entirely within a recessed concave region in theshelf?s structure. Similarly, the fruit in a bowl is either partially or completely inside aconcave region of the bowl. What is required to identify these relationships is an approachthat extends the regions defined by the containing objects and then a metric for examiningthe overlap between this extension and the reference object.To identify containment relationships, we examine the interaction of the convex hulls ofobjects T and R [Aiello and van Benthem, 1999], which we refer to as h(T ) and h(R).If the objects already use a bounding representation, like an axis-aligned bounding box,the convex hull does not need to be computed. Since now we are comparing the relativetopology of two bodies that can actually overlap with each other, we can use the RCC-8topological calculus which has been shown to be robust and effective at describing topo-logical relations between regions. The result of our containment relation is an asymmetricrelation contain(T,R) = topology(h(T ), h(R)) that returns one of the 8 states of RCC-8to describe the relative containment of a target object T and reference object R, as shown68(a) Basic shapes(b) Individual convex hulls for identifying containmentFigure 5.5: This figure demonstrates how containment is determined using convex hulls.Figure 5.5a shows 4 objects defined by 2D regions. In Figure 5.5b, each shape has beenoverlaid by its convex hull to demonstrate containment detection. Object B is containedby object A, or more specifically, the RCC-8 containment relationship between A and B isNTPPi (non-tangential proper part inverse).in Figure 5.5. The contain(T,R) relationship captures the following types of containmentrelationships between R and T :Containment Relationship of T to R Equivalent RCC-8 Relationships of contain(T,R)Separate DC, ECOverlapping PO,EQContains TPP , NTPPContained by TPPi, NTPPiIn the end we did not use the containment relationship for two reasons. Firstly, noneof objects we included in our experiments in Chapter 8 are sufficiently concave that it ispossible for other objects to be partially contained inside them. Secondly, objects insidecontainers are difficult to detect using cameras or 3D sensors. For many container objects,like refrigerators and ovens, objects inside can only be observed when the container is open.The few object that could open and contain other objects were never open during the datacollection and did not contain any of the objects we were trying to detect. In an objectdetection model where a robot is observing and interacting with an environment for a longer69(a) Basic shapes(b) Shared convex hull (of A and D) for identifying betweennessFigure 5.6: This figure demonstrates how betweenness is determined using convex hulls.Figure 5.6a shows 4 objects defined by 2D regions. In Figure 5.6b, a convex hull has beenoverlaid on the combined points of objects A and D to detect what objects share betweennessrelationships with them. Object B and C are both between A and D, with object B havingan NTTPi relationship and object C having a PO (partial overlap) relationship.period of time there would be more opportunities for it to observe opened containers andidentify the objects they contain.BetweennessThe spatial relationship betweenness identifies when a target object T is located in theinterval between two reference objects R1 and R2. As a spatial relationship, betweennesscan capture significant attributes of highly structured scenes [Aiello and van Benthem,1999]. As a qualitative spatial relationship, betweenness may be of particular interest tous since some aspects of human environments are highly structured (e.g., books aligned onbookshelves or plates between knives and forks).An obvious approach to determining betweenness uses the centroid points of the objects asvalues to a triadic, extrinsic orientation metric. Let X, Y and Z to refer to the centroidsof objects R1, T and R2 respectively. In the simplest representation, if Z lies on the linebetween X and Y , then betweenness(X,Y, Z) = true. Since perfect collinearity may notmatch the desired definition of betweenness, the orientation typically should be relaxed as70follows. If ?(X,Y ) is the angle between X and Y given Z and ? is a constant minimumangle then:Between(X,Y, Z) =???true if |?(X,Y )| ? ?false else(5.1)A problem with the orientation approach to betweenness is that it relies on object centroidsbut does not consider the object shape or size, potentially rejecting instances where partof T is between R1 and R2 but not the central mass. Also, the orientation approachrequires an arbitrary constant ?. Orientation as given above restricts the possible qualitativebetweenness relationships to a binary value and fails to differentiate between an objectpartially between or completely between two other objects.The extended convex hull approach to determining shape-based betweenness [Aiello and vanBenthem, 1999] avoids the previous problems. In this approach, an extended convex hullsurrounding both R1 and R2 is determined and compared topologically with T . Figure 5.5shows an example of how the convex shape-based approach can be applied to determiningbetweenness in 2D. To determine shape-based betweenness, a convex hull Q is found forall points on both R1 and R2. The output value of relationship between(R1, R2, T ) is theRCC-8 calculus value of the topological relationship between Q and T . As shown in Figure5.6, the convex hull approach to betweenness incorporates object shape, allows for a morecomplex set of possible potential relationships than the triadic orientation approach andavoids the ? constant.A problem with the extended convex hull approach is that anything inside a reference objectwould also be considered between the reference objects. To deal with this, we suggest thatQ can be defined as the complement of the convex hull Q and the hulls surrounding thereference objects R1 and R2 individually:Q = S(R1 ?R2)? (S(R1) ? S(R2)) (5.2)where S(X) is the convex hull of region X.We did not use the betweenness relation in our work in Chapter 7 because it is not a binaryspatial relationship and, given our limited amount of training data, we did not have sufficientexamples to learn from. In effect, it was simply too finely granular for our work. To avoidthis issue, we created binary relationships which captured similar spatial arrangements.One relationship is ?beside? which compared the interval overlap of pairs of objects in thex and y axes defined by the shape of the room. In effect, this approach replaces the second71reference object which provided the frame of reference with the shape of the room providingthe FoR. It is based on the observation that objects are aligned with the walls of a room.We also added a relationship called ?vertical orientation? which identifies when objects arevertically stacked. Both of these relationships are explained in greater detail is Section 7.7.72Chapter 6Spatial Object Classification in Virtual Envi-ronmentsThis chapter describes our early work on object classification using spatial relationships.The goal of this work was to test whether the spatial relationships between a target unknownobject and a surrounding set of known objects could be used to effectively classify theunknown object. Note that this is a classification problem, not a detection one; we areassuming knowledge of the position of the target object and the position and type of all thereference objects. If an object can be identified by other surrounding objects, this wouldimply that object-object 3D spatial relationships could provide a good basis for performingjoint object detection, which we discuss in Chapter 7.6.1 Problem FormulationThe classification task was to identify a target object T given its spatial relationships withall the other reference objects R1...N in an entire house. The types of all objects exceptfor T were known. In this work we did not have information on the shape of the buildingitself, only the objects inside it, so we had no way of determining the extent of rooms.Therefore, the spaces we were operating within were entire buildings, not rooms or scenes.The spatial relationships were acquired using information about the position, orientationand shape of T and R1...N . This classification task, then, equates to a robot classifying asingle object with perfect information about its size and shape but no evidence to its typegiven perfect information about all other objects in the house. Clearly, this is an unrealisticproblem formulation but we picked this problem as the performance would provide anupper bound on how well object-object spatial relationships alone could be used for objectclassification. We use a maximum entropy model and alternating boosted decision trees forthe classification task but in this chapter we will only cover the more successful decisiontree results that outperformed the maximum entropy models in every experiment.736.2 Synthetic Data from Elder Scroll OblivionIn order to create and test this qualitative spatial relationship-based object classifier, weneeded a data source with information about the type, shape, position and orientationof all the objects in a set of indoor environments. Acquiring this data through manualmeasurement was impractical for such early work. We were also concerned about theprivacy issues involved in modeling the interiors of people?s houses. At the time, using3D measurement devices such as laser range finders and stereo cameras was deemed tooproblematic given the difficulty in building up complete 3D models of the environment,segmenting these models into objects and accurately labeling these objects.For training and test data for our work, we identified a novel source of high quality syntheticdata from a commercial video game called Elder Scrolls 3: Oblivion [Bethesda Softworks,2006]. Figures 6.1, 6.2 and 6.3 show scenes from Oblivion: a tavern, a dining room and alibrary. The spatial relationships demonstrated in these scenes show a surprising level ofcomplexity and detail. For example, the wine rack in the back of the tavern contains manydifferent types of wine placed into its lattice structure. The dining room table is set withover 15 types of food, with different arrangements of food on the dinner table and servingplates. Other complex relationships include tools standing upright in barrels, bowls piledwith fruits, and books ordered into sets on bookshelves. These models represent a snapshotof the contents of a house, so there is no temporal aspect to the data they provide.6.2.1 Advantages of Using the Oblivion Data SetA number of factors made the data from Oblivion appropriate for our early work. First ofall, there is plenty of material. The designers of Oblivion also designed their game to be easyto modify and customize, so the data files could be parsed to extract the layout of the entiregame world and there are tools for extracting the object models. Oblivion contains severalhundred houses, castles, forts, churches, mines and other types of habitations, though inour work we restricted ourselves to places containing the word ?house? in their description.Houses appear in different architectural and ethnic styles ranging from city to city andnone of the houses are simply identical copies of each other. The contents of each house aremodeled at a very fine level of detail. Tables in the houses are set with meals, shelves arefull of books and ornaments, bedrooms are laid out with beds, dressers and clothes. Roomsvary in the degree of tidiness with some carefully and ornately laid out and other messy oreven derelict.74Figure 6.1: A tavern from Oblivion. The wine rack in the rear of the bar holds aboutthirty bottles of wine in five different varieties. These were collapsed into a single objecttype ?wineBottle? for classification purposes. The cat-like figure behind the bar is the owner.All game characters were removed from the training and test data.Altogether, there are over 1000 different types of objects modeled for the game. The settingis a medieval fantasy world so the contents of the houses were antiquated but the complexityof their environments is substantially beyond what was normal in any other game at thetime.Oblivion is an unusual game since almost all of the objects contained have no purpose but tomake the environments seem realistic. In fact, many of the houses we used would never evenbe seen from the inside by the player, existing primarily as an interesting backdrop. Thisemphasis on realism rather that playability is a very important factor in using video gamedata since in most games the realism of an area is reduced to meet the artificial requirementsof game play. For example, in many games, if the player were just supposed to move fromone end of a building to the other quickly, most of the rooms in the building would not bemodeled. Even if a room is modeled, most of the contents will be game artifacts, objectsthat are primarily significant to the game play, like weapons or ?power ups? that would beartificially placed and not appear in the real world. There would only be a small numberof background objects to make the environments realistic.75Figure 6.2: A large and ornate dining room from Oblivion. In the center of the imageis a dining table containing food and wine set for ten people. Surrounding the table arechairs, though these are partially obscured by shapes that show how character models wouldtransition from standing to sitting on the chair.Figure 6.3: A library from the Oblivion data set. The shelves in the back contain books,ornaments and tools. In the foreground there is a table set for two with food and wine.766.2.2 Issues With Using the Oblivion Data SetThere were also some fairly serious limitations of the Oblivion data set. Due to a featureof the game?s design, none of the completely enclosing containers (e.g., boxes, chests, ordrawers) in Oblivion actually contain objects but bookshelves, not being enclosed, do. Also,despite the high level of detail, there are fewer objects in an Oblivion house than in a realone. Since this was early work on this problem, we decided it was appropriate to use asimplified data set. Despite the medieval setting, the objects still have a spatial structurethat can be modeled and the design of the houses is relatively modern, with many havingmultiple floors, large rooms and windows and areas like bedrooms, kitchens, and offices.The bigger problem we had was convincing people that the game?s environments wererealistic enough to prove that our model had validity in the real world. The main issuethat was raised was the possibility that some of the scenes might have been algorithmicallygenerated by a computer with a simple set of spatial relationship rules. If this were true,it would imply that we were really only reverse engineering the rules used to create theenvironments in the first place. We saw no evidence of this when examining the data butit remained a concern and was part of the reason we stopped using this data in our laterwork.6.3 Qualitative Spatial RelationshipsThe structure of a qualitative spatial relationship between a target unknown object T anda known reference object R was expressed as TSR where S is the spatial relationship(e.g., cup leftOf wine). Since this was early work, the relationships we used were onesof our own design, rather than specific relationships from the existing body of qualitativespatial reasoning. We used a very simple decomposition of space proposed by Freeman[Freeman, 1975] that has three types of relationships: distance-based (near, far), direction-based (left of, right of, in front, behind, above, below) and containment-based (inside,outside, surrounding, overlapping).6.3.1 Distance RelationshipsThe distance relationships we used were: touching, near (approximately within arms reach),mid (short walking distance), and far (long walking distance). Our distance metric usedthe minimum distance D(T,R) between the exteriors of the triangle meshes describing77the regions of object T and of object R [Gottschalk et al., 1996]. Each relationship thencorresponded to a range of values along this distance. The range values we used were basedon estimations of the conceptually relevant distances.Touching: Physical contact between objects (touching) is usually caused by gravity andone object supporting the other. Objects typically only touch a small range of otherobjects so contact can be a highly descriptive relationship (e.g., food on plates, bookson bookshelves and furniture on floors). It was formally defined asT touching C ?? D(T,R) <= 0 cmNear: Objects that share a common purpose, such as eating, cooking or writing, are fre-quently found within arms? reach of each other and the near relationship can capturethis effect. This range is particularly useful for grouping together objects at thetabletop level. It was formally defined asTnear C ?? D(T,R) > 0cm,D(T,C) <= 40 cmMid: The mid proximity range captures objects that belong to a common task and shareeither a room or a scene. This range is useful for grouping together larger objects,like chairs and tables, at the room level. It was formally defined asTmid C ?? D(T,R) > 40cm,D(T,C) <= 200 cmFar: This final proximity range relates objects that share the same overall environment(i.e., the same house) but are most likely not functionally associated except at thebroadest level. This relationship is useful for extracting data on the overall objectpopulation of a house. It was formally defined asT far C ?? D(T,R) > 200 cm6.3.2 Direction/Containment Spatial RelationshipsSince early on we believed that our simple direction and containment relationships werenot sufficiently expressive to be useful individually, we aggregated them together to forma joint direction/containment relationship. To capture directional and containment-basedspatial relationships, we compared axis-aligned bounding boxes around our target object78T and comparison object R. For our containment relationships we used a classic simplequalitative set of relationships: inside, overlap, surround, and disjoint [Freeman, 1975].The disjoint relationship was then subdivided further using directional data. We used anapproach shown in 5.4.1 that compared the overlap of the object bounding boxes after theywere projected onto the vertical axis using a simplified interval calculus. This subdivided thedisjoint containment set into above, below and level. We opted not to use intrinsic qualitativespatial relationships (e.g., leftOf, behind, etc) because, to do so in any real world futurework, we would need to reliably be able to determine the orientation of reference object Rif it were to function as a frame of reference for T .For complete set of relationships, let t be all points in the bounding box of T and r be allpoints in the bounding box of R. Let minx,y,z(T ) and maxx,y,z(T ) be the the minimumand maximum values in each axis of the bounding boxes for object T .T insideR ?? t ? rT surroundR ?? t ? rToverlapR ?? t ? r 6= ?, t 6? r, t 6? rTaboveR ?? t ? r = ?,minz(T ) > maxz(R)T belowR ?? t ? r = ?,maxz(T ) < minz(R)T levelR ?? t ? r = ?, else6.4 Relationship SetsWith the simple relationship types described above, we wanted to know how useful theywere for classification and how they could be combined to improve classification accuracy.To that end, using our distance relationships and direction & containment relationships, wederived four relationship sets for our experiments:Distance relationship set(Dist) This set used just the distance relationship.Direction/Containment relationship set (DirCon) This set used just the direction &containment relationship.Aggregate relationship set (Aggregate) This relationship set concatenates theDist andDirCon sets (e.g., T near R, T above R).Paired relationship set (Paired) This relationship set used the Cartesian product of79the two sets, merging relationships to create new a single relationship that combineddistance, direction and containment (e.g., T near&above R). Since we are assumingthat people use a common mental representation as the basis for their proximity,direction and containment judgments, the use of paired relationships are justified[Duckham et al., 2006]. The paired relationship set was interesting since it directlyencoded more complex and specific spatial relationships than the aggregate set butat the cost of being more finely granular and not allowing the classifiers to combinerelationships on their own.6.5 ExperimentsWith our existing classifier-based alternating boosted decision trees, we ran some exper-iments to examine the effectiveness of different groups of qualitative spatial relationshipsfor spatial object classification. Our goals with these early experiments were to determinewhether qualitative spatial classification was a tractable problem and which spatial rela-tionships were useful.6.5.1 Training and Test DataOur training and test data consisted of 197 houses from Oblivion containing approximately14,000 objects with class names, 3D shapes, positions and orientations. The shape of eachobject is from the game?s physics engine, which provides simple tight bounding polyhedra.We removed a number of game artifacts and atmospheric objects like cobwebs, ambientlight sources, and hidden switches since they would either have not been placed by humansor were relevant only to the game. Objects were grouped together into classes like book,food, table, and chair, to produce 97 object classes and from these we selected 17 types ofobjects to classify. These 17 types were chosen because they appeared frequently, thoughwith radically different frequency from each other, and they cover a variety of tasks andlocations in the building.To encode this data for the decision trees we use a flat, fixed length feature vector. If Z isthe number of types of relationships used in the experiment and C the number of classes ofobjects, the feature vector has Z?C categories. Each feature then describes the relationshipbetween the target object and a type of reference object (e.g., target near book). We includean ?absent? relationship for each category to handle situations when there is no referenceobject of the category?s type in the building. If there are multiple occurrences of the same80Table 6.1: Relationship set accuracy comparisonrelationship Set ADTree AccuracyDirection/Containment 57.56%Distance 59.97%Aggregate 68.07%Paired 66.09%.Table 6.2: Confusion matrix for aggregate distance & direction/containment classifierCorrectObjecttypeClassified Object Type (max values in bold)Plate Cup Pitcher Shelf Vase Beer Bowl Bed Bench Winerack Book Table Candle Chair Painting Food WinePlate 342 26 16 2 8 0 21 3 0 1 4 15 3 9 0 183 11Cup 16 339 21 0 5 12 5 1 1 0 0 2 6 9 0 76 12Pitcher 10 51 130 3 6 4 18 5 0 0 1 4 7 30 3 42 14Shelf 3 4 2 141 1 1 3 1 5 2 0 32 0 23 3 11 0Vase 10 21 24 1 12 2 6 1 1 0 0 2 7 10 0 20 3Beer 1 16 1 0 0 45 2 2 0 2 1 1 0 3 3 12 11Bowl 44 16 20 1 1 2 78 4 0 0 0 6 2 10 2 58 8Bed 1 2 7 5 1 0 1 85 2 0 0 14 2 12 2 7 3Bench 3 0 0 2 1 0 0 0 35 0 0 22 2 21 1 4 2Winerack 1 1 0 1 0 1 0 0 0 20 0 3 0 0 0 0 9Book 3 3 5 0 1 0 3 0 0 0 9 8 0 4 0 7 1Table 23 3 3 32 7 0 5 9 9 1 1 307 4 40 4 13 13Candle 7 14 12 1 4 1 3 0 2 0 1 7 11 21 1 13 5Chair 6 5 18 3 2 0 4 10 3 2 0 21 10 459 0 28 12Painting 1 0 2 8 1 0 0 4 0 0 0 2 0 1 48 3 2Food 78 43 22 3 1 1 13 2 0 0 0 15 2 20 2 1650 12Wine 4 8 4 0 0 4 1 1 1 1 0 1 2 13 2 43 485type of reference object in the building, we used the closest reference object to the targetobject, under the assumption that closer objects are more significant for classification.6.5.2 Results and DiscussionTable 6.1 shows the results of our experiments that compared each of the relationshipsets classification ability. We found our alternating decision tree (ADTree) classifier wassubstantially more accurate than the maximum entropy model we tried with all the datasets and showed the greatest improvement on the sets that combined distance, directionand containment.The direction/containment relationship set was the least useful relationship set, most likelybecause it did not differentiate between near and distant objects if they were disjoint. Dis-tance relationships made for a better classifier indicating the significance of touch relation-ships and the importance of being able to differentiate between close and distant objects.The relationship sets that used both direction/containment relationships and distance re-81lationships produced a more effective classifier, showing that both are useful at differentclassification problems.Paired relationships were less successful at fusing the distance and direction/containmentrelationships since, although allowing for complex dependencies on spatial relationships be-tween the objects such as ?above and touching?, the decision trees could still recognize thosecomplex relationships by encoding them into the tree structure. To recognize ?above andtouching? with only the component relationships requires two connected sequential decisionnodes that are predicated on ?above? and on ?touching?. Using the paired relationshipslimited the types of decision structures the tree could exploit but the differences were notlarge. Overall, what was evident was that having many spatial relationships was moreimportant than how they were combined.Table 6.2 shows the confusion matrix from the aggregate relationship set classifier, withthe maximum values for each row in bold. Examining it shows that classified objects wereoften confused with types of objects found in similar locations. For example, benches weremistaken for chairs and bowls were mistaken for cups or plates. Overall there is an issuewith objects being misclassified as food. Since food is a by far the most common object inElder Scrolls 4 and found in a wide variety of places, objects that are difficult to classifywere identified as food.6.6 Influence on Later WorkThese results showed us that the immediate scene context, as described by qualitativespatial relationships between objects, can provide significant evidence towards an objectstype. It also showed us the representation of the spatial relationships (combined togetheror independently represented) did not matter as much as having a variety of relationships.The next step for our work was to solve the more realistic problem of combining spatialrelationships with the results of a visual object detector to perform object detection usingboth visual and spatial information, the problem we address in the next chapter. We alsoknew that we would need to expand our selection of spatial relationships.This was also the last time we used our synthetic data from the game Oblivion. Performingvisual object detection on the video game objects was problematic because all examples ofthe same type of objects look exactly the same. We had proposed simulating the resultsof an object detector but decided that simulating a detector on simulated data would notto produce convincing results. The formatting of the Oblivion data was also difficult towork with and designing a system which allowed for switching between real and video game82data would have been very time consuming. Therefore, we dropped the Oblivion data andfocused on collecting real scene data, as we discuss in Section 8.1.83Chapter 7Improving Object Detection using 3D SpatialRelationshipsIn this chapter, we describe our approach to improving object detection accuracy using3D spatial relationships. Given that object detectors typically produce many false positivedetections, we want to differentiate between true and false positive detections using 3Dspatial relationships. Our approach employs a model, trained on 3D examples of indoorscenes, to improve the accuracy of multiple types of object detectors applied to a scene,based on the spatial relationships between the detections.We begin by describing our problem and the 3D object detections that are the input to ourapproach. The model we employ, which is based on the work of [Desai et al., 2009], wasdiscussed in Section 3.4.4. In this chapter we describe how we adapt their model to 3Dobject detection, our improvements to training and inference using branch and bound treesearch and how the parameters to our model are computed using structured support vectormachines [Tsochantaridis et al., 2004]. We end the chapter with a detailed description ofthe 3D spatial relationships we employ.7.1 OverviewOur work in Chapter 6 described a method for classifying a single object given perfect in-formation about the type and location of surrounding objects. Our new approach, however,is based on object detections and the 3D spatial relationships between them, informationthat would be available to a robot. Our work is based on a technique for improving theobject detection accuracy in images by [Desai et al., 2009] which they describe as a methodthat ?simultaneously predicts a set of detections for multiple objects from multiple classesover an entire image?. Our work performs a similar function but is applied to 3D detectionsand uses 3D spatial relationships.84The input to our model is a set of hypothesis boxes. In our work we use the term hypothesisboxes instead of 3D detections. A hypothesis box is a region of 3D space identified ascontaining an object. We differentiate between the terms because hypothesis boxes couldbe produced directly by an object detector (either image or shape-based) or constructedfrom image-based object detections. The output of our model is a new set of confidencescores for the hypothesis boxes such that the scores for true positives increase and falsepositives decrease. Our approach does not produce new detections, so it cannot improve onmissed detections.The model uses a structured SVM [Tsochantaridis et al., 2004] to learn a set of weights,each associated with a target object type, a reference object type and a qualitative spatialrelationship (e.g., pan ABOVE stove). These weights capture whether the presence of aspatial relationship between two detections of the appropriate types is a good indicator thatboth detections are likely true positives. The weights are used with a scoring function toidentify a subset of detections with spatial relationships that constitutes the most likelylayout of detections. This subset of likely detections is used to adjust the scores associatedwith every 3D hypothesis box based on its 3D spatial relationships to objects in the subset.In computing the weights, the spatial relationships of the true and false positive hypothesisboxes are used, not the relationships relative to the ground truth position of the objects.The ground truth positions are only used to identify true hypothesis boxes. Our goal is todifferentiate between true and false detections, not ground truth and false detections.7.2 Hypothesis BoxesEach hypothesis box hi has three components hi = {bi, yi, si} where bi is an axis-alignedbounding box defined by two 3D points, yi ? 1 . . .K is a numeric label for K object typesand si ? R is a detection confidence score. Hypothesis boxes are either produced by anobject detector that works on 3D data or derived from the results of an object detector thatoperates in 2D. In our work, 3D hypothesis boxes are created from multiple 2D detectionsin images of the same scene using triangulation as we describe in Section 8.3.1.We use axis-aligned bounding boxes which are efficient to compute from multiple 2D de-tections and allow for a wide range of spatial relationships. Indoor scenes generally havea dominant set of axes at right angles, defined by the surrounding walls, and objects arealigned with these axes. Therefore, our hypothesis boxes are also aligned with the dominantaxes in the room. In scenes where the objects are not aligned, using axis-aligned boxes thehypothesis boxes can become exaggerated in the x and y axes with minimal effect on the85resulting qualitative spatial relationships. The box bi could be replaced by any descriptionof 3D space (e.g., triangle mesh, bounding sphere, etc.) which allows for the computationof 3D spatial relationships between hypothesis boxes.Each hypothesis box has only one associated type and confidence score because that in-formation corresponds to the output of most object detectors. An alternative model weconsidered first identifies candidate objects in the scene using approaches like table topsegmentation, then applies N object detectors to those objects, producing a 3D region withN associated types and scores. The UBC team at the SRVC robot challenge [Meger et al.,2008] used this technique to increase object detection efficiency by allowing the robot tofocus on specific areas and acquire high quality images. We did not pursue this approachas it requires the initial detection of candidate objects in the scene which was outside thescope of this work. The multiple object types and scores per hypothesis box representationused in the SRVC could be captured with our model, simply by having multiple overlappinghypothesis boxes for each type of object.7.3 Scene DetectionGiven the results in scene detection we demonstrated in Chapter 4, it should be possible,given multiple images of a scene, to determine its type with good accuracy. Thus, for thiswork, we assume that we know the type of scene and only detect objects likely to be foundin that environment. Each scene type has its own model of spatial relationships foundbetween the set of objects that are commonly found in that type of scene. A general model,that would encompass all commonly found objects, is beyond this work, but is discussedfurther in Chapter 9.7.4 ModelLet Hs = {hi : i = 1 . . . N} be the input 3D hypothesis boxes produced for scene S. LetY = {yi : i = 0 . . . N} be a label vector for all hypothesis boxes in a given scene. Abackground object label 0 is assigned to hypothesis boxes that are predicted to be falsepositives. A hypothesis box with a label other than background is called an instancedhypothesis box.Hypothesis boxes are never relabeled with a new object type so yi can only change betweenone object type and the background label. A detection should be relabeled only, it was86incorrectly labeled (i.e., the wrong detector activated on that point). Since detectors donot suppress each other, we rely on the correct object detector to also produce a detectionat this location in space that overlaps in 3D with the wrong detection. Our approach thenwould adjust the scores of both detections to select the correct detection. This meansthere is no need to label detections to anything other than the type of their detector or asbackground to indicate that they are false positives. If only incorrect detectors fire on anobject, we simply decrease the detection score on the incorrect detectors to suppress themand there is no true positive to increase.We use the same scoring function as [Desai et al., 2009] which measures how a set of labels forhypothesis boxes agrees with our model of expected spatial relationships between objects.Maximizing the scoring function provides the most likely set of labels for all hypothesisboxes given their associated scores, types and spatial relationships. We define a scoringfunction for a set of labels Y on a set of hypothesis boxes H to beS(H,Y ) =?i,jwTyi,yjdij +?iwTyihi (7.1)where wyi,yj is a weight vector that encodes the value of two objects of type yi and yjsharing the spatial relationship dij . dij is a sparse binary vector that captures multiplespatial relationships. Since our relationships are not necessarily binary, a qualitative spatialrelationship with N possible values would be mapped onto a binary vector of length N .These binary vectors are concatenated to create dij . The qualitative spatial relationshipsare based on the axis-aligned bounding volumes bi and bj . The seven relationships we useare described later in Section 7.7.wyi represents a local detector score associated with hypothesis box hi for object type yi.It is made a two dimensional vector by appending a 1 to each detector score which allowsthis weight to be used to learn the bias between the object classes.For the hypothesis boxes assigned to the background class, the local and pair-wise spatialrelationship weights wyi and wyi,yj are 0, effectively eliminating them from the scoringfunction. This means the model only considers the interaction between hypotheses thatare considered to be true positives (instanced hypotheses). This helps both inference andlearning by reducing the number of labels that influence the model.877.5 InferenceIn the inference step, the goal is to determine a set of hypothesis boxes that are true positivesand then adjust the scores of all boxes based on their spatial relationships to this set. Withcorrectly trained weights, the labels that constitute the most likely scene can be computedas arg maxY S(H,Y ). Repeatedly computing arg maxY S(H,Y ) is also an essential part ofthe learning algorithm, as we will explain in the next section, so efficiency of computationis essential.7.5.1 Greedy Forward SearchSince computing arg maxY S(H,Y ) is NP hard under most conditions, [Desai et al., 2009]use a simple greedy forward search. Let X be a set of 2D windows produced by an objectdetector on an image. Then let I be a set of instanced windows-class pairs (i, l) from Xand let Y (I) be the associated set of labels where yi = c for all instantiated window pairsin I and otherwise windows are the background class yi = 0. Let S(X,Y ) be an equivalentscoring function to ours but applied to 2D windows rather than 3D hypothesis boxes. Letthe change in score by instantiating windows i in I be?(i, c) = S(X,Y (I ? {(i, c)}))? S(X,Y (I)) (7.2)Initialize I = {}, S = 0 and ?(i, c) = wTc xi. The following greedy algorithm is thenrepeated:1. (i?, c?) = arg max(i,c) 6?I ?(i, c)2. I = I ? {(i?, c?)}3. S = S + ?(i?, c?)4. ?(i, c) = ?(i, c) + wTc?,cdi?,i + wTc,c?di,i?until ?(i?, c?) < 0 or all windows are instantiated. [Desai et al., 2009] report that thisapproach gave solutions close to the optimal on their 2D detections.In our work on 3D scenes, we found greedy search failed to produce effective models duringtraining. We determined that on our data it frequently does not compute the maximumscoring set of labels when compared with the the brute force solution. If the first selecteddetection is a false positive, it can throw off the entire rest of the search. This is more88noticeable in our work since, even if the first selected hypothesis box has a high associateddetection score, it may not be well localized in 3D and will not have appropriate spatialrelationships with other objects in the scene.7.5.2 Branch and Bound SearchWe replaced the greedy search approach with a branch-and-bound tree search which isslower but which provides an optimal solution given an upper bound on the number ofpossible true positive detections in the scene. Even if a better solution existing with moretrue positives, our approach is guaranteed to produce a solution at least as good as thegreedy search because we initialize it with the greedy search solution. We treat the problemas a binary tree search where each node at depth d branches on instancing box Id. Let Ybe a set of labels for all hypothesis boxes. We maintain a set of viable candidate nodes Cwhich can be represented as a label and node depth pair (d, Y ). For computing the upperbound on a branch, we use:Upper(H,Y, d) = ?i>d?j>d U(i, j) + S(H,Y )Ui,j =???0 if wTYi,Yjdij + wTYihi ? 0wTYi,Yjdij + wTYihi if wTYi,Yjdij + wTYihi > 0(7.3)Initialize Y0...N = 0, Ymax = Y , d = 0, c = (y, 0), C{c} For each c = (Y, d) in C:1. d = d+ 12. Y1 = (Y : Yn = 0), Y2 = (Y : Yn = yn))3. if S(H,Y1) > S(H,Ymax) then Ymax = Y14. if S(H,Y2) > S(H,Ymax) then Ymax = Y25. if Upper(H,Y1, d) > Ymax then C = C ? Y16. if Upper(H,Y2, d) > Ymax then C = C ? Y2When computing arg maxY S(H,Y ) in inference, efficiency can be gained by limiting themaximum number of instanced boxes in y. If a scene only contains a relatively smallnumber of detectable objects, and once most of those have been identified, they supplysufficient spatial relationships to adjust the classification scores on all remaining true andfalse positive boxes.897.6 LearningIn the training phase, our goal is to learn a set of weights such that Y after computingarg maxY S(H,Y ) will instance only the true positive boxes.7.6.1 Problem FormulationFollowing the same formulation used in [Desai et al., 2009], wyi, yj the spatial relationshipweight and wy the local detector score?s weight can be considered as a single weight w byrewriting the scoring function as:S(H,Y ) =?i,jwTs ?(yi, yj , dij) +?iwTa ?(hi, yi) (7.4)where ws and ?() are vectors of length DK2 and wa and ?() are vectors of length 2K,where D is the number of spatial relationships, where K is the number of object classes.The scoring function can then be rewritten as: S(H,Y ) = wT?(H,Y ) wherew =[wswa]?(H,Y ) =[ ?i j?(yi, yj , dij)?i ?(hi, yi)](7.5)With this rewriting, inference is performed with the operation:Y ? = arg maxYwT?(H,Y ) (7.6)For training the weights, we first need a collection of scenes where we have computed trueand false positive hypothesis boxes. Given n scenes, we have hypothesis boxes Hn andlabels Yn. Next, we want to train a set of weights w such that a new set of hypotheses Hntends to produce the correct set of labels Y ?N = Yn. We want a set of weights such that thetrue label Hn scores higher than all other possible labellings of the hypothesis boxes Ln.The problem can be formalized as:arg minw,?o wTw + C?n ns.t. ?n,Hn wT??(Hn, Yn, Ln) ? l(Yn, Ln)? n(7.7)where ??(Hn, Yn, Ln) = ?(Hn, Yn)??(Hn, Ln). Since not all labellings are equally incor-rect, we require a loss function which measures how incorrect a particular labeling Ln is90relative to the ground truth Hn and is used to penalize the slack value n.Determining the correctness of a box requires ground truth knowledge of the 3D extent ofobjects in the scene. The loss function we employ penalizes incorrect detections based onthe degree of their overlap with a true positive. First we decompose the loss function asloss(Y,L) =?Ni=1 l(yi, li) over N hypothesis boxes. Then:l(yi, li) =?????????1 : yi 6= 0 ? li 6= yi1 : li 6= 0? 6 ?j s.t. [d(i, j) < 30 cm ? yi = hi]0 : otherwise(7.8)where d(i, j) is the distance between the centroids of the hypothesis boxes i and j. The firstcase captures missing detections and the second case captures false positives with a checkto ensure no true detection overlaps the hypothesis box. We use centroid distance ratherthan overlap, which is more commonly used to determine the success of 2D detectors [Desaiet al., 2009], [Everingham et al., 2010]. Due to errors in camera position detection, we foundthat the hypothesis boxes we produced for small objects rarely overlap with their groundtruth detections, even if the 2D detections that produced them were accurate. Similarly,large objects like refrigerators were much more likely to overlap with their ground truth.7.6.2 Structured SVM Weight TrainingGiven the problem formulation in equation 7.7, w can be learned from training data usinga structured support vector machine [Tsochantaridis et al., 2004] which we described inSection 2.2.3.Learning a structured SVM requires are three functions: a loss function ?(y, y?), a featuremapping function ?(x, y) and a function for computing the most violated constraints y? =arg maxy?Y H(y). For the loss function we use loss(Y,L) from Equation 7.8. The featuremapping function ?(x, y) computes a vector of all qualitative spatial relationships betweennon-background detections from y in scene x. In the computation of the most violatedconstraints, structured SVMs perform a computation very similar to Equation [7.1] [Desaiet al., 2009]. We believe that a significant portion of the improvement provided by ourbranch and bound algorithm is the result of more accurately identifying these violatedconstraints. The degree of improvement can be seen in the experiment in Section 3D Spatial RelationshipsIn our work, we use seven 3D spatial relationships for comparing object positions. Eachqualitative relationship describes the position of a target hypothesis box T relative to asingle reference hypothesis box R and the output is an integer value. We restrict ourselvesto symmetric relationships so F (R, T ) = F (T,R), as this makes certain computations, likethe branch and bound search, simpler to compute.The size of the scenes we apply our model is figural, using the size terminology from Section5.3.1, meaning they are similar in size to the human body and that can be perceived entirelywithout moving (e.g., a table or an office). Some of the scenes verge into the vista size,meaning they require a small amount of movement to examine fully but all are single roomswith no interior dividing walls. All the objects between which relationships are determinedare represented by axis-aligned bounding boxes. Despite the fact that we operate in aphysical space, hypothesis boxes do not obey physical constraints because they are theproduct of an object detector. Most noticeably, the hypothesis boxes can overlap eachother arbitrarily and do not require physical support in the scene.As we discussed in Chapter 5, there are four basic types of qualitative spatial relationships:distance, orientation, topology and shape. The relationships we use are based on a singletype of spatial relationship. More complex features which combine multiple types of rela-tionships we leave to future work. While our set of relationships is not exhaustive, we covermany of the qualitative spatial relationships available using axis-aligned bounding boxes.7.7.1 Relationship OverviewOur two distance relationships are determined using D(R, T ), the shortest distance betweenthe boundaries of the target and reference boxes, which is simple to compute for axis alignedbounding boxes. We use the distance between the boundaries rather than the centroids asit can more accurately capture spatial concepts like ?touching?.One of the advantages of 3D detections is that accurate distances between detections can bedetermined. This is especially useful when the objects have predictable distances relativeto each other as the partitions can capture human-centric concepts such as ?within arms?reach?. In 2D approaches, distances either require estimating the 3D geometry of the sceneor rely on absolute measures that are based on metrics derived from the image (e.g., pixelsor fraction of overall image).92The two distance relationships are:? Absolute Distance: an interval-based comparison of the distance based on anthropo-metric values.? Bounding Box Distance: a relative comparison of the distance based on size of thereference hypothesis box.We have four orientation relationships that use a region-based interval algebra and havea frame of reference either using gravity to define the z axis or the shape of the room todefine the x and y axes. The x and y axes are aligned with the dominant axes of the roomgeometry defined by the walls. While technically there are rooms with walls that do not formapproximate right angles, we never encountered any in our data and they are relatively rare.In our descriptions below minx,y,z(hi) and maxx,y,z(hi) are the minimum and maximumvalues in each axis of the axis-aligned bounding boxes for hypothesis i. ?(bR, bT )x,y,z is theoverlap in the x,y or z planes of the target and reference box and |bi|x,y,z is the length ofhypothesis box hi in the x,y or z axis.The four orientation relationships are:? Vertical Orientation: a comparison of the vertical partitions.? Coplanarity: a comparison of the relative position of the tops and bottoms of theboxes.? Vertical Alignment: a comparison that tests if the hypotheses form a vertical stack.? Beside: a comparison of the overlap in the x and y axes that determines if the objectsare aligned with the room walls.The possible range of topological relationships is limited because we are using a boundingvolume rather than the actual object shape. Also, since we are considering real, mostlyconvex, whole objects with no 3D part-based hypothesis boxes, objects would very rarelyactually overlap each other. Therefore, we only use one simple topological relationship:? Overlap: a comparison which tests if the two hypothesis boxes occupy the same spaceto a significant degree.We have no shape or size relationships because we have no information about the shapeof the objects in the hypothesis boxes. We do not have any useful size information either93because the approach we use for creating the axis-aligned boxes from 2D detections basesthe size of the boxes on the type of the object, not the size of the 2D detections. If size ispurely based on object type, then size comparisons are simply comparisons of type and notinformative spatial relationships.Below are formal descriptions of the seven spatial relationships we use and our reasoningbehind their selection.7.7.2 Absolute DistanceOur absolute distance relationship uses a consistent set of partitions to compare the distancebetween objects. The major challenge for absolute distances is determining the partitionson which to partition the distances. Terms such as near and far are relative to the objectsand the environments, unlike terms such as above and below.One reason absolute distances are appropriate for our work is that all the environments weare judging are on the same scale since absolute distances partitions need to be adjustedbased on the size of the environment. We use the same partitions for both the kitchenand office scenes, 5 partitions with divisions at 10, 50, 100, 200 cm. These partitions werechosen using anthropometric measures to capture the concepts of touching, within arm?sreach, within one step and requires significant movement [Tilley and Wilcox, 2002]. Theserelationships allow us to identify objects that are organized for different types of tasks. Inan office, the structure of the room is centered around the desk and chair;f vital objectsare placed within arms? reach (e.g., keyboard, mouse) while other less important ones (e.g.,other chairs, printer) are further away.It is possible to learn the partitions from the scenes but with limited training data it isnot feasible to remove a significant number of scenes for learning the partitions. Usingn-fold cross validation in our experiments left us with two alternatives: either we learn thepartitions using all training and test scenes, which would mean training on test data, orlearn different partitions based on the training set in each fold, which is impractical. Toavoid these problems we opted instead for partitions we determined based on estimatedanthropometric and environmental properties.Let P1...n be increasing values that divide space into n+1 partitions. Our absolute distance94relationship outputs an integer value and is defined as:AD(R, T ) =?????????1 if D(R, T ) < P1i if D(R, T ) > Pi?1 ?D(R, T ) < Pin+ 1 if D(R, T ) > Pn(7.9)Bounding Box DistanceBounding box distance is a relationship that uses a relative distance measure (a distancemeasure based on the size of the things being measured). The motivation for relativedistance measures is that, when using descriptive terms such as near and far, the size ofthe objects in question is important. Two cups might be considered near if within 10 cmof each other, while two buildings near if they are within a 100 meters. In a kitchen oroffice, a similar effect may exist between large objects or small objects. Another advantageof relative distance is that there are no partitions to define or learn. One issue with relativedistances is it is very coarsely granular, with only two possible states as we have definedthe relationship. While it is possible to include additional partitions (e.g., half or doublethe size of the reference objects), these are not well motivated in the qualitative spatialliterature.In order to avoid odd effects resulting from objects that are much shorter in one axis thanthe others, we partition the distance between the objects using the longest internal span ofeither the target or reference objects. Using the maximum distance avoids the relationshipbeing asymmetrical.Let Sp(hi) be the longest internal span in hypothesis box i. Our box distance relationshipoutputs a binary value and is defined as:BD(bR, bT ) = D(bR, bT ) < max(Sp(R), Sp(T )) (7.10)Vertical OrientationVertical orientation is an orientation relationship based on comparisons of the overlap ofobjects in the z axis and identifies semantic concepts like above, below or level. It relies onan external frame of reference that is defined by gravity and is therefore easy to determine.Vertical orientation is used in both 2D contextual work [Desai et al., 2009; Neumann, 2008;Ranganathan and Dellaert, 2007]. Vertical orientation relationships can be captured with95Figure 7.1: This figure illustrates the box distance relationship. Sp(R) and Sp(T ) (notshown) are the longest internal spanning vector of the reference box R, shown in blue),and target boxes T , shown in red. Since Sp(R) > Sp(T ) then Sp(R) is used to define themaximum separation between objects that are ?close?.some accuracy in 2D as long as all images are taken from approximately the same heightand angle but this becomes less effective as objects get closer to the photographer and theviewing angle becomes more downward tilted.Vertical orientation comparisons of objects are effective because gravity is the most powerfulfactor controlling the layout of scenes. Every object we detect in our work is supportedby a structure in the scene. Typically these are either the floor, a horizontal work surfacelike a desk or counter, or another object like a pot on a stove. Large objects are usuallysupported by the floor, while smaller objects are more readily available on work surfaces.We use a modified version of the Allen interval algebra with a reduced number of states tomake granularity of the relationship more coarse, as shown in Figure 7.2. Specifically, weremoved those relationships that are defined by the top and bottom of the objects being96Figure 7.2: This figure illustrates the vertical orientation relationship. The reference boxR is shown in blue and different target boxes are shown in red with the resulting relationshipbeside them.close and moved them to a separate relationship called Coplanarity which we discuss next.Our vertical orientation relationship has an integer output and is defined as:V O(bR, bT ) =???????????????????????????????1(above) if max(R)z < min(T )z2(semi-above)if max(R)z < max(T )z?max(R)z > min(T )z3(semi-below)if min(R)z < min(T )z?min(R)z > max(T )z4(below) if min(R)z < max(T )z5(level) else(7.11)97CoplanarityCoplanarity is a relationship similar to the vertical orientation relationship but focuses onwhether the top or bottom of objects are coplanar. It captures some of the relationships fromthe Allen interval algebra that were not included in Vertical Orientation. The relationshipidentifies objects that are either supporting or supported by other objects. It also identifiesobjects that are supported by the same scene structure like the floor or tables.We wanted to remove the coplanarity relationships from Vertical Orientation as they aremore affected by inaccurate localization and we want to be able to analyze their effectivenessseparately later. Also Coplanarity cannot easily be determined from 2D data and has adistinct meaning semantically from concepts such as above and below.Our coplanarity relationship has an integer output and is define as:CO(bR, bT ) =?????????????????????1(same base) if |min(T )z ?min(R)z| > 2(same top) if |max(T )z ?max(R)z| > 3(supporting) if |max(T )z ?min(R)z| > 4(supported) if |min(T )z ?max(R)z| > 5(non-coplanar) else(7.12)These relationships are illustrated in Figure 7.3. Same base means that the underside ofthe two objects are approximately coplanar in the scene and therefore are likely on thesame surface. Same top means that the top of the two objects are approximately coplanar,which occurs when both objects are integrated into a common work surface like a counter.Supporting and supported means that either the target or reference objects are aligned suchthat one might support the other.The  value accommodates inaccurate object localization in 3D (we used a value of 10 cm).Unlike in the other relationships, the coplanarity states are not mutually exclusive as objectscan be both ?same bottom? and ?same top?. We found this to be a rare relationship sowe handled it by using integer value of the first true case but another state could be addedwithout significantly changing the results.It is important to note that Coplanarity is strictly based on the z axis partitions of the boxes.It does not consider the overlap in the x or y axes so it does not actually detect supportrelationships. Doing so would have required combining either topological and orientationor distance and orientation relationships. So, objects that are in a supporting relationshipmight be distant from each other.98Figure 7.3: This figure illustrates the Coplanarity relationship. The reference box R isshown in blue and different target boxes are shown in red with the resulting relationshipbeside them.Vertical AlignmentThe vertical alignment relation is an orientation relationship that is used to identify objectsdirectly over or under each other. Since actual relative elevations of objects are captured bythe vertical orientation relationship, vertical alignment is based on the overlap in both thex and y planes as shown in Figure 7.4. This relationship is included to aid in determiningthe support relationships between objects which cannot be completely captured by theCoplanarity relationship.99Figure 7.4: This figure illustrates the vertical alignment relationship. The reference box Ris shown in blue and different target boxes are shown in red with the resulting relationshipbeside them.Our vertical alignment relationship is:V A(bR, bT ) =?????????????1if ?(bR, bT )xmin(|R|x, |T |x)> ?V A ?if ?(bR, bT )ymin(|R|y, |T |y)> ?V A0 else(7.13)where ?V A is a variable defining the necessary potential percentage of overlap required (weused ?V A = 0.5).100Figure 7.5: This figure illustrates the vertical alignment relationship. The reference box Ris shown in blue and different target boxes are shown in red with the resulting relationshipbeside them.BesideThe beside relationship is an orientation relation based on target and reference objectsoverlap in either the x or y axes, as shown in Figure 7.5. This novel relationship is based onour observation that, in indoor environments, objects are often aligned relative to a majoraxis in the environment. For example, in a kitchen a dishwasher, sink and microwave areeither on or integrated into the same counter. This alignment can be for many practicalreasons. Work surfaces and large objects are typically rectangular and therefore alignmentwith the walls minimizes wasted space. Also, keeping objects against the walls leaves thecenter of a room open for movement.We use a constant ?BE, which is the required percentage of overlap in each axes, to identifyonly boxes with a significant degree of overlap.101Our beside relationship has a binary output and is defined by:BE(bR, bT ) =?????????????1if ?(bR, bT )xmin(|R|x, |T |x)> ?BE ?if ?(bR, bT )ymin(|R|y, |T |y)> ?BE0 else(7.14)where ?BE is a variable defining the necessary potential percentage of overlap required (weused ?BE = 0.5).OverlapThis simple topological relationship identifies when two detections occupy roughly the samespace in 3D. It is primarily useful for handling mislabeled detections using an effect similarto non-maximal suppression [Neubeck and Van Gool, 2006], allowing an instanced detectionto suppress another overlapping detection. It does not need the complex topological statesfound in the RCC-8 topological description, instead using a simple binary ?overlap? or?disjoint? relation. We use the constant ?BE, which is the required percentage of overlapin each axis, to identify only boxes with a significant degree of overlap.Our overlap relationship has a binary output and is defined by:OV (bR, bT ) =???????????????????????1if ?(bR, bT )xmin(|R|x, |T |x)> ?BE ??(bR, bT )xmin(|R|x, |T |y)> ?BE ??(bR, bT )ymin(|R|y, |T |z)> ?BE0 else(7.15)102Chapter 8Object Detection using 3D Spatial Relation-ships ResultsIn this chapter, we present experimental results for improving object detection scores using3D spatial relationships and using the model described in Chapter 7. First, we presentour method for acquiring training and test data on real world scenes. We describe ourapproach for computing 3D hypothesis boxes using 2D object detections from scene imagesand our model?s effectiveness at improving the detection accuracy. We present experimentsdemonstrating our work on real-world data by applying our method to 3D hypothesis boxescreated from 2D object detections.After the real world experiments, we explore how our model would perform with differentinitial object detection results using simulation experiments where hypothesis boxes aregenerated from ground truth data (rather than from image data of the scenes). Theseexperiments demonstrate how the properties of the hypothesis boxes (e.g., score, number oftrue/false detections, localization accuracy, etc.) affect the detection accuracy improvementfrom applying our model. Finally, we end with an analysis of the individual and cumulativeeffectiveness of the spatial relationships we described in Section 7.7 when combined withour model on simulated scenes.8.1 3D Data CollectionAcquiring 3D data to train and test our model was a significant challenge. In Chapter 6 weused synthetic indoor scenes from a video game. While this was a good starting point for ourwork, it was difficult to prove that the object layouts were realistic and that our approachwould work outside the scope of a game. Further progress required us to demonstrate ourwork on real world data.Our video game data was models of entire houses but for our real world data we decided103to restrict our data to collecting image from individual rooms. This choice was based onthe practical difficulties of collecting and integrating 3D views from different rooms intoa single coherent model of object positions. Also, there is little evidence to suggest thathumans make use of spatial relationships between objects in different rooms. This choicealso allowed us to focus our model and choice of object detectors on a single type of sceneat a time.8.1.1 Fiducial Marker 3D Data CollectionThe data for our experiments requires two elements. First is the ground truth 3D layoutof objects in a scene, which is used to determine localization accuracy of the hypothesisboxes and in our simulation experiments to generate hypothesis boxes. Second is objectappearances in the scene from which we produce 2D detections using an image-based objectdetector. These 2D detections are then used to triangulate a position in the scene of 3Dhypothesis boxes for our real world experiments. In the real-world experiments we useimage-based object detections but, alternatively, our experiments could be reproduced withrange data and a shape-based object detector.Both 3D ground truth positions and 3D hypothesis can be computed from images of a scene,as long as the camera position for each image is known, the images cover a variety of anglesand there are multiple images of each object of interest.Our data collection technique is based on the work of [Meger et al., 2011] for collecting 3Dspatial data using a fiducial marker placed in the middle of each scene. A fiducial marker isa structure in an imaged space that is used as a point of reference for measuring distances,positions and scales. In our case, the fiducial used by [Meger et al., 2011] is a cube with a3 by 3 grid ARTag markers [Fiala, 2005], bitonal planar patterns that encode a unique ID.The resulting grid of recognizable points allows us to determine camera position and scenescale relative to the marker.Ground Truth 3D Object DataThis section describes how to compute Gs = {gi : i = 1 . . . Zs}, the ground truth boxes thatdescribe the position and extent of each object in scene s. Each ground truth box consistsof a pair g = {t, l} where ti = (tminxyz , tmaxxyz ) is an axis-aligned bounding box defined by two3D points and l ? {1 . . .K} is a numeric label for the K types of object types in the scenewe are detecting.104Figure 8.1: A sample image of a kitchen taken from IKEA. The object in the foregroundis a fiducial marker which allows us to recompute the camera position and integrate multipleimages into a 3D model of the object layout.To determine gi for each object, we photographed the scene from many angles, acquiringmultiple images of each commonly found object and the fiducial marker (see Figure 8.1).Using the marker, we computed the camera position for each image. To compute the groundtruth positions, using software developed by [Meger et al., 2011] we manually labeled 3Daxis-aligned bounding boxes for commonly found objects in each scene. The labeling wasperformed by first manually identifying the center of each object in 3 or more images. Wetriangulated the 3D position of the centroid of the object from these points using singularvalue decomposition. Then an axis-aligned bounding box is projected onto each image andmanually resized until it covers the object in each image. The resulting 3D boxes werechecked using a 3D visualizer to determine that their layout corresponded to the scene.The x and y axes of the scene are determined by the orientation of the fiducial marker whichwas placed to be aligned with the dominant axes of the room defined by the walls. Whilewe did this manually during data collection, determining this basic scene structure can bedone automatically and from a single image [Lee et al., 2010; Tsai et al., 2011]. We did not105use a robot for data collection as we were limited by what we could easily transport butour 3D data collection technique could be replicated by a robot equipped with a cameraand performing simultaneous location and mapping (SLAM) [Montemerlo et al., 2003].8.1.2 Locations & ScenesThere are obvious privacy issues involved in collecting data on the interior of houses. Ad-ditionally, houses typically only contain one example of most scene types. Finding enoughvolunteers and moving equipment to enough locations to acquire a broad dataset was in-feasible if we used real houses. Therefore, we collected data in locations where we couldcapture many examples of the same type of scene. The first location was our departmentwhich provided many offices, kitchens and bathrooms. The second source of scenes was twoIKEA stores that maintain many prefabricated scenes to demonstrate different selectionsof products. These scenes were ideal for data collection as they are realistic, show theircontents well and are divided clearly into scene types (kitchens, offices, bedrooms, livingrooms and bathrooms).After data collection, we focused on kitchens and offices because we had the most examplesand the scenes contained objects we determined were detectable by an image-based objectdetector with enough accuracy to provide a basis on which 3D qualitative spatial relation-ships could improve. Both bedrooms and living-rooms contained few objects that could berecognized by our detector. Objects like couches and coffee tables simply vary too muchin appearance and lack strong defining visual characteristics. Finally, we had to discardbathrooms despite being able to detect some common bathroom objects reliably. In publicbathrooms our data collection technique was ineffective because the bathrooms are designedto obscure views and hide areas from the users. In private bathrooms the area was oftentoo small to effectively image with our approach since we need multiple shots from differentangles of each object and the marker.Our data set consists of 29 examples of kitchens and 27 examples of offices. For both sceneswe selected a set of commonly found objects and labeled their positions in 3D. In the officeswe selected keyboards, monitors, computer mice, chairs, telephones and printers. In thekitchens we selected sinks, faucets, dishwashers, ovens, microwaves, refrigerators, pans andsmall appliances. The small appliance category is an aggregation of electric kettles, toastersand coffee makers into a single category. With only 6-8 examples each of kettles, toastersand coffee makers across all kitchens, there were insufficient examples for learning. However,we observed they were roughly the same size and occupy similar locations in the scene sowe aggregated them into the more general category of small appliance.1068.2 Experimental OverviewWe evaluate our method experimentally on both simulated and real hypothesis boxes. Wefirst present results on hypothesis boxes produced from image-based object detections fromscene images. Then, our simulated hypothesis experiments use hypothesis boxes generatedfrom ground truth data (rather than from image data of the scenes) and demonstrate howthe properties of the hypothesis boxes affect the results of our method.In all our experiments, we judge success by improvements in the average precision (AP)of all 3D hypotheses after we apply our method. AP is commonly used for comparingimprovements in precision and recall [Everingham et al., 2010]. We compute AP by samplingthe monotonically decreasing precision/recall curve at a fixed set of uniformly-spaced recallvalues (0, 0.1, 0.2, . . . , 1).8.3 Image-based Hypothesis ExperimentsThe real world image-based experiments are designed to demonstrate the effectiveness ofour model on 3D detections that are computed from actual 2D detections from images. Thetechnique we use for determining 3D detections is just one of many possible approaches anddoes not take advantage of the 3D sensors that would likely be available on many robots.We are restricted to the data we could collect (i.e., images with a fiducial marker).8.3.1 3D Hypothesis ConstructionWe use a technique for constructing 3D hypothesis boxes that combines multiple 2D imagedetection windows across multiple images of a scene. To produce our detections, we usethe Deformable Parts Model detector [Felzenszwalb et al., 2008] trained with object imagesacquired from ImageNet [Deng et al., 2009].The goal of this step is to construct a corresponding set of 3D hypothesis boxes H = h1...N .Given a set of 2D detections corresponding to the same object and using the camera positionswe computed using the fiducial markers, we can triangulate the position of the object in3D using the center point of each window. We use singular value decomposition to performtriangulation [Siciliano and Khatib, 2008]. To find the detections that correspond to thesame object, we cluster detections using quality threshold clustering (QTC) [Jin and Han,2010]. Because of occlusion, errors in detection and non-overlapping images, not everyobject in the scene will have a detection in each image. QTC selects a candidate 2D detection107Table 8.1: The improvement in average precision (?AP ) for objects and overall on 3Ddetections produced from scene images.Office Ini. AP Final AP ? AP Kitchen Ini. AP Final AP ? APKeyboard 0.64 0.97 0.33 Sink 0.28 0.53 0.25Monitor 0.67 0.60 -0.07 Faucet 0.68 0.64 -0.04Lamp 0.45 0.62 0.17 Dishwasher 0.37 0.65 0.28Mouse 0.80 0.61 -0.19 Oven 0.77 0.81 0.04Chair 0.55 0.57 0.02 Microwave 0.65 0.74 0.09Telephone 0.22 0.51 0.29 Refrigerator 0.35 0.24 -0.11Pot 0.43 0.47 0.05Small Appliance 0.64 0.56 -0.08Overall 0.54 0.67 0.13 Overall 0.48 0.56 0.08and iteratively adds the closest detection using a distance metric until the distance to theclosest detection exceeds a threshold. Our distance metric is the error of triangulation in3D.To produce an axis-aligned box for each hypothesis, we use an average shape template basedon the average dimensions of each object type in the training data. There are six possibleposes for an axis-aligned box given a centroid. However, we assume all objects have anidentifiable top and bottom (e.g., all microwaves are always found vertically aligned in thesame direction) which reduces the possible poses to two. We reduce this to one pose bymaking the x and y axes of the template equal length. The template for an object type hasa z length equal to the average height of ground truth boxes of that type. The width andlength are made equal and set to the average length of the other two dimensions. Giventypical error in triangulation, we found the effect on the resulting spatial relationshipsfrom making the length and width equal was minimal. Finally, we give each detection aconfidence score equal to the average score of all 2D detections used to create it.Table 8.1 presents the overall and per object improvement in average precision after applyingour model for both types of scenes. On both types of scenes we saw improvement in theoverall average precision almost double that observed in [Desai et al., 2009] though ofcourse the data sets are different. Also, like them, we observed variation in the degreeof improvement from class to class with results on some classes becoming worse. As wewill discuss in our next section on synthetic detection experiments, we have observed thatthis tendency to improve on some classes and not others is caused by lower numbers oftrue positive detections in the training and test data and inaccurate localization in thedetectors. In our work, image-based detections produce hypothesis boxes for approximately60 ? 70% of all detections and many were poorly localized. Since the goal of this work is108to demonstrate how effectively 3D spatial relationships can improve object detections, wemove onto our synthetic detection experiments.8.3.2 Instanced Hypothesis Boxes of Image-based ExperimentsThis section provides examples of which 3D hypothesis boxes were instanced in some ofour real world experiments. Instanced hypothesis boxes are the high confidence subset ofboxes that are used to adjust the confidence scores of all boxes to remove false positives.Instanced boxes comprises both high confidence boxes and those that constitute a layoutconsistent with our model (as selected by the branch and bound tree search). A good setof instanced hypothesis boxes should be well localized.Figures 8.2 through 8.13 show both good and bad examples of instanced hypothesis boxesand the demonstrate both the benefits and potential problems of using 3D spatial relation-ships for improving object detection. These are discussed in greater detail in the captionsof each figure.It is important to remember that these configurations of boxes are used to adjust the finalscores of all detections, they are not the end result of our approach. Even if they are basedon false positive detections, as long as they constitute a likely configuration, they can stillbe used for adjusting the scores of other detections. If a hypothesis boxes is the result of amisclassification but is still well localized, it can still be useful for adjusting the confidencescores of the other boxes. For example, if a stove was misclassified as a dishwasher but stillwell localized, that box could still provide valuable spatial information about the structureof the room and counter. Similarly, a poorly localized hypothesis box might still be usefulbecause some of our spatial relationships, like distance, vertical orientation and beside arecoarse enough to be tolerant to poorly localized detections.8.4 Generated Hypothesis ExperimentsOur image-based hypothesis experiments demonstrate a fully functioning system for per-forming object detection and localization in 3D with spatial relationships used to improvedetection accuracy. However, there are many approaches to the creation of 3D hypotheses:other 2D detectors that could be used for triangulation, different techniques for producing3D boxes for 2D and a whole range of object detection techniques that perform objectdetection on 3D data. How might other 3D detectors utilize 3D spatial relationships toimprove their detection rates?109Figure 8.2: An illustration of good instanced hypothesis boxes from image-based detectionsfor a kitchen scene. In this scene most of the objects had well localized boxes that compriseda likely layout so most objects overlap an instanced hypothesis box.Figure 8.3: An illustration of good instanced hypothesis boxes from image-based detectionsfor a kitchen scene. Again, many of the objects in the scene overlap an instanced hypothesisbox. There are multiple overlapping boxes included for the faucet because both boxes arewell localized enough to be considered correct boxes and our loss function does not penalizeadditional good boxes.110Figure 8.4: An illustration of good instanced hypothesis boxes from image-based detectionsfor a kitchen scene. This scene has fewer instanced hypothesis boxes but contained anexample of stacked ovens being correctly detected and instanced. The localization of the topoven box is poor but it is still instanced because it shares spatial relationships with the bottomoven boxes observed in other stacked ovens in the data set.Figure 8.5: An illustration of bad instanced hypothesis boxes from image-based detectionsfor a kitchen scene. The model can select multiple poorly localized hypothesis boxes becausethey constitute a likely layout. Here, the oven, dishwasher and coffee maker boxes werelikely selected because the oven and dishwasher top and coffee maker bottom are coplanar,a layout common in many scenes.111Figure 8.6: An illustration of bad instanced hypothesis boxes from image-based detectionsfor a kitchen scene. This scene contains two instanced oven hypothesis boxes, side by side,with one poorly localized and one well localized. The extra oven box might have been includedbecause multiple oven boxes near each other are common in stacked ovens and the extradetection is close to a small appliance detection.Figure 8.7: An illustration of bad instanced hypothesis boxes from image-based detectionsfor a kitchen scene. In this scene, both the oven detection box and pot detection box aresignificantly offset from their ground truths. Pots are often observed in the data set on topof ovens so the model selected the pot and oven boxes that shared that relationship.112Figure 8.8: An illustration of good instanced hypothesis boxes from image-based detectionsfor an office scene. In this scene several objects had well localized boxes that comprised alikely layout so many objects overlap an instanced hypothesis box.Figure 8.9: An illustration of good instanced hypothesis boxes from image-based detectionsfor an office scene. This scene is very complicated with a large number of objects detectedand localized accurately. Not every object is well localized, one of the monitors is offsetsignificantly. Also, the two close chairs can lead to extra poorly localized boxes when 2Ddetections from different chairs are combined together to create a 3D hypothesis box.113Figure 8.10: An illustration of good and bad instanced hypothesis boxes from image-baseddetections for an office scene. Two parts of the scene contain well localized detections oneither side and an extra chair and telephone false positive detections appear in the back-ground.Figure 8.11: An illustration of good and bad instanced hypothesis boxes from image-baseddetections for an office scene. Both chairs and one monitor hypotheses were well localizedbut the two false positive monitors and one chair were added, though they do comprise areasonable scene layout. The monitor on the left was likely added because there is a mousehypothesis next to it.114Figure 8.12: An illustration of bad instanced hypothesis boxes from image-based detectionsfor an office scene. The problem in this scene was too many false positive and badly localizedboxes. The scene had a horseshoe arrangement of desks with the fiducial marker placed inthe middle and many objects close together. This lead to many incorrectly localized and falsepositive boxes because every camera frustum overlapped, meaning many more intersectionsin the 3D hypothesis box creation.Figure 8.13: An illustration of bad instanced hypothesis boxes from image-based detectionsfor an office scene. In this scene there simply were not enough images taken and there arefew 2D detections. This resulted in few hypothesis boxes and so only a small number ofpoorly localized boxes were instanced.115To test this question and determine the effectiveness of our model for improving detectionrates, we perform a broad range of experiment which simulate the creation of 3D boxes. Inthe broader scope of this thesis, these experiments allow us to examine how the properties ofthe hypothesis boxes affect the overall effectiveness of our approach. They provide an upperbound for how effective 3D spatial relationships could be at improving detection accuracy.Furthermore, they identify how the properties of the hypothesis boxes affect detection rates,allowing future work to optimize their inputs for our model.8.5 Simulated Hypothesis ExperimentsIn our simulated hypothesis experiments we simulate the results of an object detector andproduce the 3D hypothesis boxes algorithmically for both training and test data. Simulatedhypotheses are created from ground truth detections by shifting the position to simulatelocalization error and providing a detection score from a distribution. Different techniquesare used to create true and false positive detection positions and scores. Simulated exper-iments examine how hypothesis boxes properties affect our approaches effectiveness. Theproperties examined are:? Classification Score? Number of true detections? Localization error? Branch and Bound vs. Greedy SearchWe also present experiment results comparing the effectiveness of individual 3D spatialrelationships we described in Section 7.7 on different types of scenes and varying levels oflocalization accuracy.8.5.1 Simulated True Positive DetectionsSimulated true positive detections correspond to a detector identifying an object in the scenecorrectly. The resulting score and localization will vary but we constrain the localizationerror such that the detection is correct according to the loss function 7.8. Given a groundtruth detection box gi = {ti, li}, a true positive hypothesis box hi = {bi, yi, si} is defined116by:bi = {tmini + {Dx, Dy, Dz}, tmaxi + {Dx, Dy, Dz}}s.t.?D2x, D2y, D2z < 30cm where D = N (0, ?box)yi = lisi = N (?0.5, 0.25) + T(8.1)N (0, ?box) is a normal distribution and ?box is used to control the degree of localizationaccuracy. Without actual image detections to provide a detection score for the hypothesisboxes, we generate scores from a distribution. N (?0.5, 0.25) is a normal distribution and Tis a constant we add to positive detections to control the average difference between true andfalse positive detections. The values (?0.5, 0.25) were chosen to resemble the distributionof scores with the 3D object detector we describe Section 8.3.1. Changing T allows us tosimulate detectors with varying ability to differentiate true positives and false positives.8.5.2 Simulated False Positive DetectionsSimulated false positive detections come in two types based on our observations of howfalse detections are produced in 2D image detectors and our results when creating 3Dhypothesis boxes from image detections. False positives of the first type are similar to falsepositive detections generated by background objects or hypothesis boxes with extremelyhigh localization error. These detections generally appear to be randomly positioned inthe scene. False positives of the second type are the results of detector mislabeling. Thesedetections correspond to a detector firing on an object of the wrong type which happensfor objects with similar appearances such as microwaves and ovens. Both types of falsepositives are created from a reference ground truth detection gi = ti, li.Background False Positivebi = {tmini + {Ux, Uy, Uz}, tmaxi + {Ux, Uy, Uz}}where U = U(0, ?scene)yi = lisi = N (0.5, 0.5)Mislabeled False Positivebi = {tminxyz + {Dx, Dy, Dz}, tmaxxyz + {Dx, Dy, Dz}}s.t.?D2x, D2y, D2z < 30cm where D = N (0, ?box)yi = lrands.t.lrand 6= lisi = N (0.5, 0.5)(8.2)117U(0, ?scene) is a uniform distribution where ?scene is a approximate value for the averagesize of all scenes. lrand is a random integer from (1 . . .K).8.5.3 Simulated Hypothesis Experimental ProceduresWe control the number of true detections per scene with Pdet, the probability that eachground truth object has an associated true detection. We generate Fback background falsepositive detections and Fmis mislabeled false positive detections for each ground truthobject.For each experiment we produce ten different layouts of the hypothesis boxes for eachscene. We train and test our method using 5-fold cross validation across the scene. Resultsshown are the average precision of the true hypothesis boxes before and after using spatialrelationships to modify detection scores. We average the AP over the ten layouts. In eachexperiment we vary a set of parameters. Unless otherwise indicated, each experiment usesthe following constants: ?box = 20cm, T = 0.2, ?scene = {300, 300, 200}, Pdet = 1, Fback = 2and Fmis = Detection and Localization Error ExperimentThis experiment is designed to determine the combined effects of detection and localizationerror. Detection errors occur when the object detection does not produce a hypothesis boxfor an object in the scene. Having enough objects detected in both training and test datais important because each additional good hypothesis box provides information on spatialrelationships to all other boxes. This means the number of informative spatial relationshipsgrows exponentially in the number of detection boxes. Localization error occurs when anobject is detected in the scene but the hypothesis box does not completely overlap theground truth. Some spatial relationships are sensitive to poor localization, such as verticalalignment and coplanarity especially, but can be highly informative.Figure 8.14 shows how the average precision improvement increases as a function of de-creasing ?box and increasing the number of true detections (controlled by Pdet). As ?boxincreases, the negative effects of having fewer detections grows. Without localization error,we see a very large improvement from our method but perfect localization is unlikely in realworld results. However, our method works well with either poorly localized detections ormissing detections. The combination can cause our method to under perform.Clearly offices perform better with decreasing object counts and this could be because spatial1180.4 0.5 0.6 0.7 0.8 0.9 ?box = 00.4 0.5 0.6 0.7 0.8 0.9 ?box = 100.4 0.5 0.6 0.7 0.8 0.9 ?box = 200.4 0.5 0.6 0.7 0.8 0.9 ?box = 30Figure 8.14: Simulation results which vary the number of true positive detections, con-trolled by Pdet, the probability that each ground truth object has an associated positive detec-tion. Results are shown at different levels of localization error (controlled by ?box). Kitchenresults are in red and offices in blue. Solid lines show average precision before applying ourmethod and dotted lines after. Our model provides a significant improvement with eithergood localization or many positive detections.11900. ?box = ?box = 20Figure 8.15: Simulation results which vary T , the average difference between the trueand false positive scores, shown at different levels of localization error (controlled by ?box).Kitchen results are shown in red and offices in blue. Solid lines show the average precisionbefore applying our method and dotted lines after. The largest improvement in averageprecision occurs when T is small, simulating a detector with a poor ability to differentiatetrue and false positive detections.relationships between office objects are more consistent. However, we believe that officesperform better as the rate of detection decreases because there are on average twice as manyobjects in office scenes, even though there are fewer classes of objects being detected. Manyof the offices have multiple chairs, monitors, keyboards, etc., while most of the objects foundin the kitchens only occur once in a scene. Overall, having enough true positive detectionsis more important than having them accurately localized as our 3D spatial relationshipscan handle poor localization but without true positives there are not enough relationshipsfor learning.8.5.5 Score Separation ExperimentThis experiment is designed to test how the difference in average scores between true andfalse positive detections affects the improvement in average precision. The difference inscores is controlled by the T variable. Large values of T simulate an object detector thatis effective at differentiating between true and false detections. Since we found localizationaccuracy had such a significant effect in the previous experiments, we ran this experimentvarying both T and ?box.120Figure 8.15 shows the improvement in average precision from our method increases as thedifference in score between true detections and false positive detections decreases. The im-provement in AP increases because our method works best when there is confusion betweentrue and false positives. As the difference in scores decreases, our method?s performanceremains consistent but the average precision before our method is applied decreases.This ability to function well regardless of score difference has several advantages. Firstly, weare tolerant to object classifiers that either do not produce a margin of classification or oneswhere that margin is not very informative. Secondly, we did not need a more elaboratetechnique for combining together scores from 2D detections into 3D. As we described inSection 8.3.1, we use the average of the 2D detections for the hypothesis box score. Weconsidered using a more complex technique, such as learning an SVM that would recomputethe score based on the 2D detection scores, the number of detections used, the error intriangulation and other scene properties. However, it is clear that it is unnecessary toprovide accurate initial confidence scores for our approach to be effective.8.5.6 Branch and Bound Tree Vs Greedy SearchAs we discussed in Sections 7.5.2 and 7.6.2, we implemented a branch and bound treesearch approach for both inference and structured SVM training. This approach replacedthe greedy search based approach used in [Desai et al., 2009]. This experiment is designedto examine the improvement in average precision we derived from using a branch and boundtree search instead of the simpler greedy search approach.Figure 8.16 shows how our branch and bound search outperforms the greedy search ap-proach. The effect is greater when there is more error in localization. This is likely dueto branch and bound being less likely to be misled by a poorly localized, highly scored de-tection, which the greedy search approach can fixate on early while the branch and boundfinds an overall more optimal solution.The most evident result is that branch and bound also performed more consistently acrossscene types, with the greedy approach performing negatively on the kitchen data whenlocalization or number of positive detections are low. These negative results for greedysearch are either because the spatial relationships in the kitchen are harder to identify,(possibly because they are less structured than the offices) or because there are fewer objects,and therefore relationships between positive detections, for learning. Either way, branchand bound search seems better able to make use of the available training data and producesa better model.1210.4 0.5 0.6 0.7 0.8 0.9 Offices with ?box = 00.4 0.5 0.6 0.7 0.8 0.9 Offices with ?box = 300.4 0.5 0.6 0.7 0.8 0.9 Kitchens with ?box = 00.4 0.5 0.6 0.7 0.8 0.9 Kitchens with ?box = 30Figure 8.16: Simulation results that compare the effectiveness of our branch and boundsearch against the [Desai et al., 2009] greedy search. We varied the localization error con-trolled by ?box. Branch and bound results are in red and greedy search results in blue. Solidlines show average precision before applying our method and dotted lines after. From theseit is clear that the branch and bound approach improvements vary between scene types butdid provide consistently better results.1228.5.7 Spatial Relationship ComparisonThis experiment is designed to rank the effectiveness of the seven spatial relationships weproposed in 7.7. We would like to know which relationships are useful in which scenarios.Is there a smaller set of relationships that work as well or better on all scenes? How doesthe ranking of useful relationship changes as positional error increases?We compare the effectiveness of each relationship on both office and kitchen scenes attwo levels of location accuracy, good accuracy (?box = 0cm) and poor accuracy (?box =30cm), to examine how the best spatial features change when accuracy decreases. Weused a greedy approach to ranking the relationships. We performed the experiments bygenerating 10 versions of each scene and apply each relationship independently. We thentook the relationship that led to the highest increase in detection accuracy and repeatedthe experiment with the combination of that best relationship and each remaining. Werepeated this procedure, added a new relationship at each step, until all relationships wereused.Figure 8.17 shows the resulting cumulative benefit from each relationship in both types ofscenes. It is evident that there is a plateau effect in most types of scene and little benefit frommore than 4 types of relationships. However, it is also evident that the types of relationshipsbest suited to each scenario vary depending on both scene type and localization accuracy.In examining these graphs, it is important to remember that small ranking differences mightmean little.Clearly the effectiveness of no single relationship outperforms all others and a combination ofrelationships is necessary. There does appear to be a slight overall decrease in performancefrom using all relationships together. This is likely because the last few added relationshipscontribute nothing useful for improvement but allow for over-fitted relationship weightsto be learned that cause poor performance. With more training data, this would likelynot happen as low weights for bad features would be learned. Fortunately, the overalldegradation from using all relationships is minimal and likely outweighed by the usefulnessof not having to adapt the extracted features to the scene or localization accuracy.There are no consistently useless relationships and none that clearly dominate. It does ap-pear that the combination of at least one distance-based and one vertical-orientation basedrelationship is necessary for good results. For distance-based relationships, the absolutedistances based on human metrics that we used are more useful than relative boundingbox distances but they combine effectively. When localization is poor, fewer features areuseful and the combination of absolute distance with coplanarity seems to provide most123(a) Offices with ?box = 0 (b) Offices with ?box = 30(c) Kitchens with ?box = 0 (d) Kitchens with ?box = 30Figure 8.17: Simulation results that show the cumulative benefit of each of the spatialrelationships. The relationships were chosen by greedily selecting the one that led to thegreatest improvement at each step. We varied the localization error, controlled by ?box andran the experiment on both offices and kitchens. The order in which the features were appliedis shown in the table embedded in each graph.of the improvement. With vertical orientation relationships, both vertical orientation andcoplanarity were useful when localization is good. Besideness was useful in kitchens, likelybecause counters and cabinets keep object aligned with each other relative to the walls,something which is not necessarily true in offices. The greater use of overhead spaces andobjects embedded into the counters meant that vertical orientation was more useful inkitchens than offices.124Chapter 9ConclusionThis thesis has examined the effectiveness of 3D spatial relationships for performing orimproving object detection and classification. This chapter will discus our conclusions onthis subject and the future directions this work could follow.9.1 ConclusionsHere is a breakdown of our conclusions by chapter:In Chapter 4 we determined that object detections could provide a good basis for per-forming scene classification. We believe this is because objects are the basis on whichhumans create and classify scenes. This scene classification can be used to inform ascene-specific spatial model for improving object detections.In Chapter 5 we identified the qualitative spatial reasoning literature as a good source ofanalysis on qualitative spatial relationships. We utilized this research and determinedthat seven qualitative spatial relationships could cover most of the possible usefulbinary relationships available when comparing axis-aligned bounding box object de-tections.In Chapter 6 we demonstrated that even using simple qualitative spatial relationships itis possible to perform object classification using only information about an object?sspatial relationships relative to known objects. This indicates that 3D qualitativespatial relationships are usefully informative to object type and that they can likely becombined with other information, such as detector results, to improve object detectionaccuracy.In Chapters 7 and 8 we demonstrated a model for removing false positive 3D detectionsusing a structured support vector machine. We provided a technique for producing 3Ddetections from 2D detections using a fiducial marker and demonstrated our approach125was successful at significantly improving overall detection rates on real world scenesof both offices and kitchens.We showed our approach improved on the model it was based upon, that of [Desaiet al., 2009], by utilizing a branch and bound tree search to improve both training andinference. The success of our model depended on either having sufficient true posi-tive detections in the training data or having well-localized true positive detections.Our results showed we could effectively handle poor separation between the detectionconfidence scores of true and false positive detections. Finally, we analyzed the cu-mulative benefits of the spatial relationships and determined that the most effectivespatial relationships depend on both the scene type and localization accuracy. Therewas no ?magic bullet? relationship that was either sufficient on its own or always out-performed others and a mixture of relationships is always useful. However, in general,it was always useful to have a distance-based relationship and one that measured thevertical orientation of the objects. Using all relationships is a good option but a smallimprovement might be attained by selecting a subset of our relationships appropriateto the localization accuracy and scene type.Our work also better demonstrated the effectiveness of spatial relationships for im-proving object detection than that of [Desai et al., 2009] because we removed the factorof co-occurrence between object types. The training and test data they used had 20classes of object to detect and this broad range of objects came from a correspond-ingly large number of types of scenes. Therefore, in [Desai et al., 2009] co-occurrenceof object types was a very effective way of improving detection rates. For example,if a car was detected in an image, it is unlikely that a chair would also occur andtherefore all chair detections can have correspondingly decreased scores. It is unclearhow much of the improvement observed in [Desai et al., 2009] was from spatial rela-tionships and how much was from co-occurrence. In our work, we restricted ourselvesto single types of scenes and objects that frequently appear in those scenes, so co-occurrence provided little information. This meant our work better demonstrated theeffectiveness of spatial relationships by themselves.In conclusion, we have demonstrated that 3D qualitative spatial relationships can providean effective basis for improving object detections by removing false positive detections.As long as a robot can identify the scene type, has a mechanism for determining the 3Dposition of object detections and there is data for constructing a model appropriate forthat type of scene, our approach should improve on the overall accuracy of detection. Withthe increasing prevalence and decreasing price of 3D sensors, we believe that both the 3Ddetections and data will become increasingly available and that 3D spatial relationships126should be actively explored.9.2 Future DirectionsThe use of 3D qualitative spatial relationships for improving object detection is an area ofresearch still in its infancy and we believe there are many directions that could be taken toimprove upon our work.Robot Implementation It would be valuable to demonstrate this work on an actualrobot, functioning in real-time in an indoor environment. This goal is mostly a mat-ter of just implementation as there are no major limiting factors preventing this frombeing achieved. The main missing component is an implementation of an explorationtechnique which can produce 3D localized detections, which was demonstrated in[Meger et al., 2008]. We opted not to pursue this because implementation of our ap-proach on a robot would have taken a significant amount of time without significantlychanging our results.Improved 3D Object Detection Localization One improvement we believe would behighly effective at improving on our work would be a better technique for determiningthe position, size and extent of the 3D detections and a representation that couldutilize this data. Our approach was limited by the fact that we relied on only imagesof the scene and often only had two or three detections on which to determine theobject position. As such, we were limited in how much information we could reliablydetermine about the object?s position and opted to only determine the position of thecentroid and infer the size of the object detection from the average sizes previouslyobserved in the training data.We believe that if we were using an actual robot in the environment, equipped witha camera to determine the visual detection position and a scanning 3D range finderto determine the size and extend of the object, there would be several advantages.Firstly, we have shown that better localization significantly improves upon our overallaccuracy. A 3D range finder should allow the robot to much better approximate thelocation of many objects. Secondly, by moving from axis-aligned bounding boxes toa representation of size and shape of the object, new spatial relationships such as sizecomparison and relative topology become available.Non-scene Specific Model Our final model for improving object detection accuracy wasscene-specific and, in this thesis, we have not explored the options for using a non-127scene specific model. Our motivation for using a scene specific model was two fold.Firstly, our data sets consisted only of individual scenes and there were no exampleswhich contained both offices and kitchens. Secondly, it simplified the overall learningproblem by only learning relationships between objects found in the same environ-ments. We decided this simplification was necessary because of the limited availabletraining data and demonstrated it was reasonable by showing that scenes could beclassified with good accuracy based on initial object detections.However, buildings are not simply collections of scenes and there are many environ-ments that don?t have an obvious scene label. We believe that our approach wouldscale up to a scene independent model if we have more available training data. Ourevidence for this is that the model in [Desai et al., 2009] was not scene dependentand was able to handle learning a much larger number of object type pairs than weattempted. One advantage of the scene dependent model is obviously that it canlearn relationships that are scene dependent (e.g., coffee cups in kitchens are found indifferent locations than coffee cups in offices). So future work should explore ways ofmediating between scene dependent and independent models and assessing which isappropriate for a given situation.Expanded Scene and Object Categories It would be useful to demonstrate this workon a larger collection of scene types and with more object types. Specifically, it wouldbe interesting to explore less structured scenes like living rooms or large scenes likeshops. More scene types would give us a better insight into the value of different typesof spatial relationships and different representations of object detections. More objecttypes, if they can be detected, should hopefully lead to improved results as we havedemonstrated that having more true positive detections improves our overall results.Learn the Qualitative Spatial Relationships This thesis used spatial relationships thatwere derived from the Qualitative Spatial Reasoning literature and other work oncontext aided computer vision. As such, the spatial relationships were created fromhuman expert experience rather than learned directly from observations of the scenes.An interesting future direction for this work would be to learn qualitative spatialrelationships from quantitative spatial information about scenes. We suspect thatlearning qualitative spatial relationships would require significantly more 3D scenedata than we were able to produce for this thesis.128BibliographyM. Aiello and J. van Benthem. Logical Patterns in Space. ILLC scientific publications.Institute for Logic, Language and Computation, 1999.M. A. Aizerman, E. A. Braverman, and L. Rozonoer. Theoretical foundations of the poten-tial function method in pattern recognition learning. In Automation and Remote Control,pages 821?837, 1964.J. F. Allen. Maintaining knowledge about temporal intervals. Communications of the ACM,26(11):832?943, 1983.S. Andrews, T. Hofmann, and I. Tsochantaridis. Multiple instance learning with generalizedsupport vector machines. In American Association for Artificial Intelligence, pages 943?944, 2002.D. Anguelov, B. Taskar, V. Chatalbashev, D. Koller, D. Gupta, G. Heitz, and A. Ng.Discriminative learning of Markov random fields for segmentation of 3D range data. InComputer Vision and Pattern Recognition, 2005.C. B. Barber, D. P. Dobkin, and H. Huhdanpaa. The quickhull algorithm for convex hulls.ACM Transactions on Mathematical Software, 22(4):469?483, 1996.S. Belongie, J. Malik, and J. Puzicha. Shape matching and object recognition using shapecontexts. IEEE Transactions on Pattern Analysis and Machine Intelligence, 24(4):509?522, 2002.Bethesda Softworks. Elder scrolls 4: Oblivion. [DVD-ROM], 2006.A. Bosch, X. Mun?oz, and R. Mart??. Review: Which is the best way to organize/classifyimages by content? Image and Vision Computing, 25(6):778?791, 2007.L. Bourdev, S. Maji, T. Brox, and J. Malik. Detecting people using mutually consistentposelet activations. In European Conference on Computer Vision, 2010.C. J. C. Burges. A tutorial on support vector machines for pattern recognition. Data Miningand Knowledge Discovery, 2(2):121?167, 1998.129S. B. Carolina Galleguillos, Brian McFee and G. R. G. Lanckriet. Multi-class object lo-calization by combining local contextual interactions. In Computer Vision and PatternRecognition, pages 113?120, 2010.E. Clementini and P. D. Felice. Approximate topological relations. International Journalof Approximate Reasoning, 16(2):173?204, 1997.A. G. Cohn and S. M. Hazarika. Qualitative spatial representation and reasoning: Anoverview. Fundamenta Informaticae, 46:1?29, 2001.C. Cortes and V. Vapnik. Support-vector networks. Machine Learning, 20(3):273?297, 1995.K. Crammer and Y. Singer. On the algorithmic implementation of multiclass kernel-basedvector machines. Journal of Machine Learning Research, 2:265?292, Mar. 2002.N. Dalal and B. Triggs. Histograms of oriented gradients for human detection. In ComputerVision and Pattern Recognition, pages 886?893, 2005.T. de Laguna. Point, line, and surface, as sets of solids. The Journal of Philosophy, 19:449?461, 1922.J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and F.-F. Li. Imagenet: A large-scalehierarchical image database. In Computer Vision and Pattern Recognition, pages 248?255, 2009.C. Desai, D. Ramanan, and C. Fowlkes. Discriminative models for multi-class object layout.In International Conference on Computer Vision, pages 229?236, 2009.T. G. Dietterich. An experimental comparison of three methods for constructing ensemblesof decision trees: Bagging, boosting, and randomization. Machine Learning, 40(2):139?157, 2000.S. K. Divvala, D. Hoiem, J. Hays, A. A. Efros, and M. Hebert. An empirical study of contextin object detection. In Computer Vision and Pattern Recognition, pages 1271?1278, 2009.M. Duckham, J. Lingham, K. T. Mason, and M. F. Worboys. Qualitative reasoning aboutconsistency in geographic information. Information Sciences, 176(6):601?627, 2006.E. F. Ersi and J. K. Tsotsos. Visual place categorization in indoor environments. InCanadian Conference on Computer and Robot Vision, pages 448?453, 2012.T. Escrig and F. Toledo. Qualitative Spatial Reasoning Theory and Practice: Theory andPractice: Application to Robot Navigation. Frontiers in Artificial Intelligence and Appli-cations, 47. IOS PressInc, 1998.130M. Everingham, L. Van Gool, C. K. I. Williams, J. Winn, and A. Zisserman. The pascalvisual object classes challenge. International Journal of Computer Vision, 88(2):303?338,June 2010.P. F. Felzenszwalb and D. P. Huttenlocher. Pictorial structures for object recognition.International Journal of Computer Vision, 61(1):55?79, 2005.P. F. Felzenszwalb, D. McAllester, and D. Ramanan. A discriminatively trained, multiscale,deformable part model. In Computer Vision and Pattern Recognition, 2008.R. Fergus, F. Li, P. Perona, and A. Zisserman. Learning object categories from google?simage search. In International Conference on Computer Vision, 2005.M. Fiala. Artag, a fiducial marker system using digital techniques. In Computer Visionand Pattern Recognition, pages 590?596, 2005.D. F. Fouhey, V. Delaitre, A. Gupta, A. A. Efros, I. Laptev, and J. Sivic. People watching:Human actions as a cue for single-view geometry. In European Conference on ComputerVision, pages 732?745, 2012.A. U. Frank. Qualitative spatial reasoning about cardinal directions. In International JointConference on Artificial Intelligence, pages 157?167, 1991.J. Freeman. The modeling of spatial relations. Computer Graphics and Image Processing,4:156?171, 1975.C. Freksa. Using orientation information for qualitative spatial reasoning. In SpatioTemporalReasoning, pages 162?178, 1992.Y. Freund. An adaptive version of the boost by majority algorithm. Machine Learning, 43(3):293?318, 2001.Y. Freund and L. Mason. The alternating decision tree learning algorithm. In InternationalConference on Machine Learning, pages 124?133, 1999.Y. Freund and R. E. Schapire. A brief introduction to boosting. In International JointConference on Artificial Intelligence, pages 1401?1406, 1999.J. Friedman, T. Hastie, and R. Tibshirani. Additive logistic regression: A statistical viewof boosting. The Annals of Statistics, 28(2):337?374, 2000.C. Galleguillos and S. Belongie. Context based object categorization: A critical survey.Computer Vision and Image Understanding, 114(6):712?722, 2010.131C. Galleguillos, A. Rabinovich, and S. Belongie. Object categorization using co-occurrence,location and appearance. In Computer Vision and Pattern Recognition, 2008.S. Gottschalk, M. C. Lin, and D. Manocha. OBBTree: A hierarchical structure for rapidinterference detection. In SIGGRAPH, pages 171?180, 1996.K. Grauman and T. Darrell. The pyramid match kernel: Efficient learning with sets offeatures. Journal of Machine Learning Research, 8:725?760, 2007.J. Hays and A. A. Efros. Im2gps: estimating geographic information from a single image.In Computer Vision and Pattern Recognition, pages 1?8, 2008.V. Hedau, D. Hoiem, and D. A. Forsyth. Recovering the spatial layout of cluttered rooms.In International Conference on Computer Vision, pages 1849?1856, 2009.S. Helmer and D. Lowe. Using stereo for object recognition. In International Conference ofRobotics and Automation, pages 3121?3127, 2010.D. Hernndez and K. Zimmermann. Qualitative Representation of Spatial Knowledge. Lec-ture Notes in Artificial Intelligence. Springer, 1994.D. Hoiem, A. A. Efros, and M. Hebert. Putting objects in perspective. In Computer Visionand Pattern Recognition, pages 2137?2144, 2006.D. Hoiem, C. Rother, and J. Winn. 3D LayoutCRF for multi-view object class recognitionand segmentation. In Computer Vision and Pattern Recognition, June 2007.X. Jin and J. Han. Quality threshold clustering. In Encyclopedia of Machine Learning,page 820. Springer, 2010.A. E. Johnson and M. Hebert. Using spin images for efficient object recognition in cluttered3D scenes. IEEE Transactions on Pattern Analysis and Machine Intelligence, 21(5):433?449, 1999.E. Kalogerakis, A. Hertzmann, and K. Singh. Learning 3D Mesh Segmentation and Label-ing. ACM Transactions on Graphics, 29(3), 2010.A. Kreutzmann, K. Terzic?, and B. Neumann. Context-aware classification for incrementalscene interpretation. In Workshop on Use of Context in Vision Processing, pages 1?6,2009.B. Kro?se, O. Booij, and Z. Zivkovic. A geometrically constrained image similarity mea-sure for visual mapping, localization and navigation. In European Conference on MobileRobots, pages 168 ? 174, Freiburg, Germany, 2007.132H. W. Kuhn and A. W. Tucker. Nonlinear programming. In Second Berkeley Symposiumon Mathematical Statistics and Probability, pages 481?492, 1951.B. Kuipers, R. Browning, B. Gribble, M. Hewett, and E. Remolina. The spatial semantichierarchy. Artificial Intelligence, 119:191?233, 2000.J.-F. Lalonde, S. G. Narasimhan, and A. A. Efros. What does the sky tell us about thecamera. In European Conference of Computer Vision, pages 354?367, 2008.S. M. LaValle. Planning Algorithms. Cambridge University Press, 2006.S. Lazebnik, C. Schmid, and J. Ponce. Beyond bags of features: Spatial pyramid matchingfor recognizing natural scene categories. In Computer Vision and Pattern Recognition,pages 2169?2178, 2006.D. C. Lee, A. Gupta, M. Hebert, and T. Kanade. Estimating spatial layout of rooms usingvolumetric reasoning about objects and surfaces. In Advances in Neural InformationProcessing Systems, pages 1288?1296, 2010.F.-F. Li and L.-J. Li. What, where and who? telling the story of an image by activity clas-sification, scene recognition and object categorization. In Computer Vision: Detection,Recognition and Reconstruction, volume 285 of Studies in Computational Intelligence,pages 157?171. Springer, 2010.F.-F. Li and P. Perona. A Bayesian hierarchical model for learning natural scene categories.In Computer Vision and Pattern Recognition, pages 524?531, 2005.L.-J. Li, H. Su, Y. Lim, and F.-F. Li. Objects as attributes for scene classification. InEuropean Conference of Computer Vision, Workshop on Parts and Attributes, pages 57?69, 2010.S. Z. Li. Markov Random Field Modeling in Image Analysis. Springer Publishing Company,Incorporated, 3rd edition, 2009.C. Liu, J. Yuen, A. Torralba, J. Sivic, and W. T. Freeman. Sift flow: Dense correspondenceacross different scenes. In European Conference on Computer Vision, pages 28?42, 2008.D. Lowe. Distinctive image features from scale-invariant keypoints. International Journalof Computer Vision, 60:91?110, 2004.M. Mavrovouniotis and G. Stephanopoulos. Formal order-of-magnitude reasoning in processengineering. Readings in qualitative reasoning about physical systems, pages 323?336,1990.133M. L. Mavrovouniotis and G. Stephanopoulos. Order-of-magnitude reasoning with O[M].AI in Engineering, 4(3):106?114, 1989.D. Meger, P.-E. Forsse?n, K. Lai, S. Helmer, S. McCann, T. Southey, M. Baumann, J. J.Little, and D. G. Lowe. Curious George: An attentive semantic robot. Robotics andAutonomous Systems, 56(6):503?511, June 2008.D. Meger, C. Wojek, B. Schiele, and J. J. Little. Explicit occlusion reasoning for 3D objectdetection. In British Machine Vision Conference, 2011.J. Modayil and B. Kuipers. Bootstrap learning for object discovery. In InternationalConference on Intelligent Robots and Systems, 2004.D. R. Montello. Scale and multiple psychologies of space. In Conference On Spatial Infor-mation Theory, pages 312?321, 1993.M. Montemerlo, S. Thrun, D. Koller, and B. Wegbreit. FastSLAM 2.0: An improved particlefiltering algorithm for simultaneous localization and mapping that provably converges. InInternational Joint Conference on Artificial Intelligence, 2003.S. G. Narasimhan and S. K. Nayar. Vision and the atmosphere. International Journal ofComputer Vision, 48(3):233?254, 2002.A. Neubeck and L. Van Gool. Efficient non-maximum suppression. In Internation Confer-ence on Pattern Recognition, pages 850?855, 2006.B. Neumann. Bayesian compositional hierarchies - a probabilistic structure for scene inter-pretation. In Logic and Probability for Scene Interpretation, 2008.B. Neumann and R. Mo?ller. On scene interpretation with description logics. Image andVision Computing, Special Issue on Cognitive Vision, 26(1):82?101, 2007.A. Oliva and A. Torralba. Modeling the shape of the scene: A holistic representation of thespatial envelope. International Journal of Computer Vision, 42(3):145?175, 2001.A. Pronobis, B. Caputo, P. Jensfelt, and H. I. Christensen. A discriminative approach torobust visual place recognition. In International Conference on Intelligent Robots andSystems, pages 3829?3836, 2006.A. Pronobis, O. M. Mozos, and B. Caputo. SVM-based discriminative accumulation schemefor place recognition. In International Conference on Robotics and Automation, pages522?529, 2008.134A. Pronobis, O. M. Mozos, B. Caputo, and P. Jensfelt. Multi-modal semantic place classi-fication. The International Journal of Robotics Research, 29:298?320, 2010.A. Quattoni and A. B. Torralba. Recognizing indoor scenes. In Computer Vision andPattern Recognition, pages 413?420, 2009.J. R. Quinlan. Induction of decision trees. Machine Learning, 1(1):81?106, 1986.J. R. Quinlan. C4.5: programs for machine learning. Morgan Kaufmann Publishers Inc.,San Mateo, CA, 1993.D. A. Randell, Z. Cui, and A. G. Cohn. A spatial logic based on regions and connection.In Knowledge Representation, pages 165?176, 1992.A. Ranganathan and F. Dellaert. Semantic modeling of places using objects. In Robotics:Science and Systems, 2007.A. Ranganathan and J. Lim. Visual place categorization in maps. In International Confer-ence on Intelligent Robots and Systems, pages 3982?3989, 2011.J. Renz. Qualitative spatial reasoning with topological information. Lecture Notes in Com-puter Science. Springer-Verlag, 2002.R. Rimey and C. Brown. Control of selective perception using Bayes nets and decisiontheory. International Journal of Computer Vision, 12:173?207, 1994.F. Roberts and P. Suppes. Some problems in the geometry of visual perception. Synthese,17(1):173?201, 1967.B. Russell, A. Torralba, K. Murphy, and W. Freeman. Labelme: a database and web-basedtool for image annotation. International Journal of Computer Vision, 77:157?173, 2008.B. C. Russell, A. Torralba, C. Liu, R. Fergus, and W. T. Freeman. Object recognition byscene alignment. In Advances in Neural Information Processing Systems, 2007.R. E. Schapire and Y. Singer. Boostexter: A boosting-based system for text categorization.Machine Learning, 39:135?168, 2000.C. Schlieder. Representing visible locations for qualitative navigation. In Qualitative rea-soning and decision technologies, pages 523?532. CIMNE, 1993.J. Shotton, A. Blake, and R. Cipolla. Multiscale categorical object recognition using contourfragments. IEEE Transactions on Pattern Analysis and Machine Intelligence, 30(7):1270?1281, July 2008.135J. Shotton, J. Winn, C. Rother, and A. Criminisi. Textonboost for image understanding:Multi-class object recognition and segmentation by jointly modeling texture, layout, andcontext. International Journal of Computer Vision, 81(1):2?23, 2009.B. Siciliano and O. Khatib. Springer Handbook of Robotics. Springer, Berlin, 2008.N. Silberman, D. Hoiem, P. Kohli, and R. Fergus. Indoor segmentation and support infer-ence from RGBD images. In European Conference of Computer Vision, pages 746?760,2012.I. Simon and S. M. Seitz. Scene segmentation using the wisdom of crowds. In EuropeanConference of Computer Vision, pages 541?553, 2008.P. Simons. Parts: A Study in Ontology. Routledge, 1987.A. J. Smola and B. Scho?lkopf. A tutorial on support vector regression. Statistics andComputing, 14(3):199?222, 2004.T. Southey and J. J. Little. Object discovery through motion, appearance and shape. InAmerican Association for Artificial Intelligence, Workshop on Cognitive Robotics. AAAIPress, 2006.T. Southey and J. J. Little. 3d spatial relationships for improving object detection. InInternational Conference on Robots and Automation, 2012.T. Spexard, S. Li, B. Wrede, J. Fritsch, G. Sagerer, O. Booij, Z. Zivkovic, B. Terwijn,and B. Kro?se. Biron, where are you? - enabling a robot to learn new places in a realhome environment by integrating spoken dialog and visual localization. In InternationalConference on Intelligent Robots and Systems, 2006.T. Strat. Employing contextual information in computer vision. In DARPA Image Under-standing Workshop, pages 217?229, 1993.A. Swadzba and S. Wachsmuth. Aligned scene modeling of a robot?s vista space - an eval-uation. In AAAI Workshop on Language - Action Tools for Cognitive Artificial Agents:Integrating Vision, Action and Language, pages 30?35, 2011.B. Taskar, C. Guestrin, and D. Koller. Max-margin Markov networks. In Advances inNeural Information Processing Systems. MIT Press, 2003.K. Terzic? and B. Neumann. Context-based probabilistic scene interpretation. In InternationConference on Artificial Intelligence in Theory and Practice, pages 155?164, 2010.136A. R. Tilley and S. B. Wilcox. The measure of man and woman: human factors in design.Wiley, New York, 2002.A. Torralba, K. Murphy, W. T. Freeman, and M. Rubin. Context-based vision system forplace and object recognition. In International Conference on Computer Vision, pages273?280, 2003.A. Torralba, K. Murphy, and W. T. Freeman. Contextual models for object detection usingboosted random fields. In Advances in Neural Information Processing Systems, 2004.A. Torralba, K. P. Murphy, and W. T. Freeman. Using the forest to see the trees: exploitingcontext for visual object detection and localization. Communications of the ACM, 53(3):107?114, 2010.G. Tsai, C. Xu, J. Liu, and B. Kuipers. Real-time indoor scene understanding using Bayesianfiltering with motion cues. In International Conference on Computer Vision, pages 121?128, 2011.I. Tsochantaridis, T. Hofmann, T. Joachims, and Y. Altun. Support vector machine learningfor interdependent and structured output spaces. In International Conference on Machinelearning, 2004.S. Vasudevan and R. Siegwart. A Bayesian space conceptualization and place classificationfor semantic maps in mobile robotics. Robotics and Autonomous Systems, 56(6):522?537,June 2008.A. Vedaldi. A MATLAB wrapper of SVMstruct., 2011.L. Vieu. Spatial representation and reasoning in AI. In Spatial and Temporal Reasoning,pages 5?41, 1997.P. Viola and M. Jones. Robust real-time face detection. International Journal of ComputerVision, 57(2):137?154, 2004.P. Viswanathan, D. Meger, T. Southey, J. J. Little, and A. Mackworth. Automated spatial-semantic modeling with applications to place labeling and informed search. In CanadianConference on Computer and Robot Vision, pages 284?291, 2009.P. Viswanathan, T. Southey, J. J. Little, and A. Mackworth. Place classification using visualobject categorization and global information. In Canadian Conference on Computer andRobot Vision, pages 1?7, 2011.137J. Winn and J. Shotton. The layout consistent random field for recognizing and segmentingpartially occluded objects. In Computer Vision and Pattern Recognition, pages 37?44,2006.J. Wu, H. I. Christensen, and J. M. Rehg. Visual place categorization: Problem, dataset,and algorithm. In International Conference on Intelligent Robots and Systems, pages4763?4770, 2009.C. Xu and B. Kuipers. Towards the object semantic hierarchy. In International Conferenceon Development and Learning, 2010.Y. Yang and J. O. Pedersen. A comparative study on feature selection in text categorization.In International Conference on Machine Learning, pages 412?420, 1997.J. Zhang, M. Marszalek, S. Lazebnik, and C. Schmid. Local features and kernels for classi-fication of texture and object categories: A comprehensive study. International Journalof Computer Vision, 73(2):213?238, 2007.138


Citation Scheme:


Citations by CSL (citeproc-js)

Usage Statistics



Customize your widget with the following options, then copy and paste the code below into the HTML of your page to embed this item in your website.
                            <div id="ubcOpenCollectionsWidgetDisplay">
                            <script id="ubcOpenCollectionsWidget"
                            async >
IIIF logo Our image viewer uses the IIIF 2.0 standard. To load this item in other compatible viewers, use this url:


Related Items