Towards large-scale nonparametric scene parsing ofimages and videobyFrederick TungA THESIS SUBMITTED IN PARTIAL FULFILLMENTOF THE REQUIREMENTS FOR THE DEGREE OFDoctor of PhilosophyinTHE FACULTY OF GRADUATE AND POSTDOCTORALSTUDIES(Computer Science)The University Of British Columbia(Vancouver)March 2017c© Frederick Tung, 2017AbstractIn computer vision, scene parsing is the problem of labelling every pixel in animage or video with its semantic category. Its goal is a complete and consistent se-mantic interpretation of the structure of the real world scene. Scene parsing formsa core component in many emerging technologies such as self-driving vehicles andprosthetic vision, and also informs complementary computer vision tasks such asdepth estimation.This thesis presents a novel nonparametric scene parsing framework for im-ages and video. In contrast to conventional practice, our scene parsing frameworkis built on nonparametric search-based label transfer instead of discriminative clas-sification. We formulate exemplar-based scene parsing for both 2D (from images)and 3D (from video), and demonstrate accurate labelling on standard benchmarks.Since our framework is nonparametric, it is easily extensible to new categories andexamples as the database grows.Nonparametric scene parsing is computationally demanding at test time, andrequires methods for searching large collections of data that are time and mem-ory efficient. This thesis also presents two novel binary encoding algorithms forlarge-scale approximate nearest neighbor search: the bank of random rotations isdata independent and does not require training, while the supervised sparse pro-jections algorithm targets efficient search of high-dimensional labelled data. Weevaluate these algorithms on standard retrieval benchmarks, and then demonstratetheir integration into our nonparametric scene parsing framework. Using 256-bitcodes, binary encoding reduces search times by an order of magnitude and mem-ory requirements by three orders of magnitude, while maintaining a mean per-classaccuracy within 1% on the 3D scene parsing task.iiPrefaceThis thesis presents original and independent research conducted under the guid-ance of my supervisor, Dr. James J. Little. Parts of this thesis have been publishedas follows.Chapter 2. An early version of the nonparametric 2D scene parsing algorithmCollageParsing was published in F. Tung and J. J. Little, CollageParsing: Nonpara-metric scene parsing by adaptive overlapping windows, European Conference onComputer Vision (ECCV), 2014 [130]. The 2D scene parsing work was later ex-tended to the journal article F. Tung and J. J. Little, Scene parsing by nonparametriclabel transfer of content-adaptive windows, Computer Vision and Image Under-standing (CVIU), 143:191-200, 2016 [132]. I conceived the initial idea, developedthe algorithm, and designed and executed the experiments.Chapter 3. A version of the nonparametric 3D scene parsing algorithm MF3Dis presently in peer review. I conceived the idea of this extension, developed thealgorithm, and designed and executed the experiments.Chapter 4. The bank of random rotations algorithm is my individual contribu-tion to a collaborative publication F. Tung, J. Martinez, H. H. Hoos, and J. J. Little,Bank of quantization models: A data-specific approach to learning binary codesfor large-scale retrieval applications, IEEE Winter Conference on Applications ofComputer Vision (WACV), 2015 [133]. I conceived the initial idea, developed thealgorithm, and designed and executed the experiments described in this thesis. J.Martinez contributed a training-based extension to the algorithm and additional ex-periments, which are not included in this thesis. Dr. Holger H. Hoos and Dr. JamesJ. Little contributed to the writing. The supervised sparse projections algorithmwill be published in F. Tung and J. J. Little, SSP: Supervised Sparse Projectionsiiifor large-scale retrieval in high dimensions, Asian Conference on Computer Vision(ACCV), 2016 [131]. I conceived the idea, developed the algorithm, and designedand executed the experiments.Dr. James J. Little contributed to the planning and writing of all of the abovepublications, and provided valuable overall guidance.ivTable of ContentsAbstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iiPreface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iiiTable of Contents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vList of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viiList of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viiiAcknowledgments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xiv1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.1 Scene parsing: task and motivation . . . . . . . . . . . . . . . . . 21.2 The role of big data . . . . . . . . . . . . . . . . . . . . . . . . . 31.3 Nonparametric scene parsing . . . . . . . . . . . . . . . . . . . . 51.4 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62 Nonparametric 2D Scene Parsing of Images . . . . . . . . . . . . . . 92.1 Related work . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102.1.1 Markov random fields in scene parsing . . . . . . . . . . 102.1.2 Nonparametric scene parsing . . . . . . . . . . . . . . . . 112.1.3 Parametric scene parsing . . . . . . . . . . . . . . . . . . 122.1.4 Segmentation methods with overlapping regions . . . . . 132.1.5 Scene collage . . . . . . . . . . . . . . . . . . . . . . . . 142.1.6 Object proposals . . . . . . . . . . . . . . . . . . . . . . 14v2.2 Algorithm description: CollageParsing . . . . . . . . . . . . . . . 172.2.1 Markov random field model . . . . . . . . . . . . . . . . 172.2.2 Forming the retrieval set . . . . . . . . . . . . . . . . . . 182.2.3 Computing object proposal windows . . . . . . . . . . . . 192.2.4 Computing unary potentials . . . . . . . . . . . . . . . . 202.2.5 Performing MRF inference . . . . . . . . . . . . . . . . . 222.3 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 243 Nonparametric 3D Scene Parsing of Video . . . . . . . . . . . . . . 323.1 Related work . . . . . . . . . . . . . . . . . . . . . . . . . . . . 333.2 Algorithm description: MF3D . . . . . . . . . . . . . . . . . . . 353.2.1 Preliminaries: 3D reconstruction . . . . . . . . . . . . . . 373.2.2 3D CRF model . . . . . . . . . . . . . . . . . . . . . . . 393.2.3 Unary potentials update by label transfer . . . . . . . . . 413.2.4 Computation time . . . . . . . . . . . . . . . . . . . . . 463.3 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 474 Scaling Nonparametric Scene Parsing: Methods for Large-Scale Search 524.1 Related work . . . . . . . . . . . . . . . . . . . . . . . . . . . . 534.2 Bank of random rotations . . . . . . . . . . . . . . . . . . . . . . 554.2.1 Algorithm description . . . . . . . . . . . . . . . . . . . 554.2.2 Experiments . . . . . . . . . . . . . . . . . . . . . . . . 584.3 Supervised sparse projections . . . . . . . . . . . . . . . . . . . . 634.3.1 Algorithm description . . . . . . . . . . . . . . . . . . . 644.3.2 Experiments . . . . . . . . . . . . . . . . . . . . . . . . 674.4 Integration with 3D scene parsing . . . . . . . . . . . . . . . . . 765 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84viList of TablesTable 2.1 Overall per-pixel and average per-class labelling accuracy onthe SIFT Flow dataset [82] . . . . . . . . . . . . . . . . . . . 25Table 2.2 Overall per-pixel and average per-class labelling accuracy onthe LM+SUN dataset [127] . . . . . . . . . . . . . . . . . . . 28Table 3.1 Comparison to state-of-the-art methods for 3D scene parsing ofvideo on the KITTI labelled benchmark . . . . . . . . . . . . . 48Table 4.1 MF3D with binary encoding on KITTI labelled benchmark us-ing 256-bit binary codes . . . . . . . . . . . . . . . . . . . . . 77viiList of FiguresFigure 1.1 Sample output of our nonparametric scene parsing algorithmfor images (Section 2) on the SIFT Flow benchmark [82]. . . 2Figure 1.2 Sample output of our nonparametric 3D scene parsing algo-rithm for video (Section 3) to a depth of 25 metres on theKITTI labelled benchmark [38]. . . . . . . . . . . . . . . . . 6Figure 2.1 High-level overview of the CollageParsing scene parsing pipeline.CollageParsing constructs a Markov random field over imagepixels and performs nonparametric label transfer at the levelof object proposals. Object proposals are designed to preserveobjects instead of fragmenting them. Each of the four compo-nents of the pipeline (blue boxes) is described in Sections 2.2.2to 2.2.5. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17Figure 2.2 Examples of image windows with high ‘objectness’, as com-puted using Zitnick and Dolla´r’s Edge Boxes method [151].In each image, windows with the top ten objectness scores areshown. Images are from the SIFT Flow dataset [82]. . . . . . 19Figure 2.3 Visualization of the computation of the unary potential ψi bylabel transfer. The contribution of one object proposal win-dow in the query image is depicted. The steps are explained inSection 2.2.4. . . . . . . . . . . . . . . . . . . . . . . . . . . 22viiiFigure 2.4 Visualization of the unary potential for three sample query im-ages. Top row, from left to right: building, car, plant, sidewalk,and sky categories. Middle row, from left to right: building,car, fence, person, and road categories. Bottom row, from leftto right: building, car, road, sky, and tree categories. . . . . . 23Figure 2.5 Frequency counts of the semantic categories in the SIFT Flowdataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24Figure 2.6 Effect of varying the retrieval set size K, the MRF smoothingparameter λ, and γ on overall per-pixel accuracy and averageper-class accuracy on the SIFT Flow dataset. . . . . . . . . . 26Figure 2.7 Examples of scene parsing results with and without the spa-tial refinement post-processing step. From left to right: queryimage, labelling without post-processing, labelling with post-processing. . . . . . . . . . . . . . . . . . . . . . . . . . . . 27Figure 2.8 Examples of scene parsing results on the SIFT Flow dataset.From left to right: query image, ground truth labelling, Su-perParsing [127] predicted labelling, CollageParsing predictedlabelling. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29Figure 2.9 Varying γ to trade off overall per-pixel accuracy and averageper-class accuracy on the LM+SUN dataset . . . . . . . . . . 30Figure 2.10 Examples of indoor scene parsing results on the LM+SUNdataset. Top left: query image; top right: ground truth la-belling; bottom left: SuperParsing [127] predicted labelling;bottom right: CollageParsing predicted labelling. . . . . . . . 31Figure 3.1 Overview of the MF3D processing pipeline for 3D scene pars-ing. MF3D takes as input a pair of stereo RGB images, fromwhich depth and the current camera pose with respect to the3D reconstruction are estimated (Section 3.2.1). A label trans-fer operation is performed in image space and projected intothe 3D reconstruction as a unary potentials update. A 3D CRF,defined over visible voxels, is solved to obtain a semantic la-belling of the scene in 3D (Sections 3.2.2 and 3.2.3). . . . . . 36ixFigure 3.2 The depth feature descriptor fd of a window is an nd-dimensionalvector, in which each entry is determined by comparing the es-timated depth at two locations (r1, c1) and (r2, c2). These lo-cations are normalized window coordinates, i.e. 0 ≤ r1, c1, r2, c2 ≤1. If the first location is at a greater depth, the entry in fd is setto +1; otherwise, the entry is set to -1. . . . . . . . . . . . . . 42Figure 3.3 Visualization of image-space label transfer for a single regionproposal Wj . . . . . . . . . . . . . . . . . . . . . . . . . . . 44Figure 3.4 Visualization of the image-space unary potential update for asample image. Higher saturation represents higher values of ψ∗i,l. 45Figure 3.5 Visualization of the 3D CRF unary potentials for a sample im-age to a maximum depth of 25 meters. Voxel unary potentialsare accumulated over multiple frames in an online manner as arolling average. Higher saturation represents higher values ofψi,l. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46Figure 3.6 Qualitative 3D scene parsing results on the KITTI benchmark.Left column: RGB frame and labelling result from the cam-era pose. Right column: 3D renderings of the labelling resultfrom different virtual viewpoints. Labelling results are to amaximum depth of 25 meters. . . . . . . . . . . . . . . . . . 49Figure 3.7 Failure cases on the KITTI benchmark (see Section 3.3 for dis-cussion). Labelling results are to a maximum depth of 25 meters. 50Figure 3.8 Sensitivity experiments on the KITTI labelled benchmark (seetext for discussion). . . . . . . . . . . . . . . . . . . . . . . . 50Figure 4.1 Toy example illustrating neighbor retrieval under two differentsets of hashing hyperplanes. The hyperplanes in (a) map Aand B to the same binary code, and C to a different binarycode. The hyperplanes in (b) map B and C to the same code,and A to a different code. . . . . . . . . . . . . . . . . . . . . 55Figure 4.2 Toy example illustrating a query with a bank of random rotations 57xFigure 4.3 Retrieval performance on SIFT1M unsupervised retrieval bench-mark with 32-bit, 64-bit, and 128-bit binary codes (SH: Spec-tral hashing, ITQ: Iterative quantization, KMH: K-means hash-ing, BankRR: Bank of random rotations). . . . . . . . . . . . 59Figure 4.4 Retrieval performance on GIST1M unsupervised retrieval bench-mark with 32-bit, 64-bit, and 128-bit binary codes (SH: Spec-tral hashing, ITQ: Iterative quantization, KMH: K-means hash-ing, BankRR: Bank of random rotations). . . . . . . . . . . . 60Figure 4.5 Retrieval performance on CIFAR-10 unsupervised retrieval bench-mark with 32-bit, 64-bit, and 128-bit binary codes (SH: Spec-tral hashing, ITQ: Iterative quantization, KMH: K-means hash-ing, BankRR: Bank of random rotations). . . . . . . . . . . . 61Figure 4.6 Qualitative retrieval results of the bank of random rotationson the CIFAR-10 dataset. From left to right: query image;feature-space (Gist) neighbors; 32-bit code neighbors; 128-bitcode neighbors. . . . . . . . . . . . . . . . . . . . . . . . . . 62Figure 4.7 Sample results. Retrieval accuracy versus encoding time onthe SUN RGB-D benchmark [122] using 4096-dimensionalconvolutional neural network features. Binary encoding meth-ods: supervised discrete hashing (SDH) [114]; iterative quan-tization with canonical correlation analysis (ITQ-CCA) [43];sparse projections (SP) [143]; locality-sensitive hashing (LSH)[18, 39]. See Section 4.3.2 for complete results. . . . . . . . . 63Figure 4.8 Experimental results on the SUN RGB-D dataset [122] us-ing 4096-dimensional convolutional neural network features(Places-CNN [150]) and binary code lengths b = 32, 64, 96,and 128. Results are averaged over ten trials. Binary encodingmethods: supervised sparse projections (SSP); supervised dis-crete hashing (SDH) [114]; iterative quantization with canon-ical correlation analysis (ITQ-CCA) [43]; sparse projections(SP) [143]; locality-sensitive hashing (LSH) [18, 39]. . . . . . 70xiFigure 4.9 Experimental results on the AwA dataset [79] using 4096-dimensionalconvolutional neural network features (VGG-19 [117]) and bi-nary code lengths b = 32, 64, 96, and 128. Results are av-eraged over ten trials. Binary encoding methods: supervisedsparse projections (SSP); supervised discrete hashing (SDH)[114]; iterative quantization with canonical correlation analy-sis (ITQ-CCA) [43]; sparse projections (SP) [143]; locality-sensitive hashing (LSH) [18, 39]. . . . . . . . . . . . . . . . . 72Figure 4.10 Experimental results on the ImageNet dataset [111] using 4096-dimensional convolutional neural network features (VGG CNN-M [19]) and binary code lengths b = 32, 64, 96, and 128. Re-sults are averaged over ten trials. Binary encoding methods:supervised sparse projections (SSP); supervised discrete hash-ing (SDH) [114]; iterative quantization with canonical correla-tion analysis (ITQ-CCA) [43]; sparse projections (SP) [143];locality-sensitive hashing (LSH) [18, 39]. . . . . . . . . . . . 73Figure 4.11 Retrieval accuracy versus percentage of non-zeros in the SSPprojection matrix P, for SUN RGB-D at b = 64 bits. Thesparsity of P is determined by the parameter τ in Eq. 4.4. Weset τ to obtain approximately 20% non-zeros in P in all otherexperiments. . . . . . . . . . . . . . . . . . . . . . . . . . . 74Figure 4.12 Qualitative results on the SUN RGB-D dataset at b = 64 bits.The left column shows query images. The right column showsten of the top retrieved neighbors (sampled from multi-wayties) using SSP. The results indicate that SSP reliably preservessemantic neighbors in the learned binary codes. . . . . . . . . 75Figure 4.13 MF3D with binary encoding on KITTI labelled benchmark:mean per-class accuracy versus binary code length for locality-sensitive hashing (LSH), bank of random rotations (BRR), andsupervised sparse projections (SSP). . . . . . . . . . . . . . . 78xiiFigure 4.14 Qualitative 3D scene parsing results on the KITTI benchmarkwith binary encoding. Left column: RGB frame and labellingresult of MF3D with full dimensional descriptors. Top right:Labelling result of MF3D with bank of random rotations. Bot-tom right: Labelling result of MF3D with supervised sparseprojections. 256-bit binary codes are used and labelling resultsare to a maximum depth of 25 meters. . . . . . . . . . . . . . 80xiiiAcknowledgmentsThis research was funded by a postgraduate scholarship from the Natural Sci-ences and Engineering Research Council of Canada and by a University of BritishColumbia Four Year Doctoral Fellowship.xivChapter 1IntroductionHuman vision is fast and versatile. We rapidly recognize familiar faces and objects,and process complex scenes around us at a glance. Giving machines the samecapacity to interpret still images and moving sequences is an ambitious endeavour,yet one that has seen much progress in recent years. Computer vision as a fieldencompasses a broad range of tasks, from low-level image acquisition to high-level semantic recognition. Low-level computer vision is concerned with basicimage tasks such as image filtering [48, 148], edge or contour detection [2, 25],and 3D reconstruction [89, 121]. High-level vision algorithms analyze images andattempt to draw meaningful inferences about the real-world scenes. The goal ofhigh-level vision is to bridge the semantic gap [120]: the disconnect between theraw pixel data and high-level semantics.High-level semantic inferences about the scene imaged by a camera can bemade at different levels of abstraction. A traditional level of scene inference isscene categorization, or naming [63, 119]. Scene categorization is concerned withdetermining the general semantic category of a scene (e.g. forest, concert hall),which can be useful for contextual object detection or image retrieval. Instead of abasic category name, some applications such as guided image search [68] performinference at the level of nameable attributes (e.g. cluttered, rugged) [29, 79, 101].Scene parsing performs high-level semantic inference at the pixel level [82, 127]:the objective of scene parsing is to assign a semantic category label to every pixelin an image or video.11 1.1 1.2 1.3 1.4 1.5 1.6 1.7 1.8 1.9 2−1−0.8−0.6−0.4−0.200.20.40.60.81 buildingcarfencemountainpersonroadsidewalkskyunlabelled (ground truth only)Figure 1.1: Sample output of our nonparametric scene parsing algorithm forimages (Section 2) on the SIFT Flow benchmark [82].1.1 Scene parsing: task and motivationScene parsing is a fundamental problem in computer vision. It is concerned withconstructing a complete and consistent semantic interpretation of the structure ofthe real-world scene. In other words, we wish to explain the entire real-worldscene in terms of its semantic components. In contrast to object categorization[11, 99, 102], scene parsing does not assume the image contains a single dom-inant object, as in product images [9] for example. Similar to object detection[31, 41, 107], scene parsing is concerned with identifying spatially well-defined,foreground objects of interest such as vehicles, but unlike object detection, sceneparsing is also concerned with recognizing amorphous background categories suchas grass or pavement. While the output of an object detection algorithm is typicallya set of bounding boxes around detected objects, the output of scene parsing is adense pixelwise labelling of the scene. Fig. 1.1 shows the 2D scene parsing outputfor a sample image.Scene parsing enables many practical applications. It is a core component inemerging technologies such as prosthetic vision [54] and self-driving vehicles andmobile robots [113, 135, 138]. Scene parsing is also used to inform complementarycomputer vision tasks such as depth estimation [81], 3D scene reconstruction [46,76], and urban modelling [22, 88].Many factors make scene parsing a challenging task. Some challenges arerelated to imaging: important components of a scene may be occluded, imaged un-der non-uniform lighting conditions, or appear at different orientations and scales.2However, another important factor is scalability: scene parsing is intrinsically chal-lenging due to the breadth of possible foreground and background categories. Theoperating range of categories depends on the application. For example, standardscene parsing benchmarks for self-driving in urban environments focus on a smallset of relevant categories such as road, building, tree, sidewalk, fence, car, andpedestrian [15, 113, 135]; in practice, the complexity of the self-driving task mayrequire finer-grained categories such as police officers directing traffic. For imageor video search, a broad vocabulary of dozens or hundreds of categories may berequired. In addition, each of these categories can exhibit significant intra-categoryvariation in visual appearance.1.2 The role of big dataIn the past several years, “big data” has transformed how we approach high-levelcomputer vision problems. The emergence and ongoing development of today’sleading architecture for many recognition tasks, deep neural networks [72, 107],has been enabled by the creation of massive labelled databases such as ImageNet[111]. With unprecedented amounts of labelled training examples available fortasks ranging from image classification [111] to 3D reconstruction [122], increas-ingly complex models have been trained to achieve exciting new recognition results[51].Most of the contemporary research in solving high-level vision problems hasadopted a strategy of improving and applying more sophisticated classificationmodels. For example, most methods for scene parsing learn domain-specific clas-sification models to discriminate between different categories of pixels or voxels[28, 77, 113, 135, 138]. A model may be trained to distinguish a tree voxel froma voxel of a car, road, building, or any other category in a pre-defined set of cat-egories. Such methods are said to be parametric: they involve constructing com-pact models of the categories of interest, in which the parameters of the models arelearned from training examples. Inference on new inputs is performed by evaluat-ing them against the learned models. While there has been much work in leveragingbig data for training powerful parametric models, another complementary way toleverage big data for computer vision has been largely underexplored. Nonpara-3metric methods are data-driven instead of model-driven: inference on new inputsis performed by directly comparing them to similar examples in the data. Lever-aging big data for search allows for the development of powerful exemplar-basedalgorithms for visual recognition.Why might a nonparametric, search-based approach to scene parsing be useful?Nonparametric approaches are database driven and do not require expensive train-ing of category models. They can be easily extended to new semantic categoriesas the database grows: a new semantic category can be added to the database with-out having to learn how to distinguish it from all previously learned categories.Consequently, nonparametric scene parsing is scalable to an online growth in thedatabase. This scalability can be useful in embedded applications, such as the addi-tion of new mapped environments to a self-driving vehicle, as well as in web-scaleapplications, such as the addition of new user contributions to an open-universedatabase (e.g. LabelMe [112]). An open-universe database is one in which userscan continually add new images, annotations, or other content.From a theory perspective, there are several reasons why nonparametric ap-proaches are worth studying. Stone [123] proved that, in finite dimensions, the k-nearest neighbour classifier is universally consistent: as the number of data pointsincreases, its risk converges to the Bayes risk, for all probability distributions onthe joint input-output space. Hence, nonparametric approaches are sufficient tosolve any learning problem given enough data. Fixed-size deep neural networksdo not share a similar theoretical guarantee, though they can achieve good prac-tical performance with less data by exploiting the problem structure. In addition,nonparametric approaches are easy to interpret: it is straightforward to trace the ef-fects of changes in the input domain to the output domain. It is also easy to explainwhy nonparametric approaches make the predictions they do. On the other hand,the behaviour of deep neural networks is still not fully understood. For example,Szegedy et al. [125] showed that a small, imperceptible perturbation in the inputto a deep neural network can induce the network to output an arbitrary incorrectclassification label. Such “adversarial examples” can be constructed for practicallyany input, suggesting that deep neural networks have “intrinsic blind spots, whosestructure is connected to the data distribution in a non-obvious way” [125].Nonparametric search and deep learning classification are by no means mu-4tually exclusive; rather, the two approaches are complementary and can often becombined to achieve performance superior to either approach by itself. The novelscene parsing algorithms presented in this thesis leverage the strengths of deeplearning features within a nonparametric search-based framework. In the otherdirection, researchers have leveraged advances in large-scale search to acceleratedeep neural network frameworks [142].1.3 Nonparametric scene parsingScalable nonparametric approaches to classic computer vision problems have beendemonstrated in web-scale studies by Torralba et al. [129] and Hays and Efros[47]. Torralba et al. constructed a database of 80 million tiny (32x32) imagesfrom the web following the WordNet lexical dictionary [30], and demonstratedthat the massive scale of the database enables effective object and scene classifica-tion using simple nonparametric matching techniques, despite the low resolutionof the images. Scene completion, or inpainting, refers to the task of synthesizingmissing parts of an image such that the output image is semantically consistentand visually plausible. Hays and Efros [47] proposed a nonparametric scene com-pletion algorithm that draws on a database of two million unlabelled scenes togenerate potential image completions. Later, Ordonez et al. [98] collected one mil-lion human-captioned photos from the web to develop a nonparametric approachfor automatic image captioning. These studies demonstrate that high-level visiontasks can be approached by combining principled nonparametric search with large,semantically-annotated databases of images.Although scene parsing is a well studied problem in computer vision, it is onlyin recent years that nonparametric solutions have been proposed. Liu et al. [82]were the first to adopt a search-based approach to scene parsing. Given a query im-age, Liu et al.’s ‘label transfer’ algorithm finds the image’s nearest neighbors in adatabase of scenes annotated with semantic labels, warps the neighbors to the queryimage using SIFT Flow [83], and transfers the annotations from the neighbors tothe query image using a Markov random field defined over pixels. A Markov ran-dom field (MRF) is a probabilistic graphical model that is often used to solve dis-crete labelling problems. More recent nonparametric scene parsing algorithms per-5Building Vegetation Car RoadWall/fence Pavement PoleFigure 1.2: Sample output of our nonparametric 3D scene parsing algorithmfor video (Section 3) to a depth of 25 metres on the KITTI labelledbenchmark [38].form label transfer at the level of superpixels [26, 91, 118, 127]. Superpixels offer ahigher level of perceptual abstraction than pixels, enabling more efficient inferencein a random field model. Large, cohesive groups of pixels can be labelled at once.However, while superpixel-based methods tend to perform well on large regions ofbackground categories (sometimes referred to as ‘stuff’ in the literature [52]), theytend to perform less effectively on foreground object categories (‘things’). Thoughbetter than pixels, superpixels are still a relatively low-level unit for label transfer,and they tend to fragment objects.1.4 ContributionsIn this thesis, we first demonstrate that nonparametric scene parsing of 2D im-ages can be improved by performing label transfer at a higher level of percep-tual organization than superpixels. We propose an algorithm called CollageParsingthat performs label transfer at the level of object proposals [1, 151], which un-like superpixels, are designed to preserve entire objects instead of fragmentingthem. Performing label transfer using object proposals enables the construction ofa more effective Markov random field unary potential than previous approaches.The name ‘CollageParsing’ refers to the collage-like unary potential computation,which accumulates evidence from overlapping proposal windows, and derives fromthe name of Tighe and Lazebnik’s landmark ‘SuperParsing’ algorithm [127], whichperforms superpixel-based scene parsing.Furthermore, we propose the first nonparametric method for 3D scene parsing6of video. In 3D scene parsing of video, we are interested in reconstructing a 3Dmodel of the scene from video data, and densely labelling the reconstructed scenewith a set of semantic categories. A densely labelled 3D reconstruction has appli-cations in autonomous navigation [113, 135, 138] as well as in urban modelling[22, 88, 108]. Many applications require the 3D reconstruction and labelling tobe performed online: that is, using only sensory data that has been observed sofar by the platform, as opposed to offline labelling, in which the entire video se-quence may be loaded for post-analysis. We describe a novel method for online 3Dscene parsing of video that is nonparametric and easy to scale. Fig. 1.2 visualizesa sample result.In general, nonparametric algorithms in computer vision are expensive in termsof both memory and test time requirements. Nonparametric approaches are mem-ory intensive due to the need to store the large labelled database in memory, andtime intensive since this large database must be searched at test time. To enable ac-curate nonparametric scene parsing, millions of local feature descriptors may needto be efficiently stored and searched. Hence, an important question demands ourattention: how can we search “big data” efficiently while maintaining economy ofstorage?One highly successful approach to efficient large-scale retrieval is compact bi-nary encoding (or hashing) [43, 74, 95, 114, 143]. The idea is to encode eachdatabase feature vector into a short binary signature (or binary code) using a gen-erated projection matrix. Given a novel feature vector as a query, the query issimilarly projected into a binary code and compared to the binary codes in thedatabase. Binary encoding is time-efficient because binary code distances can becomputed quickly using bit operations. In particular, the Hamming distance be-twen two binary codes can be computed by taking an XOR and counting the ones,which is an operation supported directly in modern hardware. Binary encoding isalso memory-efficient because the generated binary codes are typically at least anorder of magnitude more compact to store than the original features. We proposetwo new methods for large-scale search based on binary encoding. The bank ofrandom rotations method is data-independent and does not require re-training asthe database grows over time. The supervised sparse projections method specif-ically targets search in high dimensions and enables faster encoding at test time.7We demonstrate the effectiveness of these methods on standalone retrieval datasets,and also study their integration into our scene parsing framework.In summary, this thesis makes the following contributions:• A novel nonparametric algorithm for 2D scene parsing of images. In contrastto traditional nonparametric scene parsing methods that perform matchingon the level of pixels or superpixels, we obtain higher labelling accuracy byintegrating information from overlapping object proposals.• A novel nonparametric algorithm for online 3D scene parsing of video. Ourmethod constructs a semantically labelled 3D reconstruction of the environ-ment in an online manner, building on the principles underlying our 2D sceneparsing work.• A demonstration of how deep learning features can be used in a nonpara-metric label transfer setting for the first time, and an experiment providinga positive test case that it is the learned features rather than other aspects ofdeep learning that are most important for scene parsing performance.• Two novel binary encoding algorithms for large-scale nonparametric search.The first algorithm is data-independent and easily scalable to new exam-ples and categories. The second algorithm is supervised and targets high-dimensional feature vectors such as the proposal descriptors in our sceneparsing framework.8Chapter 2Nonparametric 2D Scene Parsingof ImagesIn this chapter, we are interested in the task of assigning a semantic class label toevery pixel in a single RGB image (Fig. 1.1). We would like to explain the entireimage, including both foreground objects of interest and background elements.Foreground objects are typically compact objects with well-defined boundaries,sometimes referred to as ‘things’ in the literature [52]; examples include cars andpeople. Background elements are often amorphous in shape or defined by texture,sometimes referred to as ‘stuff’; examples include grass and mountain.With few exceptions, the dominant paradigm today for approaching recogni-tion problems in computer vision is that of parametric, discriminative classifica-tion (see Section 1.2 for a discussion). For example, we can learn discriminativeclassifiers to recognize airplanes or motorbikes in images, or to distinguish a treepixel from a pixel of a car, road, or building. On the other hand, a nonparamet-ric, search-based approach to scene parsing attempts to label the query image bymatching parts of the image to similar parts in a large database of labelled images.Typically, the problem is formulated within a Markov random field framework inwhich unary potentials are computed by nearest-neighbor search. Intuitively, labelsare ‘transferred’ from exemplars in the database to the query image.State-of-the-art nonparametric scene parsing algorithms perform label transferat the level of pixels or superpixels [26, 82, 91, 118, 127]. Superpixels offer a9higher level of perceptual abstraction than pixels, enabling more efficient inferencein a random field model. Large, cohesive groups of pixels can be labelled at once.However, while superpixel-based methods tend to perform well on large regions ofbackground (‘stuff’) categories, they tend to perform less effectively on foregroundobject (‘thing’) categories. We argue that, though better than pixels, superpixels arestill a relatively low-level unit for nonparametric label transfer. Superpixels tendto fragment objects. Is it possible to perform label transfer at a higher level oforganization?This chapter presents a novel nonparametric algorithm, called CollageParsing,for 2D scene parsing of images. In contrast to conventional nonparametric sceneparsing methods that perform matching on the level of pixels or superpixels, we ob-tain higher labelling accuracy by integrating information from overlapping objectproposals [1, 151]. Object proposals provide a higher level of perceptual organiza-tion than superpixels, and unlike superpixels are designed to preserve entire objectsinstead of fragmenting them. Performing label transfer using object proposals en-ables the construction of a more effective Markov random field unary potential thanprevious approaches. On a standard benchmark [82] consisting of outdoor scenesfrom the LabelMe database [112], CollageParsing obtains state-of-the-art perfor-mance with 15 to 19% higher average per-class accuracy than recent nonparametricscene parsing algorithms.2.1 Related work2.1.1 Markov random fields in scene parsingMarkov random fields (MRFs) have been applied in a broad range of computervision tasks, and have seen widespread adoption among methods for scene parsing.An MRF is a probabilistic graphical model that is often used to solve discretelabelling problems. We can think of an MRF as defining an energy (or cost) overa graph of random variables reflecting how good or bad a particular labelling is.Each random variable can be assigned a particular discrete label. The MRF energyis composed of potential functions defined over cliques of conditionally dependentrandom variables, typically singletons and pairs (unary and pairwise potentials,10respectively). A conditional random field (CRF) is an MRF in which the potentialsare conditioned on observation variables.Scene parsing lends itself naturally to formulation in an MRF framework, withrandom variables corresponding to image pixels or superpixels, and assignments tothose random variables producing an image labelling. Shotton et al.’s TextonBoostframework [115] consists of a CRF with shape-texture, color, and location unarypotentials and contrast-sensitive pairwise potentials. The shape-texture unary po-tential uses a boosted classifier of rectangular filter responses on texton maps.Rabinovich et al. [103] incorporated object class co-occurrences in the pairwisepotential term of a CRF model. Kohli et al. [66] introduced a higher-order poten-tial that encourages pixels within a segment to share the same label. The authorsshowed that approximate inference in a CRF augmented with their higher-orderpotential can be performed efficiently using graph cuts. Ladicky´ et al. [77] pro-posed a hierarchical CRF model in which potentials at multiple spatial scales arecombined; scales in the hierarchy correspond to pixels and superpixels at varyinggranularities. Later, Ladicky´ et al. [78] also introduced a higher-order potentialover label co-occurrences that can be efficiently approximately minimized usinggraph cuts. The potential is designed to be invariant to the number of pixels as-signed to a label and to encourage parsimony. Boix et al. [12] proposed a two-levelhierarchical CRF model in which the global node in the hierarchy can be assigneda set of labels instead of a single label. “Harmony” potentials between local nodesand the global node encourage local nodes to take on labels in the global node’s setof labels. A thorough treatment of MRFs in computer vision can be found in thesurvey of Wang et al. [139].Our proposed CollageParsing algorithm is formulated in an MRF framework,and introduces a novel and effective computation of unary potentials by nonpara-metric label transfer of overlapping object proposals.2.1.2 Nonparametric scene parsingLiu et al. [82] developed the first nonparametric, search-based approach to sceneparsing. Given a query image, Liu et al.’s ‘label transfer’ algorithm finds the im-age’s nearest neighbors in a database of scenes annotated with semantic labels,11warps the neighbors to the query image using SIFT Flow [83], and transfers the an-notations from the neighbors to the query image using an MRF defined over pixels.Tighe and Lazebnik’s landmark SuperParsing algorithm [127] builds an MRF oversuperpixels instead of pixels. Label transfer is performed by matching superpixelsin the query image with similar superpixels in the query’s nearest neighbor images.The nonparametric label transfer can be combined with additional parametric ge-ometry classification (sky, vertical, horizontal) to improve labelling consistency.Since the introduction of SuperParsing, other effective superpixel-based non-parametric scene parsing algorithms have been developed. Eigen and Fergus [26]learned weights for each descriptor in the database in a supervised manner to re-duce the influence of distractor superpixels, and augmented the retrieved set ofneighbor superpixels with examples from rare classes with similar local context.Myeong et al. [91] applied link prediction techniques to superpixels extracted fromthe query image and its nearest neighbors to learn more effective pairwise poten-tials. Singh and Kosˇecka´ [118] applied a locally adaptive nearest neighbor tech-nique to retrieve similar superpixels, and refined the retrieval set of query imageneighbors by comparing spatial pyramids of predicted labels.Instead of computing the set of nearest neighbor images at test time, the Patch-MatchGraph method of Gould and Zhang [44] builds offline a graph of patch cor-respondences across all database images. Patch correspondences are found usingan extended version of the PatchMatch algorithm [8] with additional “move” typesfor directing the local search for correspondences.CollageParsing belongs to this general family of nonparametric scene parsingapproaches, and we compare to this prior body of work in the experiments.2.1.3 Parametric scene parsingBesides the traditional MRF approaches in Section 2.1.1, many recent success-ful methods employ parametric strategies. Farabet et al. [28] developed a sceneparsing algorithm combining several deep learning techniques. Dense multi-scalefeatures are computed at each pixel and input to a trained neural network to obtainfeature maps. Feature maps are aggregated over regions in a hierarchical segmen-tation tree. Regions are classified using a second neural network and pixels are12finally labelled by the ancestor region with the highest purity score. Tighe andLazebnik [128] extended the SuperParsing algorithm [127] with per-exemplar de-tectors (Exemplar-SVMs [86]). A superpixel-based data term is the same as inSuperParsing. A detector-based data term is obtained by running the per-exemplardetectors of class instances found in the retrieval set and accumulating a weightedsum of the detection masks. The final unary potential over pixels is determinedby another SVM trained on the two data terms. Yang et al. [146] also proposeda hybrid parametric superpixel-based algorithm. The retrieved set of neighbor su-perpixels is augmented with a fixed (query-independent) set of examples of rareclasses. Rare class examples are extracted offline in a training phase; a class isconsidered rare if it falls under the 20% ‘tail’ in the overall distribution of classesin the database. Superpixel unary potentials are determined by a combination ofnon-parametric matching and a parametric SVM classifier trained on the rare classexamples. Predicted labels are used to refine the retrieval sets of query imageneighbors and superpixel neighbors in a second pass through the labelling pipeline.While the characteristics and assumptions of parametric approaches are differ-ent from ours (see discussion in Section 1.2), we include parametric baselines inthe experimental comparisons for completeness.2.1.4 Segmentation methods with overlapping regionsPantofaru et al. [100] combined multiple image segmentations to create a mapof intersections-of-regions. Each intersection-of-regions is classified by averagingover the category predictions of a learned region-based classifier on its constituentregions. Li et al. [80] also formed intersections-of-regions (referred to as super-pixels in their formulation) using multiple image segmentations. Intersections-of-regions are aggregated into objects by maximizing a composite likelihood ofpredicted object overlap with segments. In contrast, CollageParsing is formulatedin an MRF framework and operates on high-level object proposals instead of in-tersections of regions generated by low-level image segmentation. CollageParsingis also a nonparametric method, while Pantofaru et al. [100] and Li et al. [80] areparametric methods.Kuettel and Ferrari [73] proposed a figure-ground (foreground-background)13segmentation algorithm that transfers foreground masks from the training imagesby matching object proposals. Matching is independent of the global similarity be-tween the test and training images. The transferred foreground masks are used toderive unary potentials in a CRF. CollageParsing performs multi-class scene pars-ing instead of figure-ground separation, and specifically matches windows fromglobally similar images, which form the retrieval set. The retrieval set adds con-text that is important for multi-class labelling and enables scalable label transferin large databases. The label transfer in CollageParsing is nonparametric, whileKuettel and Ferrari [73] iteratively learns a mixture model of appearance to refinethe figure-ground segmentation.2.1.5 Scene collageIsola and Liu [58] proposed a similarly named ‘scene collage’ model and exploredapplications in image editing, random scene synthesis, and image-to-anaglyph. Incontrast to CollageParsing, the scene collage is an image representation: it repre-sents an image by layers of warped segments from a dictionary.2.1.6 Object proposalsDetecting objects in images can be computationally expensive. A traditional slid-ing window approach moves a window across an image and evaluates a trainedobject classifier within each window [23, 31]. The process is repeated at multi-ple scales and aspect ratios. Unfortunately, the sliding window approach spendsmany computation cycles evaluating windows that have little chance of containingobjects – patches of background texture, for example. As a result, sliding win-dow detectors can be slow, limiting their feasibility for real-time applications orrestricting the complexity of the underlying trained classifier.Object proposal algorithms have been developed to address this inefficiency[1, 3, 16, 20, 27, 70, 87, 136, 151]. Object proposal algorithms attempt to extracta small number of candidate windows (on the order of thousands) to be evaluatedby the trained object classifier. Object proposal algorithms are designed such thatthe proposal windows have high probability of containing objects of any category.Object proposals are a key component in leading large-scale object detection algo-14rithms such as R-CNN [41] and its extensions [40, 107].Alexe et al. [1] first defined the ‘objectness’ of an image window as the like-lihood that the window contains a foreground object of any kind instead of back-ground texture such as grass, sky, or road. Objects often have a closed boundary, acontrasting appearance from surroundings, and/or are unique (salient) in the image.Alexe et al. proposed an objectness measure integrating multiscale saliency, colorcontrast, and superpixels straddling cues. The multiscale saliency cue is a mul-tiscale adaptation of Hou and Zhang’s visual saliency algorithm [56]. The colorcontrast cue measures the difference between the color histograms of the windowand its surrounding rectangular ring. The superpixels straddling cue measures theextent to which superpixels straddle the window (contain pixels both inside andoutside the window). Windows that tightly bound an object are likely to have lowstraddling.Since the introduction of ‘objectness’ many effective methods have been pro-posed for generating object proposals. Van de Sande et al. [136] proposed selectivesearch, in which object proposals are generated using a hierarchical segmentationprocess that greedily merges similar segments until a single segment remains. Allgenerated segments, or their bounding boxes, become candidate windows. Multi-ple segmentations are performed in color spaces with different sensitivities to shad-ows, shading, and highlights. Arbelaez et al.’s multiscale combinatorial grouping[3] also uses hierarchical segmentation. Multiscale combinatorial grouping gen-erates proposals by considering singletons, pairs, triplets, and 4-tuples of regionsgenerated by a multiscale hierarchical segmentation algorithm. Proposals are fur-ther refined using a random forest regressor trained on size, location, shape, andcontour strength features. Manen et al. [87] generated object proposals by growingrandom partial spanning trees on a graph defined over superpixels. Edge weightsencode the probability of two adjacent superpixels belonging to the same object.The trees are grown using a randomized version of Prim’s algorithm that replacesgreedy edge selection with weighted sampling based on edge weights, and adds arandomized stopping condition.Another interesting approach to generating object proposals involves solvingmultiple figure-ground (foreground-background) segmentation problems from dif-ferent initial foreground seeds. Carreira and Sminchisescu [16] performed multiple15foreground-background segmentations from initial seeds on a regular grid usingcolor features. A random forest regressor trained on mid-level appearance featuresis used rank the proposals. Endres and Hoiem [27] selected initial seeds from ahierarchical segmentation and estimated the affinities between superpixels usingmultiple learned classifiers. The classifiers predict whether the region and seedlie on the same object; whether a region lies on the left, right, top, or bottom ofan object; and whether a region is homogeneously foreground or background. Astructural learning method is used to rank the proposals based on overlap and ap-pearance features. Kra¨henbu¨hl and Koltun [70] computed geodesic distances be-tween foreground seeds and background superpixels. Seeds can be chosen heuris-tically or using a learned classifier. Geodesic distances refer to shortest paths overa weighted graph of superpixels, in which edges are weighted by the probability ofboundary between superpixels. Proposals are derived from critical level sets in thegeodesic distance map.Superpixel-based proposal methods are often sensitive to small changes in theimage content [55]. Some approaches forego superpixel segmentation entirely.Cheng et al. [20] developed a fast method for generating object proposals based onimage gradients. Given an image window, gradient norms are aggregated in an 8x8grid and the resulting 64-dimensional feature is scored using a learned linear SVM.The computation can be accelerated by binarizing the linear model and normedgradient features, allowing scores to be evaluated via fast bit operations. Zitnickand Dolla´r [151] introduced the Edge Boxes algorithm, which generates objectproposals using edge cues: the objectness score of a window is determined bythe number and strength of edges wholly contained within the window. A faststructured forest method [25] is used for edge detection.Our CollageParsing algorithm makes use of object proposals, and a key noveltyis the computation of unary potentials by nonparametric label transfer on overlap-ping object proposals. As described in Section 2.2.3, we compute object proposalsusing the Edge Boxes algorithm [151].16DatabaseForm retrieval set using global image featuresGenerate object proposal windowsCompute unary potentials ψ by label transferPerform MRF inferenceQueryFigure 2.1: High-level overview of the CollageParsing scene parsingpipeline. CollageParsing constructs a Markov random field over imagepixels and performs nonparametric label transfer at the level of objectproposals. Object proposals are designed to preserve objects instead offragmenting them. Each of the four components of the pipeline (blueboxes) is described in Sections 2.2.2 to 2.2.5.2.2 Algorithm description: CollageParsingCollageParsing follows a common nonparametric scene parsing pipeline, in whichan MRF is constructed and nonparametric label transfer is used to determine unarypotentials. The CollageParsing pipeline is illustrated in Fig. 2.1. First, a shortlist of the query image’s nearest neighbors in the database is formed using globalimage features (Section 2.2.2). The short list is referred to as the retrieval set.Next, object proposal windows are extracted from the query image (Section 2.2.3).Unary potentials are then computed by matching these windows to similar objectproposal windows in the retrieval set (Section 2.2.4). The unary potentials arecombined with pairwise potentials, and approximate MRF inference is performedto obtain a pixelwise labelling of the image (Section 2.2.5). We start by defining theMRF model (Section 2.2.1) and then proceed to explain each stage of the pipelinein turn.2.2.1 Markov random field modelWe adopt a Markov random field G = (V, E) defined over a dense pixel lattice,where V is a set of nodes and E is a set of undirected edges. Let Xi be the ran-dom variable associated with node i ∈ V . Xi can take on a label from the labelset L = {l1, l2, ..., lk}, corresponding to the vocabulary of semantic category la-17bels. Denote by xi a realization or assignment of random variable Xi, and byx ∈ L|V| the assignments of all random variables; that is, x represents a completelabelling of the image. A clique c is a fully connected set of nodes in G. By theHammersley-Clifford theorem, the probability distribution p(x) over labellings isa Gibbs distribution and can be expressed asp(x) =1Zexp(−∑cψc(xc))(2.1)where ψc(xc) are potential functions defined over cliques and Z is the normalizingfactor or partition function. Let E(x) denote the energy of the MRF, defined asE(x) =∑cψc(xc) (2.2)Then the maximum a posteriori (MAP) labelling is given byx∗ = argmaxx∈L|V|p(x) = argminx∈L|V|E(x) (2.3)MRFs in computer vision are often defined over unary and pairwise cliques becauseof the availability of efficient approximate inference techniques. Following thismodelling practice, we arrive at an MRF energy composed of unary and pairwisepotentials:E(x) =∑i∈Vψi(xi) +∑(i,j)∈Eψij(xi, xj) (2.4)The main novelty of CollageParsing is in the computation of the unary potentialsby nonparametric label transfer of overlapping object proposals, which will be de-scribed in detail in Section 2.2.4.2.2.2 Forming the retrieval setThe standard first step in nonparametric scene parsing is the retrieval of databaseimages that are contextually similar to the query image. In addition to filteringout semantically irrelevant database images that are likely to be unhelpful, a smallretrieval set makes nearest neighbor based label transfer practical on large datasets.18Figure 2.2: Examples of image windows with high ‘objectness’, as computedusing Zitnick and Dolla´r’s Edge Boxes method [151]. In each image,windows with the top ten objectness scores are shown. Images are fromthe SIFT Flow dataset [82].This coarse filtering step is typically performed using fast global image descriptors.CollageParsing compares the query image to the database images using Gist [97]and HOG visual words [82, 145]. The database images are sorted by similarity tothe query image with respect to these two features, and the K best average ranksare selected as the retrieval set.An assumption made in this step is that the retrieval set is sufficient for accurateparsing of the query image. However, the coarse filtering could potentially discardrelevant exemplars, leading to poor matching during label transfer. Rare categoriesare at particular risk of being under-represented in the pool of exemplars. We studythe effect of retrieval set size on accuracy later in the experiments (Section 2.3).2.2.3 Computing object proposal windowsCollageParsing performs label transfer at the level of object proposal windows.Since the original ‘objectness’ work of Alexe et al. [1], many algorithms have beendeveloped for generating object proposals (Section 2.1.6). In this chapter, we adopta state-of-the-art method called Edge Boxes.Introduced by Zitnick and Dolla´r [151], the Edge Boxes algorithm generatesobject window proposals using edge cues alone. The objectness score of a windowis determined by the number and strength of edges wholly contained within thewindow. A fast structured forest method [25] is used for edge detection. Candidate19windows for score evaluation are generated by a sliding-window search over posi-tion, scale, and aspect ratio, with the step size parameterized by the target intersec-tion over union (IoU). High-scoring windows are further refined by greedy iterativesearch. Finally, a non-maximal suppression step parameterized by the target IoU isperformed. We use the publicly available implementation of Edge Boxes with thedefault parameter values recommended in Zitnick and Dolla´r [151]. Fig. 2.2 showsexamples of proposals computed using the Edge Boxes algorithm.2.2.4 Computing unary potentialsCollageParsing computes unary potentials by performing nearest-neighbor match-ing on object proposal windows and transferring the category labels. Comparedwith previous nonparametric scene parsing methods, label transfer is performedat a higher level of perceptual organization designed to preserve the integrity ofobjects.This matching requires a way to compare object proposal windows in the queryimage with object proposal windows in the retrieval set. To this end, we generatedeep convolutional neural network features using the VGG CNN-M-1024 network,which was demonstrated to provide high-quality features for object classificationin the experimental evaluation of Chatfield et al. [19]. To obtain the feature de-scriptor for an object proposal window, we resize the cropped image window tothe canonical size required by the convolutional neural network, perform a forwardpass through the network, and take the 1024-dimensional response. We use Chat-field et al.’s CNN-M-1024 network pre-trained on ImageNet as an off-the-shelffeature extractor, without any fine-tuning on our scene parsing images. We ap-pend the scaled, normalized spatial coordinates of the window centroid to producea final 1026-dimensional feature descriptor f , following the common practice ofspatial coding in object recognition [11, 90, 134], which encourages matches tocome from spatially similar regions in the respective images. To our knowledge,this represents the first use of deep features in a nearest neighbor search setting forscene parsing.Let {W1,W2, ...Wm} denote the set of object proposal windows in the queryimage. For each window, we search for its closest object proposal window in20the retrieval set by matching the region descriptors f . Each object proposal win-dow in the database has an associated ground-truth 2D label field L appropriatelycropped from the labelled database images. The search therefore produces a setof label fields {L1, L2, ..., Lm}. We warp these retrieved label fields to matchthe sizes of the query proposal windows, and denote the resized label fields by{L′1, L′2, ..., L′m}.Next, let {M1,M2, ...,Mm} denote label masks constructed from {L′1, L′2, ...,L′m}. Each mask Mj is the size of the query image and is zero (null) everywhereexcept within the corresponding object proposal window Wj , where it is populatedwith the values of L′j .The unary potential associated with assigning a category label l to pixel i isthen given byψi,l(xi) = −m∑j=1δ[Mj(i) = l]φ(l) (2.5)where δ[·] is the indicator function and Mj(i) is the value of Mj at location i,which can be a category label or null. The term φ(l) is a weight that is inverselyproportional to the frequency of category label l in the retrieval set:φ(l) =1freq(l)γ(2.6)where freq(l) denotes the number of pixels with category label l in the retrievalset. Eq. 2.6 performs a softened inverse document frequency (IDF) re-weightingto account for differences in the frequency of categories. The constant γ controlsthe strength of the penalty given to high frequency categories, and as we show laterin the experiments, influences the tradeoff between overall per-pixel and averageper-class accuracy. After computing ψi,l for all categories and pixels in the queryimage, all values are rescaled to be between -1 and 0.Instead of constructing the label masks {M1,M2, ...,Mm}, the unary potentialcomputation is implemented more efficiently in practice by the accumulation ofunary factors [76]. Each of the m unary factors has a spatial extent over pixelscorresponding to an object proposal window in the query image. The jth factoradds a single vote to each pixel within its spatial extent, for the corresponding21QueryRetrieval setFind best matchwith labellingAccumulated evidence Transfer evidencebuildingcarroadsidewalkFigure 2.3: Visualization of the computation of the unary potential ψi by la-bel transfer. The contribution of one object proposal window in thequery image is depicted. The steps are explained in Section 2.2.4.category label in L′j . Figure 2.3 illustrates matching and label transfer for a singleobject proposal window Wj . Conceptually, the unary potential is accumulated inan overlapping, collage-like manner by transferring the category labels from themost similar object proposal windows in the retrieval set.Fig. 2.4 visualizes the unary potential for three example query images from theSIFT Flow dataset [82] after allm region proposals have been processed. Saturatedcolors represent higher values of ψi,l.2.2.5 Performing MRF inferenceThe unary potentialψi(xi) is combined in (2.4) with a pairwise potentialψij(xi, xj)defined over pairs of adjacent pixels. In general, the pairwise potential in an MRFimposes a smoothness prior in the labelling. We adopt the same pairwise potentialterm as in SuperParsing [127], which is based on the co-occurrences of categorylabels:22Figure 2.4: Visualization of the unary potential for three sample query im-ages. Top row, from left to right: building, car, plant, sidewalk, and skycategories. Middle row, from left to right: building, car, fence, person,and road categories. Bottom row, from left to right: building, car, road,sky, and tree categories.ψij(xi, xj) = −λ log[(P (xi|xj) + P (xj |xi))/2]δ[xi 6= xj ] (2.7)where λ is the MRF smoothing constant. Intuitively, this pairwise potential termbiases the labelling towards category transitions that are more frequently observed.We minimize the MRF energy (2.4) approximately using α/β-swap, a standardgraph cuts technique [13, 14, 67]. A non-metric version of α-expansion [110]could also be applied for faster inference. We report α/β-swap inference here tobe consistent with our journal publication [132], and because we will adopt anentirely different and more powerful random field model and inference algorithmwhen we extend the CollageParsing framework to 3D scene parsing of video in thenext chapter.Finally, the labelling is spatially refined in a post-processing step by aligningthe labels to the query image’s superpixels. All pixels within a superpixel areassigned the most common (mode) label in the superpixel. Following SuperParsing[127], superpixels are extracted using the graph-based method of Felzenszwalb and231001,00010,000100,0001,000,00010,000,000100,000,000skybuildingmountaintreeroadseafieldgrassriverplantcarsandrocksidewalkwindowdesertdoorbridgepersonfencebalconystaircaseawningcrosswalksignstreetlightboatpolebussuncowbirdmoonLabel frequency Figure 2.5: Frequency counts of the semantic categories in the SIFT FlowdatasetHuttenlocher [32]. Post-processing produces more visually pleasing results and hasonly a small effect on labelling accuracy.2.3 ExperimentsWe performed experiments on the SIFT Flow dataset [82], a scene parsing bench-mark consisting of 200 query images and 2,488 database images from LabelMe.Images span a range of outdoor scene types, from natural to urban. Pixels arelabelled with one of 33 semantic categories. The label frequencies are shown inFigure 2.5.Table 2.1 shows the experimental results and a comparison with recent non-parametric and parametric approaches on this dataset. We set the retrieval set sizeK = 400, γ = 0.4, and the MRF smoothing parameter λ to 0.01. We investigatethe effect of these parameters on the algorithm performance later in this section.Both the overall per-pixel accuracy and the average per-class accuracy are reported.Average per-class accuracy is a more reliable measure of how well the algorithmperforms across different categories and not just on the most commonly occurringones.By performing label transfer at the level of object proposal windows, Col-lageParsing obtains state-of-the-art labelling results on the benchmark. An earlyversion of CollageParsing using traditional HOG features and Alexe et al.’s object-24Table 2.1: Overall per-pixel and average per-class labelling accuracy on theSIFT Flow dataset [82]Per-pixel Per-classNonparametric scene parsing methodsLiu et al. [82] 76.7 -Gould and Zhang [44] 65.2 14.9Tighe and Lazebnik [127] 77.0 30.1Myeong et al. [91] 77.1 32.3Eigen and Fergus [26] 77.1 32.5Singh and Kosˇecka´ [118] 79.2 33.8CollageParsing 79.9 49.3Parametric scene parsing methodsTighe and Lazebnik [128] 78.6 39.2Farabet et al. [28], “natural” 78.5 29.6Farabet et al. [28], “balanced” 74.2 46.0Yang et al. [146] 79.8 48.7ness [1] already obtained a 7 to 11% improvement in average per-class accuracyover superpixel based nonparametric approaches [26, 91, 118, 127]. By using off-the-shelf deep features and the Edge Boxes method, the current version of Col-lageParsing realizes gains of 15 to 19% average per-class accuracy in comparisonto recent nonparametric approaches.Labelling performance is also comparable to or better than recent paramet-ric methods, while not requiring expensive model training. For example, on theLM+SUN dataset (45,000 images with 232 semantic labels), just the per-exemplardetector component of Tighe and Lazebnik [128] requires four days on a 512-nodecluster to train. On a dataset of 715 images with eight semantic labels [45], Farabetet al. [28] requires “48h on a regular server” to train. The bulk of our computationtime comes from feature extraction. Once features are computed, matching objectproposal windows and approximately solving the Markov random field by graphcuts takes 10s in Matlab on a desktop. On average, 855 content-adaptive windowsare generated per database image using Edge Boxes, and the nearest-neighbor win-dow matching and label transfer account for 8 of those 10 seconds.250.450.460.470.480.490.500.780.790.800.81100 200 300 400 500 600 700 Average per-class accuracy ▲ Overall per-pixel accuracy ● Retrieval set size K 0.430.440.450.460.470.480.490.500.760.770.780.790.800.81-3 -2.5 -2 -1.5 -1 -0.5 Average per-class accuracy ▲ Overall per-pixel accuracy ● log10 λ 0.430.440.450.460.470.480.490.500.510.780.790.800.810.30 0.32 0.34 0.36 0.38 0.40 0.42 0.44 0.46Average per-class accuracy ▲ Overall per-pixel accuracy ● γ Figure 2.6: Effect of varying the retrieval set size K, the MRF smoothingparameter λ, and γ on overall per-pixel accuracy and average per-classaccuracy on the SIFT Flow dataset.Results are comparable to the hybrid parametric approach recently proposedby Yang et al. [146], which achieves a large increase in average per-class accuracy(nearly 18% compared to the authors’ baseline system) by augmenting the retrievalset with examples from rare classes to improve rare class labelling. Interestingly,this idea of ‘rare class expansion’ is complementary to our work and could be aninteresting direction for future investigation. However, from a nonparametric de-sign perspective, ‘rare class expansion’ would make CollageParsing more difficultto scale because as new images or categories are added to the database, the auxil-iary set needs to be re-composed – both the selection of ‘rare’ categories and theexamples themselves.Figure 2.6 shows the effect of varying γ, the retrieval set size K, and the MRFsmoothing parameter λ on the overall per-pixel and average per-class accuracy.Similar to Tighe and Lazebnik [127], we observed that the overall per-pixel accu-racy drops when the retrieval set is too small for sufficient matches, but also when261 1.1 1.2 1.3 1.4 1.5 1.6 1.7 1.8 1.9 2−1−0.8−0.6−0.4−0.200.20.40.60.81 mountainskytreeunlabelled (ground truth only)1 1.1 1.2 1.3 1.4 1.5 1.6 1.7 1.8 1.9 2−1−0.8−0.6−0.4−0.200.20.40.60.81 fieldskytreeunlabelled (ground truth only)1 1.1 1.2 1.3 1.4 1.5 1.6 1.7 1.8 1.9 2−1−0.8−0.6−0.4−0.200.20.40.60.81 buildingcarfencemountainpersonroadsidewalkskyunlabelled (ground truth only)Figure 2.7: Examples of scene parsing results with and without the spatial re-finement post-processing step. From left to right: query image, labellingwithout post-processing, labelling with post-processing.the retrieval set becomes large, confirming that the retrieval set performs a filteringrole in matching query windows to semantically relevant database images. Theconstant γ controls the strength of the penalty given to high frequency categories.As γ increases, evidence for high frequency categories is discounted more heav-ily, and labelling is biased towards rarer categories. As reflected in the figure, thistends to increase the average per-class accuracy but at the expense of overall per-pixel accuracy. The post-processing step of spatial refinement using superpixelsproduces more visually pleasing results and has only a small effect on accuracy:about 1% overall per-pixel accuracy and 0.1% average per-class accuracy. Fig-ure 2.7 shows some examples of scene parsing results with and without superpixelpost-processing.It can be difficult to interpret small differences in per-pixel accuracy betweenscene parsing alternatives. When ground truth labellings are crowd-sourced froma broad pool of users, with varying experiences and pre-conceptions about whatconsitutes a correct labelling, it may be unclear if such differences are significant.To address this common limitation in scene parsing evaluation, we additionally ex-plored qualitative scene parsing results. Figure 2.8 shows some qualitative scene27Table 2.2: Overall per-pixel and average per-class labelling accuracy on theLM+SUN dataset [127]Per-pixel Per-classNonparametric scene parsing methodsTighe and Lazebnik [127] 54.9 7.1CollageParsing 60.8 19.3Parametric scene parsing methodsTighe and Lazebnik [128] 61.4 15.2Yang et al. [146] 60.6 18.0parsing results, including query images, system predicted labellings for both Su-perParsing [127] and CollageParsing, and ground truth labellings. We found Col-lageParsing to perform well in a wide range of outdoor scenes, from natural (top) tourban (bottom) environments. In comparison to SuperParsing [127], CollagePars-ing more reliably detects smaller and rarer categories, such as the streetlight andsign in the fourth row, the staircase in the fifth row, and the pedestrians and fencein the sixth row.We performed further experiments on a second, larger dataset. The LM+SUNdataset [127] consists of 500 query images and 45,176 database images from theLabelMe [112] and SUN datasets [145]. The dataset is complementary to SIFTFlow in that it contains both indoor and outdoor scenes, while indoor scenes arenot represented in SIFT Flow. LM+SUN is annotated with a broad vocabulary of232 semantic categories.Table 2.2 shows evaluation results and a comparison with recent nonparametricand parametric approaches. Since LM+SUN contains a larger number of imagesand categories, we increased the retrieval set size K to 1600. The trade-off of γis shown in Fig. 2.9. LM+SUN is a challenging benchmark but CollageParsingobtains encouraging results. An improvement of 6% per-pixel and 12% averageper-class accuracy is obtained relative to SuperParsing. Compared with state-of-the-art parametric methods, a modest gain of 1% average per-class accuracy isachieved (or 7% relative). Inference on LM+SUN is more time consuming than281 1.1 1.2 1.3 1.4 1.5 1.6 1.7 1.8 1.9 2−1−0.8−0.6−0.4−0.200.20.40.60.81 mountainseaskysununlabelled (ground truth only)1 1.1 1.2 1.3 1.4 1.5 1.6 1.7 1.8 1.9 2−1−0.8−0.6−0.4−0.200.20.40.60.81 mountainskytreeunlabelled (ground truth only)1 1.1 1.2 1.3 1.4 1.5 1.6 1.7 1.8 1.9 2−1−0.8−0.6−0.4−0.202.60.81 carfencemountainroadskytreeunlabelled (ground truth only)1 1.1 1.2 1.3 1.4 1.5 1.6 1.7 1.8 1.9 2−1−0.8−0.6−0.4−0.200.20.40.60.81 buildingbuscarroadsidewalksignskystreetlighttreeunlabelled (ground truth only)1 1.1 1.2 1.3 1.4 1.5 1.6 1.7 1.8 1.9 2−1−0.8−0.6−0.4−0.200.20.40.60.81 buildingcarpersonplantroadsidewalksignskystaircasestreetlighttreeunlabelled (ground truth only)1 1.1 1.2 1.3 1.4 1.5 1.6 1.7 1.8 1.9 2−1−0.8−0.6−0.4−0.200.20.40.60.81 buildingcarfencemountainpersonroadsidewalkskyunlabelled (ground truth only)1 1.1 1.2 1.3 1.4 1.5 1.6 1.7 1.8 1.9 2−1−0.8−0.6−0.4−0.200.20.40.60.81 buildingcarroadsidewalkskytreeunlabelled (ground truth only)Figure 2.8: Examples of scene parsing results on the SIFT Flow dataset.From left to right: query image, ground truth labelling, SuperParsing[127] predicted labelling, CollageParsing predicted labelling.290.160.170.180.190.200.210.590.600.610.620.10 0.12 0.14 0.16 0.18 0.20 0.22Average per-class accuracy ▲ Overall per-pixel accuracy ● γ Figure 2.9: Varying γ to trade off overall per-pixel accuracy and average per-class accuracy on the LM+SUN dataseton SIFT Flow because of the increased retrieval set size and number of categories.Once features are computed, matching and inference take approximately 3 minutesin Matlab on a desktop, which is about half the runtime of Yang et al. [146] on thisdataset. No model training is required.Figure 2.10 shows additional qualitative results on indoor scenes, which arenot represented in the SIFT Flow dataset. Similar improvements can be seen onthe indoor scenes as on the outdoor scenes presented earlier: in comparison to Su-perParsing, CollageParsing is again more accurate for smaller and rarer categories,as well as overall.301 1.1 1.2 1.3 1.4 1.5 1.6 1.7 1.8 1.9 2−1−0.8−0.6−0.4−0.200.20.40.60.81 coffee markercounter topcupboarddoorfloorglassmicrowavepotstovetableutensilwallunlabelled (ground truth only)1 1.1 1.2 1.3 1.4 1.5 1.6 1.7 1.8 1.9 2−1−0.8−0.6−0.4−0.200.20.40.60.81 bedbookshelfbottlecarceilingchairfloorpaintingpersonpicturesofatablewallwindowunlabelled (ground truth only)1 1.1 1.2 1.3 1.4 1.5 1.6 1.7 1.8 1.9 2−1−0.8−0.6−0.4−0.200.20.40.60.81 benchcarceilingchairfloorpaintingpersonpicturesnowtablewallwindowunlabelled (ground truth only)1 1.1 1.2 1.3 1.4 1.5 1.6 1.7 1.8 1.9 2−1−0.8−0.6−0.4−0.200.20.40.60.81 benchcarceilingchairfloorpaintingpersonpicturesnowtablewallwindowunlabelled (ground truth only)Figure 2.10: Examples of indoor scene parsing results on the LM+SUNdataset. Top left: query image; top right: ground truth labelling; bot-tom left: SuperParsing [127] predicted labelling; bottom right: Col-lageParsing predicted labelling31Chapter 3Nonparametric 3D Scene Parsingof VideoIn this chapter, we extend CollageParsing to 3D scene parsing of video. We areinterested in semantically reconstructing the 3D scene imaged by a moving camerain an online manner (Fig. 1.2). Our proposed algorithm is online in the sense that ituses only sensory data that has been observed so far by the platform, as opposed tooffline semantic reconstruction, in which the entire video sequence may be loadedfor post-analysis. Online operation is required in many applications, includingself-driving vehicles and mobile robots [113, 135, 138].As in 2D scene parsing, the dominant contemporary paradigm for performing3D semantic reconstruction is that of parametric, discriminative classification (seeSection 1.2 for a discussion). This chapter presents, to our knowledge, the firstmethod to approach 3D semantic reconstruction non-parametrically. Our method,called MF3D (Model-Free 3D), approaches 3D voxel labelling via search-basedlabel transfer. There is no explicit appearance model or trained classifier to decidewhether a 3D voxel should be labelled a particular class versus others. Instead,MF3D searches over a database of labelled examples to determine how to best la-bel a novel 3D voxel. MF3D uses several state-of-the-art techniques to obtain richrepresentations of the local scene structure, and applies an intuitive label transferoperation similar to CollageParsing to accumulate evidence for labels representedin the database. This evidence is combined in a principled way via dense con-32ditional random field (CRF) inference [69] to obtain a dense, online semantic la-belling of the scene in 3D. We demonstrate that our model-free approach enablesaccurate online 3D scene parsing while retaining scalability to new categories.3.1 Related workThe problem of building a dense, semantically-labelled, 3D reconstruction of theenvironment has generated broad research interest. Brostow et al. [15] were the firstto propose using structure from motion to inform scene parsing in video. Structurefrom motion is used to obtain sparse 3D point clouds, from which several simplefeatures are extracted and input to a trained random forest classifier. The authorsdemonstrated that these structure and motion features can be combined with ap-pearance features to obtain more accurate results than either feature type by itself.Sturgess et al. [124] combined Brostow et al.’s structure from motion features withappearance features in a higher-order CRF model to perform scene parsing of roadscenes in video.Instead of using structure from motion as a source of complementary features,other techniques integrate 3D information directly into the scene parsing frame-work. Xiao and Quan [144] developed an offline scene parsing algorithm for videocaptured by a laterally-moving camera (e.g. curb-facing camera on a moving ve-hicle). The authors construct an MRF over superpixels in the video, in whichsuperpixels from different frames are connected by pairwise potentials based onstructure from motion correspondences. Unary potentials are computed using aboosted classifier trained on 2D appearance, height, and 3D geometry features.Floros and Leibe [33] extended a hierarchical CRF scene parsing model [66, 77]to video by introducing new potentials for temporal consistency. 3D scene infor-mation is obtained by a point cloud reconstruction. A higher-order potential isintroduced that encourages projections of the same 3D point to be assigned thesame label. In addition, a 3D unary potential is added. Ha¨ne et al. [46] proposed ajoint 3D reconstruction and scene parsing framework that integrates class-specificgeometric priors based on surface normals. Sengupta et al. [113] obtained 3D se-mantic reconstructions of street scenes by fusing the output of 2D scene parsing[77] with 3D surface reconstruction from stereo images. Valentin et al. [135] de-33veloped a mesh-based method in which depth estimates from RGB-D cameras orstereo pairs are fused into a 3D mesh model and a CRF is defined over the meshfaces. Unary potentials are computed using a cascaded classifier trained on ge-ometry, appearance, and contextual features. The method is offline: the full 3Dmesh model is constructed once and then labelled. Riemenschneider et al. [108]improved the efficiency of offline scene parsing by removing view redundancy andreducing the number of sites requiring unary potential computation. Multi-viewreconstruction is used to obtain a 3D mesh model. For each triangle (face) in themesh, only the single predicted best view is used to compute the unary potential.Triangles are also ranked and unary potentials are computed at only the top tri-angles. Kundu et al. [76] combined sparse visual SLAM and 2D scene parsingin a 3D CRF framework to infer semantically-labelled 3D scene structure with amonocular camera. However, the projected 2D image labellings tend to be some-what blocky or voxelated. Vineet et al. [138] proposed an online 3D scene parsingmethod that combines large-scale 3D reconstruction with a novel handling of mov-ing objects. Voxels corresponding to moving object classes (e.g. cars and people)are allowed to update more quickly. A dense 3D CRF is constructed over voxels inthe current view. Unary potentials are computed by projecting 2D unary potentialsonto the voxels; 2D unary potentials are computed in image space using a randomforest trained on appearance and position features. Martinovic´ et al. [88] parsedbuilding facades in 3D by performing point cloud labelling using a random forest,inferring facade separations based on differences in the label structure, and enforc-ing weak architectural rules such as alignment and co-occurrence (e.g. a balconymust have at least one window above it).Similar to this past work, we perform scene parsing in 3D. However, all of thesemethods for 3D scene parsing involve the training of discriminative classifiers thatdecide whether a pixel or voxel should be labelled a particular category versusthe others in a pre-defined set of categories. For example, a classifier may betrained to distinguish a tree pixel versus a pixel of car, road, or building. To thebest of our knowledge, MF3D is the first method for 3D scene parsing of videothat is nonparametric. Our experiments confirm that a nonparametric search-basedapproach can produce accurate 3D labelling results while retaining extensibility tonovel categories.343D semantic scene parsing is distinct from 3D object detection in that the aimof scene parsing is to produce a fully labelled reconstruction, including both fore-ground objects and background texture categories. However, 3D object detectionmethods may employ similar reconstruction techniques: for example, Tateno et al.[126] proposed an object recognition framework that builds on top of KinectFusionreconstruction [92] and can localize pre-trained, fully 3D objects in clutter.While our focus is on video sequences, 3D semantic scene parsing can alsobe approached by other input modalities. Bla´ha et al. [10] proposed an adaptivevolumetric model for 3D scene parsing at a city scale based on aerial images. Theirvolumetric model is hierarchical and the resolution is refined only near surfaces.Given the point cloud scan of an entire building, Armeni et al. [4] parse the pointcloud into rooms and further parse the rooms into structural elements (e.g. walls,ceiling) and furniture elements.3.2 Algorithm description: MF3DFig. 3.1 illustrates the MF3D processing pipeline. MF3D takes as input a pair ofcalibrated stereo RGB images. From this stereo pair, a disparity map is calculatedand depths are estimated for each imaged scene point. The estimated depths areused to incrementally construct a 3D volumetric representation of the surroundings.The 3D reconstruction is stored and indexed efficiently using voxel hashing. Next,a nonparametric label transfer operation is performed in image space, in which cat-egory labels from a database of labelled 2D images are transferred to the currentimage by nearest neighbor matching. The label evidence is then projected into the3D scene to form the unary potentials of a 3D CRF defined over voxels. The CRFis solved in 3D using an efficient approximate inference algorithm to obtain thecurrent labelling of the 3D scene. The pipeline updates the labelled 3D reconstruc-tion as new observations arrive (in other words, in an online manner), and does notrequire knowledge of the entire video sequence. The following sections describeeach of these components in detail.35RGB frames(stereo pair)Compute camera poseCompute depth mapCompute image-space unary update by label transferUpdate 3D reconstruction (geometry)Update unary potentials in 3D reconstructionSolve 3D CRFFigure 3.1: Overview of the MF3D processing pipeline for 3D scene parsing.MF3D takes as input a pair of stereo RGB images, from which depthand the current camera pose with respect to the 3D reconstruction areestimated (Section 3.2.1). A label transfer operation is performed in im-age space and projected into the 3D reconstruction as a unary potentialsupdate. A 3D CRF, defined over visible voxels, is solved to obtain asemantic labelling of the scene in 3D (Sections 3.2.2 and 3.2.3).363.2.1 Preliminaries: 3D reconstructionThe base representation of our 3D scene model is a sparse volumetric represen-tation in which each allocated voxel stores a truncated signed distance function(TSDF) value [21] indicating the distance to the closest surface. The TSDF isan implicit representation of a 3D surface, which is captured as the zero level set(free space is represented as positive values and occupied space as negative values).Given a camera pose, a view of the model surfaces can be obtained by raycasting.Besides the TSDF value, we also store in each voxel a mean RGB value and a k-dimensional array containing unary potentials (‘evidence’) accumulated so far forthe k semantic categories.Modern 3D reconstruction pipelines commonly consist of tracking and map-ping modules. Tracking involves localizing the camera within the 3D reconstruc-tion, and mapping (or fusion) involves integrating new sensor observations into the3D reconstruction. Klein and Murray [65] were the first to separate tracking andmapping into separate tasks that are run in parallel, in contrast to previous SLAMapproaches. Their Parallel Tracking and Mapping (PTAM) method can rapidlyestimate the pose of a calibrated hand-held camera and build a sparse 3D recon-struction of a small workspace environment. In the tracking task, map points areprojected into the image based on a prior pose estimate given by a motion model,coarse-to-fine matching of features is performed, and the updated camera pose isobtained by iterative minimization of the reprojection error. The mapping task isinitialized using a stereo technique and expanded over time with new keyframes,with bundle adjustment as resources permit. Newcombe et al.’s Dense Trackingand Mapping (DTAM) method [93] performs dense image alignment instead offeature matching, enabling more robust tracking under rapid motion.Newcombe et al.’s KinectFusion method [92] took advantage of the newly in-troduced Kinect depth sensor and contemporary GPU hardware to obtain real-timedense 3D reconstruction of room-sized environments. Kinect sensor observationsare fused into a dense TSDF voxel grid. Tracker drift is significantly reduced byestimating the new camera pose with respect to the complete current reconstructioninstead of the previous frame. The camera pose is estimated using coarse-to-fineiterative closest point (ICP) alignment between the live depth image and the ray-37casted surface prediction.Scaling a 3D reconstruction to large spaces is challenging as the memory re-quirements can quickly become impractical. Nießner et al. [94] extended denseTSDF reconstruction to larger scale environments by introducing voxel hashing(not to be confused with binary hashing methods for efficient search in Section 4).Voxel hashing reduces the memory requirements of the 3D reconstruction: insteadof allocating a regularly spaced voxel grid, voxels near surfaces are allocated asneeded as observations arrive. Empty space remains unallocated. Voxel blocks areindexed by world coordinates in a hash table based data structure.Our 3D reconstruction framework builds on top of InfiniTAM [64], a highlystreamlined implementation of the modern 3D reconstruction pipeline with a denseTSDF 3D reconstruction indexed using a modified version of voxel hashing. Opti-mizations of voxel hashing, fusion, and raycasting allow real-time 3D reconstruc-tion to be achieved on a tablet computer. The authors of InfiniTAM report an orderof magnitude speed-up over KinectFusion [92] and Nießner et al. [94]. We incor-porate modifications to InfiniTAM’s tracking and mapping (fusion) components.Instead of using InfiniTAM’s default iterative closest point (ICP) tracker, whichwe found to be unreliable for outdoor driving sequences, we track the camera poseusing FOVIS [57]. FOVIS is an algorithm for visual odometry. At each time step,the algorithm accepts a greyscale image and a depth map as input, and returnsthe integrated pose estimate. The visual odometry pipeline consists of feature ex-traction, initial rotation estimation, feature matching, inlier detection, and finallymotion estimation. First, FAST features [109] are extracted from the image atmultiple scales. An initial rotation is then estimated to constrain feature match-ing across frames. Next, features are matched and consistent inlier matches aredetected using a greedy maximum-clique technique. Finally, a motion estimate isobtained by iteratively minimizing the feature reprojection error and updating theinlier set.For mapping (fusion) in our modified InfiniTAM implementation, the TSDF,RGB, and unary potential values associated with each voxel are updated as a run-ning average in an online manner as new sensor observations arrive. Observationsat each time step include the RGB image and a disparity map estimated usingstereo. Disparity is converted to depth by z = bf/d, where z is the depth, b is the38stereo baseline, f is the camera’s focal length, and d is the disparity. In particular,we estimate the disparity map from the stereo RGB pair using the Efficient Large-scale Stereo (ELAS) method of Geiger et al. [37]. ELAS is a Bayesian methodfor stereo matching that forms a prior on the disparities by triangulating a set ofrobust point correspondences, referred to as support points. Given a disparity fora pixel in the left (right) image, an observation likelihood is computed for the cor-responding pixel in the right (left) image by comparing fast Sobel-based features.The disparity map is obtained by maximum a-posteriori (MAP) inference at eachpixel, which can be easily parallelized and does not require an expensive globaloptimization.3.2.2 3D CRF modelMF3D constructs a dense conditional random field (CRF) over the 3D reconstruc-tion visible in the current frame. Let G = (V, E) be a graph defined over the visible3D reconstruction, where V is a set of nodes corresponding to the visible voxels andE is a set of undirected edges. Let Xi be the random variable associated with nodei ∈ V . Xi can take on a label from the label set L = {l1, l2, ..., lk}, correspondingto the vocabulary of k semantic category labels. Denote by xi a realization or as-signment of random variable Xi, and by x ∈ L|V| the assignments of all randomvariables; that is, x represents a complete labelling of the image. The energy (orcost) function associated with a particular labelling of the CRF is given byE(x) =∑i∈Vψi(xi) +∑(i,j)∈Eψij(xi, xj) (3.1)where ψi(xi) are unary potentials and ψij(xi, xj) are pairwise potentials, both im-plicitly conditioned on the observed data. The task of inference is to find the la-belling x∗ with the lowest energy:x∗ = argminx∈L|V|E(x) (3.2)Unary potentials represent the costs associated with independently assigning avoxel a particular semantic category label. The conventional practice is to com-pute the unary potentials using a trained discriminative classifier. This classifier39is trained to distinguish between instances of the categories in L. In contrast toconventional practice, MF3D computes the unary potentials via the nonparametric,search-based label transfer mechanism described in the next section.Pairwise potentials represent the costs associated with assigning two connectedvoxels two particular semantic category labels. For example, the cost of assigningthe same semantic label to two voxels that are close in the 3D world and similarin color appearance should be low. Connectivity is defined in terms of the edges Ein graph G and does not necessarily correspond to spatial adjacency. In our model,G is densely connected: that is, there exists a connection between every possiblepairing of voxels in V . This is in contrast to many traditional labelling methodsin which only adjacent neighbours are connected. The dense connectivity enablesencoding of long-range patterns. Kra¨henbu¨hl and Koltun [69] demonstrated thatapproximate inference in a densely connected CRF, in which the pairwise poten-tials are defined by a linear combination of Gaussian kernels, can be performedefficiently by optimizing a mean-field approximation to the CRF distribution. Op-timizing the mean field approximation normally involves iterative message passingthat is quadratic in the number of variables (voxels). Kra¨henbu¨hl and Koltun’s in-sight is that the message passing can be implemented as Gaussian filtering in fea-ture space. By applying a fast approximate high-dimensional filtering technique,inference becomes linear in the number of variables and sublinear in the number ofedges. We adopt Kra¨henbu¨hl and Koltun’s inference method to obtain a labellingover voxels, generalizing their original 2D pairwise appearance kernel to 3D (vox-els that are close in the 3D world and similar in color are encouraged to share thesame label):ψij(xi, xj) = λ exp(−|pi − pj |2σ21− |Ii − Ij |2σ22)δ[xi 6= xj ] (3.3)where λ is the CRF smoothing constant, pi and pj are voxel positions, Ii and Ijare color vectors, σ1 and σ2 are Gaussian bandwidth parameters, and δ[·] is theindicator function.The 3D CRF is approximately solved each time step in an online manner, usingonly sensory data obtained so far, and is constructed only over voxels visible inthe current time step. Unary potentials at each voxel are accumulated online as a40running average over time (Section 3.2.1).3.2.3 Unary potentials update by label transferRecall that label transfer refers to an accumulation of semantic category evidenceby transferring the labels of matching exemplars in a labelled database. This is asearch-based strategy that differs from the common strategy of training a discrim-inative classifier to predict the semantic category label. Its main advantages areflexibility and scalability: the approach is naturally extended to new labelled ex-amples or new categories because no intensive re-learning stage is necessary. Anew category can be added to the database without having to learn how to discrim-inate it from all previously learned categories.In each frame, MF3D computes a unary potential update in image space. Theunary potential update is then projected into the online 3D reconstruction to updatethe unary potentials stored in each voxel. Two questions remain: how can wecompute a unary potential update from a database of labelled examples, and whatis the best way to project the unary update into the 3D reconstruction?In the previous chapter on 2D scene parsing, we demonstrated how using mid-level region proposals (also known as object proposals) for label transfer producesmore accurate labellings than using low-level superpixels. Intuitively, region pro-posals are more likely to preserve object boundaries. For MF3D, we generate a setof region proposals for each frame using a state-of-the-art region proposal network(RPN) [107]. Each region proposal is a rectangular window. The appearance ofeach region proposal is described using a pre-trained VGG-16 network [117] asan off-the-shelf feature extractor, without any fine-tuning for the 3D scene pars-ing task. In particular, an appearance-based feature descriptor fa is obtained byconcatenating the activations of the two fully-connected layers of VGG-16. Thesedescriptors are obtained efficiently using region pooling of the convolutional lay-ers (region pooling evaluates the expensive convolutional layers only once for theimage instead of once for each window [40, 50]).In addition, a depth-based feature descriptor fd is computed for each region pro-posal using the depth map estimated in the 3D reconstruction step (Section 3.2.1).The depth-based descriptor fd is obtained by taking multiple binary comparisons41(0,0)(1,1)(r1=0.3,c1=0.5)(r2=0.5,c2=0.8)Figure 3.2: The depth feature descriptor fd of a window is an nd-dimensionalvector, in which each entry is determined by comparing the estimateddepth at two locations (r1, c1) and (r2, c2). These locations are normal-ized window coordinates, i.e. 0 ≤ r1, c1, r2, c2 ≤ 1. If the first locationis at a greater depth, the entry in fd is set to +1; otherwise, the entry isset to -1.of the estimated depth at random relative locations within the region proposal. LetR = {(r1, c1), (r2, c2)}nd denote a set of nd relative location pairs within theregion proposal. For the jth pair, we compare the estimated depths at the two lo-cations within the region. If the first location has a greater depth estimate, the jthentry of fd is set to +1; otherwise, the jth entry of fd is set to -1. Fig. 3.2 illustratesthis comparison. R is generated once and applied for each region in the databaseand in the queries. Each r1, c1, r2, c2 is drawn uniformly at random from the inter-val [0,1]. The final region descriptor f is a concatenation of fa and fd multiplied bya scaling factor to make the two types of descriptor comparable: f = [fa, αd · fd].We transfer labels from the database in a manner similar to CollageParsing inthe previous chapter. Let {W1,W2, ...Wm} denote the set of region proposals in thecurrent frame. For each region proposal, we search for its closest region proposalin the database by matching the region descriptors f . Each region proposal in thedatabase has an associated ground-truth 2D label field L, appropriately croppedfrom the ground truth label images. The search therefore produces a set of labelfields {L1, L2, ..., Lm}. We warp these retrieved label fields to match the sizes ofthe query region proposals, and denote the resized label fields by {L′1, L′2, ..., L′m}.Let {M1,M2, ...,Mm} denote label masks constructed from {L′1, L′2, ..., L′m}.Each mask Mj is the size of the current frame and is zero (null) everywhere ex-cept within the corresponding region proposal Wj , where it is populated with the42values of L′j . The image-space unary potential update associated with assigning acategory label l to pixel i is then given byψ∗i,l(xi) = −m∑j=1δ[Mj(i) = l]φh(i, l)φsz(j)φidf(l) (3.4)where δ[·] is the indicator function and Mj(i) is the value of Mj at location i,which can be a category label or null. The term φh(i, l) is an importance weightbased on the 3D height of pixel i and the height statistics of category label l in thedatabase:φh(l) =exp(− (h(i)−µh(l))2σ2h (l))if µh(l) < 01 otherwise(3.5)where h(i) denotes the 3D height of pixel i, µh(l) is the mean height of label l, andσh(l) is the variance of the height of label l. The importance weight is defined onlyfor categories with a mean height below the camera (µh(l) < 0) because height es-timates are unreliable for objects extending beyond the upper bounds of the camerafield of view. Intuitively, the term φh(i, l) assigns greater importance to matcheswith heights that are consistent with the database examples. The height means andvariances of a category can be computed in an online manner as a running average,and independently of other categories. Height statistics therefore easily scale tonew labelled examples and new categories.The term φsz(j) assigns higher weight to smaller region proposals, and is de-fined as the inverse of the smaller dimension of the proposal window. This termincreases the influence of compact object matches and suppresses larger image re-gions whose labels may be less precisely matched.Finally, φidf(l) is the inverse document frequency term commonly used in re-trieval:φidf(l) =1freq(l)(3.6)where freq(l) is the number of pixels with label l in the database region propos-als. MF3D does not compute co-occurrence statistics explicitly but captures co-43Wjψ* incrementSearch for closest region proposal in databaseLjFigure 3.3: Visualization of image-space label transfer for a single regionproposal Wj .occurrence patterns implicitly in the window-based label transfer.Fig. 3.3 illustrates matching and label transfer for a single region proposal Wj .Instead of allocating memory for individual masks Mj , in implementation we sim-ply accumulate energies directly in ψ∗i,l. Fig. 3.4 visualizes the image-space unarypotential updates for the same image after all m region proposals have been pro-cessed. Saturated colors represent higher values of ψ∗i,l.We have discussed the question of how we can compute a unary potential up-date from a database of labelled examples, and now arrive at the remaining ques-tion: how can we best project the unary potential update into the 3D reconstruction?A straightforward approach would be to simply project the entire ψ∗i,l into the 3Dreconstruction. However, we observed that this approach can lead to some bleedingof category energies across object boundaries or discontinuities. This is a conse-quence of label transfer being coarsely localized and is a limitation of our systemwhen examples are limited. The spatial characteristics of the transferred evidencedepend on the positions of the objects in the database examples. If the positionsin the database examples do not match the observed frame exactly, there will besome spatial inexactness in the transferred labels. To address this, MF3D performsa segmentation of the frame into superpixels, followed by a winner-take-all pool-ing operation that selects only the most confident category within each superpixel.Each superpixel S selects a single category label ls as the mode of the highest-evidenced labels at each pixel:44Building Vegetation Car RoadWall/fence Pavement PoleFigure 3.4: Visualization of the image-space unary potential update for asample image. Higher saturation represents higher values of ψ∗i,l.ls = modei∈S{argmaxlψ∗i,l(xi)}(3.7)With winner-take-all pooling, the projection of the unary potential update isrestricted to a single category label at each point, and this label is the same acrossall pixels within a superpixel. The unary potential update is zeroed out for all othercategory labels. This pooling operation improves the alignment of transferred evi-dence to the observed frame and reduces the bleeding effects caused by differencesin the positions of the objects in the database examples. Our experiments indicatethat winner-take-all pooling leads to a modest improvement in labelling accuracy,though some bleeding is still noticeable, especially for thin structures such as poles(Fig. 3.7, bottom).Fig. 3.5 shows a visualization of the 3D voxel unary potentials at the frameshown in Fig. 3.4. The voxel unary potentials are accumulated over multiple framesin an online manner as a rolling average.45Building Vegetation Car RoadWall/fence Pavement PoleFigure 3.5: Visualization of the 3D CRF unary potentials for a sample imageto a maximum depth of 25 meters. Voxel unary potentials are accu-mulated over multiple frames in an online manner as a rolling average.Higher saturation represents higher values of ψi,l.3.2.4 Computation timeMost of MF3D’s computation time is spent performing label transfer. As describedso far, MF3D performs an exact nearest neighbour search during label transfer,which is computationally expensive in terms of both time and memory. It is possi-ble to speed up this expensive operation, while at the same time reducing the mem-ory overhead, by integrating efficient methods for large-scale approximate nearestneighbour search. We address this topic in detail in Chapter 4.MF3D’s 3D reconstruction pipeline has been demonstrated to be runnable inreal-time given a modern hardware platform and efficient software implementation[64, 138]. Region proposal generation and description using RPN can performedat 5 fps on a modern GPU [107].463.3 ExperimentsBenchmark details. We benchmarked MF3D on the KITTI labelled dataset [38,113, 135], a standard benchmark for 3D scene parsing of video. We chose KITTIbecause it is the common dataset on which the state-of-the-art 3D scene parsingmethods [113, 135, 138] have all been evaluated. The KITTI labelled dataset isderived from the broader KITTI benchmark for autonomous driving [38] and con-sists of four training video sequences (KITTI sequences 05 to 08), totally spanning5957 stereo pairs, and one independent testing video sequence (KITTI sequence15), totally spanning 1901 stereo pairs. All sequences are captured at 10 framesper second. The four training sequences are captured at a resolution of 1226×370,while the testing sequence is captured at a resolution of 1241 × 376. A subset of45 frames from the training sequences and 25 frames from the testing sequence ismanually annotated by the authors of [113, 135] with semantic ground truth labels.Since we build the 3D semantic reconstruction incrementally in an online manner,we process all frames but evaluate only on the 25 in the test sequence with groundtruth annotations. Labelling accuracy is assessed for seven semantic categories:building, vegetation, car, road, wall/fence, pavement, and pole. Camera calibrationparameters are provided by the authors of KITTI [38].Implementation details. Since the region proposal network generates windowsof limited aspect ratio (1:2, 1:1, 2:1) and some objects of interest are very thin (inparticular, poles), we augment the set of network proposals with Edge Boxes [151]narrower than 1:4. Edge Boxes have been independently shown to be good forlocalization and fast to compute [55]. The feature descriptors for these proposalsare generated the same way as the regular network proposals.Evaluation methodology. Labelling performance is measured by the mean per-class accuracy over the seven classes. Since the ground truth annotations are in2D, the semantically labelled 3D reconstruction is projected onto the camera planefor evaluation. Evaluation is performed on the online 3D reconstruction (i.e. usingonly the frames observed so far). Pixels beyond a maximum depth of 25 meters areignored following Vineet et al. [138].Results and comparison to state-of-the-art. Fig. 3.6 shows qualitative 3D sceneparsing results at frames 3, 36, and 68 in the testing sequence (KITTI sequence47Table 3.1: Comparison to state-of-the-art methods for 3D scene parsing ofvideo on the KITTI labelled benchmark3D Online Model-freeMean per-class accuracyValentin et al. [135] X 67.5Sengupta et al. [113] X 77.2Vineet et al. [138] X X 82.2MF3D X X X 83.515). These results are generated online, using only the frames observed up to thattime. We also render the labelled 3D reconstruction from alternative viewpoints.There are many gaps in the alternative renderings because these are virtual camerasand we have real observations only from the single canonical viewpoint. We cansee that MF3D densely reconstructs the 3D scene around the vehicle with mostlyaccurate semantic labels.Fig. 3.7 shows typical failure cases. The top row shows a reconstruction failureat frame 639 in the testing sequence, in which the FOVIS tracker fails to accuratelyestimate the camera pose. This tracking failure results in local corruption in the3D model. However, the system is able to robustly recover from the failure as itincorporates new sensor observations. The second row shows the labelling resultone second later at frame 649. We can see that there only a few erroneous recon-struction artifacts and the labelling of the scene is mostly accurate. The third rowshows another typical error case. While the winner-take-all pooling reduces thedegree of undesirable ‘bleeding’ during label transfer, some bleeding still occursfor thin structures such as poles. This is apparent in the bottom row, which showsthe reconstruction result at frame 1050.Table 3.1 shows the quantitative results as well as a qualitative comparisonwith three strong baselines for 3D scene parsing in video [113, 135, 138]. MF3Dachieves labelling accuracy comparable to the state-of-the-art method of Vineet etal. [138] and leads the other recent methods of Sengupta et al. [113] and Valentinet al. [135] by several percent.Fig. 3.8 shows the sensitivity of MF3D to the depth feature parameters and48Building Vegetation Car RoadWall/fence Pavement PoleFigure 3.6: Qualitative 3D scene parsing results on the KITTI benchmark.Left column: RGB frame and labelling result from the camera pose.Right column: 3D renderings of the labelling result from different vir-tual viewpoints. Labelling results are to a maximum depth of 25 meters.49Building Vegetation Car RoadWall/fence Pavement PoleFigure 3.7: Failure cases on the KITTI benchmark (see Section 3.3 for dis-cussion). Labelling results are to a maximum depth of 25 meters.0.8290.8300.8310.8320.8330.8340.8350.836-3.5 -3 -2.5 -2 -1.5 -1Mean per-class accuracy log10 αd 0.8290.8300.8310.8320.8330.8340.8350.8368 16 32 64 128 256Mean per-class accuracy Number of random depth pairs nd 0.8000.8100.8200.8300.840500 1000 1500 2000 2500 3000 3500 4000Mean per-class accuracy Number of test region proposals m Figure 3.8: Sensitivity experiments on the KITTI labelled benchmark (seetext for discussion).50the number of region proposals. MF3D is robust to the depth descriptor scalingfactor αd: we set αd = 0.01 and it suffices to be within an order of magnitude. Thedepth features are computed using nd = 32 random depth comparisons. Increasingthe number of comparisons up to 8-fold does not improve results consistently; atany rate, the depth descriptor has only a small effect on labelling accuracy. Wegenerated 1000 region proposals for each database image, and increased this tom = 4000 for the test images. Increasing the number of region proposals at testtime does improve the labelling accuracy but with diminishing returns.Label transfer and discriminative classification. Finally, we completed an ad-ditional experiment to characterize the effective difference between using model-free label transfer and discriminative classification. In this experiment, we replacedthe label transfer in our pipeline with the recent SegNet [7]. SegNet is a fullyconvolutional neural network architecture comprising an encoder network, a cor-responding decoder network, and a pixelwise classification layer. The network isdesigned for 2D scene parsing, and targets road scene understanding in particu-lar. The encoder network follows the VGG-16 network architecture [117]. Thedecoder performs upsampling based on the max-pooling indices computed duringencoding, and further convolves with learned filters.We used the public implementation of SegNet built on the Caffe deep learn-ing framework [61], and fine-tuned it on the KITTI dataset with pre-training onImageNet [111]. To fine-tune the network we used stochastic gradient descentwith a learning rate of 0.01 and momentum of 0.9. Differences in class frequencywere handled by median frequency balancing [7]. Holding the rest of the 3D re-construction pipeline fixed, replacing label transfer with SegNet reduces the meanper-class accuracy on KITTI to 81.8% (original: 83.5%). While SegNet is effec-tive in predicting the single most likely class per pixel, our label transfer betteraccommodates noisy predictions as it makes use of redundancy via overlapping re-gions. These independently matched overlapping regions provide multiple sourcesof evidence for the classes at a given pixel.51Chapter 4Scaling Nonparametric SceneParsing: Methods for Large-ScaleSearchIn the previous chapter, we presented a nonparametric approach to 3D scene pars-ing that differs from conventional practice in its use of search-based label transferinstead of discriminative classification. MF3D achieves competitive labelling ac-curacy and is easily extensible to new categories as no model training is required.However, as described so far, MF3D performs an exact nearest neighbor searchduring label transfer, which is computationally expensive in terms of both time andmemory. We address this point next, and show how unsupervised binary encod-ing (or hashing) can be easily incorporated into our nonparametric scene parsingframework for scalability to larger databases.In this chapter, we propose two novel binary encoding algorithms for large-scale approximate nearest neighbor search. The first algorithm is data-independentand easily scalable to new examples and categories. The second algorithm is super-vised and targets high-dimensional feature vectors such as the proposal descriptorsin our scene parsing framework. We evaluate these methods on standard retrievalbenchmarks, and then demonstrate their effectiveness in the context of 3D sceneparsing on the KITTI labelled dataset.524.1 Related workOne class of methods for performing large-scale approximate nearest neighborsearch is binary encoding, also referred to as hashing (distinct from voxel hash-ing in Section 3.2.1). Binary encoding methods learn a similarity-preserving trans-formation of the feature vectors from their original, real-valued, high-dimensionalfeature space to a low-dimensional binary-valued space. By transforming featurevectors to compact binary codes, significant gains in memory and time efficiencycan be achieved. The Hamming distance between two binary codes can addition-ally be computed by taking an XOR and counting the ones, which is directly sup-ported in modern hardware.A broad range of binary encoding algorithms has been developed for large-scale search, with varying tradeoffs in speed, accuracy, and underlying assump-tions. Locality-sensitive hashing (LSH) [18, 39] is a traditional, data-independentbinary encoding method that uses randomly generated projections. In LSH, theprobability of two feature vectors hashing to the same value is proportional to somemeasure of similarity between the vectors, such as their cosine similarity [18]. LSHoffers theoretical asymptotic guarantees and can be kernelized [62, 75, 104], how-ever a large number of bits (i.e. a large random projection matrix) is typicallyrequired for good performance in practice.This performance limitation motivated the development of binary encodingmethods that rely on training or optimization to learn the transformation. For exam-ple, binary reconstructive embedding [74] uses coordinate descent to minimize theerror between the original pairwise distances and the Hamming pairwise distances.Spectral hashing [141] solves a relaxed version of a balanced graph partitioningproblem to learn efficient codes. Iterative quantization (ITQ) [43] learns a rota-tion of PCA (unsupervised) or CCA (supervised) projected data that minimizes thequantization error, or distortion. The rotation is determined iteratively by alternat-ing between updating the assignments given the current rotation, and updating therotation to minimize the quantization error given the current assignments. Minimalloss hashing [95] applies structured prediction techniques to minimize a hinge-likeerror. Kernel-based supervised hashing [85] sequentially learns a projection in ker-nel space. K-means hashing [49] clusters the data such that the Euclidean distance53between two data points is closely approximated by the Hamming distance be-tween their cluster indices. Graph cuts coding [36] integrates the binary codes intothe optimization as auxiliary variables. Supervised discrete hashing [114] learns aprojection that optimizes the linear classification performance of the binary codes.The authors formulate a discrete optimization problem that is solved iteratively byalternating among three sub-problems: optimizing the hash functions, the linearclassifier, and the binary code assignments, given the other two. We compare toseveral of these methods in the experiments (Section 4.2.2 and Section 4.3.2).Gong et al. [42] showed that binary encoding methods are often infeasible forsearching high-dimensional feature vectors such as VLAD [59] and Fisher Vectors[102]. Their bilinear projections method replaces the usual expensive projectionmatrix with two smaller bilinear projections. Circulant binary embedding [147]learns a projection by a circulant matrix, which can be accelerated using the FastFourier Transform. Xia et al. [143] proposed the unsupervised sparse projection(SP) method for searching high-dimensional feature vectors. The sparse projec-tion is obtained by an iterative, ITQ-like optimization with an additional L0 reg-ularization constraint. SP achieves an order of magnitude faster encoding thanunsupervised dense methods while offering comparable accuracy. In addition, SPproduces much more accurate results than bilinear projections or circulant binaryembedding, while having similar encoding speed. Concurrent with SP, Rastegari etal. [105] proposed sparse binary embedding, which finds a sparse projection ma-trix via a different SVM-based optimization. Kronecker binary embedding [149]computes fast projections using a structured matrix that is the Kronecker productof small orthogonal matrices. Like these methods, our proposed supervised sparseprojections method (Section 4.3) targets high-dimensional feature vectors.Besides binary encoding, approximate nearest neighbor search can also be ap-proached via vector quantization, which encodes data points using learned code-books that can be combined in product [35, 53, 60, 96] or additive [5, 6] forms.Vector quantization algorithms tend to achieve high retrieval accuracy but are slowerthan hashing algorithms. Binary codes can also be used directly in downstreamclustering or classification modules, which is not possible with vector quantizationalgorithms [43].54A C B 0 1 0 1 A C B 0 1 0 1 (a)A C B 0 1 0 1 A C B 0 1 0 1 (b)Figure 4.1: Toy example illustrating neighbor retrieval under two differentsets of hashing hyperplanes. The hyperplanes in (a) map A and B to thesame binary code, and C to a different binary code. The hyperplanes in(b) map B and C to the same code, and A to a different code.4.2 Bank of random rotationsMost modern binary encoding and vector quantization algorithms require training,whether in the form of learning optimal transformations or codebooks. We proposea data-independent binary encoding method that, like traditional LSH, requires notraining and is therefore well-suited for reducing the time and memory require-ments of nonparametric scene parsing.4.2.1 Algorithm descriptionGraphically, we can think of the bits in a binary code as the decisions of a set ofhash functions or hyperplanes in the feature space. A wide variety of techniqueshave been developed to find the best set of hyperplanes (Section 4.1), which maydepend on the application, or more generally, the loss function to be minimized.For example, many binary encoding methods optimize for a transformation thatminimizes the overall quantization error, or distortion [43, 49, 143, 147].We can observe that, for many individual data points, the globally optimaltransformation may not yield good retrieval performance. That is, neighbor re-lationships in the original feature space may not be well preserved. Figure 4.1illustrates with a toy example in a two-dimensional feature space. Figure 4.1(a)shows two hyperplanes determining two bits of the binary code. This transforma-55tion maps A and B to the code 00 and C to the code 01. From the perspective of B,this transformation does not preserve its neighbor relationship: B’s true neighbor,C, is mapped to a different code and appears farther in Hamming space than A,which is mapped to the same code (i.e., the Hamming distance between the binarycodes of A and B is zero). Consider the alternative transformation in Figure 4.1(b).From B’s perspective, this is a better transformation because it maps B and C tothe same binary code and A to a different code.We ask whether a bank of multiple transformations can be more effective thana single globally optimal transformation. For example, instead of learning a trans-formation to minimize the overall quantization error, we consider having each datapoint select from a bank of transformations the one that minimizes its own quan-tization error. This approach removes the need for global optimization or training,and as we will demonstrate later, allows for improved search accuracy at the costof extra encoding overhead at query time.Following binary encoding and vector quantization methods that search for aglobally optimal rotation of the data [35, 43, 96], our method generates a bankof random rotations R = {R1,R2, ...,RK}. The database is encoded as fol-lows. Given a database of n points, we first preprocess the database points byzero-centering and PCA-embedding. Next, for each individual data point x, weselect R∗ ∈ R that satisfiesR∗ = argmaxR∈R‖xR‖1 (4.1)where || · || is the L1 norm. Intuitively, R∗ maps x far from the hyperplanes toreduce the quantization error (e.g. the right transformation in the toy example ofFigure 4.1). We apply the rotation to x and obtain the binary code y by the usualthresholding at zero:y = sgn(xR∗) (4.2)Finally, we store the pair (y, j), where j is the index of R∗ in the bank ofrandom rotations R. That is, j identifies the rotation used to produce the binarycode y. We allocate bits from our budget to index the best rotation (i.e. to store j)56R = { , , } Database ( y1 , ) , ( y2 , ) , ( y3 , ) , ( y4 , ) Query xq yq Compute dhamm(yq , y3) Rotate , sgn Compute dhamm(yq , y2), dhamm(yq , y4) Compute dhamm(yq , y1) yq yq Rotate , sgn Rotate , sgn Figure 4.2: Toy example illustrating a query with a bank of random rotationsfor each individual data point, and therefore incur no additional memory usage forthe database. There is, however, a small memory overhead for storing the bank.Next, we explain how online queries work in the bank of random rotationsframework. Given a (zero-centered, PCA-embedded) query point xq, we computedistances to the database points adaptively, using each database point’s selectedrotation. More specifically, for each rotation R in the bank, we quantize xq withrespect to the rotation R via yq = sgn(xqR), and compute the Hamming distanceof yq to all database binary codes that selectedR as their rotation. Figure 4.2 illus-trates with a toy example. For greater clarity, a single query performs K rotationsand then computes only n Hamming distances – not Kn distances. The number ofHamming-space comparisons is unchanged by the bank of random rotations.Large-scale applications often involve databases that grow over time as userscontinually add images or video. In the proposed bank of random rotations ap-proach, when a new point is added to the database it selects a rotation from thebank independently of all other database points. In contrast to training-based meth-ods, no global optimization problem over the database needs to be re-solved. The57proposed approach therefore offers a particularly scalable solution in large-scalescene parsing applications where the database is expected to grow over time.4.2.2 ExperimentsWe evaluated the retrieval performance of the bank of random rotations on threestandard unsupervised retrieval benchmarks: SIFT1M, GIST1M [60], and CIFAR-10 [71]. The bank of random rotations is compared with three binary encodingbaselines: spectral hashing [141], iterative quantization [43], and k-means hashing[49].Benchmark details. The SIFT1M dataset [60] consists of 1 million SIFT de-scriptors for the database and 10,000 independent descriptors for queries. SIFT1Mwas used by He et al. [49] in the experimental evaluation of k-means hashing.The GIST1M dataset [60] consists of 1 million, 960-dimensional Gist descriptors[97] for the database and 1,000 descriptors for queries. The CIFAR-10 dataset[71] is a 60,000-image, 10-class subset of the Tiny Images dataset [129]. We com-puted 384-dimensional Gist descriptors for each CIFAR-10 image. Gong et al. [43]used a similar 11-class CIFAR dataset in the experimental evaluation of iterativequantization. We followed their methodology of randomly generating five training-testing splits, each containing 1,000 test queries, and averaging over the five trials.Implementation details. We used publicly available implementations of spec-tral hashing, iterative quantization, and k-means hashing. K-means hashing re-quires a product-space decomposition [60] to be tractable. The split into productsubspaces is performed following the split used by He et al. [49] in their experi-ments: 4 bits per subspace for 64-bit and 128-bit codes from SIFT descriptors, and8 bits per subspace for Gist descriptors. Recall that, for each database point, weallocate bits from our budget to store the identifier (index) of the rotation selectedfrom the bank. Experiments on all three datasets use a bank of 256 random rota-tions. For a fair comparison, we adjusted the number of bits in our binary codesaccordingly. For example, for 64-bit codes and a bank of 256 random rotations, wecomputed 56-bit codes since 8 bits are needed to index into the bank.Evaluation methodology. Retrieval performance is illustrated using recall@Ncurves [35, 49, 60, 96], which plot the proportion of true neighbors retrieved in580 5000 100000.40.50.60.70.80.91NRecall@NSIFT1M, 32 bits SH ITQ KMH BankRR (24 bits)0 5000 100000.40.50.60.70.80.91NRecall@NSIFT1M, 64 bits SH ITQ KMH BankRR (56 bits)0 5000 100000.40.50.60.70.80.91NRecall@NSIFT1M, 128 bits SH ITQ KMH BankRR (120 bits)Figure 4.3: Retrieval performance on SIFT1M unsupervised retrieval bench-mark with 32-bit, 64-bit, and 128-bit binary codes (SH: Spectral hash-ing, ITQ: Iterative quantization, KMH: K-means hashing, BankRR:Bank of random rotations).the first N Hamming neighbors. Following the convention of He et al. [49], theground truth is considered to be the query’s 10 nearest Euclidean neighbors in theoriginal feature space.Results. Figures 4.3, 4.4, and 4.5 show the retrieval performance of 32-bit,64-bit, and 128-bit codes on the SIFT1M, GIST1M, and CIFAR-10 benchmarks.Despite not requiring any training or complex sequential optimization, the data-independent bank of random rotations achieves state-of-the-art results on CIFAR-10 and GIST1M. On SIFT1M, results at 64 and 128 bits are on par with k-meanshashing, which has been demonstrated to outperform earlier methods including lo-cality sensitive hashing, PCA hashing, spectral hashing, and minimal loss hashing590 5000 1000000.10.20.30.40.50.60.70.80.91NRecall@NGIST1M, 32 bits SH ITQ KMH BankRR (24 bits)0 5000 1000000.10.20.30.40.50.60.70.80.91NRecall@NGIST1M, 64 bits SH ITQ KMH BankRR (56 bits)0 5000 1000000.10.20.30.40.50.60.70.80.91NRecall@NGIST1M, 128 bits SH ITQ KMH BankRR (120 bits)Figure 4.4: Retrieval performance on GIST1M unsupervised retrieval bench-mark with 32-bit, 64-bit, and 128-bit binary codes (SH: Spectral hash-ing, ITQ: Iterative quantization, KMH: K-means hashing, BankRR:Bank of random rotations).[49]. The bank of random rotations consistently achieves higher retrieval accu-racy than the natural baseline, iterative quantization, which uses a single globallyoptimal rotation.Figure 4.6 shows qualitative results on CIFAR-10. The bank of random ro-tations allows for the efficient retrieval of similar neighbors at a fraction of thememory requirements. Irrelevant images are sometimes retrieved due to the ap-proximation error of binary encoding, especially at low code lengths. However,even the full-dimensional Gist descriptors sometimes retrieve incorrect images.This inexactness is expected because image descriptors only coarsely character-ize the image appearance, and visually similar instances from different semantic600 2500 50000.40.50.60.70.80.91NRecall@NCIFAR−10, 32 bits SH ITQ KMH BankRR (24 bits)0 2500 50000.40.50.60.70.80.91NRecall@NCIFAR−10, 64 bits SH ITQ KMH BankRR (56 bits)0 2500 50000.40.50.60.70.80.91NRecall@NCIFAR−10, 128 bits SH ITQ KMH BankRR (120 bits)Figure 4.5: Retrieval performance on CIFAR-10 unsupervised retrievalbenchmark with 32-bit, 64-bit, and 128-bit binary codes (SH: Spectralhashing, ITQ: Iterative quantization, KMH: K-means hashing, BankRR:Bank of random rotations).categories may be mapped to similar descriptors. Replacing a 384-dimensionalGist descriptor with a 128-bit binary code represents a reduction in memory bytwo orders of magnitude (1,536 or 3,072 bytes to 16 bytes).The bank of random rotations approach provides efficient unsupervised large-scale search without training, making it well suited to our nonparametric sceneparsing framework, as both are designed to be easily extensible to an online growthin the database. In scene parsing we are interested in matching region descriptorsthat have associated semantic labels. A natural question to ask is whether we canachieve even better performance if we leverage this supervisory information. Wenext describe a novel supervised binary encoding method that is particularly ap-61Figure 4.6: Qualitative retrieval results of the bank of random rotations onthe CIFAR-10 dataset. From left to right: query image; feature-space(Gist) neighbors; 32-bit code neighbors; 128-bit code neighbors.6250 100 150 200 250Encoding time(ms)0.10.20.30.40.50.60.7Mean Average Precision (mAP)SUN RGB-D, b = 64 bitsSSP SPITQ-CCA LSHSDHFigure 4.7: Sample results. Retrieval accuracy versus encoding time on theSUN RGB-D benchmark [122] using 4096-dimensional convolutionalneural network features. Binary encoding methods: supervised dis-crete hashing (SDH) [114]; iterative quantization with canonical corre-lation analysis (ITQ-CCA) [43]; sparse projections (SP) [143]; locality-sensitive hashing (LSH) [18, 39]. See Section 4.3.2 for complete results.plicable to searching high-dimensional feature vectors, such as today’s ubiquitousdeep neural network features.4.3 Supervised sparse projectionsWith the growth of big data, computer vision algorithms are relying on increas-ingly complex models and producing increasingly high dimensional features. Fromsecond-order pooling of traditional hand-crafted features [17], to Fisher Vectors[102], to the activations of deep neural networks [19], features with thousands ortens of thousands of dimensions have become the norm. However, high-dimensionalfeatures pose particular challenges for binary encoding algorithms. For exam-ple, traditional algorithms lack an effective regularizer for the learned projections,which sometimes leads to overfitting in high dimensions [143]. Moreover, encod-ing in high dimensions is computationally expensive due to the large projectionmatrix [42, 143].63In this section, we present a novel binary encoding method called SupervisedSparse Projections (SSP), in which the conventional dense projection matrix isreplaced with a discriminatively-trained sparse projection matrix. This design de-cision enables SSP to achieve two to three times faster encoding in comparison toconventional dense methods, while obtaining comparable retrieval accuracy. SSPis also more accurate than unsupervised high-dimensional binary encoding meth-ods while providing similar encoding speeds. Fig. 4.7 shows sample experimentalresults using 4096-dimensional convolutional neural network features.4.3.1 Algorithm descriptionLet X ∈ Rn×d denote a database containing n feature vectors, with each featurevector being d-dimensional. The goal of a binary encoding algorithm is to learna projection matrix P ∈ Rd×b that can be applied to X to produce similarity-preserving binary codes B ∈ {−1, 1}n×b, where b is the number of bits in thebinary codes. In the case of supervised binary encoding, we also have semanticclass labels Y ∈ Rn×C , where C is the number of classes, and yki = 1 if xibelongs to class k and 0 otherwise.Given a novel query xq ∈ Rd, the learned projection P is applied to obtain thecorresponding binary code b ∈ {−1, 1}b:bq = sgn(xqP) (4.3)where sgn(·) denotes the sign function applied to a vector, which returns +1 forpositive values and -1 otherwise. It is also possible to learn and apply the pro-jection to a non-linear mapping of xq: that is, bq = sgn(φ(xq)P) where φ is anon-linear mapping such as an RBF kernel mapping. For ease of formulation andexperimentation we develop the simple linear case in Eq. 4.3.What makes for an effective projection matrix P? A good projection matrixshould preserve semantic relationships, as revealed in the supervisory signal Y. Itshould be generalizable to new, previously unseen queries. Finally, we argue that itshould be fast to apply at test time – an important criterion as today’s increasinglycomplex models produce increasingly high-dimensional features.To obtain an effective projection matrix, the Supervised Sparse Projections64(SSP) method finds an approximate solution to the following optimization prob-lem:minB,W,P||Y −BW||2F + λ||W||2F (4.4)s.t. B = sgn(XP), ||P||1 ≤ τwhereW ∈ Rb×C , || · ||F denotes the Frobenius norm, λ is a regularization param-eter, and τ is a sparsity parameter. The objective (4.4) can be seen as a standard L2classification loss fitting the binary codes B to the supervisory signal Y, with anadditional constraint that B is generated by applying a sparse projection P to thefeature vectors X. We solve this optimization problem approximately following asequential approach. In the first subproblem, we find Bˆ and Wˆ satisfyingminB,W||Y −BW||2F + λ||W||2F (4.5)s.t. B ∈ {−1, 1}n×bIn the second subproblem, given Bˆ from the first subproblem, we solve for theprojection matrix P satisfyingminP||Bˆ−XP||2F (4.6)s.t. ||P||1 ≤ τAn alternative approach would be to incorporate the constraint B = sgn(XP) as apenalty term in a single objective,minB,W,P||Y −BW||2F + λ||W||2F + ν||B−XP||2Fs.t. B ∈ {−1, 1}n×b, ||P||1 ≤ τ,65and then iterate the following until convergence, similar to [114]: (i) fix B, P, andupdate W, (ii) fix W, P, and update B, (iii) fix W, B, and update P. However,the sequential approach is much faster to optimize and produces similar results inpractice. Concretely, on the SUN RGB-D benchmark with b = 32 bits, the sequen-tial approach is over 800% faster than iterating (i),(ii),(iii), with less than 0.1%difference in mAP (averaged over ten trials). Training time is dominated by thesubproblem of solving for P given the sparsity constraint and fixed W and B. Wetherefore optimize the original objective (4.4) by solving subproblems (4.5) and(4.6) sequentially.Our formulation is similar to that of [114], with two main differences: ourmethod computes a sparse projection, which allows for two to three times fasterencoding with comparable accuracy; and our optimization procedure is different.Solving for Bˆ and WˆWe solve for Bˆ and Wˆ in subproblem (4.5) via an iterative alternating optimiza-tion. First, we keep B fixed and update W; then, we keep W fixed and updateB. The update of W is optimal given fixed B, while the update of B given Wis an approximation. These steps are repeated until convergence or a maximumnumber of iterations is reached. The alternating minimization method finds a localoptimum, and so will be sensitive to initialization.(i) Fix B, update W.With B fixed in subproblem (4.5), the problem of solving for W becomesstandard L2-regularized least squares regression (ridge regression), which admitsa closed form solution. In particular, we update W asW = (B>B+ λI)−1B>Y (4.7)(ii) Fix W, update B.With W fixed in subproblem (4.5), the resulting problem for B admits an it-erative solution via the discrete cyclic coordinate descent (DCC) method [114].DCC allows us to solve for B bit by bit – in other words, one column of B at atime. Each column is solved in closed form holding the other columns fixed. Inparticular, with the other columns fixed, the kth bit (column) of B has the optimal66Algorithm 1 Learning Supervised Sparse Projections (SSP)Input: Database feature vectors X ∈ Rn×d; Supervisory labels Y ∈ Rn×C ; λ, τOutput: Projection matrix P ∈ Rd×b; Binary codes B ∈ {−1, 1}n×bInitialize P as a random projection and B as sgn(XP).repeatUpdate W by Eq. 4.7.Update B by discrete cyclic coordinate descent (Eq. 4.8).until converged or maximum number of iterations reached.Solve L1-regularized least squares (lasso) problem (Eq. 4.6) to obtain P.solutionb:k = sgn(q:k −B′W′w>k ) (4.8)where q:k is kth column of Q = YW>, B′ is B excluding column k, wk is thekth row of W, and W′ is W excluding row k. We invite the interested reader torefer to [114] for the complete derivation.Solving for PGiven Bˆ from the previous section, we next solve for P in subproblem (4.6). WithBˆ fixed, we can recognize subproblem (4.6) as an L1-regularized least squares(lasso) problem. The L1 regularizer ||P ||1 ≤ τ induces sparsity in the solution P,where the degree of sparsity is controlled by the user parameter τ .L1-regularized least squares is a well studied problem and we apply one ofthe standard algorithms in the optimization community, SPGL1 [137], to solve forP iteratively. We use the publicly available implementation of SPGL1 with thedefault parameter settings.The entire learning algorithm is summarized in Algorithm 1.4.3.2 ExperimentsWe compared SSP with several representative binary encoding methods:1. Locality-sensitive hashing (LSH) [18, 39] is an unsupervised binary encod-ing method that has been applied in a variety of computer vision tasks, in-67cluding image retrieval, feature matching, and object classification [24].2. Iterative quantization [43] with canonical correlation analysis (ITQ-CCA)and the recently introduced supervised discrete hashing (SDH) method [114]are supervised binary encoding methods.3. Sparse projections (SP) [143] is a state-of-the-art method for fast unsuper-vised high-dimensional binary encoding.Our method can be extended to a non-linear embedding by RBF kernel in thesame way as SDH [114]. For a fair comparison with ITQ-CCA, SP, and LSH, weevaluated all five methods without RBF embedding.We performed experiments on three large vision datasets from different taskdomains:1. The SUN RGB-D dataset [122] is a state-of-the-art benchmark for 3D sceneunderstanding. It consists of 10,335 RGB-D images of indoor scenes andsupports a wide range of scene understanding tasks, including scene cate-gorization, semantic segmentation, object detection, and room layout esti-mation. We followed the recommended protocol for scene categorization,which uses the scene categories having more than 80 images (a total of9,862 images over 21 categories). The SUN RGB-D benchmark providesconvolutional neural network features as a baseline. These convolutionalneural network features have d = 4096 dimensions and are generated us-ing Places-CNN [150], a leading architecture for scene categorization on theSUN database [145]. We conducted ten trials, randomly setting aside 1,000images as queries and using the remainder as the database in each trial.2. The Animals with Attributes (AwA) dataset [79] was originally introducedfor evaluating zero-shot learning. The task in zero-shot learning is to per-form image classification for categories that are described in terms of theirsemantic attributes [29, 79], but which have no training examples. Here weuse AwA simply for multi-class retrieval. The dataset consists of 30,475 im-ages covering 50 animal categories, with a minimum of 92 images in each68category. AwA includes as a baseline 4096-dimensional convolutional neu-ral network features generated using the VGG-19 “very deep” network [117].We conducted ten trials, randomly setting aside 1,000 images as queries andusing the remainder as the database in each trial.3. The ILSVRC (ImageNet Large Scale Visual Recognition Challenge) 2012dataset [111] is a standard benchmark for image classification algorithms. Itconsists of 1.2 million images spanning 1,000 semantic categories. We ex-tracted 4096-dimensional convolutional neural network features from eachimage using the VGG CNN-M network [19]. To hold the database in mem-ory for training, we conducted ten trials in which 100 classes were randomlyselected each time. Within the subset of 100 classes, we randomly set aside1,000 images as queries and used the remainder as the database. We sampled10,000 images from the database to train the binary encoding methods.We measure retrieval accuracy and encoding speed for all methods. Retrievalaccuracy is measured by the mean average precision (mAP). With labelled data,we are interested in preserving semantic similarity. The retrieval ground truth forcomputing the mean average precision consists of database examples sharing thesame semantic category label as the query. We compute the encoding time as theprocessor time required for the matrix multiplication bq = sgn(xqP) (Eq. 4.3),where xq is the query vector and the projection matrix P may be sparse or dense,depending on the binary encoding method. In particular, SP and the proposed SSPproduce sparse projections P; LSH, ITQ-CCA, and SDH produce dense projec-tions P. Both sparse and dense matrix-vector multiplications are executed usingthe Intel Math Kernel Library (MKL). All timings are performed on a desktopcomputer with a 3.60 GHz CPU.Fig. 4.8 summarizes the experimental results on SUN RGB-D for b = 32, 64,96, and 128 bits per feature vector. The vertical axis shows the mean average preci-sion; the horizontal axis shows the encoding time in milliseconds for 1,000 queries.Compared to the unsupervised high-dimensional binary encoding method, SP, theproposed SSP obtains substantially higher retrieval accuracy at similar encodingspeed. For example, at b = 64 bits SSP achieves 64.6% mAP compared to 17.4%for SP. These results demonstrate the advantage of taking into consideration the690 50 100 150Encoding time(ms)0.10.20.30.40.50.60.7Mean Average Precision (mAP)SUN RGB-D, b = 32 bits50 100 150 200 250Encoding time(ms)0.10.20.30.40.50.60.7Mean Average Precision (mAP)SUN RGB-D, b = 64 bits50 100 150 200 250 300 350Encoding time(ms)0.10.20.30.40.50.60.7Mean Average Precision (mAP)SUN RGB-D, b = 96 bits100 150 200 250 300 350 400Encoding time(ms)0.10.20.30.40.50.60.7Mean Average Precision (mAP)SUN RGB-D, b = 128 bitsSSP SPITQ-CCA LSHSDHFigure 4.8: Experimental results on the SUN RGB-D dataset [122] using4096-dimensional convolutional neural network features (Places-CNN[150]) and binary code lengths b = 32, 64, 96, and 128. Results are av-eraged over ten trials. Binary encoding methods: supervised sparse pro-jections (SSP); supervised discrete hashing (SDH) [114]; iterative quan-tization with canonical correlation analysis (ITQ-CCA) [43]; sparseprojections (SP) [143]; locality-sensitive hashing (LSH) [18, 39].70supervisory information available in many large-scale databases.Because of the sparsity of the projection matrix, SSP provides significantlyfaster encoding speed than modern dense methods such as LSH, ITQ-CCA, andSDH. At the same time, retrieval accuracy is competitive with the dense methods.SSP achieves higher accuracy than ITQ-CCA and is comparable with SDH acrossall bit lengths. In particular, at b = 32, 64, 96, and 128 bits, SSP is 2.3 to 3.4 timesfaster than SDH to encode, while matching SDH’s accuracy within 1% mAP. SSPis similarly faster than ITQ-CCA to encode, while obtaining 6 to 9% higher mAP.In fact, these experimental results likely underestimate the practical improve-ment in encoding speed. The recent work on SP [143] reports encoding speedsthat are linearly related to the proportion of nonzeros. For example, encoding witha projection matrix with 20% nonzeros is reportedly 5 times faster than encodingwith a dense projection matrix. On the other hand, in Fig. 4.8, the sparse projec-tion matrix for SSP has 20% nonzeros but we observed the encoding to be only2.3 to 3.4 times faster than SDH, for instance. Since the authors of [143] also useMKL for timing evaluation1, this difference may be due to optimizations not fullysupported in our implementation. In summary, comparisons with LSH, ITQ-CCA,and SDH on SUN RGB-D demonstrate that L1 regularization can be used to spar-sify the projection matrix and improve the encoding efficiency of high-dimensionalfeature vectors, with only a minor cost in retrieval accuracy.We observed similar results on the AwA dataset. Fig. 4.9 shows the meanaverage precision versus encoding time for 1,000 queries, for b = 32, 64, 96, and128 bits per feature vector. SSP achieves significantly faster encoding speed thanthe dense methods while providing comparable retrieval accuracy. For example, incomparison with the best performing dense method SDH, SSP is 2.2 to 3.2 timesfaster to encode and within 2% mAP across all bit lengths. SSP is significantlymore accurate than the fast high-dimensional encoding method SP, with a relativeimprovement of 49 to 57% mAP across all bit lengths.Fig. 4.10 summarizes the experimental results on ImageNet (ILSVRC 2012)for b = 32, 64, 96, and 128 bits per feature vector. Experimental results on Ima-geNet follow the general trends observed on the SUN RGB-D and AwA datasets.1Personal communication.710 50 100 150Encoding time(ms)0.10.20.30.40.50.60.70.8Mean Average Precision (mAP)AwA, b = 32 bits50 100 150 200 250Encoding time(ms)0.30.40.50.60.70.80.91Mean Average Precision (mAP)AwA, b = 64 bits50 100 150 200 250Encoding time(ms)0.30.40.50.60.70.80.91Mean Average Precision (mAP)AwA, b = 96 bits100 150 200 250 300Encoding time(ms)0.40.50.60.70.80.91Mean Average Precision (mAP)AwA, b = 128 bitsSSP SPITQ-CCA LSHSDHFigure 4.9: Experimental results on the AwA dataset [79] using 4096-dimensional convolutional neural network features (VGG-19 [117]) andbinary code lengths b = 32, 64, 96, and 128. Results are averagedover ten trials. Binary encoding methods: supervised sparse projections(SSP); supervised discrete hashing (SDH) [114]; iterative quantizationwith canonical correlation analysis (ITQ-CCA) [43]; sparse projections(SP) [143]; locality-sensitive hashing (LSH) [18, 39].720 50 100 150Encoding time(ms)0.10.20.30.40.50.60.7Mean Average Precision (mAP)ImageNet, b = 32 bits50 100 150 200 250Encoding time(ms)0.20.30.40.50.60.70.8Mean Average Precision (mAP)ImageNet, b = 64 bits100 150 200 250 300Encoding time(ms)0.20.30.40.50.60.70.8Mean Average Precision (mAP)ImageNet, b = 96 bits100 200 300 400Encoding time(ms)0.30.40.50.60.70.8Mean Average Precision (mAP)ImageNet, b = 128 bitsSSP SPITQ-CCA LSHSDHFigure 4.10: Experimental results on the ImageNet dataset [111] using 4096-dimensional convolutional neural network features (VGG CNN-M[19]) and binary code lengths b = 32, 64, 96, and 128. Resultsare averaged over ten trials. Binary encoding methods: supervisedsparse projections (SSP); supervised discrete hashing (SDH) [114];iterative quantization with canonical correlation analysis (ITQ-CCA)[43]; sparse projections (SP) [143]; locality-sensitive hashing (LSH)[18, 39].730 0.05 0.1 0.15 0.2 0.25 0.3Proportion of non-zeros in projection P0.570.580.590.60.610.620.630.640.650.66Mean Average Precision (mAP)SSP for different sparsities on SUN RGB-D, b = 64 bitsFigure 4.11: Retrieval accuracy versus percentage of non-zeros in the SSPprojection matrix P, for SUN RGB-D at b = 64 bits. The sparsity ofP is determined by the parameter τ in Eq. 4.4. We set τ to obtainapproximately 20% non-zeros in P in all other experiments.Compared to fast high-dimensional binary encoding with SP, SSP obtains moreaccurate results at similar encoding speeds, with relative improvements of 43 to69% mAP across all bit lengths. Compared to the best performing dense methodSDH, SSP is 2.1 to 2.8 times faster to encode and is within 2% mAP across all bitlengths.The sparsity of the learned projection matrix in SSP can be controlled by settingthe τ parameter. In the preceding experiments we set τ to obtain approximately20% non-zeros in the projection matrix P. Fig. 4.11 shows how retrieval accuracychanges as we vary τ to obtain sparser or denser projection matrices. Experimentswere conducted on the SUN RGB-D dataset with 64-bit codes, averaging over tenrandom trials. A tradeoff between projection sparsity and retrieval accuracy can beseen in the figure. In general, making the projection matrix more sparse enablesfaster encoding but at a cost in retrieval accuracy.On the SUN RGB-D dataset, training requires 5.9 min for b = 32 bits; 10.3 minfor b = 64 bits; 15.9 min for b = 96 bits; and 21.0 min for b = 128 bits, averagingover ten random trials.74Figure 4.12: Qualitative results on the SUN RGB-D dataset at b = 64 bits.The left column shows query images. The right column shows ten ofthe top retrieved neighbors (sampled from multi-way ties) using SSP.The results indicate that SSP reliably preserves semantic neighbors inthe learned binary codes.75Fig. 4.12 shows typical qualitative results on the SUN RGB-D dataset usingSSP with b = 64 bits. SSP reliably preserves the supervisory information in thelearned binary codes. We observed that even in failure cases, SSP returns semanti-cally related neighbors. For example, the ground truth class label of the third queryimage from the top is ‘lab’, but the top retrieved nearest neighbors are ‘office’ im-ages. In summary, SSP allows for scalable, fast, and semantics-preserving retrievalin large labelled databases. Replacing the original d = 4096 dimensional convolu-tional neural network features with 64-bit SSP binary codes results a reduction instorage costs by three orders of magnitude (131,072 or 262,144 bits reduced to 64bits).4.4 Integration with 3D scene parsingHaving presented two novel binary encoding methods with experimental validationon standard retrieval datasets, we next study the integration of these methods intoour MF3D framework for nonparametric 3D scene parsing. We follow the sameexperimental protocol as described in Section 3.3.Implementation details. The bank of random rotations method is unsupervisedand can be applied directly on the region proposal descriptors without training. Wegenerated four random rotations for the bank; an additional overhead of two bitsis required to index the selected rotation per feature vector, and this is reflected inthe slight difference in memory reduction reported later between the two binaryencoding methods.The supervised sparse projections method takes as input both the region pro-posal descriptors and a representation of their associated labels. Recall that eachregion proposal in the database has an associated ground-truth 2D label field L,appropriately cropped from the ground truth label images. We divide each pro-posal window into thirds horizontally and vertically to form a spatial grid. Withineach grid cell, we compute an indicator vector showing the presence or absence ofeach class (i.e. the ith entry is 1 if class i is present, or 0 if class i is absent). Theindicator vectors of the nine grid cells are concatenated to form a vector of length9C, where C is the number of classes, and the concatenated vectors are stacked toform SSP’s supervisory matrix Y ∈ Rn×9C (Section 4.3.1).76Table 4.1: MF3D with binary encoding on KITTI labelled benchmark using256-bit binary codesSearch time (s) Mean per-class accuracy Memory reductionBaseline MF3D 24.4 83.5 –MF3D + BankRR 1.34 82.6 ± 0.4 1,020×MF3D + SSP 1.22 82.2 ± 0.7 1,028×Quantitative results. Table 4.1 summarizes the time and memory efficienciesgained via binary encoding, using both the bank of random rotations and supervisedsparse projections to generate 256-bit binary codes. Results are averaged overfive random trials. Timing is performed in Matlab, single threaded execution ona 3.60 GHz CPU with hardware accelerated popcount for evaluating Hammingdistances. The search time for supervised sparse projections is in fact a slight over-estimate, as the sparse-dense matrix multiplication executes in Matlab no fasterthan the equivalent dense-dense matrix multiplication; however, the search time inboth sparse and dense cases is dominated by the Hamming distance computation(∼ 1.1s on average). The baseline MF3D searches the full-dimensional descriptorsusing a fast exact nearest neighbor search as implemented in the Yael library2.At 256 bits, binary encoding reduces the search time of the m = 4000 regionproposals by an order of magnitude, while obtaining a mean per-class accuracywithin 1% of the full dimensional descriptors. Replacing the full dimensional de-scriptors with 256-bit binary codes enables a reduction in memory costs of threeorders of magnitude.Fig. 4.13 shows the progression of labelling accuracy with respect to the num-ber of bits in the binary code for the bank of random rotations, supervised sparseprojections, and locality-sensitive hashing (LSH). The bank of random rotationsand supervised sparse projections methods obtain comparable labelling results,with the bank of random rotations performing slightly better at longer binary codelengths (128 and 256 bits) and supervised sparse projections performing betterat the shortest code length (32 bits). These results suggest that, for the specific2http://yael.gforge.inria.fr/7732 64 128 256Binary code length (bits)0.50.550.60.650.70.750.80.85Mean per-class accuracyLSH BRR SSPFigure 4.13: MF3D with binary encoding on KITTI labelled benchmark:mean per-class accuracy versus binary code length for locality-sensitive hashing (LSH), bank of random rotations (BRR), and super-vised sparse projections (SSP).purpose of label transfer in MF3D, the additional supervisory signal Y providesadded benefit only at high levels of compression. As the number of bits in the bi-nary code increases, preserving the similarity relationships in the region proposalsfeature space becomes sufficient for accurate label transfer; preserving Y becomesless important. At 64 bits and up, the bank of random rotations is preferred since itslabelling outcomes are as good as the more complex supervised sparse projectionsapproach, while not requiring training. Increasing the number of bits eventuallyproduces diminishing returns, with all three binary encoding methods obtainingmean per-class accuracies within 0.5% by 256 bits.Label transfer is highly parallelizable because each region proposal is matchedand transferred independently. MF3D can therefore take advantage of multipleparallel GPU threads, each processing a separate region proposal. We leave thisimplementation for future work.Qualitative results. Fig. 4.14 presents qualitative 3D scene parsing resultsat frames 3, 36, and 68 in the testing sequence (KITTI sequence 15), followingFig. 3.6. Results are generated online, using only the frames observed up to that78time. In each block, the left column shows the RGB frame and the labelling resultusing full dimensional region descriptors. The top right visualizes the labellingresult using the bank of random rotations, and the bottom right visualizes the la-belling result using supervised sparse projections. 256-bit binary codes are gener-ated for both methods. We observed that label transfer with the binary encodingmethods produces labellings that are largely consistent with those produced bylabel transfer using full dimensional descriptors. Since binary encoding is an ap-proximation, there are some minor but discernable discrepancies in the labellingresults. For example, in frame 3, MF3D with the bank of random rotations mis-labels the tree on the left side as building and pole; MF3D with supervised sparseprojections does not capture the full extent of the pavement in frame 3; and neitherbinary encoding algorithm preserves the partial detection of the vegetation in theleft foreground of frame 36. However, our qualitative results overall indicate that at256 bits, binary encoding introduces a small degree of approximation error whileobtaining a significant improvement in both time and space efficiency.79Building Vegetation Car RoadWall/fence Pavement PoleFigure 4.14: Qualitative 3D scene parsing results on the KITTI benchmarkwith binary encoding. Left column: RGB frame and labelling result ofMF3D with full dimensional descriptors. Top right: Labelling result ofMF3D with bank of random rotations. Bottom right: Labelling resultof MF3D with supervised sparse projections. 256-bit binary codes areused and labelling results are to a maximum depth of 25 meters.80Chapter 5ConclusionComputer vision is about developing systems that can “see” and make sense of acomplex visual world. Vision capabilities are being embedded in an increasinglybroad range of computer systems, from smartphone apps, to self-driving vehicles,to bionics. As our capacity to bridge the semantic gap grows, we will demand morefrom our algorithms. It is no longer enough to detect that an image contains a carinstead of a cat. In scene parsing, we aim to explain or reconstruct the entire scene,identifying all of the objects and background elements of interest.“Big data” has also enabled us to build more sophisticated data-driven sys-tems, with exciting advances in particular in deep neural networks for classifica-tion. These learning algorithms allow us to distinguish, with increasing accuracy,among pixels of a pre-defined set of semantic categories. In this thesis we haveleveraged big data in a much less commonly explored way: for scene parsing bylarge-scale nonparametric search. Leveraging big data for search allows for pow-erful exemplar-based scene parsing that does not require training and can be easilyextended to novel categories as the database grows. In Chapter 2, we presenteda nonparametric algorithm for 2D scene parsing of images. In contrast to tradi-tional nonparametric scene parsing methods that perform label transfer on the levelof pixels or superpixels, CollageParsing integrates information from overlappingobject proposals. In Chapter 3, we developed a nonparametric algorithm for 3Dscene parsing of video. MF3D constructs a semantically labelled 3D reconstruc-tion of the environment in an online manner, using only the frames observed so81far. Our experimental results, both in 2D using images and in 3D using video,show that large-scale nonparametric search can be an effective strategy for sceneparsing, producing accurate dense labellings while retaining extensibility to newcategories and examples.We argued that nonparametric search and parametric classification strategiesare complementary in scene parsing, and we believe their effective combination isa promising direction for future work. Indeed, since the original publication of Col-lageParsing in [130], other authors have explored hybrid nonparametric-parametricapproaches to 2D scene parsing. For example, Liu et al. [84] developed a hybridapproach for parsing humans in images, in which a convolutional neural network istrained to predict the confidence and displacements for matching labelled regionsfrom neighbor database images to the query image. Shuai et al. [116] constructeda graphical model for 2D scene parsing in which unary potentials are computedlocally using a trained convolutional neural network, and global potentials providehigh-level context and are derived non-parametrically from similar exemplars.Another promising direction for future work in scene parsing is the explorationof novel sources of relevant side information and unconventional ways to augmentthe labelled database of examples. Since obtaining fully labelled databases can beexpensive, it is worth investigating how other sources of information can providea useful prior over labellings. For instance, Wang et al. [140] recently leverageda crowd-sourced mapping database called OpenStreetMaps to form a geographicprior for outdoor scene understanding. Given a geo-tagged image and camera pa-rameters, the map is rendered in 3D to provide priors for geometry (depth) andsemantic labels. Besides novel sources for priors, synthetic data augmentationmay be helpful when real-world labelled data is scarce. Along this line of investi-gation, Gaidon et al. [34] recently introduced the Virtual KITTI dataset, consistingof synthetic clones of KITTI sequences photorealistically rendered using the Unitygraphics engine. Their experiments indicate that virtual pre-training improves theperformance of multi-object tracking in real video sequences.Nonparametric scene parsing requires new techniques for searching large la-belled collections of data that are time and memory efficient. In Chapter 4, weproposed two novel binary encoding algorithms for approximate nearest neigh-bor search and demonstrated how these algorithms can make nonparametric scene82parsing more scalable to larger databases. The bank of random rotations algorithmis data-independent and therefore easily scalable to new categories and examples.The supervised sparse projections algorithm specifically targets search in high di-mensions and enables faster encoding at test time. These contributions to approxi-mate nearest neighbor search are general: they are not specific to scene parsing andcan be applied to any unlabelled (bank of random rotations) or labelled (supervisedsparse projections) database.Section 1.2 briefly touched on the point of synergy between large-scale searchand deep neural network classification frameworks. We believe this will be animportant and fruitful avenue for future work. Though the techniques of binaryencoding and deep learning may seem different at first glance, we are just nowstarting to see the exciting potential of their convergence. For example, the re-cently proposed XNOR-Net [106] is a fast and compact approximation to standardconvolutional neural networks for classification in which filters and/or inputs areconstrained to be binary. Binarizing the weights produces a network that requires32 times less memory than a conventional network with single-precision weights,while maintaining comparable classification accuracy. Binarizing the inputs en-ables a speed-up of 58 times relative to standard 3x3 convolution, with some degra-dation in classification accuracy. XNOR-Net aims to enable real-time inference ondevices with limited memory and no GPUs.In closing, this is an exciting time for large-scale, data-driven computer vision.Tasks which had previously seemed far in the future, such as free-form visual ques-tion answering, semantic 3D reconstruction at an urban scale, and self-driving ve-hicles, are becoming more plausible with each year. The success of these projectswill have widespread and lasting impact: on how we work, how we design ourcities, and how we interact with machines.83Bibliography[1] B. Alexe, T. Deselaers, and V. Ferrari. Measuring the objectness of imagewindows. IEEE Transactions on Pattern Analysis and MachineIntelligence, 34(11):2189–2202, 2012. → pages 6, 10, 14, 15, 19, 25[2] P. Arbela´ez, M. Maire, C. Fowlkes, and J. Malik. Contour detection andhierarchical image segmentation. IEEE Transactions on Pattern Analysisand Machine Intelligence, 33(5):898–916, 2011. → pages 1[3] P. Arbela´ez, J. Pont-Tuset, J. T. Barron, F. Marques, and J. Malik.Multiscale combinatorial grouping. In IEEE Conference on ComputerVision and Pattern Recognition, 2014. → pages 14, 15[4] I. Armeni, O. Sener, A. R. Zamir, H. Jiang, I. Brilakis, M. Fischer, andS. Savarese. 3D semantic parsing of large-scale indoor spaces. In IEEEConference on Computer Vision and Pattern Recognition, 2016. → pages35[5] A. Babenko and V. Lempitsky. Additive quantization for extreme vectorcompression. In IEEE Conference on Computer Vision and PatternRecognition, 2014. → pages 54[6] A. Babenko and V. Lempitsky. Tree quantization for large-scale similaritysearch and classification. In IEEE Conference Computer Vision and PatternRecognition, 2015. → pages 54[7] V. Badrinarayanan, A. Kendall, and R. Cipolla. SegNet: A deepconvolutional encoder-decoder architecture for image segmentation.arXiv:1511.00561, 2015. → pages 51[8] C. Barnes, E. Shechtman, A. Finkelstein, and D. B. Goldman. PatchMatch:a randomized correspondence algorithm for structural image editing. InACM SIGGRAPH, 2009. → pages 1284[9] T. L. Berg, A. C. Berg, and J. Shih. Automatic attribute discovery andcharacterization from noisy web data. In European Conference onComputer Vision, 2010. → pages 2[10] M. Bla´ha, C. Vogel, A. Richard, J. D. Wegner, T. Pock, and K. Schindler.Large-scale semantic 3D reconstruction: an adaptive multi-resolutionmodel for multi-class volumetric labeling. In IEEE Conference onComputer Vision and Pattern Recognition, 2016. → pages 35[11] O. Boiman, E. Shechtman, and M. Irani. In defense of nearest-neighborbased image classification. In IEEE Conference on Computer Vision andPattern Recognition, 2008. → pages 2, 20[12] X. Boix, J. M. Gonfaus, J. van de Weijer, A. D. Bagdanov, J. Serrat, andJ. Gonza`lez. Harmony potentials: fusing global and local scale forsemantic image segmentation. International Journal of Computer Vision,96:83–102, 2012. → pages 11[13] Y. Boykov and V. Kolmogorov. An experimental comparison ofmin-cut/max-flow algorithms for energy minimization in vision. IEEETransactions on Pattern Analysis and Machine Intelligence, 26(9):1124–1137, 2004. → pages 23[14] Y. Boykov, O. Veksler, and R. Zabih. Efficient approximate energyminimization via graph cuts. IEEE Transactions on Pattern Analysis andMachine Intelligence, 20(12):1222–1239, 2001. → pages 23[15] G. J. Brostow, J. Shotton, J. Fauqueur, and R. Cipolla. Segmentation andrecognition using structure from motion point clouds. In EuropeanConference on Computer Vision, 2008. → pages 3, 33[16] J. Carreira and C. Sminchisescu. CPMC: Automatic object segmentationusing constrained parametric min-cuts. IEEE Transactions on PatternAnalysis and Machine Intelligence, 34(7):1312–1328, 2012. → pages 14,15[17] J. Carreira, R. Caseiro, J. Batista, and C. Sminchisescu. Semanticsegmentation with second-order pooling. In European Conference onComputer Vision, 2012. → pages 63[18] M. S. Charikar. Similarity estimation techniques from rounding algorithms.In ACM Symposium on Theory of Computing, 2002. → pages xi, xii, 53,63, 67, 70, 72, 7385[19] K. Chatfield, K. Simonyan, A. Vedaldi, and A. Zisserman. Return of thedevil in the details: delving deep into convolutional nets. In BritishMachine Vision Conference, 2014. → pages xii, 20, 63, 69, 73[20] M. Cheng, Z. Zhang, W. Lin, and P. Torr. BING: Binarized normedgradients for objectness estimation at 300fps. In IEEE Conference onComputer Vision and Pattern Recognition, 2014. → pages 14, 16[21] B. Curless and M. Levoy. A volumetric method for building complexmodels from range images. In Conference on Computer Graphics andInteractive Techniques (SIGGRAPH), 1996. → pages 37[22] D. Dai, H. Riemenschneider, G. Schmitt, and L. V. Gool. Example-basedfacade texture synthesis. In IEEE International Conference on ComputerVision, 2013. → pages 2, 7[23] N. Dalal and B. Triggs. Histograms of oriented gradients for humandetection. In IEEE Conference on Computer Vision and PatternRecognition, volume 1, pages 886–893, 2005. → pages 14[24] T. Dean, M. A. Ruzon, M. Segal, J. Shlens, S. Vijayanarasimhan, andJ. Yagnik. Fast, accurate detection of 100,000 object classes on a singlemachine. In IEEE Conference on Computer Vision and PatternRecognition, 2013. → pages 68[25] P. Dolla´r and C. L. Zitnick. Structured forests for fast edge detection. InIEEE International Conference on Computer Vision, 2013. → pages 1, 16,19[26] D. Eigen and R. Fergus. Nonparametric image parsing using adaptiveneighbor sets. In IEEE Conference on Computer Vision and PatternRecognition, 2012. → pages 6, 9, 12, 25[27] I. Endres and D. Hoiem. Category-independent object proposals withdiverse ranking. IEEE Transactions on Pattern Analysis and MachineIntelligence, 36(2):222–234, 2014. → pages 14, 16[28] C. Farabet, C. Couprie, L. Najman, and Y. LeCun. Scene parsing withmultiscale feature learning, purity trees, and optimal covers. InInternational Conference on Machine Learning, 2012. → pages 3, 12, 25[29] A. Farhadi, I. Endres, D. Hoiem, and D. Forsyth. Describing objects bytheir attributes. In IEEE Conference on Computer Vision and PatternRecognition, 2009. → pages 1, 6886[30] C. Fellbaum, editor. WordNet: An Electronic Lexical Database. MIT Press,1998. → pages 5[31] P. Felzenszwalb, R. Girshick, D. McAllester, and D. Ramanan. Objectdetection with discriminatively trained part-based models. IEEETransactions on Pattern Analysis and Machine Intelligence, 32(9):1627–1645, 2010. → pages 2, 14[32] P. F. Felzenszwalb and D. P. Huttenlocher. Efficient graph-based imagesegmentation. International Journal of Computer Vision, 59(2):167–181,2004. → pages 24[33] G. Floros and B. Leibe. Joint 2D-3D temporally consistent semanticsegmentation of street scenes. In IEEE Conference on Computer Visionand Pattern Recognition, 2012. → pages 33[34] A. Gaidon, Q. Wang, Y. Cabon, and E. Vig. Virtual worlds as proxy formulti-object tracking analysis. In IEEE Conference on Computer Visionand Pattern Recognition, 2016. → pages 82[35] T. Ge, K. He, Q. Ke, and J. Sun. Optimized product quantization forapproximate nearest neighbor search. In IEEE Conference on ComputerVision and Pattern Recognition, 2013. → pages 54, 56, 58[36] T. Ge, K. He, and J. Sun. Graph cuts for supervised binary coding. InEuropean Conference on Computer Vision, 2014. → pages 54[37] A. Geiger, M. Roser, and R. Urtasun. Efficient large-scale stereo matching.In Asian Conference on Computer Vision, 2010. → pages 39[38] A. Geiger, P. Lenz, and R. Urtasun. Are we ready for autonomous driving?the KITTI vision benchmark suite. In IEEE Conference on ComputerVision and Pattern Recognition, 2012. → pages viii, 6, 47[39] A. Gionis, P. Indyk, and R. Motwani. Similarity search in high dimensionsvia hashing. In International Conference on Very Large Data Bases, 1999.→ pages xi, xii, 53, 63, 67, 70, 72, 73[40] R. Girshick. Fast R-CNN. In IEEE International Conference on ComputerVision, 2015. → pages 15, 41[41] R. Girshick, J. Donahue, T. Darrell, and J. Malik. Rich feature hierarchiesfor accurate object detection and semantic segmentation. In IEEEConference on Computer Vision and Pattern Recognition, 2014. → pages2, 1587[42] Y. Gong, S. Kumar, H. A. Rowley, and S. Lazebnik. Learning binary codesfor high-dimensional data using bilinear projections. In IEEE Conferenceon Computer Vision and Pattern Recognition, 2013. → pages 54, 63[43] Y. Gong, S. Lazebnik, A. Gordo, and F. Perronnin. Iterative quantization: aProcrustean approach to learning binary codes for large-scale imageretrieval. IEEE Transactions on Pattern Analysis and Machine Intelligence,35(12):2916–2929, 2013. → pages xi, xii, 7, 53, 54, 55, 56, 58, 63, 68, 70,72, 73[44] S. Gould and Y. Zhang. PatchMatchGraph: Building a graph of densepatch correspondences for label transfer. In European Conference onComputer Vision, 2012. → pages 12, 25[45] S. Gould, R. Fulton, and D. Koller. Decomposing a scene into geometricand semantically consistent regions. In IEEE International Conference onComputer Vision, 2009. → pages 25[46] C. Ha¨ne, C. Zach, A. Cohen, R. Angst, and M. Pollefeys. Joint 3D scenereconstruction and class segmentation. In IEEE Conference on ComputerVision and Pattern Recognition, 2013. → pages 2, 33[47] J. Hays and A. A. Efros. Scene completion using millions of photographs.In ACM SIGGRAPH, 2007. → pages 5[48] K. He, J. Sun, and X. Tang. Guided image filtering. In EuropeanConference on Computer Vision, 2010. → pages 1[49] K. He, F. Wen, and J. Sun. K-means hashing: an affinity-preservingquantization method for learning binary compact codes. In IEEEConference on Computer Vision and Pattern Recognition, 2013. → pages53, 55, 58, 59, 60[50] K. He, X. Zhang, S. Ren, and J. Sun. Spatial pyramid pooling in deepconvolutional networks for visual recognition. IEEE Transactions onPattern Analysis and Machine Intelligence, 37(9):1904–1916, 2015. →pages 41[51] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for imagerecognition. In IEEE Conference on Computer Vision and PatternRecognition, 2016. → pages 3[52] G. Heitz and D. Koller. Learning spatial context: using stuff to find things.In European Conference on Computer Vision, 2008. → pages 6, 988[53] J.-P. Heo, Z. Lin, and S.-E. Yoon. Distance encoded product quantization.In IEEE Conference on Computer Vision and Pattern Recognition, 2014.→ pages 54[54] L. Horne, J. Alvarez, C. McCarthy, M. Salzmann, and N. Barnes. Semanticlabeling for prosthetic vision. Computer Vision and Image Understanding,in press. → pages 2[55] J. Hosang, R. Benenson, and B. Schiele. How good are detection proposals,really? In British Machine Vision Conference, 2014. → pages 16, 47[56] X. Hou and L. Zhang. Saliency detection: a spectral residual approach. InIEEE Conference on Computer Vision and Pattern Recognition, 2007. →pages 15[57] A. S. Huang, A. Bachrach, P. Henry, M. Krainin, D. Maturana, D. Fox, andN. Roy. Visual odometry and mapping for autonomous flight using anRGB-D camera. In International Symposium on Robotics Research, 2011.→ pages 38[58] P. Isola and C. Liu. Scene collaging: analysis and synthesis of naturalimages with semantic layers. In IEEE International Conference onComputer Vision, 2013. → pages 14[59] H. Je´gou, M. Douze, C. Schmid, and P. Pe´rez. Aggregating localdescriptors into a compact image representation. In IEEE Conference onComputer Vision and Pattern Recognition, 2010. → pages 54[60] H. Je´gou, M. Douze, and C. Schmid. Product quantization for nearestneighbor search. IEEE Transactions on Pattern Analysis and MachineIntelligence, 33(1):117–128, 2011. → pages 54, 58[61] Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. Girshick,S. Guadarrama, and T. Darrell. Caffe: Convolutional architecture for fastfeature embedding. arXiv:1408.5093, 2014. → pages 51[62] K. Jiang, Q. Que, and B. Kulis. Revisiting kernelized locality-sensitivehashing for improved large-scale image retrieval. In IEEE Conference onComputer Vision and Pattern Recognition, 2015. → pages 53[63] M. Juneja, A. Vedaldi, C. V. Jawahar, and A. Zisserman. Blocks that shout:distinctive parts for scene classification. In IEEE Conference on ComputerVision and Pattern Recognition, 2013. → pages 189[64] O. Ka¨hler, V. A. Prisacariu, C. Y. Ren, X. Sun, P. Torr, and D. Murray. Veryhigh frame rate volumetric integration of depth images on mobile devices.In International Symposium on Mixed and Augmented Reality, 2015. →pages 38, 46[65] G. Klein and D. Murray. Parallel tracking and mapping for small ARworkspaces. In International Symposium on Mixed and Augmented Reality,2007. → pages 37[66] P. Kohli, L. Ladicky´, and P. H. S. Torr. Robust higher order potentials forenforcing label consistency. International Journal of Computer Vision, 82(3):302–324, 2009. → pages 11, 33[67] V. Kolmogorov and R. Zabih. What energy functions can be minimized viagraph cuts? IEEE Transactions on Pattern Analysis and MachineIntelligence, 26(2):147–159, 2004. → pages 23[68] A. Kovashka, D. Parikh, and K. Grauman. WhittleSearch: image searchwith relative attribute feedback. In IEEE Conference on Computer Visionand Pattern Recognition, 2012. → pages 1[69] P. Kra¨henbu¨hl and V. Koltun. Efficient inference in fully connected CRFswith Gaussian edge potentials. In Advances in Neural InformationProcessing Systems, 2011. → pages 33, 40[70] P. Kra¨henbu¨hl and V. Koltun. Geodesic object proposals. In EuropeanConference on Computer Vision, 2014. → pages 14, 16[71] A. Krizhevsky. Learning multiple layers of features from tiny images.Technical report, 2009. → pages 58[72] A. Krizhevsky, I. Sutskever, and G. E. Hinton. ImageNet classification withdeep convolutional neural networks. In Advances in Neural InformationProcessing Systems, 2012. → pages 3[73] D. Kuettel and V. Ferrari. Figure-ground segmentation by transferringwindow masks. In IEEE Conference on Computer Vision and PatternRecognition, 2012. → pages 13, 14[74] B. Kulis and T. Darrell. Learning to hash with binary reconstructiveembeddings. In Advances in Neural Information Processing Systems, 2009.→ pages 7, 5390[75] B. Kulis and K. Grauman. Kernelized locality-sensitive hashing. IEEETransactions on Pattern Analysis and Machine Intelligence, 34(6):1092–1104, 2012. → pages 53[76] A. Kundu, Y. Li, F. Dellaert, F. Li, and J. M. Rehg. Joint semanticsegmentation and 3D reconstruction from monocular video. In EuropeanConference on Computer Vision, 2014. → pages 2, 21, 34[77] L. Ladicky´, C. Russell, P. Kohli, and P. H. S. Torr. Associative hierarchicalCRFs for object class image segmentation. In IEEE InternationalConference on Computer Vision, 2009. → pages 3, 11, 33[78] L. Ladicky´, C. Russell, P. Kohli, and P. H. S. Torr. Graph cut basedinference with co-occurrence statistics. In European Conference onComputer Vision, 2010. → pages 11[79] C. H. Lampert, H. Nickisch, and S. Harmeling. Learning to detect unseenobject classes by between-class attribute transfer. In IEEE Conference onComputer Vision and Pattern Recognition, 2009. → pages xii, 1, 68, 72[80] F. Li, J. Carreira, G. Lebanon, and C. Sminchisescu. Composite statisticalinference for semantic segmentation. In IEEE Conference on ComputerVision and Pattern Recognition, 2013. → pages 13[81] B. Liu, S. Gould, and D. Koller. Single image depth estimation frompredicted semantic labels. In IEEE Conference on Computer Vision andPattern Recognition, 2010. → pages 2[82] C. Liu, J. Yuen, and A. Torralba. Nonparametric scene parsing via labeltransfer. IEEE Transactions on Pattern Analysis and Machine Intelligence,33(12):2368–2382, 2011. → pages vii, viii, 1, 2, 5, 9, 10, 11, 19, 22, 24, 25[83] C. Liu, J. Yuen, and A. Torralba. SIFT Flow: dense correspondence acrossscenes and its applications. IEEE Transactions on Pattern Analysis andMachine Intelligence, 33(5):978–994, 2011. → pages 5, 12[84] S. Liu, X. Liang, L. Liu, X. Shen, J. Yang, C. Xu, L. Lin, X. Cao, andS. Yan. Matching-CNN meets KNN: quasi-parametric human parsing. InIEEE Conference on Computer Vision and Pattern Recognition, 2015. →pages 82[85] W. Liu, J. Wang, R. Ji, Y.-G. Jiang, and S.-F. Chang. Supervised hashingwith kernels. In IEEE Conference on Computer Vision and PatternRecognition, 2012. → pages 5391[86] T. Malisiewicz, A. Gupta, and A. A. Efros. Ensemble of Exemplar-SVMsfor object detection and beyond. In IEEE International Conference onComputer Vision, pages 89–96, 2011. → pages 13[87] S. Manen, M. Guillaumin, and L. V. Gool. Prime object proposals withrandomized Prim’s algorithm. In IEEE International Conference onComputer Vision, 2013. → pages 14, 15[88] A. Martinovic´, J. Knopp, H. Riemenschneider, and L. V. Gool. 3D all theway: semantic segmentation of urban scenes from start to end in 3D. InIEEE Conference on Computer Vision and Pattern Recognition, 2015. →pages 2, 7, 34[89] K. Matzen and N. Snavely. Scene chronology. In European Conference onComputer Vision, 2014. → pages 1[90] S. McCann and D. G. Lowe. Spatially local coding for object recognition.In Asian Conference on Computer Vision, 2012. → pages 20[91] H. Myeong, J. Y. Chang, and K. M. Lee. Learning object relationships viagraph-based context model. In IEEE Conference on Computer Vision andPattern Recognition, 2012. → pages 6, 9, 12, 25[92] R. A. Newcombe, S. Izadi, O. Hilliges, D. Molyneaux, D. Kim, A. J.Davison, P. Kohli, J. Shotton, S. Hodges, and A. Fitzgibbon. KinectFusion:Real-time dense surface mapping and tracking. In InternationalSymposium on Mixed and Augmented Reality, 2011. → pages 35, 37, 38[93] R. A. Newcombe, S. J. Lovegrove, and A. J. Davison. DTAM: Densetracking and mapping in real-time. In IEEE International Conference onComputer Vision, 2011. → pages 37[94] M. Nießner, M. Zollho¨fer, S. Izadi, and M. Stamminger. Real-time 3dreconstruction at scale using voxel hashing. ACM Transactions onGraphics, 32(6):169:1–169:11, 2013. → pages 38[95] M. Norouzi and D. J. Fleet. Minimal loss hashing for compact binarycodes. In International Conference in Machine Learning, 2011. → pages7, 53[96] M. Norouzi and D. J. Fleet. Cartesian k-means. In IEEE Conference onComputer Vision and Pattern Recognition, 2013. → pages 54, 56, 5892[97] A. Oliva and A. Torralba. Modeling the shape of the scene: a holisticrepresentation of the spatial envelope. International Journal of ComputerVision, 42(3):145–175, 2001. → pages 19, 58[98] V. Ordonez, G. Kulkarni, and T. L. Berg. Im2Text: describing images using1 million captioned photographs. In Advances in Neural InformationProcessing Systems, 2011. → pages 5[99] V. Ordonez, J. Deng, Y. Choi, A. C. Berg, and T. L. Berg. From large scaleimage categorization to entry-level categories. In IEEE InternationalConference on Computer Vision, 2013. → pages 2[100] C. Pantofaru, C. Schmid, and M. Hebert. Object recognition by integratingmultiple image segmentations. In European Conference on ComputerVision, 2008. → pages 13[101] D. Parikh and K. Grauman. Relative attributes. In IEEE InternationalConference on Computer Vision, 2011. → pages 1[102] F. Perronnin, J. Sa´nchez, and T. Mensink. Improving the Fisher kernel forlarge-scale image classification. In European Conference on ComputerVision, 2010. → pages 2, 54, 63[103] A. Rabinovich, A. Vedaldi, C. Galleguillos, E. Wiewiora, and S. Belongie.Objects in context. In IEEE International Conference on Computer Vision,2007. → pages 11[104] M. Raginsky and S. Lazebnik. Locality-sensitive binary codes fromshift-invariant kernels. In Advances in Neural Information ProcessingSystems, 2009. → pages 53[105] M. Rastegari, C. Keskin, P. Kohli, and S. Izadi. Computationally boundedretrieval. In IEEE Conference on Computer Vision and PatternRecognition, 2015. → pages 54[106] M. Rastegari, V. Ordonez, J. Redmon, and A. Farhadi. XNOR-Net:ImageNet classification using binary convolutional neural networks. InEuropean Conference on Computer Vision, 2016. → pages 83[107] S. Ren, K. He, R. Girshick, and J. Sun. Faster R-CNN: towards real-timeobject detection with region proposal networks. In Advances in NeuralInformation Processing Systems, 2015. → pages 2, 3, 15, 41, 4693[108] H. Riemenschneider, A. Bo´dis-Szomoru´, J. Weissenberg, and L. V. Gool.Learning where to classify in multi-view semantic segmentation. InEuropean Conference on Computer Vision, 2014. → pages 7, 34[109] E. Rosten and T. Drummond. Machine learning for high-speed cornerdetection. In European Conference on Computer Vision, 2006. → pages 38[110] C. Rother, S. Kumar, V. Kolmogorov, and A. Blake. Digital tapestry. InIEEE Conference on Computer Vision and Pattern Recognition, 2005. →pages 23[111] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang,A. Karpathy, A. Khosla, M. Bernstein, A. C. Berg, and L. Fei-Fei.ImageNet Large Scale Visual Recognition Challenge. arXiv:1409.0575,2014. → pages xii, 3, 51, 69, 73[112] B. C. Russell, A. Torralba, K. Murphy, and W. T. Freeman. LabelMe: adatabase and web-based tool for image annotation. International Journal ofComputer Vision, 77(1-3):157–173, 2008. → pages 4, 10, 28[113] S. Sengupta, E. Greveson, A. Shahrokni, and P. H. S. Torr. Urban 3Dsemantic modelling using stereo vision. In IEEE International Conferenceon Robotics and Automation, 2013. → pages 2, 3, 7, 32, 33, 47, 48[114] F. Shen, C. Shen, W. Liu, and H. T. Shen. Supervised discrete hashing. InIEEE Conference Computer Vision and Pattern Recognition, 2015. →pages xi, xii, 7, 54, 63, 66, 67, 68, 70, 72, 73[115] J. Shotton, J. Winn, C. Rother, and A. Criminisi. TextonBoost: Jointappearance, shape and context modeling for multi-class object recognitionand segmentation. In European Conference on Computer Vision, 2006. →pages 11[116] B. Shuai, G. Wang, Z. Zuo, B. Wang, and L. Zhao. Integrating parametricand non-parametric models for scene labeling. In IEEE Conference onComputer Vision and Pattern Recognition, 2015. → pages 82[117] K. Simonyan and A. Zisserman. Very deep convolutional networks forlarge-scale image recognition. In International Conference on LearningRepresentations, 2015. → pages xii, 41, 51, 69, 72[118] G. Singh and J. Kosˇecka´. Nonparametric scene parsing with adaptivefeature relevance and semantic context. In IEEE Conference on ComputerVision and Pattern Recognition, 2013. → pages 6, 9, 12, 2594[119] S. Singh, A. Gupta, and A. A. Efros. Unsupervised discovery of mid-leveldiscriminative patches. In European Conference on Computer Vision,2012. → pages 1[120] A. W. Smeulders, M. Worring, S. Santini, A. Gupta, and R. Jain.Content-based image retrieval at the end of the early years. IEEETransactions on Pattern Analysis and Machine Intelligence, 22(12):1349–1380, 2000. → pages 1[121] N. Snavely, S. M. Seitz, and R. Szeliski. Photo tourism: exploring photocollections in 3D. In ACM SIGGRAPH, 2006. → pages 1[122] S. Song, S. Lichtenberg, and J. Xiao. SUN RGB-D: A RGB-D sceneunderstanding benchmark suite. In IEEE Conference on Computer Visionand Pattern Recognition, 2015. → pages xi, 3, 63, 68, 70[123] C. J. Stone. Consistent nonparametric regression. The Annals of Statistics,5(4):595–620, 1977. → pages 4[124] P. Sturgess, K. Alahari, L. Ladicky´, and P. H. S. Torr. Combiningappearance and structure from motion features for road sceneunderstanding. In British Machine Vision Conference, 2009. → pages 33[125] C. Szegedy, W. Zaremba, I. Sutskever, J. Bruna, D. Erhan, I. Goodfellow,and R. Fergus. Intriguing properties of neural networks. In InternationalConference on Learning Representations, 2014. → pages 4[126] K. Tateno, F. Tombari, and N. Navab. When 2.5d is not enough:simultaneous reconstruction, segmentation and recognition on denseSLAM. In IEEE International Conference on Robotics and Automation,2016. → pages 35[127] J. Tighe and S. Lazebnik. Superparsing: scalable nonparametric imageparsing with superpixels. International Journal of Computer Vision, 101(2):329–349, 2013. → pages vii, ix, 1, 6, 9, 12, 13, 22, 23, 25, 26, 28, 29,31[128] J. Tighe and S. Lazebnik. Finding things: image parsing with regions andper-exemplar detectors. In IEEE Conference on Computer Vision andPattern Recognition, 2013. → pages 13, 25, 28[129] A. Torralba, R. Fergus, and W. T. Freeman. 80 million tiny images: a largedata set for nonparametric object and scene recognition. IEEE Transactions95on Pattern Analysis and Machine Intelligence, 30(11):1958–1970, 2008.→ pages 5, 58[130] F. Tung and J. J. Little. CollageParsing: Nonparametric scene parsing byadaptive overlapping windows. In European Conference on ComputerVision, 2014. → pages iii, 82[131] F. Tung and J. J. Little. SSP: Supervised Sparse Projections for large-scaleretrieval in high dimensions. In Asian Conference on Computer Vision,2016. → pages iv[132] F. Tung and J. J. Little. Scene parsing by nonparametric label transfer ofcontent-adaptive windows. Computer Vision and Image Understanding,143:191–200, 2016. → pages iii, 23[133] F. Tung, J. Martinez, H. H. Hoos, and J. J. Little. Bank of quantizationmodels: A data-specific approach to learning binary codes for large-scaleretrieval applications. In IEEE Winter Conference on Applications ofComputer Vision, 2015. → pages iii[134] T. Tuytelaars, M. Fritz, K. Saenko, and T. Darrell. The NBNN kernel. InIEEE International Conference on Computer Vision, pages 1824–1831,2011. → pages 20[135] J. P. C. Valentin, S. Sengupta, J. Warrell, A. Shahrokni, and P. H. S. Torr.Mesh based semantic modelling for indoor and outdoor scenes. In IEEEConference on Computer Vision and Pattern Recognition, 2013. → pages2, 3, 7, 32, 33, 47, 48[136] K. E. A. van de Sande, J. R. R. Uijlings, T. Gevers, and A. W. M.Smeulders. Segmentation as selective search for object recognition. InIEEE International Conference on Computer Vision, 2011. → pages 14, 15[137] E. van den Berg and M. P. Friedlander. Probing the pareto frontier for basispursuit solutions. SIAM Journal on Scientific Computing, 31(2):890–912,2008. → pages 67[138] V. Vineet, O. Miksik, M. Lidegaard, M. Nießner, S. Golodetz, V. A.Prisacariu, O. Ka¨hler, D. W. Murray, S. Izadi, P. Pe´rez, and P. H. S. Torr.Incremental dense semantic stereo fusion for large-scale semantic scenereconstruction. In IEEE International Conference on Robotics andAutomation, 2015. → pages 2, 3, 7, 32, 34, 46, 47, 4896[139] C. Wang, N. Komodakis, and N. Paragios. Markov Random Fieldmodeling, inference & learning in computer vision & image understanding:A survey. Computer Vision and Image Understanding, 117(11):1610–1627, 2013. → pages 11[140] S. Wang, S. Fidler, and R. Urtasun. Holistic 3D scene understanding from asingle geo-tagged image. In IEEE Conference on Computer Vision andPattern Recognition, 2015. → pages 82[141] Y. Weiss, A. Torralba, and R. Fergus. Spectral hashing. In Advances inNeural Information Processing Systems, 2008. → pages 53, 58[142] J. Wu, C. Leng, Y. Wang, Q. Hu, and J. Cheng. Quantized convolutionalneural networks for mobile devices. In IEEE Conference on ComputerVision and Pattern Recognition, 2016. → pages 5[143] Y. Xia, K. He, P. Kohli, and J. Sun. Sparse projections for high-dimensionalbinary codes. In IEEE Conference on Computer Vision and PatternRecognition, 2015. → pages xi, xii, 7, 54, 55, 63, 68, 70, 71, 72, 73[144] J. Xiao and L. Quan. Multiple view semantic segmentation for street viewimages. In IEEE International Conference on Computer Vision, 2009. →pages 33[145] J. Xiao, J. Hays, K. Ehinger, A. Oliva, and A. Torralba. SUN database:large-scale scene recognition from abbey to zoo. In IEEE Conference onComputer Vision and Pattern Recognition, pages 3485–3492, 2010. →pages 19, 28, 68[146] J. Yang, B. Price, S. Cohen, and M. Yang. Context driven scene parsingwith attention to rare classes. In IEEE Conference on Computer Vision andPattern Recognition, 2014. → pages 13, 25, 26, 28, 30[147] F. X. Yu, S. Kumar, Y. Gong, and S.-F. Chang. Circulant binary embedding.In International Conference in Machine Learning, 2014. → pages 54, 55[148] Q. Zhang, X. Shen, L. Xu, and J. Jia. Rolling guidance filter. In EuropeanConference on Computer Vision, 2014. → pages 1[149] X. Zhang, F. X. Yu, R. Guo, S. Kumar, S. Wang, and S.-F. Chang. Fastorthogonal projection based on Kronecker product. In IEEE InternationalConference on Computer Vision, 2015. → pages 5497[150] B. Zhou, A. Lapedriza, J. Xiao, A. Torralba, and A. Oliva. Learning deepfeatures for scene recognition using Places database. In Advances inNeural Information Processing Systems, 2014. → pages xi, 68, 70[151] C. L. Zitnick and P. Dolla´r. Edge Boxes: locating object proposals fromedges. In European Conference on Computer Vision, 2014. → pages viii,6, 10, 14, 16, 19, 20, 4798