Open Collections

UBC Theses and Dissertations

UBC Theses Logo

UBC Theses and Dissertations

Object categorization and localization with spatially localized features McCann, Sancho 2014

Your browser doesn't seem to have a PDF viewer, please download the PDF to view this item.

Item Metadata


24-ubc_2014_spring_mccann_sancho.pdf [ 7.24MB ]
JSON: 24-1.0167312.json
JSON-LD: 24-1.0167312-ld.json
RDF/XML (Pretty): 24-1.0167312-rdf.xml
RDF/JSON: 24-1.0167312-rdf.json
Turtle: 24-1.0167312-turtle.txt
N-Triples: 24-1.0167312-rdf-ntriples.txt
Original Record: 24-1.0167312-source.json
Full Text

Full Text

Object classification and localization with spatiallylocalized featuresbySancho McCannB. C. Sc. University of Manitoba, 2006A THESIS SUBMITTED IN PARTIAL FULFILLMENTOF THE REQUIREMENTS FOR THE DEGREE OFDoctor of PhilosophyinTHE FACULTY OF GRADUATE AND POSTDOCTORALSTUDIES(Computer Science)The University Of British Columbia(Vancouver)April 2014c© Sancho McCann, 2014AbstractObject classification and localization are important components of image under-standing. For a computer to interact with our world, it will need to identify theobjects in our world. At a more basic level, these tasks are crucial to many prac-tical applications: image organization, visual search, autonomous vehicles, andsurveillance. This thesis presents alternatives to the currently popular approachesto object classification and localization, specifically focusing on methods that moretightly integrate location information with the visual features.We start by improving on Naive Bayes Nearest Neighbor (NBNN), an alter-native to the standard bag-of-words/spatial pyramid classification pipeline. Thismodel matches localized spatial domain features between a test image and theentire training set in order to classify an image as belonging to one of severalcategories. We improve this method’s classification performance and algorithmiccomplexity.However, the nature of NBNN results in prohibitive memory requirements inlarge datasets. This leads to our second contribution: a bag-of-words model basedon a clustering of the location-augmented features. This a simple and more flex-ible approach to modeling location information than the commonly used spatialpyramid. By using location-augmented features, location information is capturedsimply in the nearest-neighbor coding of the bag-of-words model. This resultsin a more efficient use of model dimensions than the spatial pyramid and higherclassification performance than state-of-the-art alternatives.Last, we present the design of an object localization system using this high-performance classifier. Such design is made more difficult by the fact that ourmodel does not satisfy the assumptions made by recent efficient localization al-iigorithms. Our method uses a Hough transform based on an approximation to ourmodel, followed by a more accurate refinement of classifier scores and boundingboxes. We show its effectiveness on the widely used and practical Daimler monoc-ular pedestrian dataset.These contributions show that simple, location-augmented features, soft-nearest-neighbor coding, and linear Support Vector Machines (SVMS) can outperform thelong-used and optimized spatial pyramid methods and that this approach warrantsadditional research to continue to improve its efficiency.iiiPrefaceChapters 3 and 4 are based on publications at peer-reviewed academic conferences.Chapter 5 is based on a concurrent submission to a peer-reviewed academic con-ference. My supervisor, Dr. David G. Lowe played a major role through planningand discussions, and he assisted in writing, mostly at the editing stage.Chapter 3Chapter 3 is based on [67]. I implemented the local naive Bayes nearest neighbormethod, conducted all experiments, and wrote the majority of the paper. I pub-lished preliminary results of this work on ArXiv in 2011, presented it at an ECCVworkshop in 2011, and presented the peer-reviewed version at CVPR in 2012.Chapter 4Chapter 4 is based on [66]. I implemented spatially local coding, as well as severalstate-of-the-art spatial pyramid variants for comparison. I conducted all experi-ments and wrote the majority of the paper. I presented this work at ACCV 2012.Chapter 5Chapter 5 is based on a paper submitted for review. I implemented the framework,conducted all experiments, and wrote the majority of the paper.ivTable of ContentsAbstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iiPreface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ivTable of Contents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vList of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viiiList of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xGlossary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xvAcknowledgments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xvi1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.1 Visual recognition tasks . . . . . . . . . . . . . . . . . . . . . . . 11.1.1 Object instance recognition . . . . . . . . . . . . . . . . . 21.1.2 Classification . . . . . . . . . . . . . . . . . . . . . . . . 31.1.3 Detection, localization . . . . . . . . . . . . . . . . . . . 41.1.4 Local features . . . . . . . . . . . . . . . . . . . . . . . . 51.2 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51.2.1 Local naive Bayes nearest neighbor . . . . . . . . . . . . 51.2.2 Spatially local coding (SLC) . . . . . . . . . . . . . . . . 71.2.3 Spatially local coding localization . . . . . . . . . . . . . 7v2 Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82.1 Object instance recognition . . . . . . . . . . . . . . . . . . . . . 82.2 Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102.3 Object category localization . . . . . . . . . . . . . . . . . . . . 132.4 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 162.4.1 Performance measures . . . . . . . . . . . . . . . . . . . 173 Local Naive Bayes Nearest Neighbor . . . . . . . . . . . . . . . . . . 203.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 203.2 Relation to previous work . . . . . . . . . . . . . . . . . . . . . . 223.3 Naive Bayes nearest neighbor . . . . . . . . . . . . . . . . . . . . 243.4 Towards local NBNN . . . . . . . . . . . . . . . . . . . . . . . . 253.5 Local naive Bayes nearest neighbor . . . . . . . . . . . . . . . . 273.5.1 Approximate nearest neighbors and complexity . . . . . . 293.6 Experiments and results . . . . . . . . . . . . . . . . . . . . . . . 303.6.1 Tuning local naive Bayes nearest neighbor . . . . . . . . . 303.6.2 Scaling to many classes . . . . . . . . . . . . . . . . . . 313.6.3 Comparisons with other methods . . . . . . . . . . . . . 323.7 Chapter summary . . . . . . . . . . . . . . . . . . . . . . . . . . 364 Spatially Local Coding . . . . . . . . . . . . . . . . . . . . . . . . . 374.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 374.2 Related methods . . . . . . . . . . . . . . . . . . . . . . . . . . . 394.2.1 The coding/pooling pipeline . . . . . . . . . . . . . . . . 394.2.2 Other spatial models . . . . . . . . . . . . . . . . . . . . 414.3 Spatially local coding . . . . . . . . . . . . . . . . . . . . . . . . 424.3.1 Multi-level spatially local coding . . . . . . . . . . . . . 434.4 Optimal codebook size and model size . . . . . . . . . . . . . . . 444.5 Efficient codebook construction . . . . . . . . . . . . . . . . . . 474.6 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 494.6.1 Notes on timing . . . . . . . . . . . . . . . . . . . . . . . 524.7 Chapter summary . . . . . . . . . . . . . . . . . . . . . . . . . . 53vi5 Localization with Spatially Local Coding . . . . . . . . . . . . . . . 555.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 555.2 Related work . . . . . . . . . . . . . . . . . . . . . . . . . . . . 565.3 Spatially local coding . . . . . . . . . . . . . . . . . . . . . . . . 565.4 Detection methods . . . . . . . . . . . . . . . . . . . . . . . . . 575.4.1 Efficient sub-window search . . . . . . . . . . . . . . . . 575.4.2 Hough transform . . . . . . . . . . . . . . . . . . . . . . 585.5 The detection pipeline . . . . . . . . . . . . . . . . . . . . . . . . 585.5.1 Approximate SLC Hough transform . . . . . . . . . . . . 585.5.2 Optimizing Hough predictions . . . . . . . . . . . . . . . 605.5.3 Mining hard negative examples . . . . . . . . . . . . . . 655.6 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 665.7 Chapter summary . . . . . . . . . . . . . . . . . . . . . . . . . . 686 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 726.1 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 726.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75viiList of TablesTable 1.1 This is a rough taxonomy, depending on their level of objectspecificity and level of localization within an image. . . . . . . 1Table 2.1 Pseudo-code from [58] showing the duality between the slidingwindow and Hough transform approaches. They each computethe same scores, S(λ ), for each hypothesis, λ . The sliding win-dow approach constructs these scores window-by-window. TheHough transform constructs these scores feature-by-feature. Λis the hypothesis space, F is the feature set, and W (λ , f ) is thecontribution of a feature to a particular hypothesis score. . . . . 15Table 3.1 Effect of restricting increments to only the positive incrementson a downsampled version (128x128) of the Caltech 101 dataset.The ± shows one standard deviation. . . . . . . . . . . . . . . 27Table 3.2 Our Local Naive Bayes Nearest Neighbor (LNBNN) has con-sistent improvement over the original NBNN, outperforming allpreviously published results for NBNN using a single descrip-tor. We confirm NBNN outperforms the original spatial pyramidmethod, but is only competitive with the latest state-of-the-artvariant. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35Table 4.1 A taxonomy of histogram-based recognition models focusingon how they attend to appearance and spatial locality. Spatiallylocal coding moves spatial locality into the coding step. . . . . 38viiiTable 4.2 Combining multiple Spatially Local Coding (SLC) codebooks.The experimental setting of these results is explained in detailin Section 4.6. This shows that the combination of multiplecodebooks, each with a different λ -weighting, produces higherclassification accuracy than any of the codebooks individually. 44Table 4.3 Results on Caltech 101 (using 15 and 30 training examples perclass) and 256 (using 30 training examples per class). We com-pare our SLC method against previously published figures andagainst re-implementations of the original spatial pyramid andlocalized soft assignment tested on our feature set. The num-bers we report for our re-implementations of previous methodsare the best results we could achieve from among a range ofcodebook dimensionalities. Bold method names show our ex-periments; the others are from the literature. . . . . . . . . . . 54ixList of FiguresFigure 1.1 Understanding the world requires recognizing, classifying, andlocalizing things. Pedestrian detection and localization is onesuch application. . . . . . . . . . . . . . . . . . . . . . . . . 2Figure 1.2 An example of the object instance recognition problem. Inthis case the graffiti mural is recognized across images afterundergoing an affine transformation. . . . . . . . . . . . . . . 3Figure 1.3 Examples from Caltech 256. . . . . . . . . . . . . . . . . . . 4Figure 1.4 Components of the classification/localization pipeline. Chap-ter 3 describes LNBNN, an improvement to NBNN. Chapter 4describes SLC, a novel approach to coding in the coding/pool-ing framework. Chapter 5 describes a localization system forSLC. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6Figure 2.1 Semi-local constraints refers to checking that features that arematched based on appearance also match in relative position,rotation, and scale. This greatly reduces the number of spuri-ous matches between test image and the training database. . . 9Figure 2.2 The range of problems from exact image matching to objectsalience can be seen as various points on a specificity-invariancespectrum. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10Figure 2.3 Visualization of the hierarchical pooling regions and associ-ated histograms in the spatial pyramid model of [56]. (Figurealso from [56]; used with permission.) . . . . . . . . . . . . . 12xFigure 2.4 Illustration of the Implicit Shape model (ISM) of [60]. It isa generative model, and localization uses a probabilistic, gen-eralized Hough transform. (Figure also from [60]; used withpermission.) . . . . . . . . . . . . . . . . . . . . . . . . . . . 14Figure 3.1 Local NBNN: Instead of considering classes individually, wesearch one merged index. . . . . . . . . . . . . . . . . . . . . 21Figure 3.2 NBNN finds the nearest neighbor from each of the classes (theshapes, in this figure). LNBNN retrieves only the local neigh-borhood, finding nearest neighbors from only some of the classes.The shaded descriptors are those that would be used for updat-ing the distance totals. We only use the closest member fromany class, and don’t find an example from each class. . . . . . 28Figure 3.3 The effect of changing k, the number of nearest neighbors re-trieved from the merged index. Using only the local neighbors(about 10) results in optimal performance. The absolute per-formance numbers are lower than in our final results becausewe extracted fewer SIFT descriptors for this experiment. . . . 32Figure 3.4 Comparison of accuracy against computation for NBNN vs LNBNN.Computation is determined primarily by the number of dis-tance checks (c in this figure) allowed in the approximate near-est neighbor search. For 101 classes, even a single check ineach of the 101 indices is more expensive than one search withthousands of checks in the merged index due to the overheadof traversing each tree. These results were obtained on Caltech101, using a sparser sampling of descriptors than in our finalresults. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33Figure 3.5 The scaling behaviour of NBNN and LNBNN. We varied thenumber of categories from 2 up to 256 and plot the run time ofthe two methods. When classifying 256 categories, our methodis 100 times faster than the original. . . . . . . . . . . . . . . 34xiFigure 4.1 The performance of single-codebook SLC is dependent on thechoice of λ . Regardless of the choice of λ , the optimal code-book dimensionality is similar. Based on this observation, wecan choose to use the same dimensionality across all of thecodebooks in a multi-λ model. . . . . . . . . . . . . . . . . . 44Figure 4.2 Visualization of top-ranked features (hand-picked from amongthe top ten) for five of the object categories. The top row isthe spatial distribution of each codeword and the second row isthe average appearance of each codeword (as in Fig. 4.3). Thesubsequent rows show where occurrences of the codewords arecentered in particular training images. Looking down each col-umn shows that these codewords tend to be associated withparticular parts or textures. . . . . . . . . . . . . . . . . . . . 45Figure 4.3 Each pair of rows is a visualization of the top ten features froman object category. The scatter-plots show the spatial distri-bution of each codeword, while the grey-scale snippets showtheir average appearance. . . . . . . . . . . . . . . . . . . . . 45Figure 4.4 Effect of dimensionality on SLC performance. We comparethe performance of the original spatial pyramid model with thelocalized soft assignment variant and our SLC model. The lo-calized soft assignment performs better than the original andrequires larger codebook sizes for optimal performance. OurSLC model achieves even better performance, but requires evenmore codebook dimensions. When comparing the model di-mensionalities (21× the size of the codebook to account forthe spatial pyramid bins, or 4× the size of the codebook to ac-count for our 4-codebook SLC), SLC outperforms the othersat the equivalent model dimensionalities. . . . . . . . . . . . 47xiiFigure 4.5 The spatial pyramid model has many bins with relatively lowlevels of support. Bins are used for appearance/location com-binations that occur only rarely in the training data. The sup-port per bin in SLC model is more normally distributed. If abin is created, it is because there is support for it in the trainingdata. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48Figure 4.6 Codebooks produced by approximate k-means give similar clas-sification accuracies as codebooks produced by the exact k-means algorithm. The construction time for several data pointsis shown. In this experiment, we clustered 1,000,000 featuresinto 4096 codewords to use in a bag-of-features model. Ini-tialization was with kmeans++. The dataset was Caltech 101,using 15 training images per category. (Dataset details are inSection 4.6.) . . . . . . . . . . . . . . . . . . . . . . . . . . 49Figure 4.7 Our model yields best results with a linear SVM. Only at lowdimensionalities on Caltech 256, where the performance ofboth SVM types is low, does the intersection kernel outper-form the linear SVM. At optimal codebook dimensionalities,our model’s performance using a linear SVM dominates theperformance of the histogram intersection kernel. . . . . . . . 51Figure 5.1 The 5-D Hough space. For each (a,s) ∈ A×S , we build anx,y grid of Hough bins, each of which tracks per-codewordvotes. After max-pooling voting is complete, we collapse eachhistogram into a Hough score for each bin using Equation 5.1. 59Figure 5.2 We are able to maintain high recall while eliminating morethan half of the candidate windows by discarding Hough binsreceiving negative scores. This experiment runs Algorithm3 with a varying threshold. (Recall in this experiment onlyreaches 86% due to our choice of grid resolution and minimumscale.) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61xiiiFigure 5.3 By focusing only on peaks, we eliminate many candidate lo-cations. Thresholding at zero further reduces the number ofcandidate windows without reducing recall. This experimentruns Algorithm 4 with a varying threshold. . . . . . . . . . . 62Figure 5.4 By re-scoring the Hough peaks, and then refining their loca-tions using gradient descent, we retrieve the performance ofthe sliding window detector while gaining a large speedup. Allmethods were subject to non-maximum suppression to elimi-nate overlapping predictions. . . . . . . . . . . . . . . . . . . 64Figure 5.5 Three rounds of training with hard negative mining results in asignificant improvement in classifier accuracy. . . . . . . . . . 66Figure 5.6 Results on the Daimler monocular pedestrians dataset. (Thedata is taken from [25]. We have extracted the performancecurves for the single-feature, non-motion methods that theyevaluate. Methods incorporating multiple features and motioncues perform even better.) . . . . . . . . . . . . . . . . . . . 69Figure 5.7 Examples of typical detections made by our detector on Daim-ler monocular pedestrian dataset. We’ve displayed detectionsabove the threshold associated with the equal error rate. Thenumbers attached to each detection report the SLC score. Truepositives are outlined in green. Ground truth annotations areoutlined in blue. False positives are outlined in red. . . . . . . 70Figure 5.8 A more representative sample of localization performance. Theseexamples were sampled uniformly from the test set. Detectionsare presented if they are above the threshold that produced theequal-error-rate. Note that, in this dataset, pedestrians below acertain height are not included in the evaluation protocol, norare bicyclists or occluded pedestrians. . . . . . . . . . . . . . 71xivGlossaryDPM Deformable Part ModelESS Efficient Subwindow SearchFLANN Fast Library for Approximate Nearest NeighborsHOG Histograms of Oriented GradientsISM Implicit Shape modelLNBNN Local Naive Bayes Nearest NeighborNBNN Naive Bayes Nearest NeighborPRISM Principled Implicit Shape ModelSIFT Scale Invariant Feature TransformSLC Spatially Local CodingSPM Spatial Pyramid ModelSVM Support Vector MachineVOC Visual Object ClassesxvAcknowledgmentsForemost, I would like to thank my supervisor, Dr. David Lowe, for his patience,encouragement, mentorship, and insight over these seven years.I thank my committee members Dr. Jim Little and Dr. Nando de Freitas, andother members of the Laboratory for Computational Intelligence, especially Dr.Bob Woodham. You have all provided inspiration and interesting discussion in thecourses I’ve taken, our weekly reading groups, and conference travels.Thank you to all of my friends at UBC who’ve come and gone, and the manywho are still here, especially UBC Ultimate. You’ve made this a great place to goto school, for work and for play.Last, I thank my family, for your love and support throughout my life.xviChapter 1IntroductionIf we want computers and robots to do work for us, in the real world, they willneed to understand our world. This relies in a large part on visual understanding.Recognizing objects, classifying and localizing things are three vital componentsto understanding our world.1.1 Visual recognition tasksRoughly, visual recognition tasks can be divided based on their object and local-ization specificity as shown in Table 1.1.Instances CategoriesWhole-imageanalysisNear-duplicate detection Image/object classification (Sec-tion 1.1.2)Localization Object instance recogni-tion and registration (Sec-tion 1.1.1)Object category localization.E.g. faces, pedestrians, VOC de-tection challenge [30]. (Section1.1.3)Table 1.1: This is a rough taxonomy, depending on their level of object speci-ficity and level of localization within an image.1Figure 1.1: Understanding the world requires recognizing, classifying, andlocalizing things. Pedestrian detection and localization is one such ap-plication.1.1.1 Object instance recognitionObject instance recognition refers to the detection and localization/registration ofalready seen things. Examples include landmark recognition [77, 78], object in-stance detection and retrieval [63], and panorama creation [14]. Production ap-plications of instance recognition systems are used for media fingerprinting forcopyright infringement detection [47], Google Goggles [38], and face recognitionin photo library management software [36]. This involves matching each visiblepiece of the object in the test stimulus with a piece of a training instance. Figure1.2 visualizes this problem. While not a focus of this thesis, this can be seen assimply a more specific variant of object categorization, and the solutions to it haveinformed much object classification and localization work.2Figure 1.2: An example of the object instance recognition problem. In thiscase the graffiti mural is recognized across images after undergoing anaffine transformation.1.1.2 ClassificationA less specific visual perception task is classification: given an image of an object,identify that object as being a member of one of several categories of interest [76].It is also used to describe the more general problem of determining whether or notan object of a particular category is present in an image [21, 30].1 The former canbe seen as a special case of the latter: if you can tell me if an image contains a dog,you can tell me whether or not an image is of a dog or not.In the general sense, classification is multi-label: an image may contain severalcategories of objects. Instead of an image being of “a dog” or “a cat”, this sense ofclassification determines if an image is of “a scene containing a dog”, or “a scenecontaining a cat”. The difference is perhaps only semantic, but while classificationof a given object learns about the object, the more general sense of image clas-sification learns about the object and possibly the context in which the object isgenerally found.Widely used datasets testing the classification problem include Caltech 101[32], Caltech 256 [39], and 15 scenes [31, 56].For example, Caltech 256 tests a system’s ability to classify an image into oneof 256 diverse object categories: bears, bowling pins, dice, hourglasses, penguins,and skateboards, among other things. (See Figure 1.3.)1Perona [76] calls this detection, a term we reserve to be synonymous with localization.3(a) Bears(b) Bowling pins(c) DiceFigure 1.3: Examples from Caltech 256.Classification algorithms are often used as a means to an end in localization,but other applications make classification important for its own sake. Classifica-tion can be used to organize large collections of personal photos [35, 83] and rankcollections for retrieval tasks. In many applications, regions of interest are de-termined by motion segmentation [41], visual salience [68], or other means, andthose regions are then classified. Examples fitting this pipeline include coffee beansorting [29], surveillance analytics [41], and handwritten character classification[72]. Classification of an image’s scene can also be used to inform segmentationor object detection priors [61, 70].1.1.3 Detection, localizationThe terms detection and localization are not used consistently in the literature.As mentioned at Footnote 1, Perona uses detection to refer to the general versionof the classification problem: “[An image] may contain an exemplar of a given4category. Is it there?” [76] However, we will use “detection” to be synonymouswith “localization” [25, 28, 30, 76]. That is: find the X in this image, and indicateits location, if one exists.Instead of producing a label (or set of labels) for the image (as in classification),detection/localization produces a (possibly null) set of bounding boxes predictingthe presence of the object category of interest. Localization can be seen as simplyas localized classification: classify each subwindow of an image as either contain-ing or not containing the object category of interest. Some methods literally do thisby sliding a classifier over the image in a coarse grid ([1, 20, 71]). Others makeapproximations [53] or use models that are not strictly classifiers [60].Many applications need a good object category localization algorithm: facelocalization in digital cameras [52], pedestrian localization in automated drivingsystems [28], robotic visual search [68], and tracking (for initialization) [64].1.1.4 Local featuresThe foundation of the methods presented in this thesis is the local feature. Localfeatures are extracted in the spatial domain from compact regions, each coveringa relatively small portion of an image. These features describe the local texture,gradient information, and shape characteristics of a small region of an image. Ex-amples include SIFT [63], SURF [6], and BRIEF [15]. The presence and spatiallayout of particular features provides information regarding the category of objectin the image.1.2 ContributionsThe contributions of this thesis are focused on the problems of classification andlocalization. The models proposed are alternatives to popular contemporaries. Thetheme of these proposed models is spatially localized visual features that allowmore flexible treatment of feature location.1.2.1 Local naive Bayes nearest neighborThe first model is an improvement to the Naive Bayes Nearest Neighbor (NBNN)method that decreases its runtime and increases its classification accuracy.5Figure 1.4: Components of the classification/localization pipeline. Chapter 3describes LNBNN, an improvement to NBNN. Chapter 4 describes SLC,a novel approach to coding in the coding/pooling framework. Chapter 5describes a localization system for SLC.NBNN [9] is a classification method based on classification by composition:“[determining] to what degree one signal can be composed, as a puzzle, fromchunks taken from another signal” [8]. By composing a test image out of patch-based nearest neighbors from the training set, NBNN computes the image-to-categorydistance from the test image to each of the training categories. It is a very simplemethod that obtained high performance merely using nearest-neighbor distancesand no training phase.An important component of this nearest-neighbor matching is a location termthat encourages matches to correspond not only in appearance, but also in relativelocation.Our contribution, Local Naive Bayes Nearest Neighbor (LNBNN) addressed amajor efficiency issue with NBNN: it scaled linearly with the number of categories.This is a limitation common to many object classification methods. As applica-tions become interested in growing numbers of object categories, it is increasingly6important to focus on this aspect of scalability. In Chapter 3, we present an alter-native nearest neighbor look-up strategy that results in scaling that is logarithmicwith the number of categories under consideration and in improved classificationperformance.We present an evaluation on two widely used classification datasets, Caltech101 and Caltech 256, comparing against NBNN and contemporary spatial pyramidmethods, showing a 100x speed-up on Caltech Spatially local coding (SLC)In Chapter 4, we present an object model that is a half-way point between NBNNand the more traditional spatial pyramid approach to classification, using code-words based on the location-augmented features of NBNN and LNBNN. Traditionalcodebook-based approaches like bags-of-words and spatial pyramids build dictio-naries based solely on patch appearances. Spatially Local Coding (SLC) clustersusing location-augmented features. Spatial pyramids aggregate features into his-tograms in regions arranged in a fixed hierarchical grid. This fixed arrangement isan inefficiency in the spatial pyramid representation.SLC is a more flexible way of including location information than the fixedgrids of the spatial pyramid and makes better use of the model dimensions, ensuringthat each has support in the training data. It outperforms state-of-the-art spatialpyramid models on classification problems.1.2.3 Spatially local coding localizationThe final contribution is described in Chapter 5. It adapts the SLC model to theobject localization problem.Related state-of-the-art efficient search algorithms make certain assumptionsabout the underlying classifier model that SLC does not satisfy. Thus, adaptingthe SLC to object localization requires innovation to integrate with efficient searchstrategies.We describe an approximate Hough transform to quickly identify regions ofthe image warranting verification and location refinement using the SLC model andevaluate performance on the Daimler monocular pedestrian dataset [28].7Chapter 2PreliminariesIn this Chapter, we review relevant related work in the areas of object instancerecognition, object classification, and localization.2.1 Object instance recognitionSchmid and Mohr [80] was an early example of the local-feature approach to in-stance recognition that remains popular today. It is the approach that underliesthe contributions of this thesis. They were one of the first to use local featuresextracted at interest points to represent object instances. Compared to previouswork that used global representations (illumination or color histograms, for exam-ple), the local-feature representation handled partial visibility. Previous geometricmethods had difficulty representing objects that were not well represented by a setof geometric primitives.They recognized the importance of repeatability in interest point detection, se-lecting the Harris corner detector [40] for this reason. For retrieval, they used near-est neighbor matches in a database along with semi-local constraints to eliminatefalse matches.This was the approach that Lowe [63] adapted to the Scale Invariant FeatureTransform (SIFT). Lowe engineered SIFT to be a feature invariant to scale, rota-tion, illumination change, and small viewpoint changes. This feature, along withalternatives optimized for specific applications (SURF [6], BRIEF [79], ORB [15],8Figure 2.1: Semi-local constraints refers to checking that features that arematched based on appearance also match in relative position, rotation,and scale. This greatly reduces the number of spurious matches betweentest image and the training database.and FREAK [2]), has been the basis for most instance recognition work since.One way of looking at instance recognition and object classification approachesis their handling of the specificity-invariance tradeoff [81]. The work pre-Schmidand Mohr had low specificity and tended to not be invariant to partial visibility.Schmid and Mohr contributed invariance to partial visibility. Schmid and Mohr,and Lowe, increased specificity by using semi-local constraints to ensure that ge-ometry matches between query and retrievals. Lowe added invariance by design-ing a feature that was robustly detected across scale, rotation, and illuminationchanges.Regarding the specificity-invariance trade-off, what is acceptable is very taskdependent. Tracking is one application of recognition that requires a very highlevel of specificity and can accept a low level of invariance. While tracking anobject in a cluttered scene, it is important to not lose fixation on a particular objectinstance, but that object instance is not likely to undergo large changes in scale,rotation, or illumination between frames. Thus, for tracking, a recognition systemthat is highly specific is acceptable. One example of this is the tracking-learning-detection (TLD) system (aka. Predator) [46]. It returns to a representation basedsolely on example illuminance, and is based on matching test patches against a setof positive examples extracted during previous tracking frames.The high levels of object-specificity and viewpoint/illumination-invariance de-9Figure 2.2: The range of problems from exact image matching to objectsalience can be seen as various points on a specificity-invariance spec-trum.sired by recognition systems has resulted in methods that maintain large datasetsof local patches extracted from many training examples varying in viewpoint andlighting conditions. The size of these datasets poses two problems: large mem-ory requirements, and efficient matching. Memory requirements can be mitigatedby quantizing features as in [73]. Efficient matching is handled by approximatenearest neighbor matching schemes as in [5], [78], and [69].The Fast Library for Approximate Nearest Neighbors (FLANN) [69] is espe-cially relevant as it is a recent meta-method that automatically selects among arange of approximate nearest neighbor algorithms and parameter settings.2.2 ClassificationClassification can be seen as another point on the specificity-invariance spectrum(Figure 2.2). Where instance recognition methods aim to be specific to a particu-lar object, and invariant to appearance changes of that particular instance, objectclassification aims only to be specific to an object category, and be invariant toappearance variation within that object category.Csurka et al. [19] introduced the coding-pooling paradigm that has been thefoundation of many successful classification methods since. Their bag-of-wordsmodel uses local features, as in recognition, but quantized into relatively few (1000)codewords. These codewords are elements of a dictionary learned by clusteringfeatures from a training set. They pool the coded features into a global histogram,which becomes the representation of the image. Their method increases invariance10in two ways. First, the quantization of features into a small number of codewordsincreases invariance to local appearance: all features within the Voronoi region ofa dictionary element are considered identical after quantization. Second, the globalimage pooling of quantized features into a single histogram increases invariance torelative feature location, in this case ignoring it completely.A variety of machine learning approaches have been applied on top of thisrepresentation to complete the classifier framework. Examples include naive Bayes[19], Support Vector Machine (SVM) [11, 19, 56], randomized decision trees [10],nearest neighbor [22], and SVM combined with k-nearest neighbors (SVM-KNN)[94].As an example of the blurred lines between specific object recognition andobject classification, this codeword approach also appeared in the object instancerecognition context when Nister and Stewinis [73] applied feature coding to land-mark recognition. However, their approach used a very large vocabulary of onemillion codewords in order to maintain the specificity appropriate for instancerecognition.Sampling strategy. Csurka et al. [19] and several other codeword-basedrecognition experiments (e.g., [32, 83]) used sparse features sampled at interestpoints. However, Fei-Fei et al. [31] demonstrated that dense sampling (every 10pixels) results in higher classification performance than random sampling or inter-est point detections. Interest point sampling yields features at very particular andrepeatable parts of objects — if the objects in fact look the same. For object andscene categorization, with higher intra-class variation, interest point sampling doesnot provide enough instance-to-instance overlap in the extracted representations.This observation has been confirmed multiple times (for car localization and im-age classification in [45], and for object and texture classification in [74]). Alsoconsistent with dense sampling resulting in increased performance, Chatfield et al.[16] observed increased performance with increased extraction density [16].A deficiency of the bag-of-words model is that the spatial arrangement of fea-tures is not represented. While this allows for variation in object appearance withinan object class, completely ignoring the spatial arrangement of local features goestoo far.Discriminative models building on the bag-of-words model have focused on11building more informative histograms during the pooling phase by using multiplepooling regions. For example, the spatial pyramid [56] does this by building awhole-image histogram as in bag-of-words, but also a histogram for each quad-rant of the image (upper-left, upper-right, lower-left, lower-right), and so-on, ina hierarchical manner (Figure 2.3). More recent variants improve the coding, us-ing sparse coding [93], local soft coding [62], or locality constrained linear coding[91], for example. These all aim to distribute the representation of a local featureacross several codebook elements.Figure 2.3: Visualization of the hierarchical pooling regions and associatedhistograms in the spatial pyramid model of [56]. (Figure also from [56];used with permission.)In the 2012 Visual Object Classes (VOC) Challenge, all competition entrantsto their “classification” challenge (detecting the presence of 20 object categoriesacross thousands of natural images mined from Flickr) used variants of the bag-of-words or spatial pyramid models [30].However, deep learning methods have been demonstrated to be effective atvery large scale learning for visual recognition. In the 2012 ImageNet Large ScaleVisual Recognition Challenge (ILSVRC) [23], the top performing method in boththe classification and localization challenges was a state-of-the-art deep learning12method of Krizhevsky et al. [51]. The scale at which these methods are being usedis much larger than the scale of training that is the focus of this thesis. For example,[51] uses a model with 60 million parameters, 650,000 neurons, and trains on 1.2million labeled examples. Le et al. [57] use a model with 1 billion connections andtrain with 10 million unsupervised examples. Models of this size and representa-tional capacity are extremely susceptible to overfitting when learning with smallnumbers of training examples as used in this thesis [27]. Even with 1.2 million ex-amples, Krizhevsky et al. needed to use a regularization method called “dropout”to avoid overfitting behaviour.There are several other classifiers whose utility is largely due to their appli-cation to object localization (e.g., Histograms of Oriented Gradients (HOG) [20],Implicit Shape model (ISM) [60]) so we describe those in the next section.2.3 Object category localizationLocalization is the task of finding the location of an example of a particular cate-gory in an image, if one exists.The simplest approach to this task is to treat localization as localized classi-fication: evaluate a classifier at a set of locations in the image and return thoselocations with a classifier score above a threshold.Improvements to object localization algorithms comes in several forms:• Improvement to the underlying classifier’s accuracy. [11–13, 20, 34, 66, 91]• Improvement to the underlying classifier’s efficiency. [89, 93]• Improvement to the sliding window search strategy. [53, 89]• Design of a model designed specifically for integration with a localizationalgorithm. [58–60]In a review paper, Andreopoulos and Tsotsos [3] observe that between 1996and 2004, most localization work used a simple sliding window approach to local-ization. This highlights the importance of the quality of the underlying classifier tolocalization performance. Most advances to localization have attacked the problemof improving classification accuracy.13Figure 2.4: Illustration of the ISM of [60]. It is a generative model, and local-ization uses a probabilistic, generalized Hough transform. (Figure alsofrom [60]; used with permission.)Cascades. Viola and Jones [89] improved search efficiency by using a cas-cade of classifiers, allowing computation to be focused on only the most promis-ing regions of the image. The system used simple Haar features that could bequickly computed using integral images. By training boosted classifiers on topof these simple features, classifiers of increasing accuracy and computational de-mands were learned. These classifiers were used in a cascade starting with a slidingwindow that applied the cheapest classifier and continued with more accurate clas-sifiers progressively applied to smaller subsets of the image at each stage.Not all localization methods are based on windowed classifiers. Two examplesare the Implicit Shape model (ISM) of Leibe et al. [60], and efficient subwindowsearch of Lampert et al. [53].ISM [60] is a generative model that allows the uses a probabilistic Hough trans-form in order to avoid the sliding window approach altogether (see Figure 2.4).In this approach, each extracted local feature casts votes for the possible objectlocations that are consistent with the feature. The maxima in this voting spacecorrespond to the most likely object locations.Lampert et al. [53] also sidestep the problem of sliding window efficiency byrecasting the search for an optimal subwindow as a branch-and-bound search oversubwindow sets. This method is called Efficient Subwindow Search (ESS). Theonly requirement on the classifier is that one must be able to extract per-codeword14Sliding window Hough Transformfor λ ∈ Λ for f ∈ Ffor f ∈ {W (λ , f ) 6= 0} for λ ∈ {W (λ , f ) 6= 0}S(λ )+ =W (λ , f ) S(λ )+ =W (λ , f )end endend endTable 2.1: Pseudo-code from [58] showing the duality between the slidingwindow and Hough transform approaches. They each compute the samescores, S(λ ), for each hypothesis, λ . The sliding window approachconstructs these scores window-by-window. The Hough transform con-structs these scores feature-by-feature. Λ is the hypothesis space, F isthe feature set, and W (λ , f ) is the contribution of a feature to a particularhypothesis score.additive weights from the classifier decision function. A specific example of ex-tracting these weights is given in Section 5.5.The Principled Implicit Shape Model (PRISM) [59] is another way to view theproblem of localized classification. Lehmann et al. present the sliding window andHough transform as duals, maximizing the exact same objective, but with differentapproaches. This, along with the ISM of Leibe et al. [60] motivate our localizationframework of Chapter 5.Feature-centric efficient subwindow search takes the branch-and-bound ap-proach of [53] and applies with a feature-centric bound (rather than a window-centric bound as in ESS). The difference is that this feature-centric bound optimizesthe same objective function as the PRISM model, allows for a more flexible descrip-tion of feature location distributions (e.g., mixtures of Gaussians), and results in animplementation with more favourable memory tradeoffs in many cases.The best performing detectors on problems of the scale of the VOC challegeare based on HOG [20] and Deformable Part Models (DPMS) [34]. The HOG modelcomprises a fixed grid of rectangular regions, over which gradient strengths arepooled by orientation. A SVM is used to learn which of these features are discrim-inative for classification.The DPM uses HOG as the underlying representation for part appearances. In-stead of representing an object based on a fixed grid of expected gradients, DPM15comprises several parts, each of which has a HOG-based appearance model anddistribution of locations. This contribution of allowing flexible locations for sub-parts is analagous to our contribution of spatially local coding in Chapter 4. DPMuses a small number of sub-parts (5 to 6 in [34]) represented by HOG models; our“sub-parts” are thousands of location-augmented SIFT features. DPM overcomesthe rigidity of HOG. Spatially Local Coding (SLC) overcomes the rigidity of Spa-tial Pyramid Model (SPM).This thesis fits amongst this previous work by focusing on design of an im-proved underlying classifier (Chapters 3 and 4), and then developing a localizationframework for the SLC classifier of Chapter 4.2.4 EvaluationHow should we evaluate a classifier or detector? Dietterich [24] identifies ninerelevant questions one could ask regarding a classifier or algorithm’s performance.Three of those are relevant to current research in this field:1. Given two learning algorithms A and B and a large dataset S, which algorithmwill produce more accurate classifiers when trained on datasets of a specifiedsize m < |S|?2. Given two learning algorithms A and B and a small data set S, which algo-rithm will produce more accurate classifiers when trained on datasets of thesame size as S?3. Given two learning algorithms A and B and data sets from several domains,which algorithm will produce more accurate classifiers when trained on ex-amples from new domains?Research has tended to compare the output of a complete training and testingpipeline. It is not very interesting to simply take a single instance of a classifiertrained using algorithm A and ask whether or not it performs better or worse thanan instance of a classifier trained using algorithm B. What is more important is thepattern that arrises when these algorithms are trained repeatedly and the resultingclassifiers are compared.16In some settings, the established evaluation protocol does compare the resultsof individual trained models. For example, in the VOC Challenge [30], competitorsupload prediction results obtained from a single classifier. These are comparedagainst the results of individual classifiers uploaded by other competitors. This isa comparison between classifiers, not a comparison between learning algorithms.These comparisons don’t capture the variation introduced by the randomness of thelearning procedure.2.4.1 Performance measuresIn a 2-class classification setting (object of interest vs. background), the obviousperformance measure is classification accuracy. What percentage of classificationsare correct?In a multi-class classification setting (classifying an image into one of severalcategories), total classification accuracy again make sense, if the test set is bal-anced. If it is not balanced, and some categories are over-represented in the testset, comparisons between total classification accuracy will be biased towards thecategories with more test examples. An improvement is to test on a fixed num-ber of examples per category and report the overall error rate. This is not alwayspossible. Again, going back to Caltech 101, some categories simply do not haveenough test examples when training with 30 images per category. In this case, onecan avoid the category-size bias by averaging across all the per-category accuracyrates, and this is the standard practice [39].For the localization problem, at the lowest level, one needs to determine whethera prediction is associated with a ground truth object. VOC establishes a measurebased on percentage of overlap. It defines the overlap area as:aoverlap =area(BBprediction∩BBtruth)area(BBprediction∪BBtruth), (2.1)where BBprediction is the predicted bounding box, and BBtruth is the ground truthbounding box.If aoverlap > 0.5, the prediction is considered a true positive, and the groundtruth prediction considered detected.17There are three popular performance measures for the localization problem:• Precision-recall curve (and average precision)• Recall vs. false positives per window• Recall vs. false positives per imageThe precision-recall method is advocated by the VOC Challenge [30]. Thisinvolves the system producing a sorted list of predicted bounding boxes across theentire dataset. Each of these is determined to be either a true positive or falsepositive (according to Equation 2.1). Cumulative recall is also tracked as eachprediction is considered. This allows the reporting of precision and recall at eachhypothetical post-prediction threshold:precision =ntpnfp +ntp(2.2)recall =ntpnfn +ntp(2.3)where ntp is the number of true positives, nfp is the number of false positives, andnfn is the number of false negatives (missed detections). Precision-recall curves canbe summarized in a single number by computing the area under the curve. This isalso known as average precision.Recall vs. false-positives per window is a measure advocated by Dalal andTriggs [20] and Enzweiler and Gavrila [28]. Instead of presenting precision as inEquation 2.2, this metric reports recall as compared to the false-positive rate interms of false positives per window evaluated. This metric has several downsidesthat make it less useful than the other two alternatives. First, it treats the local-ization problem as large-scale classification. Second, it cannot be used to evaluatesystems that don’t treat localization as a large-scale classification problem (e.g.,Chapter 5, [60], [59], [54]), as the number of windows evaluated isn’t well-defined.Dollar et al. [25] advocate plotting recall (miss-rate) vs. false-positives per im-age in the context of pedestrian detection, since for these systems the number offalse positives per unit of time is an important practical consideration. This metriccan be applied fairly to all localization systems regardless of their underlying clas-sifier (or lack thereof). The curves produced by this method can be summarized by18a single number by computing the average miss-rate or recall across a prescribedfalse-positive range.In this thesis, we use average per-category accuracy rates to report performanceon the classification problem (to avoid category bias), and we use miss-rate vs.false-positives per image for reporting performance on pedestrian detection (as isrecommended by Dollar et al. [25]). Additional justification is presented in Chapter5.19Chapter 3Local Naive Bayes NearestNeighbor3.1 IntroductionAs mentioned, a widely used approach to object category recognition has been thebag-of-words method [19] combined with the spatial pyramid match kernel [56].This approach uses visual feature extraction, quantizes features into a limited setof visual words, and performs classification, often with a support vector machine[45, 54].In contrast to the bag-of-words method, Boiman et al. [9] introduced a feature-wise nearest neighbor algorithm called Naive Bayes Nearest Neighbor (NBNN).They do not quantize the visual descriptors and instead retain all of the referencedescriptors in their original form.Boiman et al. [9] showed that quantizing descriptors in the bag-of-words modelgreatly decreases the discriminativity of the data. The bag-of-words model usuallyreduces the high dimensional feature space to just a few thousand visual words.Despite the independence assumption of naive Bayes (independence of thedescriptors in the query image), Boiman et al. demonstrated state-of-the-art per-formance on several object recognition datasets, improving upon the commonlyused Support Vector Machine (SVM) classifier with a spatial pyramid match ker-nel. Also, NBNN has been shown to be superior to bag-of-words based methods for20(a) The original NBNN asks, “Does this descriptor look like a keyboard? a car? ...a dog?”(b) Local NBNN asks, “What does this descriptor look like?”Figure 3.1: Local NBNN: Instead of considering classes individually, wesearch one merged index.domain adaptation [84].NBNN is a simple algorithm. The task is to determine the most probable classCˆ of a query image Q. Let d1, . . . ,dn be all the descriptors in the query image. Thetraining data for a class is a collection of descriptors extracted from a set of labelledexample images. These are stored in data structures that allow for efficient nearestneighbor searches (the nearest neighbor of descriptor di in class C is NNC(di)).The original NBNN is listed as Algorithm 1.The contribution of this chapter is a modification to the original NBNN algo-rithm that increases classification accuracy and provides a significant speed-upwhen scaling to large numbers of classes. We eliminate the need to search for21Algorithm 1: NBNN(Q) [9]Require: A nearest neighbor index for each C, queried using NNC().Require: A query image Q, with descriptors di.for all descriptors di ∈ Q dofor all classes C dototals[C]← totals[C]+‖di−NNC(di)‖2end forend forreturn argminC totals[C]a nearest neighbor in each of the classes. Instead, we merge the reference datasetstogether and use an alternative nearest neighbor search strategy in which we onlyadjust the scores for the classes nearest to any query descriptor. The question be-comes, “What does this descriptor look like?”, instead of “Does this descriptor looklike one from a car? a duck? a face? a plane? ...” Figure 3.1 gives a conceptualvisualization.We also provide the first head-to-head comparison of NBNN based methodswith spatial pyramid methods using a common feature set. Previous work [9, 86]has only provided comparisons with published figures while extracting differentfeature sets for their experiments.3.2 Relation to previous workAn obvious issue with the naive Bayes approach is that it makes the unrealistic as-sumption that image features provide independent evidence for an object category.In defense of the naive Bayes assumption, Domingos and Pazzani [26] demon-strate the applicability of the naive Bayes classifier even in domains where theindependence assumption is violated. They show that while the independence as-sumption does need to hold in order for the naive Bayes classifier to give optimalprobability estimates, the classifier can perform well as regards misclassificationrate even when the assumption doesn’t hold. They perform extensive evaluationson many real-world datasets that violate the independence assumption and showclassification performance on par with or better than other learning methods.22Behmo et al. [7] corrects NBNN for the case of unbalanced training sets. Behmoet al. implemented and compared a variant of NBNN that used n 1-vs-all binary clas-sifiers, highlighting the effect of unbalanced training data. In the experiments wepresent, the training sets are approximately balanced, and we compare our resultsto the original NBNN algorithm. Behmo et al. also point out that a major practi-cal limitation of NBNN is the time that is needed to perform the nearest neighborsearch, which is what our work addresses.The most recent work on NBNN is by Tuytelaars et al. [86]. They use the NBNNresponse vector (the ‘totals’ array in Algorithm 1) of a query image as the inputfeatures for a kernel SVM. This allows for discriminative training and combinationwith other complementary features by using multiple kernels. Kernel NBNN givesincreased classification accuracy (69.6% vs. 65.5%) over using the basic NBNNalgorithm. Our work is complementary to this in that the responses resulting fromour local NBNN could also be fed into their second layer of discriminative learning.Due to the poor scaling of the original NBNN algorithm, Tuytelaars et al. had toheavily subsample the query images in order to obtain faster results for their exper-iments, hampering their absolute performance values. In NBNN, what dominatesis the time needed to search for nearest neighbors in each of the object categorysearch structures. Even approximate methods can be slow here and scale linearlywith the number of categories.The method we will introduce is a local nearest neighbor modification to theoriginal NBNN. Other methods taking advantage of local coding include localityconstrained linear coding by [91] and early cut-off soft assignment by [62]. Bothlimit themselves to using only the local neighborhood of a descriptor during thecoding step. By restricting the coding to use only the local dictionary elements,these methods achieve improvements over their non-local equivalents. The authorshypothesize this is due to the manifold structure of descriptor space, which causesEuclidean distances to give poor estimates of membership in codewords far fromthe descriptor being coded [62].The main contribution of Local Naive Bayes Nearest Neighbor (LNBNN) is animprovement to NBNN’s classification complexity. LNBNN scales logarithmicallyin the number of object classes. This sublinear scaling has also been achieved us-ing feature sharing as in [85]. In that case, this was not theoretically guaranteed,23but it was empirically observed that the number of features required for a fixed per-formance level scaled logarithmically with the number of object classes. Sublinearscaling is also inherent when using hierarchical classification as in [37].NBNN methods can be compared to the popular spatial pyramid methods [11,56, 91], which achieve state-of-the-art results on image categorization problems.The original spatial pyramid method used hard codeword assignment and averagepooling within each of the hierarchical histogram bins. Today, the best performingvariants of spatial pyramid use local coding methods combined with max pool-ing [11, 13, 62, 91]. State-of-the-art spatial pyramid methods achieve high ac-curacy on benchmark datasets, but there has been no head-to-head comparison ofNBNN methods against spatial pyramid methods. Previous work has only comparedagainst published figures, but these comparisons are based on different feature sets,which makes it difficult to isolate the contributions of the features from the contri-butions of the methods.3.3 Naive Bayes nearest neighborTo help motivate and justify our modifications to the NBNN algorithm, this sectionprovides an overview of the original derivation [9]. Each image Q is classified asbelonging to class Cˆ according toCˆ = argmaxCp(C|Q). (3.1)Assuming a uniform prior over classes and applying Bayes’ rule,Cˆ = argmaxClog(p(Q|C)). (3.2)The assumption of independence of the descriptors di found in image Q givesCˆ = argmaxC[log(n∏i=1p(di|C))](3.3)= argmaxC[n∑i=1log p(di|C)]. (3.4)24Next, approximating p(di|C) in Equation 3.4 by a Parzen window estimator, withkernel K, givespˆ(di|C) =1LL∑j=1K(di−dCj ) (3.5)where there are L descriptors in the training set for class C and dCj is the j-th nearestdescriptor in class C. This can be further approximated by using only the r nearestneighbors aspˆr(di|C) =1Lr∑j=1K(di−dCj ) (3.6)and NBNN takes this to the extreme by using only the single nearest neighbor(NNC(di)):pˆ1(di|C) =1LK(di−NNC(di)). (3.7)Choosing a Gaussian kernel for K and substituting Equation 3.7 (the single nearestneighbor approximation of p(di|C)) into Equation 3.4 (the sum of log probabilities)gives:Cˆ = argmaxC[n∑i=1log1Le−12σ2‖di−NNC(di)‖2](3.8)= argminC[n∑i=1‖di−NNC(di)‖2]. (3.9)Equation 3.9 is the NBNN classification rule: find the class with the minimumdistance from the query image.3.4 Towards local NBNNBefore introducing LNBNN, we first present some results demonstrating that wecan be selective with the updates that we choose to apply for each query descriptor.Recall from Algorithm 1 that for every query descriptor, we update the aggregatedistance of the query image to every class. Not all of these updates are necessary.We start by re-casting the NBNN updates as adjustments to the posterior log-oddsof each class. Then, we show that only the updates giving positive evidence for aclass are necessary.25The effect of each descriptor in a query image Q can be expressed as a log-oddsupdate. This formulation is useful because it allows us to restrict updates to onlythose classes for which the descriptor gives significant evidence. Let C be someclass and C be the set of all other classes.The odds (O) for class C is given byOC =P(C|Q)P(C|Q)(3.10)=P(Q|C)P(C)P(Q|C)P(C)(3.11)=n∏i=1P(di|C)P(di|C)P(C)P(C). (3.12)Taking the log gives:log(OC) =n∑i=1logP(di|C)P(di|C)︸ ︷︷ ︸increment+ logP(C)P(C)︸ ︷︷ ︸prior. (3.13)Equation 3.13 has an intuitive interpretation. The prior odds are P(C)/P(C).Each descriptor then contributes an adjustment to the posterior odds: P(di|C)/P(di|C).This allows an alternative classification rule that’s expressed in terms of log-odds increments:Cˆ = argmaxC[n∑i=1logP(di|C)P(di|C)+ logP(C)P(C)](3.14)where the prior term can be dropped if you assume equal class priors. The in-crement term is simple to compute if we leave P(di|C) ∝ e−‖di−NNC(di)‖2as in theoriginal.The benefit that comes from this formulation is that every query descriptor doesnot have to affect every class’s posterior log-odds: we can use only the significantlog-odds updates. For example, we can decide to only adjust the class posteri-ors for which the descriptor gives a positive contribution to the sum in Equation3.14. Table 3.1 shows that this selectivity does not affect classification accuracy.26Method Avg. # increments Accuracy %Full NBNN 101 55.2 ±0.97Positive increments only 55.0 55.6 ±1.17Table 3.1: Effect of restricting increments to only the positive increments ona downsampled version (128x128) of the Caltech 101 dataset. The ±shows one standard deviation.This makes intuitive sense in the presence of background clutter since no feature’spresence should be taken as strong evidence against a particular object’s presence.3.5 Local naive Bayes nearest neighborThe selectivity introduced in the previous section shows that we do not need toupdate each class’s posterior for each descriptor. This section shows that by fo-cusing on a much smaller, local neighborhood (rather than on a particular log-oddsthreshold), we can use an alternate search strategy to speed up the algorithm, andalso achieve better classification performance by ignoring the distances to classesfar from the query descriptor. As mentioned in Section 3.2, previous work has alsorestricted feature matching to local neighborhoods, with the argument that compar-isons of Euclidean distances to far-off descriptors are less meaningful.Instead of performing a search for a query descriptor’s nearest neighbor ineach of the classes’ reference sets, we search for only the nearest few neighborsin a single, merged dataset comprising all the features from all labelled trainingdata from all classes. Doing one approximate k-nearest-neighbor search in thislarge index is much faster than querying each of the classes’ approximate-nearest-neighbor search structures. This is a result of the sublinear growth in computationtime with respect to index size for approximate nearest neighbor search algorithmsas discussed in Section 3.5.1. This allows the algorithm to scale up to handle manymore classes, avoiding a prohibitive increase in runtime.This is an approximation to the original method. For each test descriptor in aquery image, we do not find a nearest neighbor from every class, only a nearestneighbor from classes that are represented in the k nearest descriptors to that test27Figure 3.2: NBNN finds the nearest neighbor from each of the classes (theshapes, in this figure). LNBNN retrieves only the local neighborhood,finding nearest neighbors from only some of the classes. The shadeddescriptors are those that would be used for updating the distance to-tals. We only use the closest member from any class, and don’t find anexample from each class.descriptor. We call this Local Naive Bayes Nearest Neighbor (LNBNN), visualizedin Figure 3.2.It is important to properly deal with the set of background classes which werenot found in the k nearest neighbors of d. To handle the classes that were notfound in the k nearest neighbors, we conservatively estimate their distance to bethe distance to the k + 1-st nearest neighbor (see line 3 from Algorithm 2). Thiscan be thought of as an upper bound on the density of background features. Inpractice, instead of adjusting the distance totals to every class, it is more efficientto only adjust the distances for the relatively few classes that were found in the knearest neighbors, but discount those adjustments by the distance to backgroundclasses (see line 6 from Algorithm 2). This does not affect the minimum.The LNBNN algorithm is as follows:The update at line 6 matches the increment from Equation 3.13 when we letP(di|C)∝ e−‖di−NNC(di)‖2, and estimate P(di|C) using the k+1-st nearest neighbor.We also experimented with using the k+b-th nearest neighbor (for various valuesof b) to estimate the background density. This performs no better than just usingthe k+ 1-st nearest neighbor. It seems the most critical element of LNBNN is thatwe are focusing on the distinctions between the most probable classes and settingthem apart from the rest.28Algorithm 2: LOCALNBNN(Q,k)Require: A nearest neighbor index comprising all descriptors, queried usingNN(descriptor,#neighbors).Require: A class lookup, Class(descriptor) that returns the class of a descriptor.1: for all descriptors di ∈ Q do2: {p1, p2, . . . , pk+1}← NN(di,k+1)3: distB←‖di− pk+1‖24: for all categories C found in the k nearest neighbors do5: distC = min{p j|Class(p j)=C} ‖di− p j‖26: totals[C]← totals[C]+distC−distB7: end for8: end for9: return argminC totals[C]3.5.1 Approximate nearest neighbors and complexityOur algorithm scales with the log of the number of classes rather than linearly in thenumber of classes. This analysis depends on the nearest neighbor search structurethat we use.For both our implementation of the original NBNN and LNBNN, we use theFast Library for Approximate Nearest Neighbors (FLANN) [69] to store descriptorsin efficient approximate nearest neighbor search structures. FLANN is a library forfinding approximate nearest neighbor matches that is able to automatically selectfrom among several algorithms and tune their parameters for a specified accuracy.It makes use of multiple, randomized KD-trees as described by Silpa-Anan andHartley [82] and is faster than single KD-tree methods like ANN [5] (used byBoiman et al. in the original NBNN) or locality sensitive hashing methods. Thecomputation required and the accuracy of the nearest neighbor search is controlledby the number of leaf nodes checked in the KD-trees.Following the analysis by Boiman et al. [9], let NT be the number of trainingimages per class, NC the number of classes, and ND the average number of de-scriptors per image. In the original, each KD-tree contains NT ND descriptors andeach of the ND query descriptors requires an approximate search in each of the NCKD-tree structures. The accuracy of the approximate search is controlled by the29number of distance checks, c. The time complexity for processing one query im-age under the original algorithm is O(cNDNC log(NT ND)). In our method, there isa single search structure containing NDNCNT descriptors in which we search for knearest neighbors (using c distance checks, where c k). The time complexity forprocessing one query image under our method is O(cND log(NCNT ND)). The NCterm has moved inside of the log term.3.6 Experiments and resultsWe show results on both the Caltech 101 and Caltech 256 Datasets [32, 39]. Eachimage is resized so that its longest side is 300 pixels, preserving aspect ratio. Wetrain using 15 and 30 images, common reference points from previously publishedresults. SIFT descriptors [63] are extracted on a dense, multi-scale grid, and wediscard descriptors from regions with low contrast. We have attempted to matchas closely as possible the extraction parameters used by Boiman et al. [9].1 Wemeasure performance by the average per-class classification accuracy (the averageof the diagonal of the confusion matrix) as suggested by [39]. Further justificationfor this performance metric (instead of overall accuracy) is given in Section 2.4.Boiman et al. [9] also introduced an optional parameter, α , that controls theimportance given to the location of a descriptor when matching. That is, theyappend to each descriptor d its location l in the image, weighted by α: d˜ = (d|αl).For all experiments, we fix α = 1.6, based on coarse tuning on a small subset ofCaltech 101.As discussed, we use FLANN [69] to store reference descriptors extracted fromthe labelled images in efficient approximate nearest neighbor search structures.3.6.1 Tuning local naive Bayes nearest neighborFigure 3.3 shows the effect of varying the cut-off, k, that defines the local neigh-borhood of a descriptor. This experiment shows that using a relatively low valuefor k improves performance. Using too low a value for k hurts performance, andusing a much higher value for k reverts to the performance of the original NBNN.We also demonstrate that this improved accuracy comes at a significant time1Code is available at for ease of comparison.30savings over the original. Instead of building 101 indices, local NBNN uses a singleindex comprising all the training data, storing a small amount of extra accessorydata: the object class of each descriptor.We vary the computation afforded to both NBNN and LNBNN, and track the as-sociated classification accuracy. For LNBNN, we do a search for 10 nearest neigh-bors, which returns an example from approximately 7 of the classes on average.The selection of an appropriately low number of nearest neighbors is important(see Figure 3.3).To control the computation for each method, we control a parameter of FLANN’sapproximate nearest neighbor search: the number of leaf-nodes checked in theKD-trees. This also determines the accuracy of the approximation. The higherthe number of checks, the more expensive the nearest neighbor searches, and themore accurate the nearest neighbors retrieved. While FLANN does allow for auto-tuning the parameters to achieve a particular accuracy setting, we fix the number ofrandomized KD-trees used by FLANN to 4 so that we can control the computationmore directly. This setting achieves good performance with minimal memory use.Figure 3.4 shows the results of this experiment. There are significant improve-ments in both classification accuracy and computation time. Looking in each of the101 separate class indices for just a single nearest neighbor in each, and checkingonly one leaf node in each of those search structures was still slower than localizedsearch in the merged dataset. Even doing 1000, 2000, or 4000 leaf node checks inthe merged dataset is still faster.3.6.2 Scaling to many classesFigure 3.5 further shows how the computation for these two methods grows as afunction of the number of classes in the dataset. As new classes are added in ourmethod, the depth of the randomized KD-tree search structures increases at a lograte. As we increase the number of classes to 256, LNBNN using the merged datasetruns 100 times faster than the original. In the original NBNN, an additional searchstructure is required for each class, causing its linear growth rate. This requires abest-bin-first traversal of each KD-tree. However, in the case where we query asingle search structure for 10-30 nearest neighbors, the best-bin-first traversal from31100 101 102k (number of nearest neigbors)0.560.580.600.620.640.660.68AccuracyOriginal NBNNFigure 3.3: The effect of changing k, the number of nearest neighbors re-trieved from the merged index. Using only the local neighbors (about10) results in optimal performance. The absolute performance numbersare lower than in our final results because we extracted fewer SIFT de-scriptors for this experiment.root to leaf happens only once, with the remainder of the nearest neighbors anddistance checks being performed by backtracking. The preprocessing time to buildthe indices is almost identical between the two methods.3.6.3 Comparisons with other methodsUntil now, no comparison has been done between NBNN and spatial pyramid meth-ods using the same base feature set. We show those results in Table 3.2. (Runtimefor the original NBNN on Caltech 256 was prohibitive — approximately 100 sec-onds per image — so we do not report those results.)We choose to compare against two spatial pyramid methods. First, the originalmodel introduced by Lazebnik et al. [56]. Second, a recent variant by Liu et al.[62] that takes advantage of local soft assignment in a manner similar to our localcut-off, and that uses max pooling [12] rather than average pooling within eachspatial histogram bin. We trained a codebook of size 1024 for each of the training32100 101 102Seconds per image (1639 descriptors on average)0.450.500.550.600.650.700.75Accuracy (Caltech 101)c=10c=20c=60c=120c=200c=400 c=1000 c=2000 c=4000c=1c=5c=10c=20 c=40 c=256Original NBNNLocal NBNNFigure 3.4: Comparison of accuracy against computation for NBNN vsLNBNN. Computation is determined primarily by the number of dis-tance checks (c in this figure) allowed in the approximate nearest neigh-bor search. For 101 classes, even a single check in each of the 101 in-dices is more expensive than one search with thousands of checks in themerged index due to the overhead of traversing each tree. These resultswere obtained on Caltech 101, using a sparser sampling of descriptorsthan in our final results.set regimes (Caltech 101 with 15 and 30 training images, Caltech 256 with 15 and30 training images). Our spatial pyramid was 3 levels (1x1, 2x2, and 4x4 histogramarrangements). For classification, we trained one-vs-all SVMS using the histogramintersection kernel [56] and used a fixed regularization term for all training regimes.We also compare against some previously published figures for NBNN. No-tably, LNBNN gives the best performance of any NBNN method to date.While LNBNN (and NBNN) performs better than the original spatial pyramidmodel, it does not perform better than the model of Liu et al. The soft assignmentavoids some of the information loss through quantization, and the discriminativetraining step provides an additional benefit. However, the image-to-class distancethat LNBNN uses may allow it to better recognize test examples that are a com-position of several training examples, possibly explaining its improvement over33101 102Number of classes10-1100101102103Average seconds per imageOriginal NBNNLocal NBNNFigure 3.5: The scaling behaviour of NBNN and LNBNN. We varied the num-ber of categories from 2 up to 256 and plot the run time of the twomethods. When classifying 256 categories, our method is 100 timesfaster than the original.spatial pyramid approaches on Caltech 256, a dataset with more intra-class varia-tion. The spatial pyramid pipeline computes image-to-image distances rather thanan image-to-class distance.The recent kernel NBNN of Tuytelaars et al. is a complementary contribution,and we suspect that the combinations of LNBNN with the kernel NBNN would leadto even better performance. We hypothesize that this combination would lead toNBNN matching or improving upon the performance of state-of-the-art spatial pyra-mid methods.There are other results using a single feature type that have higher publishedaccuracy on these benchmarks. For example, Boureau et al. [13] show 77.1% ac-curacy on Caltech 101 and 41.7% on Caltech 256 with 30 training images, but theyuse a macro-feature built on top of SIFT as their base feature, so that is not directlycomparable with our feature set. Combining different feature types together wouldalso yield higher performance as shown frequently in literature [9, 86].34Caltech 101(15 trainingimages)Caltech 101(30 trainingimages)Results from literatureNBNN [9] 65±1.14 70.4NBNN [86] 62.7±0.5 65.5±1.0NBNN kernel [86] 61.3±0.2 69.6±0.9Results using our feature setSPM (Hard-assignment, avg.-pooling)2 62.5±0.9 66.3±2.6SPM (Local soft-assignment, max-pooling)3 68.6±0.7 76.0±0.9NBNN (Our implementation) 63.2±0.94 70.3±0.6LNBNN 66.1±1.1 71.9±0.6Caltech 256(15 trainingimages)Caltech 256(30 trainingimages)Results from literatureNBNN [9] 30.51 37NBNN [86] - -NBNN kernel [86] - -Results using our feature setSPM (Hard-assignment, avg.-pooling)2 27.3±0.5 33.1±0.5SPM (Local soft-assignment, max-pooling)3 33.2±0.8 39.5±0.4NBNN (Our implementation) - -LNBNN 33.5±0.9 40.1±0.1Table 3.2: Our LNBNN has consistent improvement over the original NBNN, out-performing all previously published results for NBNN using a single descriptor.We confirm NBNN outperforms the original spatial pyramid method, but is onlycompetitive with the latest state-of-the-art variant.1 Boiman et al. did not do an experiment with 15 images on this dataset. The 30.5is an interpolation from their plot that showed 10 training examples giving 28%and 20 training examples giving 33%.2 The original spatial pyramid match by Lazebnik et al. [56] (re-implementation).3 A recent variant of the spatial pyramid match from Liu et al. [62] (re-implementation).4 Our experiment using NBNN achieves 63.2± 0.9 compared to 65.0± 1.14 from[9]. The original implementation is not available, and we have had discussionswith the authors to resolve these differences in performance. We attribute thedisparity to unresolved differences in parameters of our feature extraction. Thisdifficulty is also reported in [48].353.7 Chapter summaryWe have demonstrated that LNBNN is a superior alternative to the original NBNN,giving improved classification performance and a greater ability to scale to largenumbers of object classes. Classification performance is improved by making ad-justments only to the classes found in the local neighborhood comprising k nearestneighbors. Additionally, it is much faster to search through a merged index foronly the closest few neighbors rather than search for a descriptor’s nearest neigh-bor from each of the object classes.Our comparison against spatial pyramid methods confirms previous results [9]claiming that NBNN outperforms the early spatial pyramid models. Further, whileNBNN is competitive with the recent state-of-the-art variants of the spatial pyramid,additional discriminative training (as in the NBNN kernel of Tuytelaars et al. [86])may be necessary in order to obtain similar performance.As new recognition applications such as web search attempt to classify everlarger numbers of visual classes, we can expect the importance of scalability withrespect to the number of classes to continue to grow in importance. For example,ImageNet [21] is working to obtain labelled training data for each visual conceptin the English language. With very large numbers of visual categories, it becomeseven more apparent that feature indexing should be used to identify only thosecategories that contain the most similar features rather than separately consideringthe presence of a feature in each known category.36Chapter 4Spatially Local Coding4.1 IntroductionThe last chapter presented an improvement on NBNN, a non-trained model basedon direct feature matching. This chapter instead presents an improvement on thebag-of-words / spatial pyramid model. These are trained models that learn a fixed-size set of representative features for object appearance.Spatially Local Coding (SLC) represents location information in the same wayas NBNN and LNBNN, but clusters to avoid the prohibitive memory usage associatedwith storing every training feature.Models based on histograms of visual word frequencies have been quite suc-cessful for object recognition, delivering good results on varied datasets. The spa-tial pyramid was introduced by Lazebnik et al. [56] to allow the representationto account for the spatial distribution of these visual features. In this model, vi-sual features are coded across elements of a visual vocabulary, and these codes arepooled into histograms at several spatial granularities.Many improvements have been made to the original spatial pyramid model.These include replacing nearest neighbor vector quantization with sparse coding[93] or localized soft assignment [62] and replacing average pooling with max-pooling in each spatial bin [11, 13, 62, 93]. Boureau et al. [11] analyzed thevarious choices available during the coding and pooling stages: hard assignmentvs. soft assignment, average pooling vs. max-pooling, and the linear kernel vs. the37Method Appearance locality Spatial localityBags-of-features [19] Coding -Spatial pyramid [56] Coding PoolingSparse coding [93] Sparse coding PoolingLLC [91] Local coding Pooling“Ask-the-locals” sparse coding [13] Sparse coding + Pooling PoolingSpatially local coding (this work) Coding CodingTable 4.1: A taxonomy of histogram-based recognition models focusing onhow they attend to appearance and spatial locality. Spatially local codingmoves spatial locality into the coding step.histogram intersection kernel.It is useful to view these approaches in the context of how each method attendsto spatial locality and appearance locality. Prior to spatial pyramids, the bag-of-features method [19] enforced appearance locality in the coding step by its nearest-neighbor vector quantization. Using only a whole-image pooling region, it did notenforce any spatial locality. The spatial pyramid introduced spatial locality usinga spatially hierarchical pooling stage. The sparse coding spatial pyramid method[93] takes a different approach to appearance locality that can select very differentbases for patches that are visually similar [91]. Boureau et al. [13] re-instatedstricter appearance locality by modifying the pooling stage to enforce appearancelocality. Table 4.1 presents a taxonomy of these methods based on how they attendto appearance and spatial locality.In contrast to the spatial pyramid approaches, we explore moving the task ofenforcing spatial locality into the coding step. We first present more detail aboutthe related spatial pyramid methods. Then we present our method, tuning and im-plementation details, and finally, experimental results on Caltech 101 and Caltech256.384.2 Related methods4.2.1 The coding/pooling pipelineAs our method relates closely to the previous bag-of-features and spatial pyramidapproaches, we will explain our method in the context of these other methods. Thiswill facilitate comparisons with these approaches in Section 4.6.We adopt the coding/pooling framework of Boureau et al. in which all thebag-of-features and spatial pyramid methods can be seen as various choices for acoding and pooling step. Given an image I, we first do feature extraction: Φ(I) :I 7→ {(φ1,x1,y1),(φ2,x2,y2), . . .(φnI ,xnI ,ynI )}. The features φi are typically localimage descriptors (SIFT, for example) and they vary in number from image toimage (nI). The pixel location at which feature i is centered is (xi,yi).After we extract features, we code them using some coding function g((φi,xi,yi)).In previous work, this coding function has included nearest neighbor vector quan-tization, sparse coding, soft assignment, and localized soft assignment. In bags-of-features and spatial pyramid methods, the coding has been limited to the appear-ance portion of the descriptor φi, such that g((φi,xi,yi)) = (gˆ(φi),xi,yi). That is, theextracted descriptor’s appearance is converted into a coded version, still associatedwith the original pixel location.We’ll now give some concrete examples of the appearance coding function gˆfor the coding methods just mentioned. In these examples, assume that we haveconstructed a dictionary D of k appearance elements. This is often referred to as acodebook, and the elements called codewords.For nearest neighbor vector quantization, we use a 1-of-k coding:gˆ(φi) = [ui1,ui2, . . . ,uik] : ui j ={1 if j = argmina ‖φi−Da‖,0 otherwise(4.1)For soft assignment, demonstrated by van Gemert et al. [87], a feature is codedacross many codebook elements instead of just one:gˆ(φi) = [ui1,ui2, . . . ,uik] : ui j =exp(−β‖φi−D j‖2)∑ka=1 exp(−β‖φi−Da‖2)(4.2)39where β is a parameter controlling how widely the assignment distributes theweight across all the codewords. A small β gives a broad distribution, while alarge β gives a peaked distribution, more closely approximating hard assignment.This is further improved by Liu et al. [62], who use localized soft assignment.Instead of distributing the weight across all codebook elements, they confine thesoft assignment to a local neighborhood around the descriptor being coded. LetNN(κ)(φi) be the set of κ nearest neighbors to φi in D. Then, the localized softassignment coding is:gˆ(φi) = ui = [ui1,ui2, . . . ,uik] : ui j =exp(−βd(φi,D j)∑ka=1 exp(−βd(φi,Da))(4.3)d(φi,D j) ={‖φi−D j‖2 if D j ∈ NN(κ)(φi)∞ otherwiseSparse coding ([93], [11]) codes a descriptor by using the coefficients of alinear combination of the codewords in D, with a sparsity-promoting l1 norm:gˆ(φi) = ui = argminu∈Rk‖φi−Du‖22 +λ‖u‖1 (4.4)Locality constrained linear coding (LLC) is similar, but adds a penalty for usingelements of the codebook that have a large Euclidean distance from the descriptorbeing coded [91].After choosing one of the above coding functions for gˆ, we obtain the codingfor every descriptor extracted from an image: g((φi,xi,yi)) = (ui,xi,yi).The pooling stage combines these coded features within an image region into ahistogram. This histogram hm associated with spatial pooling region Sm is obtainedusing a pooling function:hm = f ({ui|(xi,yi) ∈ Sm}). (4.5)The structure of the spatial pooling regions S varies between methods. In thebag-of-features approach of Csurka et al. [19], there is just a single spatial poolingregion—one comprising the entire image. In the spatial pyramid, there are several40spatial pooling regions—one comprising the entire image and additional regionsthat split the image into quadrants and even finer subdivisions.Previous work has chosen between two options for the pooling function: average-pooling and max-pooling. Average-pooling produces a histogram that representsthe average value of u within a spatial pooling region. Max-pooling instead takesthe maximum value of each dimension of u within a spatial pooling region.havgm =∑{i|(xi,yi)∈Sm}ui|{(xi,yi) ∈ Sm}|(4.6)hmaxm = [hm1,hm2, . . . ,hmk] wherehm j = max{ui j|(xi,yi) ∈ Sm} (4.7)The final image representation used by the classifier is a concatenation of thosehistograms hm:H = [h1h2 . . .hM] (4.8)4.2.2 Other spatial modelsWe present in Section 4.3 perhaps the simplest approach for adding spatial infor-mation to the local features, but first, we outline relevant previous work that hasattempted to account for the spatial layout of visual features.Boiman et al., in their Naive Bayes nearest neighbor work [9], append a weightedlocation to the feature vectors, which matches our approach. However, in contrastwith the coding/pooling pipeline, they do not quantize features using a codebook.They instead maintain an index of all features extracted from the training data.Zhou et al. [95] use a mixture of Gaussians to model visual appearance, fol-lowed by spatially hierarchical pooling of the local features’ membership in each ofthe mixture’s components. This is essentially a variant of the spatial pyramid modelwhere soft codeword assignment is performed via a Gaussian mixture model, andbetter results have been achieved by spatial pyramids using localized soft assign-ment coding and max-pooling [62].Krapac et al. [50] introduced a compact Fisher vector coding to encode spatial41layout of features. They first learn an appearance codebook using k-means or amixture of Gaussians. Then, for each appearance component, they learn a mixtureof Gaussians to represent its spatial distribution. While their representation is morecompact, their evaluation shows marginal (if any) improvement over SPM in termsof classification accuracy.Oliveira et al. introduced a method called sparse spatial coding [75]. In theirwork, they code a descriptor using sparse codebook elements nearby in descriptorspace. In this sense, it is very close to LLC [91]. In contrast to our work, despitetheir method’s name, sparse spatial coding is local in descriptor space, not pixelspace. As a separate contribution, they introduce a new learning stage called Or-thogonal Class Learning that results in better performance than the standard SVMclassifier. However, that work is complementary to the work improving the cod-ing/pooling pipeline. Their results using the standard SVM classifier are inferiorto LLC [91], NBNN [9], and sparse coding SPM [93].4.3 Spatially local codingOur method differs from the previous coding/pooling methods in that we choosea coding function g that directly handles spatial locality, and use a single, whole-image pooling region during the pooling stage. Instead of choosing g(φi,xi,yi) =(gˆ(φi),xi,yi), we simultaneously code φi and the location (xi,yi). In the stan-dard models, φi = [φi1,φi2, . . . ,φid ]. In spatially local coding, we use a location-augmented descriptor: φ (λ )i = [φi1,φi2, . . . ,φid ,λxi,λyi], where λ ∈ R is a locationweighting factor giving the importance of the location in feature matching.For example, hard assignment, nearest neighbor vector quantization becomes:g(φi,xi,yi) = ui : ui j ={1 if j = argmina ‖[φ(λ )i −D(λ )a ‖,0 otherwise(4.9)where we have constructed the codebook D(λ ) by k-means clustering over a set oflocation-augmented descriptors and locations extracted from the training data.All of the coding functions presented in Section 4.2 can use the location-augmented descriptors in place of the standard appearance-only descriptors.In summary, spatially local coding uses location-augmented feature vectors,42and a single, whole-image spatial pooling region, because the features themselvesalready carry sufficient spatial information. We have moved the task of maintainingspatial locality into the coding stage. Previously, this has been left for the poolingstage.One advantage of spatially local coding is that we avoid having to commit toartificial grid boundaries to define the spatial pooling regions. Other authors havehad to work around this by experimenting with alternate binning geometries toengineer one that performs well for their problem [55], or supervised learning ofoptimal pooling regions from an over-complete set of rectangular bins [44]. Bothof these methods keep the responsibility for spatial locality in the pooling stage.Instead, our λ parameter defines a receptive field associated with each code-book element.4.3.1 Multi-level spatially local codingThe λ parameter plays an important role in our model. If we set λ = 0, we revertto the standard bag-of-features representation. If we set λ to be high, we learnfeatures that are very strictly localized. Figure 4.1 shows that it is possible to de-termine a somewhat optimal setting for λ (approximately 1.5 for the Caltech 101dataset). However, the specific λ may depend on the particulars of the dataset.Instead of committing to a single λ , we build several codebooks, each with a dif-ferent λ , and code image features across all of the codebooks simultaneously. Forexample, if we use hard coding, each feature would be coded into one codeword ineach of the codebooks, using a different λ for each. The codebooks are then con-catenated and max pooling is performed as normal from this point forward. Thisis similar in spirit to having multiple levels in the spatial pyramid, with each leveldividing space into finer and finer regions. Table 4.2 shows that this combinationis beneficial.Figure 4.2 shows the types of features learned by our system. For this visual-ization, we choose a particular category and determine the per-codeword weightsderived from the linear SVM model (the SVM training is described further in Sec-tion 4.6). These are the codewords that most signal the presence of the object ofinterest. We show the distribution of features found in the training images that43102 103 104 105Codebook dimensions0.250.300.350.400.450.500.550.600.650.70Average class accuracy0.000.751.503.00Caltech 101 (15 training images)λ = 0.00λ = 0.75λ = 1.50λ = 3.00Figure 4.1: The performance of single-codebook SLC is dependent on thechoice of λ . Regardless of the choice of λ , the optimal codebook di-mensionality is similar. Based on this observation, we can choose to usethe same dimensionality across all of the codebooks in a multi-λ model.Codebook (8192D) Caltech 101 (15 training)λ = 0.00 52.4±0.4λ = 0.75 61.1±0.2λ = 1.50 66.5±0.5λ = 3.00 64.9±0.34-level (linear kernel) 68.4±0.2Table 4.2: Combining multiple SLC codebooks. The experimental setting ofthese results is explained in detail in Section 4.6. This shows that thecombination of multiple codebooks, each with a different λ -weighting,produces higher classification accuracy than any of the codebooks indi-vidually.match the top-weighted codewords. The SVM can choose between unlocalizedand highly localized features.4.4 Optimal codebook size and model sizeIn [56], Lazebnik et al. reported their model reached optimal performance forCaltech 101 with approximately 200-400 codewords. We verify that in Fig. 4.4.We also confirm that localized soft assignment coding [62] achieves superior per-formance and can effectively make use of more codebook elements. Our model44killer-whale sunflower dice car-side zebraFigure 4.2: Visualization of top-ranked features (hand-picked from amongthe top ten) for five of the object categories. The top row is the spatialdistribution of each codeword and the second row is the average appear-ance of each codeword (as in Fig. 4.3). The subsequent rows showwhere occurrences of the codewords are centered in particular trainingimages. Looking down each column shows that these codewords tendto be associated with particular parts or textures.saturndicecar-sideFigure 4.3: Each pair of rows is a visualization of the top ten features from anobject category. The scatter-plots show the spatial distribution of eachcodeword, while the grey-scale snippets show their average appearance.45performs even better, but requires more codebook elements (for this figure, weused 4-codebook SLC with λ = 0,0.75,1.5,3.0, localized soft assignment, andmax-pooling).These results may seem counter to those reported by Chatfield et al. in [16],who report performance of spatial pyramid methods never decreasing on Caltech101 as codebook size grew up to 8,000 dimensions. However, those results wereshown for a much denser SIFT sampling density than used in our experimental se-tups. We extract single-scale SIFT every 8 pixels. Chatfield et al. extracted SIFTat 4 scales every 2 pixels, resulting in approximately 64× the descriptors as in ourexperimental setting. In Section 4.6, we show some results at higher extractiondensities and note that at higher densities, the optimal codebook dimensionality ishigher. We hypothesize that Chatfield et al. had not reached that optimal dimen-sionality for their extremely dense extraction setting.Our results match those by Boureau et al. [11] who reported that on Caltech101, optimal codebook size is relatively low for hard assignment coding, higherfor soft assignment coding, and higher when extraction densities increase. Thesetrends are also observed by van Gemert et al. [87].Comparing the required codebook size is useful, as it points to a potential in-crease in the computational cost of the coding phase, but comparing the resultingmodel size is also useful, as this affects the computational cost of learning andtesting. In 3-level spatial pyramid models, the model dimensionality is 21× thesize of the codebook. This is due to the 21 spatial pooling regions (1 whole-imageregion, 2×2 regions in the second level, and 4×4 regions at the third level). Fig-ure 4.4 also compares the methods based on their model dimensionality, not justtheir codebook size. 4-codebook SLC produces better classification accuracy thana state-of-the-art spatial pyramid at comparable model dimensionalities.Figure 4.5 highlights a possible reason for the improved performance of SLCcompared to spatial pyramid models. When quantizing on appearance only, andthen reserving a bin for that appearance in each of the spatially arranged poolingregions, bins are reserved for appearance-location combinations that don’t occur inthe training data. Since SLC clusters appearance and location jointly, the resultingcluster centers must be representative of the training data.46102 103 104 105Codebook dimensionality0.540.560.580.600.620.640.660.680.70Average class accuracy103 104 105Model dimensionality0.540.560.580.600.620.640.660.680.70Caltech 101 (15 training images): Codebook and model dimensionality vs. accuracyOriginal SPM (3-levels, intersection kernel)Localized soft assignment (3-levels, intersection kernel)SLC (4-codebooks, linear kernel)Figure 4.4: Effect of dimensionality on SLC performance. We compare theperformance of the original spatial pyramid model with the localizedsoft assignment variant and our SLC model. The localized soft as-signment performs better than the original and requires larger codebooksizes for optimal performance. Our SLC model achieves even better per-formance, but requires even more codebook dimensions. When compar-ing the model dimensionalities (21× the size of the codebook to accountfor the spatial pyramid bins, or 4× the size of the codebook to accountfor our 4-codebook SLC), SLC outperforms the others at the equivalentmodel dimensionalities.4.5 Efficient codebook constructionAs highlighted in the previous section, the codebook size used by our model issignificantly larger than those used by spatial pyramid methods. Efficient codebookconstruction becomes important. The standard algorithm for k-means takes O(nkd)per iteration, where n is the number of data points (SIFT features) that are clusteredinto k clusters and d is the data dimensionality. We used two approximations in ourimplementation of k-means to achieve efficient clustering.First, we use FLANN (Fast Library for Approximate Nearest Neighbors) [69]to perform approximate assignment of data points to clusters during each iterationof k-means. FLANN automatically selects and tunes its approximate nearest neigh-bor search algorithm to the particulars of the dataset in order to achieve a specifiedaccuracy with maximum efficiency. This is similar to the approximate k-means470 5 10 15 20Amount of support for a histogram bin05001000150020002500300035004000Number of binsLocal Soft Assignment Spatial Pyramid Model0 5 10 15 20Amount of support for a histogram bin05001000150020002500300035004000Number of binsSpatially Local Coding Bags of Words ModelFigure 4.5: The spatial pyramid model has many bins with relatively low lev-els of support. Bins are used for appearance/location combinations thatoccur only rarely in the training data. The support per bin in SLC modelis more normally distributed. If a bin is created, it is because there issupport for it in the training data.described by Philbin et al. in [78], but with more precise control of the approxima-tion error. Figure 4.6 shows how classification performance and construction timeis affected by changing FLANN’s approximation accuracy.Second, we have observed a small improvement in classification accuracy byusing the kmeans++ [4] initialization method rather than random initialization.However, this initialization method is expensive, so we use a subsampled kmeans++initialization that does not significantly affect our results. Instead of performingkmeans++ initialization using all descriptors to be clustered, we perform kmeans++initialization using a random 10% of the descriptors. This gives the benefit of animproved initialization at a fraction of the cost. When using subsampled kmeans++initialization instead of full kmeans++ initialization, construction time (cluster-480.0 0.2 0.4 0.6 0.8 1.0K-means approximation accuracy0.4750.4800.4850.4900.4950.5000.5050.510Classification accuracy 240 min37 min23 min18 min16 min15 minFigure 4.6: Codebooks produced by approximate k-means give similar clas-sification accuracies as codebooks produced by the exact k-means al-gorithm. The construction time for several data points is shown. Inthis experiment, we clustered 1,000,000 features into 4096 codewordsto use in a bag-of-features model. Initialization was with kmeans++.The dataset was Caltech 101, using 15 training images per category.(Dataset details are in Section 4.6.)ing 1,000,000 descriptors into a 4096-dimensional codebook) with 95% accuracydrops from 37 minutes to 20 minutes. At 80% accuracy, the construction timedrops from 23 minutes to 12 minutes. Note that exact kmeans using full kmeans++initialization requires approximately 240 minutes in this setting (see Fig. 4.6). Weuse 90% target accuracy and subsampled kmeans++ in all of our following experi-ments.4.6 ExperimentsWe compare spatially local coding against a variety of state-of-the-art spatial pyra-mid variants on Caltech 101 [32] and Caltech 256 [39].1We resize Caltech 101 and Caltech 256 images to fit inside a 300× 300 pixelsquare. This is consistent with the published results we compare against. Wecompared methods using 15 and 30 training images per category for Caltech 101,and using 30 training images per category for Caltech 256, all common points ofreference from the literature.We first learn a codebook using a random subsampling of 1,000,000 descrip-1Code is available at for ease of comparison.49tors from the training set. This is an appearance-only codebook for all of the spatialpyramid methods. For our SLC method, we learn multiple spatially local code-books, with λ = {0,0.75,1.50,3.00}.Then, we form the spatial pyramid or SLC representations of the training im-ages. For our implementation of the original spatial pyramid, we use nearest-neighbor vector quantization and average pooling in three spatial pyramid levels, asin [56]. For our implementation of the localized soft assignment spatial pyramid,we use soft assignment over the ten nearest codebook elements, and max-poolingin three spatial pyramid levels as in [62]. For our SLC model, we also use softassignment over the ten nearest codebook elements from each λ -level, and max-pooling over a global spatial pooling region. In [62], they showed results motivat-ing a small neighborhood, with ten neighbors being the optimal for Caltech 101.We confirmed that this still holds in our codebooks.We learn for each class a one-vs-all SVM (using the LibSVM library), selectingthe regularization parameter via cross-validation. For each run of the experiment,we record the average class accuracy over 101 or 256 classes (ignoring the back-ground class as suggested by [39]).We experimented with both the linear SVM and histogram intersection kernelsSVM, but found that the linear SVM outperforms the histogram intersection kernelfor our model (see Fig. 4.7). Thus, in Table 4.3, the results shown for our SLCmodel are obtained using a linear SVM.All of our reported numbers are based on 10 repetitions of the experiment,each time selecting a different training/testing split, building new codebooks, andlearning new models. We report the mean and standard error of the mean.The methods we compare against have generally published their results basedon single-scale 16x16 SIFT features [63] using the VLFeat implementation [88]extracted every 8 pixels [11, 56, 62, 93]. We perform the majority of our compar-isons at this setting as well. We re-implement the original spatial pyramid method[56] and the best performing state-of-the-art variant, localized soft assignment [62]so as to provide a head-to-head comparison based on our extracted features. To re-port the classification accuracy achieved by our re-implementations, we ran themat various dimensionalities, since their performance is dependent on codebook di-mensionality, and we report the highest result we could achieve with the other508192 16384 32768 65536SLC codebook dimensionality41.542.042.543.043.544.044.545.045.5Classification accuracy Caltech 256 (30 training images)8192 16384 32768 65536SLC codebook dimensionality69.069.570.070.571.071.572.0 Caltech 101 (15 training images)LinearIntersectionFigure 4.7: Our model yields best results with a linear SVM. Only at lowdimensionalities on Caltech 256, where the performance of both SVMtypes is low, does the intersection kernel outperform the linear SVM.At optimal codebook dimensionalities, our model’s performance usinga linear SVM dominates the performance of the histogram intersectionkernel.methods. Despite our results on SLC being obtained at higher codebook dimen-sionalities, this comparison is fair, since the model dimensionalities are comparable(see Fig. 4.4), and performance only deteriorates for the other methods as we in-crease their codebook size.Table 4.3 shows that spatially local coding achieves better classification accu-racy than all previous alternatives using single-scale SIFT features on Caltech 101and Caltech 256.Boureau et al. [13] have reported higher accuracy than any of the other previ-ous methods. However, they achieve these results using a denser SIFT sampling(every 4 pixels), and use a feature based on the concatenation of 4 individual SIFTfeatures within a small region which they call macrofeatures. We set these resultsaside from the rest of the reported numbers to highlight this difference, and run aseparate comparison where we also extract SIFT every 4 pixels (however, we don’tconstruct their macrofeatures). Spatially local coding again achieves a higher clas-sification accuracy.Wang et al. [91] use another slightly different feature set for their experimentson locality constrained linear coding. They extract multiscale features (HOG) ev-ery 8 pixels at three different sizes. We again set these reported results aside fromthe rest and run a separate comparison.While the above comparisons are sufficient to demonstrate SLC’s superior per-51formance as compared to state-of-the-art spatial pyramids, we provide an addi-tional evaluation at an even denser extraction setting, extracting multiscale SIFTevery 4 pixels.These experiments show that spatially local coding outperforms all previousspatial pyramid methods for object recognition on Caltech 101 and Caltech 256.We have shown this in the setting of single-scale and multi-scale SIFT features, attwo different extraction densities. In addition, we have provided a detailed surveyand comparison of the previously published results, making clear how they arecomparable with each other or not.To the best of our knowledge, any previous work reporting higher accuracy thanours combines together multiple complementary feature types. This is expected,since multiple feature types, while increasing the complexity, provide additionaland often complementary cues that improve classification performance.4.6.1 Notes on timingCodebook construction scales logarithmically with the number of codebook di-mensions, because we use approximate matching at each k-means step. SLC’smultiple codebook construction can be multi-threaded, giving elapsed real time forcodebook construction that scales logarithmically with the number of codebookdimensions.Vector quantization also scales logarithmically with the number of codebookdimensions and linearly with the number of codebooks. To code 7680 trainingimages (for Caltech 256), 1024D SPM required 2 minutes. SLC using four 16384Dcodebooks takes 11 minutes. That is 5.5× the compute time, despite the fact thatthe model size is 64× as large.SVM training is the bottleneck regardless of the approach. This time is unaf-fected by switching from local soft assignment SPM to SLC. Both use soft assign-ment across a small number of nearest neighbors, giving sparse feature vectors.Gram matrix construction for Caltech 256’s 7680 training images took 55 minutesusing local soft assignment SPM with 8192D, and 50 minutes under SLC usingfour 8192D dictionaries. Doubling the SLC dimensionality to 16384D increasedthe gram matrix construction time to 65 minutes.524.7 Chapter summarySpatially local coding is a simpler and more flexible way to enforce spatial localityin histogram-based object models than previous approaches. By simultaneouslycoding appearance and location, we remove the necessity of a complicated poolingstage to adequately model the spatial locations of visual features. We have shownthat a combination of location-augmented codebooks gives better classification ac-curacy than spatial pyramid models.It represents location information in the same way as NBNN and LNBNN, butclusters to avoid the prohibitive memory usage associated with storing every train-ing feature.The large codebook dimensionality required by our models potentially posesextra computational cost, but we have shown approximations that speed up thecodebook construction process and an evaluation of the effect of those approxima-tions on classification accuracy.We have presented these results alongside a useful survey of previously pub-lished results and our re-implementations of previous results. To the best of ourknowledge, we have shown the highest classification accuracy achieved on Caltech101 and Caltech 256 among methods using greyscale SIFT features.53Cal 101 (15) Cal 101 (30) Cal 256 (30)Single-scale SIFT features extracted every 8 pixelsOriginal SPM [56] 56.4 64.6±0.8 -Original SPM (ours) 57.8±0.3 65.2±0.4 30.0±0.4Localized soft assignment [62] - 74.21±0.81 -Localized soft assignment (ours) 66.2±0.4 72.2±0.3 37.2±0.2Sparse Coding SPM [93] 67.0±0.45 73.2±0.54 34.02±0.35Sparse Coding SPM [11] - 71.8±1.0 -SLC [8192d × 4] 68.4±0.2 75.5±0.4 38.9±0.3SLC [16384d × 4] 68.3±0.3 75.7±0.4 40.0±0.2Multi-scale features extracted every 8 pixelsLLC [91] (HOG) 65.43 73.44 41.19Localized soft assignment [62] - 76.48±0.71 -SLC [8192d × 4] 71.4±0.4 78.0±0.4 41.8±0.3SLC [16384d × 4] 72.5±0.3 79.2±0.2 43.4±0.2SLC [32768d × 4] 70.9±0.4 77.2±0.6 44.3±0.1SLC [65536d × 4] 69.6±0.3 77.5±0.4 43.4±0.3Single-scale SIFT features extracted every 4 pixelsBoureau et al. [13] - 77.3±0.6 41.7±0.8SLC [8192d × 4] 71.0±0.3 77.6±0.2 41.9±0.2SLC [16384d × 4] 71.1±0.3 78.9±0.2 43.5±0.3SLC [32768d × 4] 71.6±0.4 79.6±0.8 44.6±0.2SLC [65536d × 4] - - 45.1±0.2Multi-scale SIFT features extracted every 4 pixelsSLC [65536d × 4] 72.7±0.4 81.0±0.2 46.6±0.2Table 4.3: Results on Caltech 101 (using 15 and 30 training examples perclass) and 256 (using 30 training examples per class). We compareour SLC method against previously published figures and against re-implementations of the original spatial pyramid and localized soft as-signment tested on our feature set. The numbers we report for ourre-implementations of previous methods are the best results we couldachieve from among a range of codebook dimensionalities. Bold methodnames show our experiments; the others are from the literature.54Chapter 5Localization with Spatially LocalCoding5.1 IntroductionAs shown in Chapter 4, SLC [66] is a viable alternative for including spatial infor-mation in object recognition. SLC is a more flexible spatial representation than thefixed grids used in the spatial pyramid models and it uses a simple, whole-imageregion during the pooling stage. It outperforms modern variants of the spatial pyra-mid at equivalent model dimensionalities, and achieved better classification perfor-mance than all previous single-feature methods when tested on the Caltech 101 and256 object recognition datasets [66].Given SLC’s high performance as a classifier, it is natural to adopt it for use asa detector to find the location of objects within images of natural scenes.The contribution of this Chapter is the design and evaluation of an efficientdetector for Spatially Local Coding. Given the non-standard way that SLC includeslocation information, designing this detector is not a straight-forward applicationof an existing detector framework. We demonstrate the promise of this detector byevaluating on the Daimler monocular pedestrian dataset [28].We start with a short review of previous detector work, describe the design andoptimization of our detector, and finish with evaluation and discussion.555.2 Related workThere is a large diversity of approaches to object detection, and we review in thissection a small sampling to highlight the range of methods that have shown success.One family of approaches starts with local feature patches such as SIFT [63]and combines them in bags-of-words [19] or spatial pyramids [56] for classifica-tions of image sub-windows. These have been applied to detection through the useof the spatial pyramid as an exemplar model [17].Another line of work starts with larger-scale features representing the overallobject appearance or parts. Examples here include histograms of oriented gradients(HOG) [20] and the deformable parts model [34].Any classifier can be used as a detector by treating the detection problem aslocalized classification: sliding the classifier across the image at different scalesand finding maxima of the classification function. However, this can be expen-sive, especially as one increases the resolution of the search space. More efficientalternatives have been proposed. Efficient sub-window search [53, 54] avoids ex-haustive evaluation of every sub-window by using a branch-and-bound search.Most related to our proposed method is the implicit shape model (ISM) [60],which uses a generalized Hough transform to vote for object centers. ISM’s objectmodel is generative, and its probabilistic Hough voting is part of the model. In ourdetector, the Hough voting (see Section 5.5) is only an initial approximation to theSLC model.Higher order information such as object interrelationships [92] and tracking (inthe case of video or image sequences) could also be leveraged to aid in detectionperformance, but these approaches are orthogonal to basic detector design.5.3 Spatially local codingWe will briefly review SLC from Chapter 4. SLC uses location-augmented featurevectors, localized soft-assignment coding, and a single, whole-image max-poolingregion. SLC moves the task of maintaining spatial locality into the coding stage,whereas previously this had been left for the pooling stage.We avoid the issue of λ selection by use of the multi-level SLC variant assuggested in Chapter 4.565.4 Detection methodsThis section reviews two methods that seek to avoid having to perform an exhaus-tive sliding window detection. A 320×240 image “contains more than one billionrectangular sub-images” [53]. As individual evaluations of the SLC classifier arerelatively expensive, it is important to avoid unnecessary evaluations.5.4.1 Efficient sub-window searchEfficient sub-window search (ESS) was presented by Lampert et al. [53, 54] as away of effectively performing exhaustive search of all image sub-windows in timethat is sub-linear relative to the number of possible sub-windows. ESS is a branch-and-bound algorithm that relies on efficiently computing an upper bound for thescores of sets of sub-windows.The efficiency of the original ESS bound results from a one-time quantizationfrom features into codewords and then accumulation of those codewords’ linearSVM weights into integral histograms [53]. This step is not possible with SLC.Without committing to a particular sub-window reference frame, the quantizationfrom feature into an SLC codeword is not defined, because location relative to thesub-window is a component of the feature.The alternative, feature-centric efficient sub-window search proposed by Lehmannet al. [58, 59] is also inadequate for SLC because the feature-to-codeword quanti-zation still depends on a commitment to a particular detection window as a refer-ence frame. For each possible sub-window, a different feature-to-codeword map-ping takes place, rendering the bound of feature-centric ESS expensive to compute.An approximate bound similar to the approximation we make in Section 5.5 is pos-sible, and we include an implementation with our code, but there are two reasonswe decide against this approach. First, as observed by [59], the large number ofextracted features results in a computational bottleneck. Second, we have observedthat using the approximation from Section 5.5 when computing the bounds resultsin traversing many unfruitful paths through the branch-and-bound search space.575.4.2 Hough transformThe detection framework we develop builds upon the Generalized Hough trans-form [58, 60]. While it lacks the theoretical guarantees of the efficient sub-windowsearch, the Hough transform is compatible with approximations to SLC, and canstill quickly suggest promising regions of the image over which to focus expensiveevaluation of the full SLC model. It handles large numbers of features well (thevoting phase scales linearly with the number of features). We are able to greatlyspeed up detection over the sliding window approach, without sacrificing perfor-mance.Our approach uses a series of approximations and a cascade of thresholds tofocus the SLC classifier on only the most promising regions. The design and opti-mization of this pipeline is described next.5.5 The detection pipeline5.5.1 Approximate SLC Hough transformWe follow the approach of Lehmann et al. [58] in considering hypothesis foot-prints. Each hypothesis stamps out a footprint over which evidence for an objectdetection is accumulated. However, instead of accumulating votes hypothesis-by-hypothesis as in a sliding window approach, the Hough transform has each featurecast a weighted vote for (or against) hypotheses consistent (or inconsistent) withthat feature’s occurrence.The vote weights associated with each codeword are derived from the linearSVM weights. As in Lampert et al. [54], we re-write the linear SVM deci-sion function as a sum of per-codeword weights. The SVM decision function isf (h) = β +∑i αi〈h,h(i)〉, where h is the SLC histogram being classified, h(i) arethe training histograms, and αi are the learned per-example SVM weights. We canextract per-codeword weights w j =∑i αih(i)j , and re-write the decision function as:f (h) = β +∑jh j ·w j (5.1)and drop the bias term β because we only the relative scores matter.58Figure 5.1: The 5-D Hough space. For each (a,s) ∈ A×S, we build anx,y grid of Hough bins, each of which tracks per-codeword votes. Af-ter max-pooling voting is complete, we collapse each histogram into aHough score for each bin using Equation 5.1.Since SLC is based on max-pooling rather than sum-pooling, we need to per-form the Hough transform in two phases. The first phase, where most of the workis done, involves building a max-pooling histogram at each bin in the discretizedHough space.This phase occurs in a 5D Hough space: (s,a,x,y,c), with s being scale, rep-resented by hypothesis window width, a being aspect ratio, x and y being the hy-pothesis center, and c being the codeword. See Figure 5.1 for a visualization.To determine the region over which a feature will cast a vote, we use a lo-cation distribution for each SLC codeword c: (µ(c),σ2(c)) (with µ = (µx,µy) ∈[−0.5,0.5]2). The location distribution is only explicitly used during this approx-imate Hough transform step and is not part of the final SLC model. It is a meansof approximating the region over which a given codeword is likely to contribute tothe hypothesis footprint.SLC quantizes using feature location relative to a reference frame. However,there is no such reference frame when quantizing features prior to voting in Hough59space. Thus, we make the following approximation. We quantize using localizedsoft assignment [62] based solely on appearance information (we set λ = 0 inEquation 4.9). This maps a feature to the codewords that it might be matched tounder SLC. We use this tentative matching based on appearance along with thelearned location distributions to determine the bins in which to cast Hough votes.Given feature center ( fx, fy), an appearance-only soft-coding that results innon-zero weight for codeword c, and learned location distribution for that code-word, (µ(c),σ2(c)), we compute for each (a,s) ∈ A×S the Hough bins that thisfeatures should vote in as follows:µˆx = fx−µ(c)x · s; µˆy = fy−µ(c)y ·sa(5.2)σˆx = σ (c)x · s; σˆy = σ (c)y ·sa(5.3)Equations 5.2 and 5.3 invert the learned location distribution into Hough spaceat the appropriate scale (width) and aspect ratio. We update the max-pooling his-togram for codeword c in all Hough bins within 3σˆ of µˆ .After all votes have been cast, we apply Equation 5.1 to each bin to turn thehistograms into scores.5.5.2 Optimizing Hough predictionsIn this section, we coarsely optimize parameters of our approximate Hough trans-form on a subset of the VOC 2012 car category. We train on all non-truncated,non-difficult, non-occluded cars in the ’train’ portion of the training set, and teston all images in the ’validation’ set that include a car.Our goal in using the Hough transform and cascade of refinements is to reducethe number of times we need to evaluate using the full model. First, we check thatthe approximation we use when computing the Hough votes is appropriate. Dothe Hough scores reflect roughly the regions of the image that are more likely tocontain the object of interest? We run a simple thresholding algorithm (Algorithm3) for this check. Figure 5.2 shows that the Hough scores are meaningful. Almostevery Hough bin has a score greater than -1, so the recall when thresholding at -1is effectively the maximum recall we can achieve with a Hough transform at this601.0 0.5 0.0 0.5 1.0Hough threshold0. show number of Hough bins passing the threshold.)Figure 5.2: We are able to maintain high recall while eliminating more thanhalf of the candidate windows by discarding Hough bins receiving nega-tive scores. This experiment runs Algorithm 3 with a varying threshold.(Recall in this experiment only reaches 86% due to our choice of gridresolution and minimum scale.)bin density. We retain almost 100% of that recall and eliminate 60% of the binsfrom consideration by thresholding at zero. (Recall achieves a maximum of only86% in this series of experiments due to our choice of search grid resolution andminimum scale. We make the same choices for the sliding window baseline wecompare against later in Figure 5.4.)Algorithm 3: Thresholding binsData: 4D Hough map m, threshold tresults← {};for bin b ∈ m doif b.score >= t thenadd b to results ;endendWe can do better. Instead of selecting all Hough bins that pass a threshold,what if we select only local peaks in Hough space that pass the threshold? Figure613.0 2.5 2.0 1.5 1.0 0.5 0.0 0.5 1.0Hough threshold0. show number of Hough peaks passing the threshold.)Figure 5.3: By focusing only on peaks, we eliminate many candidate loca-tions. Thresholding at zero further reduces the number of candidatewindows without reducing recall. This experiment runs Algorithm 4with a varying threshold.5.3 shows the result of running Algorithm 4 with a varying threshold. There areon average 2528 local Hough peaks in each image. This results in recall of 81%compared to exhaustive consideration of all candidate windows (86%), as shownat the left extreme of Figure 5.2. Again, by thresholding at zero, we remove evenmore windows from consideration without losing further recall.Algorithm 4: Thresholding peaksData: 4D Hough map m, threshold tresults← {};for bin b ∈ m doif b.score >= t and IsLocalMax(b) thenadd b to results ;endendThis is additional evidence that the signal from our Hough transform, eventhough based on an approximation of our full model, is meaningful. The scorewell separates candidate regions from background.62Many windows still remain, and many candidates overlap significantly withone-another. We perform non-maximum suppression at this point to produce thefinal predictions based solely on the Hough scores (Algorithm 5). This resultsin poor detection performance (13% average precision) compared to the slidingwindow detector.Algorithm 5: Predict from thresholded peaksData: 4D Hough map m, threshold tresults← {};for bin b ∈ m doif b.score >= t and IsLocalMax(b) thenadd b to results ;endendNonMaximumSuppression(results)The discrepancy is because the Hough scores are only an approximation to thefull model. The scores of the candidate windows at this stage are not as accurateas what the full model would provide, and they aren’t as precisely localized. Evenif we did re-evaluate each of these peaks with the true model (Algorithm 6), thefact that they aren’t well-localized means that those scores will still not be as infor-mative, boosting performance to only 19% average precision. Figure 5.4 (Houghpeaks and Re-scored Hough peaks) shows the performance of these two methods.Algorithm 6: Predict from re-scored peaksData: 4D Hough map m, threshold tresults← {};for bin b ∈ m doif b.score >= t and IsLocalMax(b) thenb.score = EvaluateSLC(b);if b.score >= t thenadd b to results ;endendendNonMaximumSuppression(results)63Sliding Window(~48 hours)Hough peaks(~1 hour)RescoredHough peaks(~2 hours total)RescoredHough peaks w.4D refinement(~4 hours total) precisionFigure 5.4: By re-scoring the Hough peaks, and then refining their locationsusing gradient descent, we retrieve the performance of the sliding win-dow detector while gaining a large speedup. All methods were subjectto non-maximum suppression to eliminate overlapping predictions.One last step is necessary to nearly recover the performance of the sliding win-dow approach. After re-scoring the Hough peaks, so that we know their scoreunder the true model, we again discard peaks with negative scores, suppress strongoverlaps, and finally refine the remaining peaks using a gradient descent procedure[58]. Our gradient descent uses a finite difference approximation of the gradient,followed by line search. We simultaneously refine the (x,y) location, the aspectratio, and scale until we reach a local maximum. This is in contrast to Leibe etal. [60], who are able to use mean-shift [18] to find local maxima directly in theHough voting space. This method is not applicable in our case because our Houghvotes are approximate. The true classifier score is not found in the Hough votesand must be obtained by evaluation of the full model at a hypothesized location.As a result, we consider a much finer set of potential candidate windows aroundthe peaks than the sliding window approach does, but only evaluate a small numberof windows in total using the full SLC model.Now that we have largely recovered the performance of the brute force sliding64Algorithm 7: Full detection pipelineData: 4D Hough map m, threshold tresults← {};for bin b ∈ m doif b.score >= t and IsLocalMax(b) thenb.score = EvaluateSLC(b);if b.score >= t thenadd b to results ;endendendNonMaximumSuppression(results);for b ∈ results dob← GradientDescentSLC(b);endNonMaximumSuppression(results);window detector using a much faster approximation, we turn our attention to thedesign of the training phase.5.5.3 Mining hard negative examplesFor the above experiments, we used a model that was trained with a single passthrough the training set, collecting all non-difficult, non-truncated, non-occludedpositive examples and 10× that many negative examples selected randomly fromregions of the training data that did not overlap with any positive example.A technique used by many others in the past [20, 34, 90] is to mine the trainingdata for hard negative examples: negative windows that the classifier erroneouslypredicts as belonging to the class. Walk et al. [90] observe that without hard neg-ative selection, performance is extremely sensitive to the randomness in selectingexamples for the negative training set, and that at least 2 re-training rounds arerequired in order to reach full performance using HOG + linear SVM. We followthis practice by running the detector over the training set, and adding the highestscoring false positives into our negative training set. Figure 5.5 shows the effectof hard negative mining on the car class. The effect of additional hard negative650.0 0.2 0.4 0.6 0.8 1.0Recall0. detector: VOC 2012 Cars1 training round (average precision: 0.229422)3 training rounds (average precision: 0.320923)Figure 5.5: Three rounds of training with hard negative mining results in asignificant improvement in classifier accuracy.examples is clear.5.6 EvaluationWe evaluate our detector on the Daimler monocular pedestrian dataset [28], whichpresents a challenging real-world problem and has been used to test and compareother methods. While the previous chapters evaluated classification algorithms onCaltech 101 and 256, those datasets are not relevant to the localization problem asthey contain only a single, usually well-framed object in each image. We selectthe Daimler pedestrian dataset because it allows us to focus solely on the localiza-tion problem, without tangential issues related to widely varying aspect ratios andviewpoints as in the PASCAL VOC datasets.The Daimler pedestrian training set comprises 15660 pedestrian training ex-66amples, each presented in a 48× 96 cropped image, and 6744 images containingno pedestrians. We extract multi-scale SIFT and build three SLC dictionaries withλ = {0.0,1.5,3.0}. We perform three training rounds. During the first traininground, we extract random negative training windows such that the number of neg-atives is 7× the number of positives (chosen to fit within memory constraints).We train a linear SVM using k-fold cross validation for selection of the hyper-parameters. Before re-training in the next round, we run our detector across all6744 negative training images to mine for the most difficult (most highly scored)negative examples, and replace the 25% easiest examples from the previous train-ing round with an equivalent number of new difficult examples.For testing, we followed the evaluation protocol described by Dolla´r et al. [25].We evaluate against all fully-visible, labeled pedestrians, ignoring bicyclists, mo-torcyclists, pedestrian groups, and partially visible pedestrians. Detections andfailed detections of ignored annotations neither count for or against our detector.As in [25], we standardize the aspect ratio of all ground truth boxes to 0.41 by re-taining the annotated height and adjusting the width of the ground truth boundingbox to 0.41 times the height. We up-scale the test images by 1.6 to allow detectionof the smaller scale pedestrians.Enzweiler et al. [28] initially proposed reporting recall vs. false positives perframe. However, Dollar et al. [25] point out that false positives per image is a moreuseful measure of false positives in this setting. They observe that the differenceis in whether one is evaluating the performance of the classifier underlying the de-tector, or evaluating the performance of the entire detector pipeline as a completesystem. “Choices made in converting a binary classifier to a detector, includingchoices for spatial and scale stride and non-maximal suppression, influence fullimage performance.” [25]. We follow them in reporting miss-rate vs. false pos-itives per image. This is similar to reporting precision vs. recall as in the VisualObject Classes Challenge, but false positives per images is perhaps more impor-tant in an automotive pedestrian detection setting. Figure 5.6 shows our resultscompared against the single-feature, non-motion results reported by [25].Our detector outperforms HOG [20], histogram intersection kernel SVM [65],and an early version of the deformable parts model [33] throughout a wide rangeof false-positive rates. The one single-feature method that has an advantage is the67latent SVM [34] that learns explicit subparts. State of the art on this dataset isachieved by methods that use multiple features and frame-to-frame tracking cues[25], but that is not relevant to our comparison. The recall (1−miss rate) of ourdetector does saturate at higher false positive rates, but Dolla´r et al. identify theregion between 10−2 and 100 false positives per window as the region most relevantfor comparison. This is supported by Hussein et al. [42] who also use a score thatfocuses more on the region of the curve with low false alarms. They say, “this isuseful since in many applications we are more interested in the low false alarm raterange”. Nevertheless, the saturation of recall in our detector is a point for futureinvestigation. We suspect this occurs because some of the true detections are notcovered by the initial set of Hough peaks, and our refinement phase is not able torecover from these poor local maxima.Figure 5.7 shows typical output of our detector. Common error cases includereporting false positives on regions containing vertical structures and merging twopedestrians into a single detection.5.7 Chapter summaryWe’ve shown that efficient detection and accurate localization can be achieved forSLC through use of a Hough transform approximation, followed by non-maximumsuppression and gradient descent refinement. The approach was demonstrated onthe widely used Daimler monocular pedestrian dataset, achieving a high level ofdetection accuracy.There is considerable scope for further research to extend this approach. Onecomponent of the current pipeline needing improvement is the selection of Houghpeaks. Initializing the gradient descent procedure from a broader selection of loca-tions would lead to improved recall. There is also considerable scope for expandingthe set of features being used and adding location weights to achieve higher detec-tion performance.6810-4 10-3 10-2 10-1 100 101 102False positives per image0.200.300.400.500.640.801.00Miss rateDaimler DBSLC (this paper)HOGHIKSVMLATSVM-V1LATSVM-V2Figure 5.6: Results on the Daimler monocular pedestrians dataset. (The datais taken from [25]. We have extracted the performance curves for thesingle-feature, non-motion methods that they evaluate. Methods incor-porating multiple features and motion cues perform even better.)69(a) Two successful detections (b) A success, and a false positive(c) A missed detection (the child)Figure 5.7: Examples of typical detections made by our detector on Daimlermonocular pedestrian dataset. We’ve displayed detections above thethreshold associated with the equal error rate. The numbers attachedto each detection report the SLC score. True positives are outlined ingreen. Ground truth annotations are outlined in blue. False positives areoutlined in red.70Figure 5.8: A more representative sample of localization performance. Theseexamples were sampled uniformly from the test set. Detections are pre-sented if they are above the threshold that produced the equal-error-rate.Note that, in this dataset, pedestrians below a certain height are not in-cluded in the evaluation protocol, nor are bicyclists or occluded pedes-trians.71Chapter 6Conclusion6.1 DiscussionObject classification and localization are important for a broad set of applications.Classification can be used to organize photo collections, content-based image re-trieval, and as a component of scene understanding. It is also used as a means toan end in localization. Object category localization is crucial to many applicationsthat require some level of understanding of the physical world. Surveillance, facelocalization on digital cameras, tracking, an automatic car reacting to pedestriansand other objects – all of these benefit from an accurate object category localizationsystem.The performance of categorization is not yet close to the performance realiz-able by humans, and as in the past, continued improvement to object classifica-tion will yield improved object localization. However, those improvements onlytransfer in a practical setting if classification methods can be used efficiently forlocalization.This thesis adds to the field by presenting two improvements to object clas-sification – Local Naive Bayes Nearest Neighbor (LNBNN) and Spatially LocalCoding (SLC) – and integrates SLC into an object localization framework.Chapter 3 covered LNBNN, an improvement to Naive Bayes Nearest Neigh-bor (NBNN). This method avoids quantization of features, and performs recogni-tion by composition: is the test image explained by patches from a training class?72Our contribution is demonstrating that only features in the local neighborhood areinformative for this type of recognition. A change to the nearest-neighbor searchstrategy that exploited this observation changed the scaling to be logarithmic inthe number of object categories, achieving a 100× speedup on Caltech 256, and ahigher performance than the original NBNN. LNBNN still makes an unfavourablememory trade-off, in that all training features must be maintained. Additionally, itdid not outperform some state-of-the-art methods in classification accuracy.Chapter 4 showed how to use the feature representation from NBNN (a location-augmented feature) in the bag-of-words model as an alternative to the spatial pyra-mid. This is a more flexible method of incorporating feature-location information,avoiding the commitment to fixed, hierarchical grids of the spatial pyramid. Weshowed state-of-the-art performance on the object classification task, outperform-ing state-of-the-art spatial pyramid methods. We achieved the highest reportedperformance of any method based on greyscale SIFT features.Chapter 5 presented the design of a localization framework that incorporatesthe SLC classifier. We use an approximate Hough transform along with a cascadeof thresholds and refinements enroute to final predicted locations. Evaluation onthe Daimler monocular pedestrian dataset demonstrates localization performancecompetitive with contemporary alternatives.6.2 Future WorkThis thesis leaves open several directions for future research.Feature substitution or inclusion of additional feature types The work in thisthesis has been based on a single feature type (greyscale-SIFT). This al-lowed us to isolated the effects of improvements to the classification andlocalization algorithms from feature design. However, feature design is im-portant, and the design decisions of the rest of the pipeline may be alteredby improved feature design or use of multiple complementary features. Toachieve the highest possible classification and localization accuracy, multi-ple feature types will need to be included. In the SLC framework, the use ofadditional feature types will increase computational and memory demands.This could be offset by feature mining or sparse regularization to select a73subset of useful features from this larger set. Alternatively, we also expectuse of features specifically optimized for the particular recognition task ordataset to yield improved performance.Further improve localization performance Our localization experiments can beconsidered exploratory. They demonstrate one method for using SLC totackle the localization problem. There are other alternatives. While wedecided against feature centric Efficient Subwindow Search (ESS) due to amismatch in assumptions, perhaps that can be overcome. Our approximateHough transform is followed by a simple gradient descent method that isvulnerable to a poor starting location. Any improvement to the proposed lo-cations from the approximate Hough transform, or an improved local searchwould yield improved localization accuracy.Apply the spatial locality principle to other problems SLC is a high-performingclassifier. While we have demonstrated its application to 2D object categorylocalization, it should be considered for other application areas as well. Ex-tension of SLC to the three spatio-temporal dimensions (as HOG [20] wasextended to 3D [49]) could allow its use in action recognition. Extension intothree spatial dimensions could allow its use for object classification from 3Ddata.Algorithm optimization An object category localization pipeline has many pa-rameters and design alternatives to be chosen. Even with expert knowledge,this is time-consuming, and non-intuitive. The appropriate decisions forthese parameters and design choices are often domain and task dependent.Automatic algorithm configuration [43] could be useful in this area.74Bibliography[1] S. Agarwal, A. Awan, and D. Roth, “Learning to detect objects in images viaa sparse, part-based representation.” PAMI, vol. 26, no. 11, pp. 1475–90,2004. → pages 5[2] A. Alahi, R. Ortiz, and P. Vandergheynst, “FREAK: Fast Retina Keypoint,”in CVPR, 2012. → pages 9[3] A. Andreopoulos and J. K. Tsotsos, “50 Years of object recognition:Directions forward,” Computer Vision and Image Understanding, vol. 117,no. 8, pp. 827–891, Aug. 2013. → pages 13[4] D. Arthur and S. Vassilvitskii, “k-means ++ : The Advantages of CarefulSeeding,” in Eighteenth Annual ACM-SIAM Symposium on DiscreteAlgorithms (SODA), 2007. → pages 48[5] S. Arya, D. M. Mount, N. S. Netanyahu, R. Silverman, and A. Y. Wu, “Anoptimal algorithm for approximate nearest neighbor searching in fixeddimensions,” Journal of the ACM, vol. 45, no. 6, pp. 891–923, Nov. 1998.→ pages 10, 29[6] H. Bay, T. Tuytelaars, and L. V. Gool, “SURF: Speeded up robust features,”in ECCV, 2006. → pages 5, 8[7] R. Behmo, P. Marcombes, A. Dalalyan, and V. Prinet, “Towards optimalnaive bayes nearest neighbor,” in ECCV, 2010. → pages 23[8] O. Boiman, “Inference by Composition,” Ph.D. Thesis, The WeizmannInstitute of Science, 2009. → pages 6[9] O. Boiman, E. Shechtman, and M. Irani, “In defense of nearest-neighborbased image classification,” in CVPR, 2008. → pages 6, 20, 22, 24, 29, 30,34, 35, 36, 41, 4275[10] A. Bosch, A. Zisserman, and X. Munoz, “Image classification using randomforests and ferns,” in ICCV, 2007. → pages 11[11] Y.-L. Boureau, F. Bach, Y. LeCun, and J. Ponce, “Learning mid-levelfeatures for recognition,” in CVPR, 2010. → pages 11, 13, 24, 37, 40, 46,50, 54[12] Y.-L. Boureau, J. Ponce, and Y. LeCun, “A theoretical analysis of featurepooling in visual recognition,” in ICML, 2010. → pages 32[13] Y.-L. Boureau, N. Le Roux, F. Bach, J. Ponce, and Y. LeCun, “Ask thelocals: multi-way local pooling for image recognition,” in ICCV, 2011. →pages 13, 24, 34, 37, 38, 51, 54[14] M. Brown and D. Lowe, “Recognising panoramas,” ICCV, 2003. → pages 2[15] M. Calonder, V. Lepetit, C. Strecha, and P. Fua, “BRIEF : Binary RobustIndependent Elementary Features,” in ECCV, 2010. → pages 5, 8[16] K. Chatfield, V. Lempitsky, and A. Vedaldi, “The devil is in the details: anevaluation of recent feature encoding methods,” in BMVC, 2011. → pages11, 46[17] O. Chum and A. Zisserman, “An exemplar model for learning objectclasses,” in CVPR, 2007. → pages 56[18] D. Comaniciu and P. Meer, “Mean shift: a robust approach toward featurespace analysis,” PAMI, vol. 24, no. 5, pp. 603–619, 2002. → pages 64[19] G. Csurka, C. Dance, L. Fan, J. Willamowski, and C. Bray, “Visualcategorization with bags of keypoints,” in Workshop on Statistical Learningin Computer Vision, ECCV, 2004. → pages 10, 11, 20, 38, 40, 56[20] N. Dalal and B. Triggs, “Histograms of oriented gradients for humandetection,” in CVPR, 2005. → pages 5, 13, 15, 18, 56, 65, 67, 74[21] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei, “ImageNet: Alarge-scale hierarchical image database,” in CVPR, 2009. → pages 3, 36[22] J. Deng, A. C. Berg, K. Li, and L. Fei-Fei, “What does classifying more than10,000 image categories tell us?” ECCV, 2010. → pages 11[23] J. Deng, A. Berg, S. Satheesh, H. Su, A. Khosla, and L. Fei-Fei, “TheImageNet Large Scale Visual Recognition Challenge (ILSVRC),” 2012.76[Online]. Available: →pages 12[24] T. G. Dietterich, “Approximate statistical tests for comparing supervisedclassification learning algorithms,” Journal of Neural Computation, vol. 10,no. 7, pp. 1895–1923, Sep. 1998. → pages 16[25] P. Dolla´r, C. Wojek, B. Schiele, and P. Perona, “Pedestrian detection: anevaluation of the state of the art.” PAMI, vol. 34, no. 4, pp. 743–61, Apr.2012. → pages xiv, 5, 18, 19, 67, 68, 69[26] P. Domingos and M. Pazzani, “On the optimality of the simple Bayesianclassifier under zero-one loss,” Journal of Machine Learning, vol. 29, no. 2,pp. 103–130, 1997. → pages 22[27] J. Donahue, Y. Jia, O. Vinyals, J. Hoffman, N. Zhang, E. Tzeng, andT. Darrell, “DeCAF : A Deep Convolutional Activation Feature for GenericVisual Recognition,” University of California, Berkeley, Tech. Rep., 2013.→ pages 13[28] M. Enzweiler and D. M. Gavrila, “Monocular pedestrian detection: surveyand experiments,” PAMI, vol. 31, no. 12, pp. 2179–95, Dec. 2009. → pages5, 7, 18, 55, 66, 67[29] I. Ernesto and S. M. William, “Procedure for sorting a granular material anda machine for executing the procedure,” 1989. [Online]. Available: → pages 4[30] M. Everingham, L. Van-Gool, C. Williams, J. Winn, and A. Zisserman, “ThePASCAL Visual Object Classes Challenge 2012 (VOC2012) Results,” 2012.[Online]. Available:→ pages 1, 3, 5, 12, 17, 18[31] L. Fei-Fei and P. Perona, “A Bayesian Hierarchical Model for LearningNatural Scene Categories,” in CVPR, 2005. → pages 3, 11[32] L. Fei-Fei, R. Fergus, and P. Perona, “Learning generative visual modelsfrom few training examples: An incremental Bayesian approach tested on101 object categories,” in Workshop on Generative-Model Based Vision,CVPR, 2004. → pages 3, 11, 30, 49[33] P. Felzenszwalb, D. McAllester, and D. Ramanan, “A discriminativelytrained, multiscale, deformable part model,” in CVPR, 2008. → pages 6777[34] P. F. Felzenszwalb, R. B. Girshick, D. McAllester, and D. Ramanan, “ObjectDetection with Discriminatively Trained Part Based Models,” PAMI, vol. 32,no. 9, pp. 1627–1645, 2009. → pages 13, 15, 16, 56, 65, 68[35] R. Fergus, L. Fei-Fei, P. Perona, and A. Zisserman, “Learning ObjectCategories from Googles Image Search,” in ICCV, 2005. → pages 4[36] A. C. Gallagher, “Identifying unique objects in multiple image collections,”2006. [Online]. Available: →pages 2[37] T. Gao and D. Koller, “Discriminative learning of relaxed hierarchy forlarge-scale visual recognition,” in ICCV, 2011. → pages 24[38] Google Inc., “Google Goggles,” 2013. [Online]. Available: → pages 2[39] G. Griffin, A. Holub, and P. Perona, “Caltech-256 Object Category Dataset,”California Institute of Technology, Tech. Rep., 2007. [Online]. Available: → pages 3, 17, 30, 49, 50[40] C. Harris and M. Stephens, “A Combined Corner and Edge Detector,” inProcedings of the Alvey Vision Conference, 1988. → pages 8[41] W. Hu, T. Tan, L. Wang, and S. Maybank, “A Survey on Visual Surveillanceof Object Motion and Behaviors,” IEEE Transactions on Systems, Man andCybernetics, Part C (Applications and Reviews), vol. 34, no. 3, pp. 334–352,Aug. 2004. → pages 4[42] M. Hussein, F. Porikli, and L. Davis, “A comprehensive evaluationframework and a comparative study for human detectors,” IEEETransactions on Intelligent Transportation Systems, vol. 10, no. 3, pp. 1–10,2009. → pages 68[43] F. Hutter, H. Hoos, K. Leyton-Brown, and T. Stuetzle, “ParamILS: anautomatic algorithm configuration framework,” Journal of ArtificialIntelligence Research, vol. 36, no. 1, pp. 267–306, 2009. → pages 74[44] Y. Jia and C. Huang, “Beyond Spatial Pyramids : Receptive Field Learningfor Pooled Image Features,” in NIPS 2011 Workshop on Deep Learning andUnsupervised Feature Learning, 2011. → pages 43[45] F. Jurie and B. Triggs, “Creating efficient codebooks for visual recognition,”in CVPR, 2005. → pages 11, 2078[46] Z. Kalal, K. Mikolajczyk, and J. Matas, “Tracking-Learning-Detection,”PAMI, vol. 6, no. 1, pp. 1–14, Dec. 2011. → pages 9[47] Y. Ke, R. Sukthankar, and L. Huston, “Efficient Near-duplicate Detectionand Sub-image Retrieval,” ACM Multimedia, vol. 4, no. 1, 2004. → pages 2[48] J. Kim and K. Grauman, “Boundary preserving dense local regions,” inCVPR, 2011. → pages 35[49] A. Kla¨ser, M. Marszaek, and C. Schmid, “A Spatio-Temporal DescriptorBased on 3d-Gradients,” BMVC, 2008. → pages 74[50] J. Krapac, J. Verbeek, and F. Jurie, “Modeling spatial layout with Fishervectors for image categorization,” in ICCV, 2011. → pages 41[51] A. Krizhevsky, I. Sutskever, and G. Hinton, “Imagenet classification withdeep convolutional neural networks,” NIPS, 2012. → pages 13[52] Y. H. Kwon and N. d. V. Lobo, “Face detection using templates,” 1994.[Online]. Available: → pages5[53] C. H. Lampert, M. B. Blaschko, and T. Hofmann, “Beyond sliding windows:Object localization by efficient subwindow search,” in CVPR, 2008. →pages 5, 13, 14, 15, 56, 57[54] ——, “Efficient subwindow search: a branch and bound framework forobject localization.” PAMI, vol. 31, no. 12, pp. 2129–42, Dec. 2009. →pages 18, 20, 56, 57, 58[55] I. Laptev, M. Marszalek, and C. Schmid, “Learning realistic human actionsfrom movies,” in CVPR, 2008. → pages 43[56] S. Lazebnik, C. Schmid, and J. Ponce, “Beyond Bags of Features: SpatialPyramid Matching for Recognizing Natural Scene Categories,” CVPR, 2006.→ pages x, 3, 11, 12, 20, 24, 32, 33, 35, 37, 38, 44, 50, 54, 56[57] Q. V. Le, M. Ranzato, R. Monga, M. Devin, C. Kai, C. S. Greg, J. Dean, andA. Y. Ng, “Building high-level features using large scale unsupervisedlearning,” in ICML, 2012. → pages 13[58] A. Lehmann, B. Leibe, and L. van Gool, “PRISM: Principled implicit shapemodel,” in British Machine Vision Conference, 2009. → pages viii, 13, 15,57, 58, 6479[59] A. Lehmann, L. Van Gool, and B. Leibe, “Feature-Centric EfficientSubwindow Search,” in CVPR, 2009. → pages 15, 18, 57[60] B. Leibe, A. Leonardis, and B. Schiele, “Robust object detection withinterleaved categorization and segmentation,” IJCV, vol. 77, no. 1, pp.259–289, 2008. → pages xi, 5, 13, 14, 15, 18, 56, 58, 64[61] L. Li-Jai and L. Fei-Fei, “What, where and who? Classifying events byscene and object recognition,” in ICCV, 2007. → pages 4[62] L. Liu, L. Wang, and X. Liu, “In Defense of Soft-assignment Coding,” inICCV, 2011. → pages 12, 23, 24, 32, 35, 37, 40, 41, 44, 50, 54, 60[63] D. G. Lowe, “Distinctive Image Features from Scale-Invariant Keypoints,”IJCV, vol. 60, no. 2, pp. 91–110, Nov. 2004. → pages 2, 5, 8, 30, 50, 56[64] W.-L. Lu, J.-a. Ting, J. J. Little, and K. P. Murphy, “Learning to Track andIdentify Players from Broadcast Sports Videos,” PAMI, vol. 37, no. 7, pp.1704–1716, 2013. → pages 5[65] S. Maji, A. Berg, and J. Malik, “Classification using Intersection KernelSupport Vector Machines is Efficient,” CVPR, 2008. → pages 67[66] S. McCann and D. G. Lowe, “Spatially local coding for object recognition,”in ACCV, 2012. → pages iv, 13, 55[67] ——, “Local Naive Bayes Nearest Neighbor for image classification,” inCVPR, 2012. → pages iv[68] D. Meger, P.-E. Forsse´n, K. Lai, S. Helmer, S. McCann, T. Southey,M. Baumann, J. J. Little, and D. G. Lowe, “Curious george: An attentivesemantic robot,” Robotics and Autonomous Systems, vol. 56, no. 6, pp.503–511, 2008. → pages 4, 5[69] M. Muja and D. Lowe, “Fast approximate nearest neighbors with automaticalgorithm configuration,” in VISAPP, 2009. → pages 10, 29, 30, 47[70] K. P. Murphy, A. Torralba, and W. T. Freeman, “Using the Forest to See theTrees : A Graphical Model Relating Features , Objects , and Scenes,” inNIPS, 2003. → pages 4[71] J. Mutch and D. G. Lowe, “Multiclass object recognition with sparse,localized features,” in CVPR, 2006. → pages 580[72] Y. Netzer, T. Wang, A. Coates, A. Bissacco, B. Wu, and A. Y. Ng, “Readingdigits in natural images with unsupervised feature learning,” in NIPSWorkshop on Deep Learning and Unsupervised Feature Learning 2011,2011. → pages 4[73] D. Nister and H. Stewenius, “Scalable recognition with a vocabulary tree,”in CVPR, 2006. → pages 10, 11[74] E. Nowak, F. Jurie, and B. Triggs, “Sampling strategies for bag-of-featuresimage classification,” in ECCV, 2006. → pages 11[75] G. L. Oliveira, E. R. Nascimento, A. W. Vieira, and M. F. Campos, “SparseSpatial Coding: A Novel Approach for Efficient and Accurate ObjectRecognition,” in ICRA, 2012. → pages 42[76] P. Perona, “Object Categorization: Computer and Human VisionPerspectives,” S. J. Dickinson, A. Leonardis, B. Schiele, and M. J. Tarr, Eds.,2009, vol. 19, ch. Visual Rec, pp. 55–68. → pages 3, 5[77] J. Philbin, O. Chum, M. Isard, J. Sivic, and A. Zisserman, “Lost inquantization: Improving particular object retrieval in large scale imagedatabases,” in CVPR, 2008. → pages 2[78] ——, “Object retrieval with large vocabularies and fast spatial matching,” inCVPR, 2007. → pages 2, 10, 48[79] E. Rublee, V. Rabaud, K. Konolige, and G. Bradski, “ORB: An efficientalternative to SIFT or SURF,” in ICCV, 2011. → pages 8[80] C. Schmid and R. Mohr, “Local Greyvalue Invariants for Image Retrieval,”PAMI, vol. 5, no. 19, pp. 530–534, 1997. → pages 8[81] T. Serre, M. Kouh, C. Cadieu, U. Knoblich, G. Kreiman, and T. Poggio, “Atheory of object recognition: computations and circuits in the feedforwardpath of the ventral stream in primate visual cortex,” MIT CSAIL, Tech. Rep.,2005. → pages 9[82] C. Silpa-Anan and R. Hartley, “Optimised KD-trees for fast imagedescriptor matching,” in CVPR, 2008. → pages 29[83] J. Sivic, B. C. Russell, A. A. Efros, A. Zisserman, and W. T. Freeman,“Discovering Objects and their Location in Images,” in ICCV, 2005. →pages 4, 1181[84] T. Tommasi and B. Caputo, “Frustratingly Easy NBNN DomainAdaptation,” in ICCV, 2013. → pages 21[85] A. Torralba, K. P. Murphy, and W. T. Freeman, “Sharing visual features formulticlass and multiview object detection,” PAMI, vol. 29, no. 5, pp. 854–69,2007. → pages 23[86] T. Tuytelaars, M. Fritz, K. Saenko, and T. Darrell, “The NBNN kernel,” inICCV, 2011. → pages 22, 23, 34, 35, 36[87] J. C. van Gemert, C. J. Veenman, A. W. M. Smeulders, and J.-M.Geusebroek, “Visual word ambiguity.” PAMI, vol. 32, no. 7, pp. 1271–83,Jul. 2010. → pages 39, 46[88] A. Vedaldi and B. Fulkerson, “VLFeat: An open and portable library ofcomputer vision algorithms (,” 2008. [Online]. Available: → pages 50[89] P. Viola and M. Jones, “Rapid object detection using a boosted cascade ofsimple features,” in CVPR, 2001. → pages 13, 14[90] S. Walk, N. Majer, K. Schindler, and B. Schiele, “New features and insightsfor pedestrian detection,” in CVPR, 2010. → pages 65[91] J. Wang, J. Yang, K. Yu, F. Lv, T. Huang, and Y. Gong,“Locality-constrained linear coding for image classification,” in CVPR,2010. → pages 12, 13, 23, 24, 38, 40, 42, 51, 54[92] P. Wohlhart, M. Donoser, P. Roth, and H. Bischof, “Detecting partiallyoccluded objects with an implicit shape model random field,” ACCV, 2013.→ pages 56[93] J. Yang, K. Yu, Y. Gon, and T. Huang, “Linear spatial pyramid matchingusing sparse coding for image classification,” in CVPR, 2009. → pages 12,13, 37, 38, 40, 42, 50, 54[94] H. Zhang, A. C. Berg, M. Maire, and J. Malik, “SVM-KNN: Discriminativenearest neighbor classification for visual category recognition,” CVPR, 2006.→ pages 11[95] X. Zhou, N. Cui, Z. Li, F. Liang, and T. S. Huang, “HierarchicalGaussianization for Image Classification,” in ICCV, 2009. → pages 4182


Citation Scheme:


Citations by CSL (citeproc-js)

Usage Statistics



Customize your widget with the following options, then copy and paste the code below into the HTML of your page to embed this item in your website.
                            <div id="ubcOpenCollectionsWidgetDisplay">
                            <script id="ubcOpenCollectionsWidget"
                            async >
IIIF logo Our image viewer uses the IIIF 2.0 standard. To load this item in other compatible viewers, use this url:


Related Items