Algorithms for large-scale multi-codebook quantizationbyJulieta Martinez-CovarrubiasA THESIS SUBMITTED IN PARTIAL FULFILLMENTOF THE REQUIREMENTS FOR THE DEGREE OFDoctor of PhilosophyinTHE FACULTY OF SCIENCE(Computer Science)The University of British Columbia(Vancouver)December 2018c© Julieta Martinez-Covarrubias, 2018The following individuals certify that they have read, and recommend to the Fac-ulty of Graduate and Postdoctoral Studies for acceptance, the thesis entitled:Algorithms for large-scale multi-codebook quantizationsubmitted by Julieta Martinez-Covarrubias in partial fulfillment of the require-ments for the degree of Doctor of Philosophy in Computer Science.Examining Committee:Prof. James J. Little, Computer Science, University of British ColumbiaCo-supervisorProf. Holger H. Hoos, Computer Science, University of British ColumbiaCo-supervisorProf. Mark Schmidt, Computer Science, University of British ColumbiaSupervisory Committee MemberProf. Michiel van de Panne, Computer Science, University of British ColumbiaUniversity ExaminerProf. Vincent Wong, Electrical Engineering, University of British ColumbiaUniversity ExaminerProf. David J. Fleet, Computer Science, University of TorontoExternal ExamineriiAbstractCombinatorial vector compression is the task of expressing a set of vectors asaccurately as possible in terms of discrete entries in multiple bases. The prob-lem is of interest in the context of large-scale similarity search, as it providesa memory-efficient, yet ready-to-use compact representation of high-dimensionaldata on which vector similarities such as Euclidean distances and dot products canbe efficiently approximated.Combinatorial compression poses a series of challenging optimization prob-lems that are often a barrier to its deployment on very large scale systems (e.g.,of over a billion entries). In this thesis we explore algorithms and optimizationtechniques that make combinatorial compression more accurate and efficient inpractice, and thus provide a practical alternative to current methods for large-scalesimilarity search.iiiLay SummaryImagine someone had a long list of every post office in the country and wanted tofigure out which one to use. If they know the location of their house they couldcalculate the distance to each post office in the list, keep track of the minimumdistance, and easily find their nearest post office. This problem is known as “nearestneighbour search”.Many problems in computer science involve solving this problem many timesper second, in high dimensions, and on very large databases. This thesis introducesa series of algorithmic improvements for systems that perform this task approxi-mately on very large datasets, trading off accuracy, memory usage, and speed.ivPrefaceThe main work presented in this thesis is based on three publications.• A version of Chapter 3 has been published in the Proceedings of the Euro-pean Conference on Computer Vision (ECCV) (Martinez et al., 2016a):J. Martinez, J. Clement, H. H. Hoos, and J. J. Little. Revisitingadditive quantization. In ECCV 2016.and in (Martinez et al., 2016b)J. Martinez, H. H. Hoos, and J. J. Little. Solving multi-codebookquantization in the GPU. In 4th Workshop on Web-scale Visionand Social Media (VSM), ECCV workshops, 2016.The chapter and the CUDA code were both written by Martinez. Clementhelped run experiments and wrote Julia code for the software release relatedto the above publications.• A version of Chapter 4 has been published in the Proceedings of the Euro-pean Conference on Computer Vision (ECCV) (Martinez et al., 2018):J. Martinez, S. Zakhmi, H. H. Hoos, and J. J. Little, LSQ++:Lower running time and higher recall in multi-codebook quan-tization. In ECCV 2018.The chapter was entirely written by Martinez. Zakhmi helped run experi-ments and write Julia code for the new algorithms described in Chapter 4.v• Chapter 6 was written by Martinez, and the work presented there was devel-oped by Martinez and Melody Wong during her internship at the UBC visionlab in the summer of 2017.• Chapters 5-7 contain new material. Hoos, Little and Mark Schmidt haveedited and revised all the chapters in this thesis.viTable of ContentsAbstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iiiLay Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ivPreface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vTable of Contents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viiGlossary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xiAcknowledgements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xv1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.1 Multi-codebook quantization (MCQ) applications in computer vision 21.2 Optimization challenges in MCQ . . . . . . . . . . . . . . . . . . 31.3 Contributions to MCQ outlined in this thesis . . . . . . . . . . . . 41.4 The Rayuela.jl library . . . . . . . . . . . . . . . . . . . . . . . . 52 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62.1 Nearest neighbour search (NNS) . . . . . . . . . . . . . . . . . . 62.1.1 Exact NNS algorithms . . . . . . . . . . . . . . . . . . . 72.2 Approximate nearest neighbour search (ANNS) . . . . . . . . . . 82.3 Memory-oblivious ANNS techniques . . . . . . . . . . . . . . . . 82.4 Multi-codebook quantization (MCQ) for approximate nearest neigh-bour search . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102.4.1 Orthogonal methods . . . . . . . . . . . . . . . . . . . . 11vii2.4.2 Semi-orthogonal methods . . . . . . . . . . . . . . . . . 132.4.3 Non-orthogonal methods . . . . . . . . . . . . . . . . . . 142.4.4 Other MCQ improvements . . . . . . . . . . . . . . . . . 192.4.5 Supervised quantization . . . . . . . . . . . . . . . . . . 202.4.6 Software . . . . . . . . . . . . . . . . . . . . . . . . . . 213 Efficient encoding and sparse codebooks . . . . . . . . . . . . . . . . 223.1 Background and related work . . . . . . . . . . . . . . . . . . . . 233.1.1 The encoding problem in MCQ . . . . . . . . . . . . . . . 233.1.2 Reducing query time in Additive quantization (AQ) . . . . 253.1.3 MAP estimation of MRFs using GPUs . . . . . . . . . . . . 263.1.4 Stochastic local search (SLS) algorithms . . . . . . . . . . 273.2 Iterated local search for MCQ encoding . . . . . . . . . . . . . . . 283.2.1 Local search procedure . . . . . . . . . . . . . . . . . . . 283.2.2 ILS Perturbation . . . . . . . . . . . . . . . . . . . . . . 303.2.3 ILS Initialization . . . . . . . . . . . . . . . . . . . . . . 303.2.4 Pseudocode . . . . . . . . . . . . . . . . . . . . . . . . . 313.3 Graphics processor unit (GPU) implementation . . . . . . . . . . . 313.4 The advantages of a simple formulation:easy sparse codebooks . . . . . . . . . . . . . . . . . . . . . . . 353.5 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 373.5.1 Evaluation protocol . . . . . . . . . . . . . . . . . . . . . 373.5.2 Baselines . . . . . . . . . . . . . . . . . . . . . . . . . . 383.5.3 Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . 393.5.4 Parameter settings . . . . . . . . . . . . . . . . . . . . . 403.6 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 423.6.1 Small training/query/base datasets . . . . . . . . . . . . . 423.6.2 Query/base datasets . . . . . . . . . . . . . . . . . . . . . 433.6.3 Very large-scale training/query/base datasets . . . . . . . 443.6.4 Sparse extension . . . . . . . . . . . . . . . . . . . . . . 453.6.5 Encoding speed . . . . . . . . . . . . . . . . . . . . . . . 463.7 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47viii4 Stochastic relaxations and fast codebook updates . . . . . . . . . . . 484.1 Background and related work . . . . . . . . . . . . . . . . . . . . 494.1.1 Codebook update in MCQ . . . . . . . . . . . . . . . . . 494.1.2 Improving k-means . . . . . . . . . . . . . . . . . . . . . 514.2 Solution methodology . . . . . . . . . . . . . . . . . . . . . . . . 534.2.1 Direct codebook update . . . . . . . . . . . . . . . . . . 534.2.2 Stochastic relaxations (SR) . . . . . . . . . . . . . . . . . 554.2.3 Recap: SR-C and SR-D . . . . . . . . . . . . . . . . . . . 584.3 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 594.4 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 614.4.1 Fast codebook update . . . . . . . . . . . . . . . . . . . . 614.4.2 SR-C and SR-D . . . . . . . . . . . . . . . . . . . . . . . 624.4.3 Comparison against Competitive quantization (COMPQ) . 664.5 Conclusions and future work . . . . . . . . . . . . . . . . . . . . 685 Automatic hyperparameter optimization of MCQ algorithms . . . . 705.1 Background and related work . . . . . . . . . . . . . . . . . . . . 725.1.1 Choosing a hyperparameter optimizer . . . . . . . . . . . 745.2 Experimental setup . . . . . . . . . . . . . . . . . . . . . . . . . 745.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 755.4 Conclusions and future work . . . . . . . . . . . . . . . . . . . . 786 Deep stochastic local search: learning SLS perturbations with deepreinforcement learning . . . . . . . . . . . . . . . . . . . . . . . . . 796.1 Related work . . . . . . . . . . . . . . . . . . . . . . . . . . . . 806.1.1 Stochastic local search (SLS) . . . . . . . . . . . . . . . . 806.1.2 Deep reinforcement learning . . . . . . . . . . . . . . . . 816.2 Problem formulation . . . . . . . . . . . . . . . . . . . . . . . . 836.3 Experimental setup . . . . . . . . . . . . . . . . . . . . . . . . . 836.4 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 866.5 Conclusions and future work . . . . . . . . . . . . . . . . . . . . 86ix7 Rayuela.jl: A package for large-scale similarity search . . . . . . . . 887.1 Functionality . . . . . . . . . . . . . . . . . . . . . . . . . . . . 907.2 Strengths . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 907.3 Challenges and limitations . . . . . . . . . . . . . . . . . . . . . 917.4 Future work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 937.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 948 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97xGlossaryANN Approximate nearest neighbourANNS Approximate nearest neighbour searchAPI Application programming interfaceAQ Additive quantizationCKM Cartesian k-meansCG Conjugate gradient, a family of methods for solving large-scalesystems of linear equations.CNN Convolutional neural networkCQ Composite quantizationCOMPQ Competitive quantizationERVQ Enhanced residual vector quantizationEM Expectation-maximization, an optimization approach for problemswith multiple latent variablesFAISS Facebook AI Similarity Search, a library for large-scale approximatenearest neighbour searchFLANN Fast Library for Approximate Nearest Neighbours, a popular libraryfor fast approximate nearest neighbour searchxiFALCONN Fast Lookups of Cosine and Other Nearest Neighbors, a library forapproximate nearest neighbour searchGPU Graphics processor unitHDD Hard-drive diskICM Iterated conditional modes, an approximate method formaximum-a-posteriori (MAP) estimation in Markov random fields(MRF)ILS Iterated local searchILSRVC ImageNet Large-Scale Visual Recognition Challenge, a benchmark toclassify images into 1 000 categories.JKM Joint k-means quantizationLBP Loopy belief propagationL-BFGS Limited-memory Broyden-Fletcher-Goldfarb-Shanno algorithm, amethod for constrained optimizationLSH Locallity-sentitive hashingLSQ Local search quantization, a multi-codebook quantization methodintroduced in this thesisMAP Maximum-a-posteriori, the most likely state in a Markov randomfield (MRF)MRF Markov random fieldMCQ Multi-codebook quantizationMIPS Maximum inner product searchNNS Nearest neighbour searchOCKM Optimized Cartesian k-meansxiiOPQ Optimized product quantizationOTQ Optimized tree quantizationPCA Principal component analysisPQ Product quantizationRAM Random access memoryRL Reinforcement learningRVQ Residual vector quantizationSAT Boolean satisfiability problemSCQ Sparse composite quantizationSA Simulated annealingSGD Stochatic gradient descentSH Spectral hashingSIMD Single instruction multiple dataSLS Stochastic local searchSOBE Self-organizing binary encodingSMAC Sequential model-based optimization for algorithm configuration, amethod for automatic hyperparameter optimization.SPGL1 Spectral projected gradient, a method for large-scale, `1-regularized,least-squares optimizationSSD Solid-state driveSQ Stacked quantizersSR Stochastic eelaxationxiiiSR-D An extension of simulated annealing applicable to multi-codebookquantization, introduced in this workSR-C An extension of simulated annealing applicable to multi-codebookquantization, introduced in this workSIFT Scale invariant feature transform, a popular local image descriptorTC Transform codingTSP Travelling salesperson problemVQ Vector quantizationxivAcknowledgementsI would like to thank my supervisors, Prof. James J. Little and Prof. Holger H.Hoos, for their advice and support during the past five years. The ideas presentedin this thesis sparked from conversations with them, and the development workwould not have been possible without their support. I also want to thank Prof.Mark Schmidt for his input on this work as a member of my thesis committee, andfor constantly pushing me to learn new topics and improve my presentation skillsin the machine learning reading group at UBC. I am also thankful to Prof. DavidFleet, Prof. Michiel van de Panne, and Prof. Vincent Wong for their thoughtfulcomments on this work.I am also grateful to Javier Romero, Prof. Michael J. Black, and the rest of theteam at the Max Planck Institute for Intelligent Systems, whom I was fortunate towork with during the fall of 2016.I am also indebted to my interns, Joris Clement, Shobhit Zakhmi, and MelodyWong, for trusting me as a mentor and allowing me to learn so much from them.Thanks also to Anahi Molar Cruz, Bita Nejat, Erica Peredo, Jaimie Veale, Jen-nifer Gordon, Sarina McFarland, Sampoorna Biswas, Shay Loo, Shoshana Izsak,and Silvia Picazo Barraga´n. Munich, San Diego, Aotearoa, Mexico City, Eind-hoven, Vancouver and Toronto would not have been the same without you. I loveall of you, and I cannot believe how fortunate I am to have you in my life.Finally, I acknowledge that this thesis was completed while I lived and workedat UBC, which is located in the traditional, ancestral, and unceded territory of theMusqueam people.xvChapter 1IntroductionA notable characteristic of the computational landscape at the turn of the 21st cen-tury is an unprecedented growth in the use and collection of data. Scientific fieldssuch as genetics, bioinformatics, chemistry, astronomy, particle physics and cli-mate science have all benefited from large-scale data collection mechanisms. Atthe same time, large-scale data collection has inspired new research directions inthose areas.In everyday life, a data-collection explosion has appeared in a more familiarform. Consumer-grade cameras (which are often attached to smartphones), cou-pled with the growth in popularity of social media platforms and increased inter-connectivity, have made it very easy for the average consumer in developed nationsto record and store visual narratives at an unprecedented rate. From the perspec-tive of a smartphone owner, taking a picture or recording a short video of a person,event or object that they care about, and sharing this content with their social net-work, is a now a common, easy, and low-overhead task.At the same time, deep learning techniques have, over the past decade, dramat-ically improved the state of the art on a variety of computational perceptual taskssuch as speech and object recognition. Such techniques typically consist of a multi-layered function approximator that produces, as an end result, a high-dimensionalreal-valued vector that is amenable to tasks such as multi-label classification. Sys-tems that perform tasks such as object classification can now be used to extract, ifat a somewhat basic level, semantic information from large collections of images1and video.A problem that occurs in practice is creating an effective tool to search forobjects or events of interest in a large collection of visual data. Imagine, for exam-ple, building a search engine that can be queried with other images, and returns anordered list of semantically similar pictures to the user. While images can be pre-processed by deep architectures, one still needs to build an indexing system that isable to compute similarities in a high-dimensional space and scales graciously interms of memory (because data keeps growing), speed (because users expect fastanswers), and accuracy (because users expect correct answers).The goal of this thesis is to explore algorithms and optimization techniquesthat are useful to build systems for vector similarity search and scale well with datagrowth. We focus on applications in computer vision, but the fundamental problemaddressed is large-scale similarity search, of which maximum inner product search(MIPS) and approximate nearest neighbour search (ANNS) are special cases. Thesetwo problems have multiple applications across computer science.Chapter 2 includes more detail on the history and applications of the problemsin general, and here we include a small introduction in the context of computervision.1.1 Multi-codebook quantization (MCQ) applications incomputer visionMany computer vision applications involve computing the similarity of many high-dimensional, real-valued image representations, in a process known as featurematching. For example, in structure from motion (Snavely et al., 2006), it iscommon to estimate the relative viewpoint of each image in large collections ofphotographs by computing the pairwise similarity of several million scale invari-ant feature transform (SIFT) (Lowe, 2004) descriptors. It is also now common forretrieval and classification datasets to comprise millions of images (Deng et al.,2009; Torralba et al., 2008), and for deep learning pipelines to directly learn rep-resentations tailored for retrieval on millions of images (Gordo et al., 2017). Ina commercial application, the popular reverse image search engine TinEye (https://www.TinEye.com/) currently searches over 27.6 billion images indexed from all2over the Internet. When such large databases of images are used, matching posessignificant computational bottlenecks.In practice, the large-scale matching problem often translates into large-scaleANN search, a problem that has traditionally been addressed with hashing (Gongand Lazebnik, 2011; Weiss et al., 2009). However, a series of methods based onvector quantization have recently demonstrated superior performance and scala-bility, sparking interest from the machine learning, computer vision and multime-dia retrieval communities (Babenko and Lempitsky, 2014a; Ge et al., 2014; Je´gouet al., 2011; Norouzi and Fleet, 2013; Zhang et al., 2014). These methods are allbased on multi-codebook quantization (MCQ), a generalization of k-means clus-tering with cluster centres arising from the sum of entries in multiple codebooks.Other applications of MCQ include large-scale MIPS (Guo et al., 2016; Shrivas-tava and Li, 2014), and the compression of deep neural networks for mobile de-vices (Wu et al., 2016).1.2 Optimization challenges in MCQLike k-means clustering, MCQ is posed as the search for a set of codes and code-books that best approximate a given dataset. Importantly, the error of this approxi-mation provides a lower bound for Euclidean distance and dot-product approxima-tions in ANNS and MIPS applications – and, by extension, in convolution approx-imations, as used by Wu et al. (2016) to compress convolutional neural networks(CNNs). Therefore, finding optimization methods that achieve low-error solutionsis crucial for improving the performance of MCQ applications.Early approaches to MCQ were designed enforcing codebook orthogonality (Geet al., 2014; Je´gou et al., 2011; Norouzi and Fleet, 2013), which greatly simpli-fies the problem at the expense of accuracy. In contrast, more recent methodsuse non-orthogonal, often full-dimensional codebooks (Babenko and Lempitsky,2014a, 2015; Ozan et al., 2016a; Zhang et al., 2014), and tend to be more compu-tationally expensive than their classical counterparts, which in practice limits theirapplicability on very large-scale datasets. Thus, it is also important to find efficientoptimization methods that scale well with dataset size.31.3 Contributions to MCQ outlined in this thesisIn this thesis, we explore algorithms for MCQ and their applications in ANNS. Wemake several contributions to this problem:• Fixing the codebooks in an MCQ problem produces a distribution of small,yet hard combinatorial optimization problems to find the optimal correspond-ing codes. We have noticed that there is a large field of literature with cor-responding well-established techniques for designing algorithms that targetdistributions of NP-hard problems, such as the Boolean satisfiability prob-lem (SAT) instances that arise in particular industrial applications. Inspiredby this area of research, in Chapter 3 we describe an algorithm based on it-erated local search (ILS) for finding these codes in MCQ, and detail a graph-ics processor unit (GPU) implementation that greatly accelerates this part ofMCQ optimization.• If one fixes the codes in an MCQ problem, finding the optimal codebookbecomes a large-scale least squares problem. Previous work has proposedusing large-scale conjugate-gradient solvers for this step (Babenko and Lem-pitsky, 2014a), or used decomposition-based direct methods that, nonethe-less, require expensive matrix multiplications (Zhang et al., 2014). In Chap-ter 4 we exploit the structure of the code matrix and propose a direct least-squares method that can be efficiently computed via Cholesky decompo-sition on a small, symmetric, positive definite matrix. Conveniently, thismatrix can be computed using the codes as indices, i.e., without an explicitmatrix multiplication routine. This results in a method for codebook updatethat is orders of magnitude faster than previous work.• We note that MCQ is much like k-means clustering, except that it is combi-natorial. However, if one finds methods that improve k-means, one might beable to adapt them to improve MCQ as well. In Chapter 4, we introducestochastic relaxations (SR), two methods inspired in simulated annealingused in the 1990s to improve k-means clustering, and adapt them to MCQ,which results in better distance approximations at a negligible increase incomputation.4• Our novel encoding algorithm and our SR methods have a number of hyper-parameters that influence the performance of our algorithms in practice. Weautomatically search for good hyperparameters using an automatic algorithmconfiguration procedure in Chapter 5.We also attempted to use deep reinforcement learning to learn the parametersof a (deep) encoding function for MCQ. However, our attempts did not result in analgorithm with better computation-to-recall trade-offs than those we achieved withILS. Therefore, we consider these experiments to be a negative result. We reportour setup and experimental results in Chapter 6.1.4 The Rayuela.jl libraryUnfortunately, most of the work reviewed in this thesis, which includes baselinesfor our work, has been published without giving access to public code that canbe used to reproduce its results. This, in practice, puts the burden of proof onother researchers, because in order to show that they have made progress on aproblem, they also have to reproduce previous baselines. This is a daunting taskthat consumed a large part of the time we spent working on this problem.Together with this thesis we are releasing Rayuela.jl, a software package forlarge-scale similarity search written in Julia (Bezanson et al., 2014). Rayuela.jlis available under an MIT licence, and can be downloaded at https://github.com/una-dinosauria/Rayuela.jl. Rayuela implements multiple baselines and algorithmsfor MCQ, and is crucial in showing that our contributions result in Pareto improve-ments with respect to previous work. We give further details about the developmentof Rayuela.jl and future work in Chapter 7.Rayuela.jl is our attempt to make this research reproducible and accessible, andto make it easier for others, especially newcomers, to contribute to this field. Wealso release it with the hope of alleviating the irreproducibility cycle that currentlyplagues computer vision research.5Chapter 2BackgroundUsed to be so wilfully obtuseOr is the word abstruse?Semantics like a nooseGet out your dictionaries!—Andrew BirdIn this thesis we focus on techniques for approximate nearest neighbour search(ANNS) that are based on vector quantization (VQ). First, we provide some con-text on approaches for (exact) nearest neighbour search (NNS), and then discussmethods for ANNS. Finally, we focus on VQ techniques for ANNS.2.1 Nearest neighbour search (NNS)Given a dataset X comprised of n entries X = {x1,x2, . . . ,xn},xi ∈ Rd , a queryq ∈ Rd , and a distance function dist : Rd ×Rd → R the nearest neighbour searchproblem is the task of finding an element in NN(q,X) ∈ X such thatNN(q,X) ∈ argminx∈Xdist(q,x), (2.1)usually, dist(x, xˆ) = ‖x− xˆ‖2 (i.e., Euclidean distance). For notational conve-nience, we often represent X ∈ Rd×n as a matrix, where we simply stack all thed-dimensional database vectors xi horizontally.6Running time in nearest neighbour search (NNS)There are two main phases in ANN algorithms where running time is important.Offline time, also called training time, includes the time that is spent preprocessingX . Online time, also called query time, includes the time that, given a query, ittakes to obtain k of its neighbours in a dataset.While training time is less critical than query time, everything else being equalit is clearly desirable to have higher approximation accuracy and shorter trainingtimes. Because in practice we often deal with very large datasets, a difference ofminutes to hours in training time on a small dataset can mean a difference fromdays to months on larger datasets.Query time is more critical, because it is often the interface that the user ob-serves. For example, if our ANNS system forms the backend of an image searchengine or a recommendation system, a difference of milliseconds in response timecould result in a vastly different experience to the end user.2.1.1 Exact NNS algorithmsExhaustive searchA trivial solution consists of exhaustively computing dist(q,xi) for all vectors xiin X , and keeping track of the entry that produces the minimum distance. Thisalgorithm, although exact, is also potentially very inefficient for large n or large d.The research challenge is thus to find ways to preprocess X , such that we can findneighbours efficiently.Very low dimensions (d = {1,2})For the special case when d = 1, sorting X and performing binary search guaranteesfinding the nearest neighbour in O(logn) query time. This technique is typicallyused as a subroutine of sorting algorithms and data structures such as heaps andordered queues. For the special case when d = 2, it is possible to perform exactsearch in O(d · logn) time using Voronoi diagrams (Aurenhammer, 1991).7Higher dimensions (d ≥ 3)Unfortunately, no exact NNS algorithms with O(d · logn) query time complexityare known for dimensionality d ≥ 3.Kd-trees (Bentley, 1975) recursively partition the space across dimensions andreturn a correct solution every time, but only have an O(d · logn) expected, notguaranteed, query time. Similarly, Ball trees (Omohundro, 1989) partition thespace in a series of potentially overlapping hyperspheres, and provide an expectedO(d · logn) query time. Finally, Norouzi et al. (2014) introduce a method for ex-act search with expected sub-linear search time, but their results is only applicableto Hamming space. These methods, however, do not work well for high dimen-sions (Kibriya and Frank, 2007; Muja and Lowe, 2009). In this case, it is commonto resort to approximate algorithms.2.2 Approximate nearest neighbour search (ANNS)We now review ANNS algorithms, dividing them into memory-oblivious (i.e., onesthat do not reduce storage), and memory-conscious. The latter scale better withdata growth, and are the main focus of this thesis.2.3 Memory-oblivious ANNS techniquesA modern commodity computer handles memory in a hierarchy, trading off capac-ity for speed of access. In a desktop server, a hard-drive disk (HDD) or a solid-statedrive (SSD) is often the largest and slowest unit of memory, currently in the orderof a Terabyte. Next is the random access memory (RAM), several orders of magni-tude faster than main storage, but also more scarce, currently in the order of 32-64gigabytes per machine. Closer to the processor are several levels of cache memo-ries, typically in the order of a few megabytes to hundreds of kilobytes, followedby the fastest – and smallest – registers, with a few bytes of capacity each.Typically, operating systems allow user programs to store their state and vari-ables in RAM, as disk accesses are considered too slow for user interaction. Simi-larly, most research in methods for ANNS consider datasets small enough in magni-tude to be fully stored in RAM, and often build data structures that require additional8memory on top of the dataset itself. These memory requirements clearly imposepractical limitations and the size of datasets that a user can handle. For this reason,we refer to these methods as “memory-oblivious”.Most research in ANNS falls under this category. In the computer vision com-munity, the most popular methods are perhaps randomized kd-trees (Silpa-Ananand Hartley, 2008), and hierarchical k-means clustering (Nister and Stewenius,2006), both implemented in the very successful Fast Library for ApproximateNearest Neighbours (FLANN) (Muja and Lowe, 2009). More recent data structuresand efficient implementations include the Hierarchical Navigable Small WorldGraphs (Malkov and Yashunin, 2016), or the Fast Lookups of Cosine and OtherNearest Neighbors (FALCONN) library based on an improved version of E2LSH (An-doni et al., 2015). FALCONN, for example, provides theoretical guarantees as well,but as Matthijs Douze points out1:[Locality sensitive hashing] and its derivatives suffer from two draw-backs: [(1)] They require a lot of hash functions ([number of] parti-tions) to achieve acceptable results, leading to a lot of extra memory.Memory is not cheap[, and (2) t]he hash function are [sic] not adaptedto the input data. This is good for proofs but leads to suboptimal choiceresults [sic] in practice.Indeed, for very large datasets, memory is usually the most constrained re-source. Conversely, however, theory-driven methods such as Locallity-sentitivehashing (LSH) often provide theoretical bounds on the worst performance of anyquery.Quantization-based methods, in general, do not provide such guarantees andin practice assume that the queries come from the same distribution as the trainingset. This reflects the workflow of applications such as image search or 3d recon-struction, where this assumption is true. Moreover, vector quantization methodsare memory-conscious (compressing the data is at their core) and have great per-formance in practice. We now review work in this area.1https://github.com/facebookresearch/faiss/wiki/Faiss-indexes#relationship-with-lsh92.4 Multi-codebook quantization (MCQ) for approximatenearest neighbour searchHistorically, the first methods for ANNS using MCQ considered only orthogonalcodebooks (i.e., methods that encode different subspaces of the data indepen-dently). These methods were introduced mainly before 2012, when Scale invariantfeature transform (SIFT) descriptors (Lowe, 2004) were the main features used incomputer vision tasks. Given a database of images, the object matching pipelineproposed by Lowe consists of (i) offline computing interest points and SIFT de-scriptors of each image, and (ii) matching features from an unseen image to thosein the database – hence the name of the task: feature matching.To match SIFT features, Lowe suggests for each SIFT descriptor in the image,finding the two nearest neighbours in the database, and keeping only the matcheswhose ratio of the distances to the first and the second neighbours is below 0.8, toensure discriminability. This came to be known as Lowe’s distance criterion.Given the success of SIFT, and since a high-resolution image can easily pro-duce tens of thousands of descriptors, the computer vision community becameinterested in finding fast and scalable methods for nearest neighbour search in highdimensions. In this context, orthogonal VQ methods are particularly attractive, notonly because they are relatively easy to optimize, but also because they have theadvantage that one can compute approximate Euclidean distances using a numberof table lookups linear in the number of codebooks.Orthogonal methods, however, are bound to have fewer parameters than meth-ods with full-dimensional (i.e., non-orthogonal) codebooks. Non-orthogonal meth-ods are conversely harder to optimize, have quadratic query times (a detail that canand must be included for a fair comparison), and have overall better recall resultswhen properly optimized. With non-orthogonal methods one can also compute ap-proximate dot products with a number of table lookups linear on the number ofcodebooks, so they are generally better suited for ANNS under dot product similar-ity.102.4.1 Orthogonal methodsProduct quantization (PQ)The first uses of vector compression for ANNS are due to both Je´gou et al. (2009)and Sandhawalia and Je´gou (2010). These two papers noted that the approximationquality (i.e., the quantization error) of a quantizer bounds the distance approxima-tion of a query, and suggested using lookup tables to speed-up query response time.Je´gou et al. (2009) called their method product quantization (PQ). The empiricalevaluation compared PQ against spectral hashing (SH) (Weiss et al., 2009), and theHamming embedding method of Je´gou et al. (2008), obtaining considerably higherrecall given the same memory usage.Formally, the goal in PQ is to determineminC,B‖X−CB‖2F , (2.2)where the set to quantize is X ∈Rd×n, having n data points with d dimensions each.The latent variables are both the codebooks C = [C1,C2, . . . ,Cm], Ci ∈Rd×n, whichare constrained to be mutually orthogonal (i.e., C is block-diagonal):C = [C1,C2, . . . ,Cm] =Cˆ1 0 . . . 00 Cˆ2 0.... . ....0 0 . . . Cˆm , (2.3)where, without loss of generality, Cˆ ∈Rbd/mc×h (if d is not divisible by m, the con-vention is to assign the extra dimensions to the first few codebooks), and the codesB = [b1,b2, . . . ,bn] ∈ {0,1}m·h×n, where bi = [b>i1,b>i2, . . . ,b>im]> ∈ {0,1}m·h, andeach subcode bi j ∈ {0,1}h is limited to having only one non-zero entry: ‖bi j‖0 = 1and ‖bi j‖1 = 1. Figure 2.1 shows these constraints in a more graphical and intuitiveway.Je´gou et al. (2009) also introduced the idea of using an inverted file, similarto that proposed by Sivic and Zisserman (2003), to shortlist candidate vectors andfurther reduce query time, which allowed the system to scale up to one billion11minC,BkX CBk22 minC,BkX CBk22C 2 Rd⇥mhX Rd⇥nB 2 {0, 1}mh⇥nk2FFigure 2.1: Illustration of the PQ optimization problem.vectors.On the other hand, Sandhawalia and Je´gou did not use product codes, but quan-tized each dimension independently, and compared only against Spectral hash-ing (SH). In parallel work, Brandt (2010) proposed a similar method – that quan-tizes dimensions independently – but also has data-driven bit allocation and non-uniform minimum distortion quantization. Brandt simply calls the method trans-form coding (TC).A followup version of the tech report by Je´gou et al. (2009) was later publishedin a computer vision journal (Je´gou et al., 2011), and further included a compari-son against FLANN (Muja and Lowe, 2009), a library for ANNS that is extremelypopular in the computer vision community, obtaining slightly better recall at com-parable query times. Since ANNS requires storing the entire database in RAM atruntime, these results demonstrated that PQ with an inverted file would a betteroption for large-scale ANNS.Optimized product quantization (OPQ)Je´gou et al. noticed that using either a random permutation of the dimensions, orusing a random rotation (i.e., equalizing variance among subspaces) to preprocessthe data resulted in better recall rates. Followup work (Je´gou et al., 2012) thensuggested PCA-rotating the data for dimensionality reduction, and then applyinga random orthogonal transformation to equalize variance among subspaces andfurther improve accuracy. With these observations in mind, Ge et al. (2014) intro-duced optimized product quantization (OPQ), a method that learns a rotation thatboth maximizes independence and distributes the variance among the subspaces12of PQ. Norouzi and Fleet (2013) independently discovered the same method, andpublished it at the same conference, under the name of Cartesian k-means (CKM).2.4.2 Semi-orthogonal methodsSome vector quantization methods in the ANNS literature use neither fully-independent,nor fully-dimensional codebooks, and thus exist as a middle ground between bothends of the spectrum.Composite quantization (CQ)Zhang et al. (2014) propose composite quantization (CQ), a method that opti-mizes the codebooks such that they are fully-dimensional, but the dot productof any vectors in two different codebooks is close to a constant – the so-called“near-orthogonality” constraint. This results in a non-linear constraint optimiza-tion problem that in CQ is solved with the limited-memory Broyden-Fletcher-Goldfarb-Shanno (L-BFGS) algorithm (Nocedal, 1980). Followup work has ex-tended CQ to use sparse codebooks, in a method coined sparse composite quan-tization (SCQ) (Zhang et al., 2015), and mixed CQ with collaborative filtering toimprove cross-modal search (Zhang and Wang, 2016).Notably, CQ manages to keep query time linear in the number of codebooks,without incurring additional memory cost. This is an impressive feat that no othersemi or non-orthogonal method can achieve without using a codebook from thememory budget to store norm approximation. This detail, however, is often omittedin the literature.Zhang et al. released code to reproduce their results more than 3 years af-ter publication (https://github.com/hellozting/CompositeQuantization/), at the sametime that an extended version with journal formatting appeared on arXiv (Wang andZhang, 2017). Inspecting this code, we have learned that the paper reported resultsusing a non-standard protocol, which follow-up work on non-orthogonal quanti-zation is unaware of. We give more details about this finding in the experimentalevaluation of Chapters 3 and 4.13Optimized tree quantization (OTQ)Babenko and Lempitsky (2015) introduce optimized tree quantization (OTQ), amethod that limits the codebook non-orthogonality to either one or two interactionsper codebook, and constrains the topology of these interactions to form a tree.While in fully non-orthogonal methods encoding is usually NP-hard, using a treestructure means that one can find optimal encodings in polynomial time using themin-sum (resp. max-product) algorithm. Updating the codebooks, however, resultsin a binary integer linear program, which in OTQ is solved using the commercialGurobi solver. A major downside of OTQ is that it increases query time by a factorof two, compared to PQ and OPQ; thus, the results reported in that work are notstrictly comparable to these baselines.Optimized Cartesian k-means (OCKM)Wang et al. (2014) proposed optimized Cartesian k-means (OCKM) a semi-orthogonalmethod whose codebooks are non-orthgonal with only one other codebook. Topo-logically, this results in a forest, rather than a chain or a tree. Like OTQ, this resultsin increased query time, which was not accounted for during the experimental eval-uation comparing against PQ and OPQ.2.4.3 Non-orthogonal methodsNon-orthogonal MCQ methods use full-dimensional codebooks, thus have morefree parameters, and pose more challenging optimization problems. They also tendto deliver lower quantization error, and thus better distance approximations andhigher recall. Formally, the goal of non-orthogonal MCQ is to determineminC,B‖X−CB‖2F , (2.4)and the variables are defined just as in PQ, except that the codebooks C= [C1,C2, . . . ,Cm],Ci ∈ Rd×h, are not constrained to be mutually orthogonal. Figure 2.2 shows thisformulation in a more graphical and intuitive way.14minC,BkX CBk22 minC,BkX CBk22C 2 Rd⇥mhX Rd⇥nB 2 {0, 1}mh⇥n!1k2F<latexit sha1_base64="O41x/oYxdx0Jj1FZvAPCIR1n8zc=">AAAB8HicbVBNS8N AEJ3Ur1q/qh69BIvgqSRFUG9FQTxWMG2xjWWznbRLN5uwuxFK6L/w4kHFqz/Hm//GbZuDtj4YeLw3w8y8IOFMacf5tgorq2vrG8XN0tb2zu5eef+gqeJUUvRozGPZDohCzgR6m mmO7UQiiQKOrWB0PfVbTygVi8W9HifoR2QgWMgo0UZ66MomSv1Y6930yhWn6sxgLxM3JxXI0eiVv7r9mKYRCk05UarjOon2MyI1oxwnpW6qMCF0RAbYMVSQCJWfzS6e2CdG6d thLE0Jbc/U3xMZiZQaR4HpjIgeqkVvKv7ndVIdXvgZE0mqUdD5ojDlto7t6ft2n0mkmo8NIVQyc6tNh0QSqk1IJROCu/jyMvFq1cuqe3dWqV/laRThCI7hFFw4hzrcQgM8oCDg GV7hzVLWi/VufcxbC1Y+cwh/YH3+AH4rkFY=</latexit><latexit sha1_base64="O41x/oYxdx0Jj1FZvAPCIR1n8zc=">AAAB8HicbVBNS8N AEJ3Ur1q/qh69BIvgqSRFUG9FQTxWMG2xjWWznbRLN5uwuxFK6L/w4kHFqz/Hm//GbZuDtj4YeLw3w8y8IOFMacf5tgorq2vrG8XN0tb2zu5eef+gqeJUUvRozGPZDohCzgR6m mmO7UQiiQKOrWB0PfVbTygVi8W9HifoR2QgWMgo0UZ66MomSv1Y6930yhWn6sxgLxM3JxXI0eiVv7r9mKYRCk05UarjOon2MyI1oxwnpW6qMCF0RAbYMVSQCJWfzS6e2CdG6d thLE0Jbc/U3xMZiZQaR4HpjIgeqkVvKv7ndVIdXvgZE0mqUdD5ojDlto7t6ft2n0mkmo8NIVQyc6tNh0QSqk1IJROCu/jyMvFq1cuqe3dWqV/laRThCI7hFFw4hzrcQgM8oCDg GV7hzVLWi/VufcxbC1Y+cwh/YH3+AH4rkFY=</latexit><latexit sha1_base64="O41x/oYxdx0Jj1FZvAPCIR1n8zc=">AAAB8HicbVBNS8N AEJ3Ur1q/qh69BIvgqSRFUG9FQTxWMG2xjWWznbRLN5uwuxFK6L/w4kHFqz/Hm//GbZuDtj4YeLw3w8y8IOFMacf5tgorq2vrG8XN0tb2zu5eef+gqeJUUvRozGPZDohCzgR6m mmO7UQiiQKOrWB0PfVbTygVi8W9HifoR2QgWMgo0UZ66MomSv1Y6930yhWn6sxgLxM3JxXI0eiVv7r9mKYRCk05UarjOon2MyI1oxwnpW6qMCF0RAbYMVSQCJWfzS6e2CdG6d thLE0Jbc/U3xMZiZQaR4HpjIgeqkVvKv7ndVIdXvgZE0mqUdD5ojDlto7t6ft2n0mkmo8NIVQyc6tNh0QSqk1IJROCu/jyMvFq1cuqe3dWqV/laRThCI7hFFw4hzrcQgM8oCDg GV7hzVLWi/VufcxbC1Y+cwh/YH3+AH4rkFY=</latexit><latexit sha1_base64="O41x/oYxdx0Jj1FZvAPCIR1n8zc=">AAAB8HicbVBNS8N AEJ3Ur1q/qh69BIvgqSRFUG9FQTxWMG2xjWWznbRLN5uwuxFK6L/w4kHFqz/Hm//GbZuDtj4YeLw3w8y8IOFMacf5tgorq2vrG8XN0tb2zu5eef+gqeJUUvRozGPZDohCzgR6m mmO7UQiiQKOrWB0PfVbTygVi8W9HifoR2QgWMgo0UZ66MomSv1Y6930yhWn6sxgLxM3JxXI0eiVv7r9mKYRCk05UarjOon2MyI1oxwnpW6qMCF0RAbYMVSQCJWfzS6e2CdG6d thLE0Jbc/U3xMZiZQaR4HpjIgeqkVvKv7ndVIdXvgZE0mqUdD5ojDlto7t6ft2n0mkmo8NIVQyc6tNh0QSqk1IJROCu/jyMvFq1cuqe3dWqV/laRThCI7hFFw4hzrcQgM8oCDg GV7hzVLWi/VufcxbC1Y+cwh/YH3+AH4rkFY=</latexit>Figure 2.2: Illustration of the non-orthogonal MCQ optimization problem.Residual vector quantization (RVQ)The first use of non-orthogonal codebooks for ANNS is due to Chen et al. (2010).The proposed optimization approach simply optimizes the codebook sequentiallyusing k-means. Their study, however, does not address the issue of the extra mem-ory incurred by the need to store the norms of the approximations, and simply sug-gests storing them during the offline training stage, which makes the comparisonwith PQ somewhat problematic.Enhanced residual vector quantization (ERVQ), Stacked quantizers (SQ)Ai et al. (2014, 2017) proposed an enhanced version of RVQ, borrowing ideas fromthe vector quantization (VQ) literature of the 1990s (Chan et al., 1992). ERVQuses RVQ as initialization, and further iterates to refine the learned codebooks byfixing the parameters of all but the ith codebook, updating its parameters by co-ordinate descent until convergence. The authors further propose using the short-listing approach of Hwang et al. (2012b) to speed up encoding. The same ideawas independently applied to ANNS by Martinez et al. (2014), who also evaluatedtheir approach on features obtained from deep neural networks, showing for thefirst time that non-orthogonal approaches produce particularly better results thanorthogonal ones on such datasets.Additive quantization (AQ)Babenko and Lempitsky (2014a) introduced a method that has the same formu-lation as RVQ and ERVQ, but were apparently not aware of such previous work.15Babenko and Lempitsky put an emphasis on the combinatorial optimization prob-lems that arise when one tries to optimize multiple codebooks jointly, and proposedan efficient least-squares formulation and method for codebook update. They alsonoted that finding the optimal codes after fixing the codebooks results in multipleNP-hard combinatorial problems, and proposed to use beam search for this stage.Moreover, they suggest simply quantizing the norm of the approximation and look-ing it up during search, which results in a query time comparable to PQ and OPQ,but achieves marginal improvements over those baselines.Although Babenko and Lempitsky (2014a) appear to be unaware of previouswork on the subject, and did not report particularly good results over PQ and OPQ,their emphasis on the computational challenges of the problem and its practicalsolution to the query time overhead of non-orthogonal methods have proven influ-ential within the computer vision community.The authors released a Python implementation of their method 11 months afterpublication at https://github.com/arbabenko/Quantizations.Self-organizing binary encoding (SOBE)Ozan et al. (2016c) proposed a method that encodes each vector greedily, similar toRVQ but using a lookahead over the next two codebooks. Moreover, after each vec-tor is encoded, the codewords that it indexes into are updated with gradient descent.The paper presents a comparison against RVQ (Chen et al., 2010) and OCKM (Wanget al., 2014), achieving slightly better results than these two baselines.Joint k-means quantization (JKM)Followup work by Ozan et al. (2016b) proposed a hierarchical beam search en-coding method that keeps track of a shortlist of H candidates for each codebook,and then explored the best H2 candidates for every subsequent codebook pair.This is much faster than the beam search proposed by Babenko and Lempitsky,and achieves slightly better results than AQ (Babenko and Lempitsky, 2014a) andOTQ (Babenko and Lempitsky, 2015). Liu et al. (2016) have proposed the sameencoding strategy, and called it multi-path encoding.16Competitive quantization (COMPQ)The most recent work by Ozan et al. (2016a) uses the gradient-based codebookupdate of SOBE (Ozan et al., 2016c) and the hierarchical beam search encoding ofJKM (Ozan et al., 2016b), and achieves better results than both methods combined,albeit without mentioning runtime.These three papers (Ozan et al., 2016a,b,c) did not take into consideration theincreased query time that comes with using full-dimensional codebooks, and werebenchmarked only on datasets of SIFT and Gist (Oliva and Torralba, 2006) features.Improving the speed of AQ encodingMuravev et al. (2017) follow the AQ formulation and use a pyramid structure toprune beam search candidates, resulting in faster, but not always more accurate,encoding. They simply call their approach pyramid encoding. Li et al. (2017)propose finding an optimal codebook ordering for sequential encoding based onthe indexing frequency of codebook indexing. They show results similar to AQin accuracy, but with much improved runtimes. They call their approach fast AQencoding.MCQ recap. Table 2.1 shows a summary of the methods reviewed so far.17Reference Name Abbr. CodeOrthogonal methodsBrandt (2010) Transform coding TCSandhawalia and Je´gou (2010) Expectation-based coding –Je´gou et al. (2009, 2011) Product quantization PQ Matlab, C2Ge et al. (2014) Optimized product quantization OPQ MatlabNorouzi and Fleet (2013) Cartesian k-means CKM Matlab, C++Douze et al. (2016) Polysemous codes – C++Semi-orthogonal methodsZhang et al. (2014) Composite quantization CQ MSVC C++3Babenko and Lempitsky (2015) Optimized tree quantization OTQWang et al. (2014) Optimized Cartesian k-means OCKMNon-orthogonal methodsChen et al. (2010) Residual vector quantization RVQAi et al. (2014, 2017) Enhanced residual vector quantization ERVQBabenko and Lempitsky (2014a) Additive quantization AQ Python4Martinez et al. (2014) Stacked quantizers SQ Matlab, C++Ozan et al. (2016c) Self-organizing binary encoding SOBEOzan et al. (2016b) Joint k-means quantization JKMOzan et al. (2016a) Competitive quantization COMPQMuravev et al. (2017) Pyramid additive quantization –Li et al. (2017) Fast additive quantization –Table 2.1: Summary of MCQ methods in the literature2The Matlab version is free, and a “quite optimized” C version is available for purchase at http://people.rennes.inria.fr/Herve.Jegou/projects/ann.html. However, Andre´ et al. (2015) found that thisC version is, in fact, slower than a naı¨ve C implementation on a commodity Haswell processor.3Not mentioned in the original paper. Released 37 months after publication at https://github.com/hellozting/CompositeQuantization4Not mentioned in the original paper. Released 11 months after publication at https://github.com/arbabenko/Quantizations2.4.4 Other MCQ improvementsHere we review other notable improvements to MCQ. These improvements can beapplied independently from the techniques described above, and we only reviewthem for completeness.Quantized sparse representationsJain et al. (2016) adapt sparse coding to the large-scale similarity search problem.In sparse coding, there is a set of (dense) codebooks, and a set of sparse, real-valuedcoefficients. Jain et al. further constrain the coefficients to belong to a small set.These coefficients can be learned jointly with the codebooks, and quantized witha small memory overhead. Search incurs an extra multiplication per lookup table,compared to PQ. The authors demonstrate increased accuracy over PQ and RVQ atthe cost of increased query time.Polysemous codesDouze et al. (2016) propose polysemous codes, binary codes whose Hamming dis-tance can be used to shortlist candidates, but whose indices are still used to indexinto codebooks, and thus can also be used to compute more accurate distances withlookup tables. The authors use simulated annealing to optimize their formulation,and show improved query times over PQ at the cost of a small decrease in recall.Distance lookup caching and vectorizationOther lines of work have focused on low-level software improvements to the lookupprocess of MCQ.Andre´ et al. (2015) quantize floating point lookup tables to 8 bits per value.These smaller tables fit into single instruction multiple data (SIMD) registers andaccelerate the lookups necessary to compute distances in MCQ.In parallel, Blalock and Guttag (2017) use smaller tables, and also quantizethe distance tables using 8 bits. Their experiments focus on demonstrating fastencoding and the application to faster, yet approximate, matrix multiplications.19MCQ for maximum inner product searchGuo et al. (2016) propose a modification of PQ tailored for maximum inner prod-uct search (MIPS). They modify the usual k-means procedure to use the Maha-lanobis distance, and demonstrate that the approximation yields an unbiased es-timator. They also propose a further modification that adapts the codebooks to aparticular query distribution.Finally, Wu et al. (2017) propose multi-scale quantization, a pipeline for non-orthogonal MCQ that quantizes the norms of the database separately (similar togain-shape quantization), and that is trainable end-to-end with gradient descent.The authors also demonstrate various theoretical guarantees on its performance,and show better empirical results compared to PQ and SQ (Martinez et al., 2014).2.4.5 Supervised quantizationIn a supervised setting, one has access to both the data and its corresponding la-bels. In this scenario, it is possible to use the labels to improve the accuracy ofsystems that compute ANNs as part of a semantic retrieval or large-scale classifica-tion pipelines. The main challenge in this area is that current deep learning systemsare usually trained with backpropagation, and are thus more efficiently trained withdifferentiable operations. Since quantization is in general non-differentiable, workmostly focuses on optimization techniques that approximate hard quantization as-signments.The work in this thesis does not make use of labels, and thus falls under theunsupervised category. However, it is interesting to review the recent literature insupervised quantization as it provides interesting avenues for future work.Supervised composite quantizationWang et al. (2016) introduce a method similar to composite quantization (Zhanget al., 2014), but that has the additional constraint of maximizing the separabilityof different classes during encoding.Deep quantization networkCao et al. (2016) jointly train a neural network with a product quantizer. However,20only the network benefits from supervision, and the quantizer is updated once per-epoch without label information. Cao et al. (2017) extend this work to optimize forMaximum inner product search (MIPS).Supervised structured binary codes (SUBIC)More recently, Jain et al. (2017) introduced SUBIC, a neural network that bothlearns feature representations and produces compact codes. SUBIC manages toimplicitly learn a set of codebooks by adding sparsity terms that ensure block-wise one-hot encodings, and balancing the use of all codebooks across the dataset.Klein and Wolf (2017) have further improved upon SUBIC with a network thatexplicitly learns soft- and hard-quantized representations. Followup work by Jainet al. (2018) introduces a differentiable large-scale indexing structure that can belearned end-to-end together with SUBIC, resulting in the first complete image in-dexing pipeline that can be learned end-to-end.2.4.6 SoftwareWhile some of the papers mentioned above have released code that reproducestheir results, no public library for quantization-based ANNS existed until mid-2017,when Johnson et al. (2017) released the Facebook AI Similarity Search (FAISS)library: https://github.com/facebookresearch/faiss.The implementations in FAISS are highly optimized, written in C++ and havePython bindings readily available, which makes it easy to use and an invaluabletool in practice. Furthermore, the authors have ported most of the methods toCUDA, with additional code that automatically takes advantage of multiple GPUsin a node transparently, using a computational model similar to Tensorflow (Abadiet al., 2016). FAISS implements multiple indexing methods, such as an invertedfile (Je´gou et al., 2011) and the inverted multi-index (Babenko and Lempitsky,2012), the latter being a crucial part of state-of-the-art, quantization-based systemsfor large-scale ANNS. However, the package only implements three vector quanti-zation methods: PQ, OPQ and polysemous codes (Douze et al., 2016); all of themorthogonal. We believe that FAISS can benefit from implementing state-of-the-artnon-orthogonal quantization methods as well.21Chapter 3Efficient encoding and sparsecodebooksColour my life with the chaos of trouble—Belle and SebastianIn multi-codebook quantization (MCQ), encoding amounts to finding, for eachvector in the database, the set of assignments into the codebooks that minimizequantization error. For a given dataset and a fixed set of codebooks, the encodingproblem can be expressed as maximum-a-posteriori (MAP) inference in multipleMarkov random fields (MRFs) (Babenko and Lempitsky, 2014a). This process canbe a computational bottleneck when MCQ is applied to large-scale databases.The combinatorial optimization community has been dealing with similar prob-lems for many years, and competitions to solve distribution-specific instances ofNP-hard problems as efficiently as possible (e.g., the SAT competition series (Ja¨rvisaloet al., 2012)) have driven research on fast combinatorial optimization methodssuch as stochastic local search (SLS) (Hoos and Stu¨tzle, 2004) and portfolio-basedsolvers (Xu et al., 2008). Inspired by the combinatorial optimization literature, ourmain contribution in this chapter is the introduction of an SLS-based algorithm thatachieves low quantization error at a high encoding speed in MCQ. We also discuss aseries of implementation details that make our algorithm fast in practice, includinga graphics processor unit (GPU) implementation. Finally, we also demonstrate an22application of our approach that incorporates sparsity constraints in the codebooks.3.1 Background and related workWe use the same notation as introduced in Chapter 2. Formally, we denote the setto quantize as X ∈Rd×n, having n data points with d dimensions each; MCQ is theproblem of finding m codebooks Ci ∈ Rd×h and the corresponding codes Bi thatminimize quantization error, i.e., to determineminCi,Bi‖X− [C1,C2, . . . ,Cm]B1B2...Bm‖2F , (3.1)where Bi = [bi1,bi2, . . . ,bin] ∈ {0,1}h×n, and each subcode bi j is limited to havingonly one non-zero entry: ‖bi j‖0 = 1, ‖bi j‖1 = 1. Letting C = [C1,C2, . . . ,Cm] andB =[B>1 ,B>2 , . . . ,B>m]>, we can rewrite expression 3.1 more succinctly asminC,B‖X−CB‖2F . (3.2)Although Babenko and Lempitsky were not technically the first to use thisformulation (that credit goes to Chen et al. (2010)), their paper was the first topropose a non-greedy optimization approach based on block coordinate descent(“EM-like”), consisting of, first, initializing B randomly, and then• updating C while maintaining B fixed, and• updating B while maintaining C fixed,until convergence. In this approach, updating the codebooks C amounts to solvinga large-scale least-squares problem, and updating B results in n hard combinatorialoptimization problems. In this chapter we focus on the latter problem.3.1.1 The encoding problem in MCQOur work focuses on improving the encoding time and performance of block co-ordinate approach approaches to MCQ – that is, given the data X and codebooks C,23we search for a method to find the codes B that minimize Expression 3.2.Formally, the encoding problem amounts to maximum-a-posteriori (MAP) in-ference1 in n fully-connected MRFs (one for each item in X) with m nodes each(one per subcode). These MRF nodes may take any value between 1 and h, so ingeneral they are not binary. In the general case, this problem is NP-hard.In these MRFs, the minimum energy is achieved by finding the subcodes bi thatminimize the squared distance between the vector to encode x, and its approxima-tion xˆ:‖x− xˆ‖22 = ‖x−m∑iCibi‖22 = ‖x‖22−2 ·m∑i〈x,Cibi〉+‖m∑iCibi‖22 (3.3)where the norm of the approximation ‖xˆ‖22 = ‖∑mi Cibi‖22 can be expanded as‖m∑iCibi‖22 =m∑i‖Cibi‖22+m∑im∑j 6=i〈Cibi,C jb j〉. (3.4)Posed as an MRF with m nodes, the ‖x‖22 term in (3.3) can be discarded becauseit is a constant with respect to bi; the −2 ·∑mi 〈x,Cibi〉 and ∑mi ‖Cibi‖22 terms aresummed up and become the unary terms, and ∑mi ∑mj 6=i〈Cibi,C jb j〉 yields the pair-wise terms. Since each code bi may take any value between 1 and h, there are hmpossible configurations for each MRF (typically, m = {8,16}, and h = 256), whichrenders the problem inherently combinatorial and, in general, NP-hard (Cooper,1990).On the empirical challenges of AQ-encodingSolving the MRFs that arise in AQ encoding is known to be challenging; in fact,Babenko and Lempitsky (2014a) noted:The optimization problem can be solved approximately by any of theexisting algorithms like loopy belief propagation (LBP), iterated con-ditional modes (ICM), etc. However, because of full connectivity andthe general form of the pairwise potentials, we observed that LBP and1Unfortunately enough, Maximum-a-posteriori (MAP) inference is often called “decoding”, dueto its historical use in the receiving end of error-correcting codes (Gersho and Gray, 1992).24ICM, and, in fact other algorithms from the MRF optimization library[Kappes et al. (2013)] perform poorly.Babenko and Lempitsky (2014a) proposed resorting to beam search, a compu-tationally expensive construction search method. This method, however, achievedonly marginal improvements over the baselines PQ an OPQ. Moreover, the authorsalso note:Interestingly, we tried to formulate the MRF optimization [...] as aninteger quadratic program and submit it to a general purpose branch-and-cut optimizer [(Gurobi)]. For small codebooks [h] = 64 and [m] =8 and given very large amount of time, the solver could find a globalminimum with much smaller energy (coding error) then [sic] thosefound by [b]eam [s]earch or other non-exhasutive [sic] algorithms.While such “smart bruteforce” approach is infeasible for large datasetsand meaningful codebook sizes, this result suggests the AQ-codingproblems as interesting instances for the [sic] research into new MRF-optimization algorithms.We build on this observation, and look for an encoding algorithm with im-proved computation vs accuracy trade-offs for this problem.3.1.2 Reducing query time in AQAQ (as well as most other non-orthogonal MCQ techniques) involves major compu-tational overhead during query time. The overhead occurs when a new query q isreceived, and we estimate the distance to the encoded vectors xˆi. This amounts toevaluating‖q− xˆ‖22 = ‖q−m∑iCibi‖22 = ‖q‖22−2 ·m∑i〈q,Cibi〉+‖m∑iCibi‖22 (3.5)which requires O(m2) table lookups for evaluating the norm of the encoded vector‖xˆ‖22.Babenko and Lempitsky (2014a) proposed a simple solution to this problem:use m− 1 codebooks to quantize x, and use an extra byte to quantize the norm of25the approximation ‖xˆ‖22. This approximation is likely to be very good, as we arethen using h= 256 centroids to quantize a scalar. Using this extra table is crucial tomake sure that the query time of non-orthogonal MCQ methods is comparable withPQ and OPQ. Unfortunately, with this constraint in mind, the improvement in recallof AQ over PQ and OPQ became rather marginal (see AQ-7 in Figure 5 of Babenkoand Lempitsky (2014a)).We use m− 1 codebooks in all our experiments, which makes sure that ourquery time is comparable with previous work.3.1.3 MAP estimation of MRFs using GPUsOur broader goal is to come up with algorithms that scale well on large datasets, sowe want our algorithm to be fast in practice. Since the encoding problem decom-poses into n independent MRFs, we also investigate the use of parallel architectures– in particular GPUs – for accelerating MAP estimation in MRFs.MRFs are a widely-used tool in computer vision, frequently leveraged to modelvisual smoothness in problems such as stereo, semantic segmentation, optical flowand visual localization. The first connection between these statistical models andimages is due to Geman and Geman (1984). In this context, Boykov et al. (2001)propose graph cuts, an iterative algorithm that makes large moves in labelling overmultiple nodes at once, achieving fast convergence. Kra¨henbu¨hl and Koltun (2011)use filtering to perform MAP inference on MRFs with binary terms coming fromGaussian kernels, and apply their optimization method to semantically segmentnatural images.Given the widespread use of MRFs as a modelling tool in computer vision,we have found previous work that performs MAP estimation in the GPU. Vineetand Narayanan (2008) introduce CUDACuts, a CUDA-based implementation ofthe graph cuts algorithm by Boykov et al., which achieves exact Markov ran-dom field (MRF) optimization on submodular MRFs. The implementation exploitsshared memory in the push-relabel step of graph cuts, and is demonstrated on aNvidia GTX 280 GPU. Zach et al. (2008) introduce a data-parallel approach toMRF optimization based on plane sweeping, which is applicable when the pairwiseterms follow a Potts prior. The work dates back to 2008, when CUDA was a more26difficult language to program in than it is today, so the authors used the OpenGLapplication programming interface (API) to program their solution. The authorsdemonstrate an implementation on an Nvidia Geforce 8800 Ultra GPU.Most previous work focuses on MRFs applied to image-related tasks, such asscene segmentation. In these applications, the MRFs typically have a large num-ber of nodes (1 per pixel), a low number of labels (e.g., 3 (Zach et al., 2008)),are sparsely connected, and have special constraints on their pairwise terms (e.g.Potts in the work of Zach et al. (2008) and submodular in the work of Vineet andNarayanan (2008)). Thus, it is unlikely that those implementations will fit ourproblem, which has a low number of nodes m = {8,16}, a relatively large num-ber of labels (h = 256), is densely connected, and whose pairwise terms are notconstrained to follow any particular formulation.3.1.4 Stochastic local search (SLS) algorithmsTop-performing methods for solving many NP-hard problems have been, at severalpoints in time, variations of stochastic local search (SLS), and continue to definethe state of the art for solving prominent NP-hard problems, such as the travellingsalesperson problem (TSP) (Nagata and Kobayashi, 2013). Given a candidate so-lution to a given problem instance, SLS methods iteratively examine neighbouringcandidates, and replace the candidate solution with one of its neighbours until astop criterion is met. A formal treatment of the subject involves defining neigh-bourhood functions, characterizing problem instances and formally defining localsearch procedures; while this is beyond the scope of this work, we direct interestedreaders to the book by Hoos and Stu¨tzle (2004).Iterated local search (ILS)Iterated local search (ILS) algorithms (Lourenc¸o et al., 2003), which form the basisfor the algorithm we propose in this work, alternate between perturbing the currentsolution s (with the goal of escaping local minima) and performing local searchstarting from the perturbed solution, leading to a new candidate solution, s′. At theend of each local search phase, a so-called acceptance criterion is used to decidewhether to continue the search process from s or s′. In many applications of ILS,27including ours described in the following, the acceptance criterion simply selectsthe better of s and s′.Theoretical bounds of SLS algorithmsA downside of SLS algorithms (and many other heuristic methods that performwell in practice) is that their theoretical performance has historically proven hardto analyze. Similar to prominent deep learning techniques, SLS methods oftenperform far better than theory predicts, and thus, research in the area is heavilybased on empirical analysis. An attempt to achieve a theoretical breakthrough forour method would be out of the scope of this work, so instead, we focus on thethorough empirical evaluation of its performance on a number of benchmarks withvarying sizes, protocols and descriptor types to show the strength of our approach.3.2 Iterated local search for MCQ encodingWe now introduce our ILS algorithm to optimize B in Expression 3.2. Our algo-rithm is defined by1. an initialization procedure (to create the first solution)2. a perturbation procedure (to escape local minima)3. a local search algorithm, and4. an acceptance criterion (stated above).We now explain our design choices for these four components.3.2.1 Local search procedureAs our local search procedure we choose cyclic iterated conditional modes ICM.Iterated conditional modes ICM offers two key advantages over other local searchalgorithms: (i) on the theoretical side, it exhibits very good speed, and, (ii) in prac-tice, it can be implemented in a way that is amenable to caching and vectorization.Although ICM by itself does not tend to perform well on large MRF models ap-plied to full images, our problem is different enough that we consider it a suitablecandidate. We discuss both advantages in more detail below.28Complexity analysis of ICMIterated conditional modes ICM iterates over all the nodes in the given MRF, con-ditioning the current node on the assignments of other nodes, and minimizing(finding the mode of) the resulting univariate distribution. In a fully-connectedMRF, such as the one in our problem, each node represents a subcode bi. Thus,ICM cycles through m subcodes, and conditions its value on the other m− 1 sub-codes, adding h terms for each conditioning. This results in a complexity ofO(m2 ·h). Comparing to the beam search procedure of AQ, which has a complexityof O(m · h2(m+ logm · h)) (Babenko and Lempitsky, 2015), we can see that fortypical values of m ∈ {8,16} and h = 256, a single ICM cycle is much faster thanbeam search. This suggests that we can afford to run several rounds of ICM, whichis necessary for ILS, while keeping the overall encoding time low. In practice, ourimplementation is 30-50 times faster than the beam search implementation due toBabenko: https://github.com/arbabenko/Quantizations.Cache hits and vectorization of ICMWhile this is not true in general, in our special case of interest ICM has a secondcrucial advantage: it can be programmed in a way that is cache-friendly and easyto vectorize.In practice, the computational bottleneck of ICM arises in the conditioning step,when the algorithm looks at all the neighbours of the ith node and adds the assign-ments in those nodes to the current one. A naı¨ve implementation, such as thatavailable from off-the-shelf MRF libraries (Kappes et al., 2013), encodes each datapoint sequentially, looking up the pairwise terms from O(m2) different tables ofsize h× h. This results in a large number of cache-misses, as different tables areloaded into cache for each conditioning.Our observation from Equation 3.3 is that only the unary term −2∑mi 〈x,Cibi〉depends on the vector to encode x. We also note that the pairwise terms∑mi ∑mj 6=i〈Cibi,C jb j〉are the same for all the MRFs that arise in the encoding problem. This means that,during the conditioning step, we can condition all the ith subcodes in the databasew.r.t. the jth subcode, loading only one h×h pairwise lookup table into cache at atime. This results in better caching performance, is easily vectorized, and dramati-29cally speeds up ICM in our case. In practice, this is accomplished by switching theorder of the for loops in ICM over the entire (or a large portion of the) database.We refer to this implementation as “batched Iterated conditional modes (ICM)”.Other variants of ICMOther local search procedures may be used instead of cyclic ICM. For example,one may explore all the combinations of multiple codes at a time and update themjointly (block ICM), or update only the code that offers the best improvement(greedy ICM). These algorithms are all implemented in the UGM toolbox duetoSchmidt (2007). Greedy ICM may be implemented efficiently using a heap (Nu-tini et al., 2015, Appendix. A). Exploring the cost and empirical performance ofthis implementation is an interesting area of future work.3.2.2 ILS PerturbationIn each incumbent solution s, we choose k codes to perturb by sampling withoutreplacement from the discrete uniform distribution from 1 to m: ik ∼ U {1,m}.We then perturb each selected code uniformly at random by setting it to a valuebetween 1 and h: bik ∼ U {1,h}. This perturbed solution s′ is then used as thestarting point for the subsequent local search phase, i.e., invocation of ICM. Whilesimple, this perturbation method is commonly used in the SLS literature (Hoos andStu¨tzle, 2004). We note that our approach generalizes previous work where ICMwas used but no perturbations were applied (Zhang et al., 2014), which correspondsto setting k = 0. This method is both effective and very fast in practice: comparedto ICM, the time spent in this step is negligible.3.2.3 ILS InitializationWe use a simple initialization method, setting all the codes to values between 1 andh uniformly at random: bi ∼ U {1,h} ∀i ∈ {1,2, . . . ,m}. We also experimentedwith other initialization approaches, such as using FLANN (Muja and Lowe, 2009)to copy the codes of the nearest neighbour in the training dataset, or using thecode that minimizes the binary terms (which are expected to dominate the unaryterms for large m). However, we did not observe significant improvements over30our random initialization after a few rounds of ILS optimization.Overall initializationLike AQ and composite quantization (CQ), we use an auxiliary quantization methodto initialize B and C in our optimization procedure. We run OPQ, followed bya method similar to optimized tree quantization (OTQ) (Babenko and Lempitsky,2015), but simplified to assume that the dimension assignments are given by anatural partition of adjacent dimensions.3.2.4 PseudocodeWe now provide the pseudocode for our algorihtm.3.3 Graphics processor unit (GPU) implementationGPUs are computing devices originally conceived for graphics applications, capa-ble of performing the same operation over many data points in a single clock cycle,achieving higher throughput than CPUs. Thus, they are a good fit for problems thatare amenable to parallelization. Next, we describe the details of our GPU imple-mentation.Our implementation consists of three main CUDA kernels: (a) a setup kernel,which initializes the solutions, computes unary lookup tables and creates a seriesof random number generators; (b) a perturbation kernel, which alters random codesin the solution, and (c) a local search kernel, which implements ICM.Initialization kernelOur initialization kernel creates a CUDA random (curand) state for each datapoint in the database and, during the first iteration, uses the samples to create arandom initial solution. The implementation takes less than 1/1000 of the totalrunning time.31Algorithm 1 Our SLS algorithm for encoding in MCQ1: function ENCODE(x ∈ Rd , . Vector to encodeb ∈ {0,1}m×h = [bi,b2, . . . ,bm];bi ∈ {0,1}h, . Initial codesk ∈ {0,1, . . . ,m}, . Number of codes to perturbm ∈ Z, h ∈ Z, . Number of codebooks and entries in each codebookI ∈ Z, J ∈ Z) . Number of ILS and ICM iterations2: for i← 1,2, . . . ,m do3: bi ∼ onehot(U {1,h}) . Initialize each code to a random value.4: end for5: s← [bi,b2, . . . ,bm] . Save the current solution6: for i← 1,2, . . . , I do . Loop over ILS iterations.7: b← PERTURB(b,k,m) . Perturb k codes8: for j← 1,2, . . . ,J do . Loop for the number of ICM iterations9: for k← 1,2, . . . ,m do . Update each code, keeping the rest fixed10: unary←−2〈x,Ck ·bk〉+‖Ck ·bk‖2211: binary← ∑mi6=k〈Ck ·bk,Ci ·bi〉12: bk ∈ argminbk(unary+binary)13: end for14: end for15: s′← [bi,b2, . . . ,bm]16: if cost(s′)< cost(s) then17: s← s′ . If the new solution is better, keep it18: end if19: end for20: return s21: end functionPerturbation kernelIn this kernel, we have one thread per data point, with the goal of maximizing thework/thread ratio. The task of each thread is to choose k entries in the currentsolution, and perturb them to random values between 1 and h.The main challenge in this kernel is that the choice of k entries to perturbamounts to sampling without replacement from the uniform distribution between 1and m. Since there are no off-the-shelf functions for sampling without replacement32in the curand library, we implemented our own version of reservoir sampling2.The pseudocode for this kernel is shown as Algorithm 2. Here we show our per-turbation algorithm in pseudocode.Algorithm 2 Our perturbation method using reservoir sampling1: function PERTURB(b ∈ {0,1}m×h = [bi,b2, . . . ,bm];bi ∈ {0,1}h, . Initial codesk ∈ {0,1, . . . ,m}, . Number of codes to perturbm ∈ Z, h ∈ Z) . Number of codebooks and entries in each codebook2: need← k . The number of codes that we still have to perturb3: le f t← m . The number of codes that we can still visit4: for i← 1,2, . . . ,m do5: r ∼U (0,1) . Sample a value between 0 and 16: if r < need/le f t then . With probability (need/le f t)...7: bi ∼U {1,h} . Alter the ith code8: need← need−1 . Decrease the codes we have left to perturb9: if need ≤ 0 then10: return b . If we have perturbed enough codes, we are done11: end if12: end if13: le f t← le f t−1 . Decrease the codes we can still perturb14: end for15: return b . Return the perturbed code16: end functionIt is easy to show by induction that each code has an equal probability k/m ofbeing perturbed, and the algorithm finishes when k samples have been perturbed –which is most likely achieved before visiting every code. Although elegant, reser-voir sampling leads to large thread divergence, which is likely suboptimal for theGPU architecture. Fortunately, the time spent in this perturbation step is negligiblecompared to the ICM kernel that we discuss next.2https://en.wikipedia.org/wiki/Reservoir sampling33ICM kernelThe task of the ICM kernel is to perform local search after the solution has beenperturbed; this corresponds to lines 8-14 in Algorithm 1. This is, by far, the mostexpensive part of our pipeline, so we discuss the implementation in more detail.In short, we have three main tasks: (a) copying the pre-computed unary termsto shared memory, (b) adding the corresponding pairwise terms to the unary termsfor each possible value of each code, and (c) finding the minimum of all h possi-bilities in each code.Task (a) is only done once and its implementation is straightforward. Task (b)is memory-limited – the pairwise tables are too big to fit in shared memory, so theyhave to be read from global memory. Task (c) amounts to a reduce operation, andis compute-limited.ICM conditioning. The main challenge in ICM conditioning is to access the pair-wise terms in a coalesced manner.We simply store each table of pairwise termsTi, j =−2 ·C>i ·C j, Ti, j ∈ Rh×h. (3.6)Since Ti, j = T>j,i , this incurs extra memory usage, but allows for a simple imple-mentation where the i, j represent the code being minimized over (i), and the codewhose terms are being accumulated ( j). Once the corresponding table is loadedinto cache, we can simply use the current codes to add the corresponding pairwiseterms, thus preserving coalesced memory access.Reduce operation. Once all the pairwise terms have been added, we run task (c)as a reduce findmin operation on the sum of unary and pairwise terms. Thisis a reduction over the h values that we computed in task (b). Note, however,that the output is not the minimum value itself, but its index (a number from 1 toh), so we use an extra array of indices during reduction. We implemented all theoptimizations available in the CUDA reduce guide (https://docs.nvidia.com/cuda/samples/6 Advanced/reduction/doc/reduction.pdf) for this step.343.4 The advantages of a simple formulation:easy sparse codebooksOur approach, building on top of AQ, benefits from using a simple formulationwith no extra constraints (Expression 2.4). The advantages of a simple formulationare not merely aesthetic; in practice, they result in a straightforward optimizationprocedure and less overhead for the programmer. Furthermore, a more standardformulation might render other variants of the problem easier to solve as well. Wenow demonstrate one such use case, by implementing a recent MCQ formulationthat enforces sparsity on the codebooks (Zhang et al., 2015).Motivation for sparse codebooksZhang et al. (2015) motivate the use of sparse codebooks in the context of verylarge-scale applications, which deal with billions of data-points and are bettersuited for search with an inverted-index (Babenko and Lempitsky, 2012, 2014b;Xia et al., 2013). In this case, the time spent computing the lookup tables becomesnon-negligible, which is specially true for methods that use full-dimensional code-books, such as CQ (Zhang et al., 2014), AQ (Babenko and Lempitsky, 2014a) and,by extension, ours. For example, Zhang et al. (2015) have demonstrated that en-forcing sparsity in the codebooks can lead up to 30% speedups with an inverted in-dex (Babenko and Lempitsky, 2014b) on the SIFT1B dataset (see Table 2 in (Zhanget al., 2015)). This method is called sparse composite quantization (SCQ).There is a second use case for sparse codebooks. Recently, Andre´ et al (Andre´et al., 2015) have demonstrated that PQ scan and other lookup-based sums, suchas those in our approach, can take advantage of vectorization. The authors havedemonstrated speedups of up to a factor of 6 on distance computation, thus empha-sizing further the time spent computing distance tables even for datasets with a fewmillion data points, where linear scanning is the preferred search procedure.35The sparsity constraintFormally, the sparsity constraint on the codebooks modifies the typical MCQ for-mulation such that, in this case, we want to determineminC,B‖X−CB‖2F s.t. ‖C‖0 ≤ S. (3.7)Unfortunately, minimizing a quadratic function with an `0 constraint is non-convexand, in general, hard to solve directly. Thus, the problem is often relaxed theminimize the convex `1 norm. Our objective then becomes to determineminC,B‖X−CB‖2F s.t. ‖C‖1,1 ≤ λ . (3.8)In SCQ (Zhang et al., 2015), the problem becomes even harder, because on topof sparsity, the codebook elements are forced to have constant products. Forthis reason, Sparse composite quantization (SCQ) uses an ad-hoc iterative soft-thresholding algorithm to find C. Our problem is simpler, because we do not haveto satisfy the inter-codebook constraint and can be posed as an `1-regularized least-squares problem. To achieve this, we rewrite Expression 3.8 asminC,B‖B>C>−X>‖2F s.t. ‖C‖1,1 ≤ λ , (3.9)and it becomes apparent that the approximation of the ith column of X> dependsonly on product of B> and the ith column of C>. Thus, we can rewrite Expres-sion 3.9 asmincˆ,B‖Bˆcˆ− xˆ‖22 s.t. ‖cˆ‖1 ≤ λ , (3.10)whereBˆ =B>(1) 0 . . . 00 B>(2) . . . 0.... . ....0 0 . . . B>(d) , cˆ =c>1c>2...c>d , and xˆ =x>1x>2...x>d . (3.11)Here, B>(i) is the ith copy of B>, ci is the ith row of C, and xi is the ith row of X .36To update Bˆ we use our previously introduced ILS procedure. Updating cˆ, as-suming Bˆ and xˆ are fixed, corresponds to the well-known lasso (Tibshirani, 1996).Nearly two decades of research on the lasso have produced many robust and scal-able off-the-shelf optimization routines for this problem.Our approach, as opposed to previous work (Zhang et al., 2015), lacks ex-tra constraints and thus can directly take advantage of lasso optimizers with littleoverhead to the programmer. For example, it took us only a few hours to integratethe Spectral projected gradient (SPGL1) solver by van den Berg and Friedlander(2008) into our pipeline. This solver has the additional advantage of not requiringan explicit representation of Bˆ, but can instead be given a function that evaluates toBˆcˆ – this can be implemented as a for loop, so we only need to store one copy ofthe codes B> in memory.SPGL1 thresholdingExpression 3.9 requires setting the value of λ . We follow a block coordinate de-scent procedure, and alternate between optimizing C (using SPGL1) and B (usingILS), and set λ = α · τ , where τ is the `1 norm of the solution given by PQ, whichwe also use for initialization.For a given set of λ = α · τ , SPGL1 may not return a solution with the level ofsparsity desired; thus, at the end of each SPGL1 iteration we keep only the entriesin C with the largest S absolute values, and set everything else in C to zero. Weperform a grid search for the best value of α ∈ {0.1,0.2, . . . ,0.9}, and choose thevalue that gives the lowest quantization error on the training set.3.5 ExperimentsWe now describe the experimental setup that empirically validates the impact ofour contributions.3.5.1 Evaluation protocolWe follow previous work and evaluate the performance of our system with re-call@N (Babenko and Lempitsky, 2014a, 2015; Ge et al., 2014; Je´gou et al., 2011;Zhang et al., 2014). Recall@N is a monotonically increasing curve from 1 to N37representing the empirical probability, computed over the query set, that the N es-timated nearest neighbours include the actual nearest neighbour in the database.The goal is to obtain the highest possible recall for given N. In information re-trieval, recall@1 is considered the most important number on this curve. Also inline with previous work (Babenko and Lempitsky, 2014a, 2015; Ge et al., 2014;Je´gou et al., 2011; Zhang et al., 2014, 2015), in all our experiments, the codebookshave h = 256 elements, so we use 8 bits of memory per code. We show resultsusing 64 and 128 bits for code length.We compute approximate squared Euclidean distances applying the expansionof Equation 3.3, and we use only 7 and 15 codebooks to store codebook indices,while the last 8 bits are dedicated to quantize the (scalar) squared norm of eachdatabase entry. In all our experiments, we run our method for 100 iterations anduse asymmetric distance (Je´gou et al., 2011), i.e., the distance tables are computedfor each query, as in all our baselines.3.5.2 BaselinesWe compare our approach to previous work, controlling for two critical factorsin large-scale retrieval: code length and query time. To control for code length,we use subcodebooks with h = 256 entries and produce final codes of 64 and 128bits. To control for query time, we restrict our comparison to methods that requirem = {8,16} table lookups to compute approximate distances. Thus, we compareagainst PQ (Je´gou et al., 2011), OPQ (Ge et al., 2014), RVQ (Chen et al., 2010),ERVQ/SQ (Ai et al., 2014; Martinez et al., 2014) and CQ (Zhang et al., 2014), aswell as the AQ-7 method introduced by Babenko and Lempitsky (2014a) (i.e., theoriginal formulation of AQ that we are building on).We use our own implementations of all these methods, except for CQ. We usethe public CQ implementation due to Zhang,3 with minor modifications to run ona Linux system, since the original implementation contains C++ constructs exclu-sive to Windows systems. We also take the results reported in the original paperintroducing AQ (Babenko and Lempitsky, 2014a). We compare our sparse exten-sion against the values reported for SCQ by Zhang et al. (2015). To the best of our3https://github.com/hellozting/CompositeQuantization38knowledge, the paper by Zhang et al. is the only one on the subject, and the authorshave not released code to reproduce their results.3.5.3 DatasetsWe test our approach on 6 datasets. Three of these, SIFT1M, SIFT10M andSIFT1B (Je´gou et al., 2011), consist of 128-dimensional gradient-based, hand-crafted SIFT features. We also collected a dataset of 128-dimensional deep featuresusing the CNN-M-128 network provided by Chatfield et al. (2014), computing thefeatures from a central 224×224 crop of the 1.2M images of the ImageNet Large-Scale Visual Recognition Challenge (ILSRVC) 2012 dataset, and then subsamplinguniformly at random from all classes. It is known that deep features can be effec-tively used as descriptors for image retrieval (Babenko et al., 2014; Razavian et al.,2014), and Chatfield et al. have shown that this intra-net compression results in aminimal loss of retrieval performance.SIFT1M and Convnet1M both have 100 000 training vectors, 10 000 query vec-tors and 1 million database vectors. SIFT1B has 100 million vectors for training,and 1 billion vectors for base, as well as 10 000 queries. On SIFT1B, we followprevious work (Ge et al., 2014; Zhang et al., 2014) and used only the first 1 millionvectors for training. Better results on SIFT1B can be obtained using an invertedindex, but we did not implement this data structure as we focus on improving en-coding performance. This also has the added benefit of making our results directlycomparable to those shown in CQ (Zhang et al., 2014) and CKM (Norouzi and Fleet,2013). With SIFT10M, we followed the same protocol, but use only the 10 millionfirst vectors of SIFT1B as base.We also use 2 datasets where CQ was benchmarked: MNIST (LeCun et al.,1998) and LabelMe22K (Norouzi and Fleet, 2011). MNIST is 784-dimensional,and has 10 000 vectors for query and 60 000 vectors for base. LabelMe22K is512-dimensional and has 2 000 vectors for query and 20 019 vectors for base.ProtocolsDifferent datasets have different partitions, and this leads to important differencesin the way learning and encoding are performed. The classical datasets SIFT1M,39SIFT10M and SIFT1B (Je´gou et al., 2011), as well as our Convnet1M dataset havethree partitions: train, query and database. In this case, the standard protocol is tofirst learn the codebooks using only the train set, then use the learned codebooks toencode the database, and finally evaluate recall@N of the queries with respect tothe database (Babenko and Lempitsky, 2014a, 2015; Ge et al., 2014; Je´gou et al.,2011; Ozan et al., 2016a; Zhang et al., 2015). Zhang et al. (2014), however, useda different protocol for the results reported on their paper. In our experiments,we run their code using the standard protocol. We refer to this partition as train/-query/base.Some datasets, however, have only two partitions of the data: query and database.In this case, the iterative codebook learning procedure is run directly on the database,and recall@N is evaluated on the queries with respect to the database thereafter.For example, Locally Optimized PQ (Kalantidis and Avrithis, 2014) was evalu-ated using this partition on SIFT1B by simply ignoring the training set, and CQwas evaluated on MNIST and LabelMe22K using the train/test partitions that thedatasets provide for classification (Zhang et al., 2014). It has also been arguedthat this partition is better suited for learning inverted indices on very large-scaledatasets (see the last paragraph of (Babenko and Lempitsky, 2014b)). We refer tothis partition as query/base.3.5.4 Parameter settingsOur approach needs to set the number of ILS iterations (i.e., the number of times asolution is perturbed and local search is done). At the same time, ICM may cyclethrough the nodes a number of times, which we call ICM iterations. Finally k, thenumber of elements to perturb, also needs to be defined.We chose our parameters using a subset of the training set of SIFT1M, keepingthe values that minimize quantization error. Figure 3.1 shows the results of our pa-rameter search on 10 000 SIFT descriptors. We note that, given the same numberof ICM cycles, using 4 ICM iterations and perturbing k = 4 code elements resultsin good performance. We observed similar results on other descriptor types, so weused these values in all our experiments. As shown in Figure 3.1, the performanceof our system depends on the number of ILS iterations used, trading-off compu-400 8 16 32 644.14.24.34.44.54.6Quantization error1e4 m=7, 1 ICM iter.k=1k=2k=4k=70 8 16 32 644.14.24.34.44.54.6Quantization error1e4 m=7, 2 ICM iter.k=1k=2k=4k=70 8 16 32 644.14.24.34.44.54.6Quantization error1e4 m=7, 4 ICM iter.k=1k=2k=4k=70 8 16 32 644.14.24.34.44.54.6Quantization error1e4 m=7, 8 ICM iter.k=1k=2k=4k=70 8 16 32 642.22.32.42.52.62.72.82.9Quantization error1e4 m=15, 1 ICM iter.k=1k=2k=4k=8k=150 8 16 32 642.22.32.42.52.62.72.82.9Quantization error1e4 m=15, 2 ICM iter.k=1k=2k=4k=8k=150 8 16 32 64Total ICM iterations2.22.32.42.52.62.72.82.9Quantization error1e4 m=15, 4 ICM iter.k=1k=2k=4k=8k=150 8 16 32 64Total ICM iterations2.22.32.42.52.62.72.82.9Quantization error1e4 m=15, 8 ICM iter.k=1k=2k=4k=8k=15Figure 3.1: Quantization error as a function of ILS iterations, ICM iterationsand number of codes perturbed k on 10 000 vectors of the SIFT1M traindataset after random initialization. Using 4 perturbations and 4 ICMiterations gives good results in all cases, so we use those parameters inall our experiments. Using no perturbations (k = 0), as done in previouswork (Babenko and Lempitsky, 2014a; Zhang et al., 2014), leads tovalues that are above all the plots that we are showing, and stagnatesafter about 3 ICM iterations.tation for recall. We evaluated our method using 16 and 32 ILS iterations on thebase set of train/query/base datasets, and refer to these methods as Local searchquantization LSQ-{16, 32}. During training, we used only 8 ILS iterations.413.6 ResultsSince our method relies heavily on randomness for encoding, it is natural to thinkthat the final performance of the system could exhibit large variance. To quantifythis effect, we ran our method 10 times on each dataset, and report the mean andstandard deviation in recall@N achieved by our method. To see how this comparesto previous work, we ran all the baselines 10 times and report their mean andstandard deviation in recall@N as well. We observe that Local search quantization(LSQ), despite relying heavily on randomness for encoding, exhibits a variabilityin recall similar to that of PQ and OPQ (and presumably that of other baselines aswell).This is consistent with the fact that LSQ solves a large number of independentinstances of combinatorial optimization problems of similar difficulty; in situationslike this, the solution qualities achieved within a fixed running time are typicallynormal-distributed (Hoos and Stu¨tzle, 2004, Chapter 4). In other words, despitebeing heavily randomized, the performance of our system turns out to be verystable in practice, because it is averaged over a large number of data points.3.6.1 Small training/query/base datasets100 101 102 103N0.20.30.40.50.60.70.80.91.0Recall@NSIFT1MLSQ-32 128 bitsCQ 128 bitsERVQ 128 bitsRVQ 128 bitsOPQ 128 bitsPQ 128 bitsLSQ-32 64 bitsCQ 64 bitsERVQ 64 bitsRVQ 64 bitsOPQ 64 bitsPQ 64 bits100 101 102 103N0.00.20.40.60.81.0Recall@NConvnet1MLSQ-32 128 bitsERVQ 128 bitsRVQ 128 bitsOPQ 128 bitsPQ 128 bitsLSQ-32 64 bitsERVQ 64 bitsRVQ 64 bitsOPQ 64 bitsPQ 64 bitsFigure 3.2: Recall@N curves for (left) the SIFT1M, and (right) Convnet1Mdatasets.First, we report the recall@N results for SIFT1M and Convnet1M in Figure 3.2,42SIFT1M – 64 bits SIFT1M – 128 bitsR@1 R@10 R@100 R@1 R@2 R@5PQ 22.59±0.43 59.90±0.28 91.98±0.17 44.53±0.40 60.49±0.40 78.94±0.28OPQ 24.43±0.41 63.73±0.27 94.20±0.16 46.14±0.50 62.27±0.36 80.97±0.27ERVQ 22.37±0.34 60.48±0.43 92.58±0.17 42.31±0.80 57.91±0.98 76.77±0.96ERVQ 23.63±0.37 63.25±0.46 93.92±0.24 45.43±0.40 61.64±0.51 80.11±0.43CQ 16.29 49.95 85.59 33.73 48.03 66.25AQ 26 70 95 − − −LSQ-32 29.54±0.33 72.65±0.51 97.34±0.14 55.10±0.38 72.09±0.43 88.78±0.25Table 3.1: Detailed recall@N values for our method on the SIFT1M dataset.where it is immediately clear that LSQ outperforms the classical baselines, PQ andOPQ, in recall@N for all values of N. Similarly, our method outperforms CQ whenusing 16 ILS iterations, and the gap is widened when using 32 iterations. As we willsee, analogous effects are observed throughout our results. In Table 3.1 we showour results in more detail, providing mean recall@N values and standard deviationsfor the SIFT1M dataset.We also observe that CQ performs much worse than reported in the originalpublication, and in fact its recall is lower than the classical baselines PQ and OPQ.This is due to the difference in the size of the training set used in the publication(the entire base set), versus what is normally used in the literature (10× smallertraining set). This suggests that CQ is not very sample efficient, as it needs moredata than the rest of the baselines to generalize.On Convnet1M, the advantage of LSQ over PQ, OPQ and other baselines is morepronounced. When using 64 bits, our method achieves a recall of 18.64, more thandouble that of PQ at 7.13, and with an 81% improvement over OPQ at 10.28. Theseresults show an increased advantage of our method over orthogonal approaches indeep-learned features, which currently dominate computer vision applications.3.6.2 Query/base datasetsIn Figure 3.3 and Table 3.2, we report results on datasets with query/base partitions:MNIST and LabelMe22K. Our method also outperforms the state of the art onboth datasets. We also note that CQ again performs 2 to 5 points below the resultsreported in the original publication, and in this case, this discrepancy cannot be43100 101 102N0.30.40.50.60.70.80.91.0Recall@NMNISTLSQ-32 128 bitsCQ 128 bitsERVQ 128 bitsRVQ 128 bitsOPQ 128 bitsPQ 128 bitsLSQ-32 64 bitsCQ 64 bitsERVQ 64 bitsRVQ 64 bitsOPQ 64 bitsPQ 64 bits100 101 102N0.20.30.40.50.60.70.80.91.0Recall@NLabelMeLSQ-32 128 bitsCQ 128 bitsERVQ 128 bitsRVQ 128 bitsOPQ 128 bitsPQ 128 bitsLSQ-32 64 bitsCQ 64 bitsERVQ 64 bitsRVQ 64 bitsOPQ 64 bitsPQ 64 bitsFigure 3.3: Recall@N curves for (left) the MNIST, and (right) LabelMe22Kdatasets.MNIST – 64 bits LabelMe22K – 64 bitsR@1 R@2 R@5 R@1 R@2 R@5PQ 29.76±0.37 45.24±0.33 67.88±0.63 16.63±0.46 24.43±0.73 38.61±0.74OPQ 35.13±0.65 51.89±0.64 74.95±0.47 32.43±0.78 46.17±0.92 66.70±1.08RVQ 32.85±0.70 48.98±0.82 71.83±0.83 28.56±0.97 41.47±1.22 60.57±1.68ERVQ 34.70±0.45 51.25±0.62 73.95±0.58 30.28±0.91 43.25±1.05 62.36±1.45CQ 40.94 58.80 81.16 31.40 45.45 65.50LSQ-8 42.38±0.49 60.78±0.38 83.42±0.28 35.32±1.13 50.39±1.54 71.25±0.70Table 3.2: Detailed recall@N on the MNIST and LabelMe22K datasets.attributed to a difference in training protocol. Unfortunately, this means that theresults reported by Zhang et al. (2014) cannot be reproduced even when using thecode provided by the original authors.3.6.3 Very large-scale training/query/base datasetsFinally, we report results on two very large-scale datasets: SIFT10M and SIFT1Bin Figure 3.4, with detailed results in Table 3.3. On SIFT1B with 64 bits, LSQ-32shows a relative improvement of 13% in recall@1 over CQ, our closest competitor,and consistently obtains between 2 and 3 points of advantage in recall over CQ inother cases. These are, to the best of our knowledge, the best results reported onSIFT1B using exhaustive search so far.44100 101 102 103N0.10.20.30.40.50.60.70.80.91.0Recall@NSIFT10MLSQ-32 128 bitsOPQ 128 bitsPQ 128 bitsLSQ-32 64 bitsOPQ 64 bitsPQ 64 bits100 101 102 103N0.00.20.40.60.81.0Recall@NSIFT1BLSQ-32 128 bitsOPQ 128 bitsPQ 128 bitsLSQ-32 64 bitsOPQ 64 bitsPQ 64 bitsFigure 3.4: Recall@N curves for very large-scale datasets: (left) theSIFT10M, and (right) SIFT1B.SIFT10M – 64 bits SIFT10M – 128 bitsR@1 R@10 R@100 R@1 R@2 R@5PQ 15.79 50.86 86.57 39.31 54.74 74.89OPQ 17.49 54.92 89.41 40.80 56.73 77.15CQ 21 63 93 47 64 84LSQ-16 22.51 64.62 94.67 49.26 66.75 85.36LSQ-32 22.94 65.20 94.85 49.50 67.31 86.33SIFT1B – 64 bits SIFT1B – 128 bitsR@1 R@10 R@100 R@1 R@2 R@5PQ 06.34 24.41 56.92 26.38 38.57 56.45OPQ 07.02 27.34 61.89 28.43 40.80 59.53CQ 09 33 70 34 48 68LSQ-16 09.73 35.82 73.84 35.32 50.84 70.66LSQ-32 10.18 36.96 75.31 36.35 51.99 72.13Table 3.3: Detailed recall@N values for our method on large-scale datasets:SIFT10M and SIFT1B.3.6.4 Sparse extensionFollowing Zhang et al. (2015), we evaluated the sparse version of our method using2 different levels of sparsity: SLSQ1, with ‖C‖0≤ S= h ·d, which has a query timecomparable to PQ, and SLSQ2, with ‖C‖0 ≤ S = h ·d+d2, which has a query timecomparable to OPQ. We compare against SCQ1 and SCQ2 by Zhang et al. (2015),which have the same levels of sparsity. A detailed study of the relationship betweensparsity and query times for very-large-scale datasets has already been presentedin Zhang et al. (2015), and it is clear that our improved encoding method has noeffect on query time, so we do not repeat those experiments. Instead, we focus ona smaller dataset, SIFT1M. As noted in Section 3.4, substantial speedups can alsobe obtained on small datasets by using sparse codebooks.Again, in this scenario we observed improved performance compared to our45100 101 102 103N0.20.30.40.50.60.70.80.91.0Recall@NSIFT1MSLSQ2-32 64 bitsSLSQ1-32 64 bitsOPQ 64 bitsPQ 64 bitsFigure 3.5: Recall@N curves forour sparse methods SLSQ1and SLSQ2 on SIFT1M.SIFT1M – 64 bitsR@1 R@10 R@100PQ 22.59±0.43 59.90±0.28 91.98SCQ 25 67 95SLSQ1-16 25.89±0.25 66.49±0.37 95.25±0.18SLSQ1-32 26.49±0.31 67.39±0.30 95.59±0.22OPQ 24.43±0.41 63.73±0.27 94.20±0.16SCQ2 27 68 96SLSQ2-16 27.87±0.23 69.85±0.28 96.47±0.15SLSQ2-32 28.40±0.17 70.59±0.29 96.76±0.22Table 3.4: Recall@N values for thesparse extension of our method onSIFT1M using 64 bits.baselines. Our improvement is of 1.49 and 1.40 over SCQ. This is close to the 1.84gain that OPQ achieves over PQ in the same dataset.3.6.5 Encoding speedMethod Sequential Batchedcodebooks (m) 7 15 7 15LSQ-16 (ms.) 1.52 7.02 0.53 2.01LSQ-32 (ms.) 3.01 13.93 1.05 4.03Method GPU (batched)codebooks (m) 7 15LSQ-16 (µs) 17.9 67.2LSQ-32 (µs) 35.7 134.3Method Exhaustive NNcodebooks (m) 8 16PQ (µs) 42.6 77.9OPQ (µs) 49.2 90.3Table 3.5: Time spent per vector during encoding in our approach. “Sequen-tial” refers to an LSQ implementation where ICM encodes each point se-quentially (i.e., does not take advantage of the shared pairwise terms).“Batched” is our LSQ implementation, which performs conditioning ofshared pairwise terms among several data points.In Table 3.5 we show the speed advantage that sharing the pairwise terms givesto LSQ over a naı¨ve implementation that encodes each point sequentially. The tableshows that our method, implemented in Julia (Bezanson et al., 2014) with someC++ bindings, remains fast when using up to 32 ILS iterations, handily achiev-ing speeds faster than real-time (we believe that even better performance could beachieved with a C implementation). With our GPU implementation, it is possible46to encode SIFT1B using 128 bits and 32 ILS iterations in around 1.5 days.3.7 ConclusionWe have introduced a new method for solving the encoding problem in Additivequantization (AQ) based on Stochastic local search (SLS). The high encoding per-formance of our method demonstrates that the elegant formulation introduced byAQ can be leveraged to achieve an improvement over the current state of the artin Multi-codebook quantization (MCQ). We have also shown that our method canbe easily extended to accommodate sparsity constraints in the codebooks, whichresults in another conceptually simple method that outperforms its competitors.47Chapter 4Stochastic relaxations and fastcodebook updatesEvils coming ’round at fiveTo get twice as much done in half the time—Street ChantA wide range of optimization methods for multi-codebook quantization (MCQ)alternatively update the codebooks and codes until convergence. A common prob-lem with such iterative methods is their tendency to stagnate in local minima; forexample, in the single-codebook case, equivalent to k-means, it is known that theselocal minima may be arbitrarily bad (Arthur and Vassilvitskii).In this chapter, we introduce two inexpensive approximations of simulated an-nealing (a well-established method for escaping local minima) that are directlyapplicable to MCQ. Our methods are inspired by a family of techniques used toimprove k-means clustering that date back to the work of Linde et al. (1980), andwere formalized by Zeger et al. (1992). This first contribution leads to improved(i.e., lower-error) MCQ solutions, with a negligible increase in computational cost.We also focus on tackling the computational cost of training in local searchqunatization (LSQ), which is a major barrier to its applicability to very large-scaledatasets. To this end, we introduce a codebook update method that is 1 to 2 ordersof magnitude faster than previous work. This novel method is based on two key48observations:1. It is possible to use a direct regularized least-squares method based on Choleskydecomposition to update the codebooks (rather than the iterative conjugategradient (CG) methods used in previous work), and2. The computational bottleneck of this process consists of a transposition andmultiplication of a very large binary matrix, and both of these operations canbe accelerated without the need to explicitly build large sparse matrices.Overall, our second contribution results in reduced training times.Our improvements directly improve LSQ, which we introduced in Chapter 3.4.1 Background and related workThe first of our improvements deals with the codebook update step of MCQ. Thesecond is a technique inspired in simulated annealing which, in the past, has beenused to improve k-means clustering. We review both areas next.4.1.1 Codebook update in MCQUpdating one codebook at a timeGreedy MCQ optimization approaches such as RVQ (Chen et al., 2010) and ERVQ/SQ (Ai et al., 2014; Martinez et al., 2014) update one codebook at a time. Updatinga single codebook in MCQ is equivalent to the codebook update step in k-means,which means that this step is not particularly computationally expensive. More-over, the best performing MCQ methods use Expectation-maximization (EM)-likeoptimization. In this case, the codebook update step amounts to solving a large-scale least-squares problem. In this chapter we focus on improving the codebookupdate step commonly used in EM-like MCQ optimization.Updating all codebooks at onceFormally, updating all the codebooks in MCQ in an EM-like approach amounts todeterminingminC‖X−CB‖2F . (4.1)49When introducing AQ, Babenko and Lempitsky (2014a) noted that this problemcan be solved independently across each dimension:While the least-squares optimization [...] may seem large (given largen and [d]), one can observe that it decomposes over each of the d di-mensions. Thus, instead of solving a single large-scale least squaresproblem with [mhd] variables, it is sufficient to solve [d] least-squaresproblems with [mh] variables, which can be formulated as an overcon-strained system of linear equations. [...] As a result, the complexity ofcodebook learning is dominated by the encoding step.The authors further proposed to use a large-scale conjugate-gradient solver (Fongand Saunders, 2011), and in practice built a sparse representation of B which waspassed to the solver, and reused for all d problems.Codebook update in CQ. Zhang et al. (2014) also used the AQ formulation toinitialize their main method (CQ). The authors derive a closed-form solution forthe problem: C = XB>(BB>)−1, but give no further details about the exact opti-mization procedure. Three years later, looking at the CQ code release due to Zhanget al. (https://github.com/hellozting/CompositeQuantization), we have realized thatthe matrix inversion is done with the help of a singular value decomposition, ignor-ing solution components associated with small eigenvalues (because the solutionmay not be numerically stable). Moreover, the authors build a dense matrix to rep-resent B. Thus, this method is in practice slower than using a conjugate gradientsolver, and can also benefit from the contributions that we introduce here.Advantages of reducing running time of the codebook update step in MCQIn MCQ, updating the codebooks and updating the codes are the two main opti-mization steps. In the original AQ paper, Babenko and Lempitsky noted that therunning time is dominated by the encoding step. However, after introducing LSQand its GPU encoding implementation, we noticed that on the SIFT1M dataset at64 bits:• updating B takes around 1.2 seconds, while50• updating C takes around 5.6 seconds.Thus, we clearly have a practical motivation to decrease the running time of thecodebook update step.4.1.2 Improving k-meansTo put stochastic relaxations in context, we review them next to several other tech-niques that improve the standard k-means algorithm; such improvements can bebroadly grouped in four categories.1. Encoding speed. In k-means encoding amounts to finding, for each vector inthe database, its nearest neighbour in a given codebook. For large datasets,large codebooks or high vector dimensionality, this can become a major com-putational bottleneck. Approaches to speed up encoding take advantage ofthe triangle inequality (Elkan, 2003) or convergent low-dimensional boundson metric spaces (Hwang et al., 2012a) to reduce the number of codebookvectors to which exact distance has to be computed. In the computer visioncommunity, Philbin et al. (2007) popularized the use of randomized kd-treesfor approximate encoding to build bag-of-visual-words representations.2. Initialization. A typical way to initialize the k-means algorithm is to producean initial codebook by sampling points from the dataset uniformly at ran-dom. Multiple alternatives for initialization have been proposed (for a recentsurvey, see the work by Celebi et al. (2013)), and perhaps the most popularinitialization algorithm is currently k-means++ (Arthur and Vassilvitskii),which also offers theoretical guarantees on the quality of the initialization.3. Soft assignments. It is also possible to assign each database vector to multi-ple entries in a codebook using multiple, weighted assignments (Dunn, 1973;Jain, 2010). According to Jain (2010), the idea can be attributed to Dunn(1973). Martinetz et al. (1993) used a ‘neural gas’ network with a softmaxlayer for soft cluster assignments, which allowed for batched optimizationusing backpropagation.514. Stochastic relaxations (SR). Several methods have exploited the idea of in-jecting randomness into the optimization process of k-means in order to im-prove the quality of the solution1 (Flanagan et al., 1989; Linde et al., 1980;Vaisey and Gersho, 1988); we follow Zeger et al. (1992) and refer collec-tively to these methods as Stochastic eelaxation (SR). The idea is to exploitSimulated annealing (SA), which is guaranteed to eventually find a globallyoptimal solution but in practice requires extremely slow temperature decayschedules, leading to impractical (exponential) running times. Therefore, theresearch challenge is to find SA approximations that produce good solutionsat reasonable computational costs.The first three approaches (improving encoding speed, improving initializationor using soft assignments) are not trivially adaptable to MCQ. For example, it is notimmediately clear how the triangle inequality or kd-trees could be used on the com-binatorial number of centroids created by MCQ. On the other hand, initializationapproaches typically add one codebook entry at a time (Arthur and Vassilvitskii);this is problematic in our case, as in MCQ, adding one entry in a codebook createsexponentially many new centroids. Finally, there is no straightforward way of list-ing multiple assignments on a combinatorial space in a computationally efficientway.Fortunately, some SR methods can be formalized by adopting a functional viewof k-means (Zeger et al., 1992), in which it is assumed that the optimization stateis fully observable given one of the two latent variables in the problem. In Sec-tion 4.2.2, we define a functional view of MCQ, which allows us to successfullyextend SR to MCQ.1Notably, in their seminal work on vector quantization, Linde et al. (1980) mention an examplewhere adding gradually decreasing amounts of Gaussian noise to the training examples results inbetter solutions. The authors “[. . . ] conjecture that this technique will work more generally [. . . ] but[. . . ] have been unable to prove this theoretically”. See (Linde et al., 1980, p. 93), for more details.524.2 Solution methodologyOur first contribution deals with the codebook update step, which amounts to de-terminingminC‖X−CB‖2F . (4.2)The current state-of-the-art method for this step was originally proposed byBabenko and Lempitsky (2014a), who noticed that finding C corresponds to a least-squares problem where C can be found independently in each dimension. Theauthors further use an iterative conjugate gradient method (Fong and Saunders,2011) in this step, which has the additional advantage that B can be reused for thed problems that finding C decomposes into. We have identified two problems withthis approach:1. Explicit sparse matrix construction is inefficient. Using conjugate gradientmethods requires that B be converted into an explicit sparse matrix. Althoughefficient data structures for sparse matrices exist (e.g., the compressed sparserow of numpy), in practice, B is stored as an m× n uint8 matrix. Ideally,we would like to use B directly and avoid using an additional data structure.2. Failure to exploit the binary nature of B. Matrix B is composed exclusivelyof ones and zeros (i.e., it is binary). Data structures used for sparse matricesare commonly designed for the general case when the non-zero entries arearbitrary real numbers, which leaves room for additional optimization.4.2.1 Direct codebook updateWe now introduce our method for fast codebook update, which takes advantage ofthese two observations. First, we note that it is possible to use a numerically-stabledirect method by rewriting Expression 4.2 as a regularized least-squares problem:minC‖X−CB‖2F +λ‖C‖2F , (4.3)53In this case, the optimal solution can be obtained by taking the derivative withrespect to C and setting it to zero, as is commonly done in ridge regressionC = XB>(BB>+λ I)−1. (4.4)While we are not interested in a regularized solution, we can still benefit fromthis formulation by setting λ to a very small value (λ = 10−4 in our experiments),which simply renders the solution numerically stable.We further note that the matrix BB> + λ I ∈ Rmh×mh is square, symmetric,positive-definite and fairly compact; notably, its size is independent of n. Thus, ma-trix inversion can be performed directly via Cholesky decomposition in O(m3h3)time. Since matrix inversion is efficient, the bottleneck of our method lies in com-puting BB> ∈ Nmh×mh, as well as XB> ∈ Rd×mh. We exploit the structure in B toaccelerate both operations.Computing BB>. By indexing B across each codebook, B = [B1, · · · ,Bm]>, BB>can be written as a block-symmetric matrix composed of m2 blocks of size h× heach:BB> =B1B>1 B1B>2 . . . B1B>mB2B>1 B2B>2 . . . B2B>m....... . ....BmB>1 BmB>2 . . . BmB>m , (4.5)here, the diagonal blocks BNB>N are diagonal matrices themselves, and since B isbinary, their entries are a histogram of the codes in BN . Moreover, the off-diagonalblocks are the transpose of their symmetric counterparts: BNB>M = (BMB>N )>, andcan be computed as bivariate histograms of the codes in BM and BN . Using thesetwo observations, this method takes O(m2n) time, while computing BB> naı¨velywould take O(m2h2n).Computing XB>. We again take advantage of the structure of B to accelerate thisstep. XB> can be written as a matrix of m blocks of size d×h each,54XB> = [XB>1 ,XB>2 , . . . ,XB>m ]. (4.6)Each block XB>i can be computed by treating the B>i columns as binary vectors thatselect the columns of X to sum. This method takes O(mnd) time, while computingXB> naively would take O(mhnd).4.2.2 Stochastic relaxations (SR)Broadly defined, simulated annealing (SA) is a classical stochastic local search(SLS) technique (Hoos and Stu¨tzle, 2004) that requires 4 major components:1. Intitialization: A function that creates the initial candidate solution s.2. Neighbourhood: The space of candidate solutions s′ that may be exploredfrom a starting solution s.3. Temperature schedule: A function that controls the probability to acceptor reject a new (worse) solution. (Better solutions are typically always ac-cepted). The temperature is normally decreased as the optimization pro-gresses. Various temperature schedules have been proposed and used in themany applications of simulated annealing.4. Termination criterion: A condition that must be satisfied for the search toend.A stochastic relaxation (Zeger et al., 1992) relaxes some of the typical SA stepsin order to make them more computationally efficient. For component (1), weinitialize our solutions uniformally at random, and for component (4) we simplylimit the number of search iterations to a constant.We now formally define our optimization state s, perturbation procedure pi(·)and acceptance criterion, for our method, which form the remaining componentsof our SA approximation.Defining a SA state: a functional view of MCQWe formally define an optimization state in MCQ. Since Expression 2.4 is definedover two latent variables, C and B, a naı¨ve approach is to define the state at time55step i as a 2-tuple si = (Ci,Bi). However, it is not clear how one could use thisdefinition to perturb the state: for example, adding random noise to B would alterits discrete nature, taking it out of the space of feasible solutions.Moreover, SLS methods are often designed to balance two main factors: (a)the amount of perturbation that allows them to escape from local minima, and(b) the ability of a local search method to recover from such perturbations and,potentially, find new and improved solutions. Globally, EM-like MCQ methodsalready rely on powerful local search techniques, namely the codebook-update andencoding methods. This suggests that these functions can be used as a subroutinesin a global method that improves the quality of the solution. Following this idea,the first step in SR is to make the assumption that the optimization state is fullyobservable given a single variable, either C or B, which fully specifies the other viaa function. We thus define two functions:• an “encoder” function C (X ,B)→C, and• a “decoder” function D(X ,C)→ B.Notice that, in k-means, C updates the centroids by taking the means of its assign-ments, and D assigns each training point to its nearest neighbour in the codebook.In MCQ, C amounts to the codebook-update step, for which we adopt the fast ridgeregression method described in Section 4.2.1. Similarly, D amounts to updatingthe codes B; in this case, we adopt the encoding method introduced in Chapter 3,based on Iterated local search (ILS) with ICM and random perturbations.We can now formally define the optimization state of MCQ in two ways, whichwill eventually give rise to two different SR methods. The first method, calledSR-C, uses the encoder function C , and the second method, called SR-D, uses thedecoder function D .Perturbing the SA stateThe next step is to define a way to perturb the SA state at time-step i. We similarlydefine 2 perturbation methods, one for SR-C and one for SR-D. Since we havedefined the state as fully-observable given either variable via a proxy function,we can perturb the state by simply perturbing the function used in SR-C or SR-D.Formally, we define the functions56• C ∗ := C (piC(X , i),B)→C for SR-C, and• D∗ :=D(X ,piD(C, i))→ B for SR-D,where the function piC(X , i)→ X +T (i) ·ε amounts to adding noise ε to X , accord-ing to a predefined temperature schedule T . In our case, we have chosen our noiseto be sampled from a zero-mean Gaussian with a diagonal covariance proportionalto X ; in other words, ε ∼ N(0,Σ), where Σ= diag(cov(X)).The main difference between k-means and MCQ is that, in MCQ, we use multi-ple codebooks. This difference is particularly important in SR-D, where the noiseaffects C, which represents m different codebooks. Since the centroids are ob-tained by summing one entry from each codebook, perturbing C amounts to per-turbing the centroids m times. We found that this perturbation is too extreme andresults in poorer solutions. Thus, in SR-D we instead use the average covarianceof each codebook – this amounts to multiplying the noise by a factor of 1/min SR-D. We thus define the perturbation function for SR-D slightly differently:piD(C, i)→ C+(T (i)/m) · ε . In other words, we multiply the noise by factor of1/m in SR-D.Notice that, in these functional formulations, the discrete variable B is neverperturbed. Thus, our approximations remain anytime algorithms and do not requireadditional rounding steps.Temperature schedule. In SA, it is common to gradually reduce the amount ofnoise that controls the probability of accepting of a new state (the so-called Metropolis-Hastings criterion). In SR, following Zeger et al. (1992), we instead use the tem-perature to control the amount of noise added in each time-step and consider threetemperature schedules:T (i)→ (1− (i/I))p , (4.7)T (i)→ 1/(i+1)p , and (4.8)T (i)→ pi, (4.9)where I is the total number of iterations, i represents the current iteration, andp ∈ (0,1] is a tunable hyper-parameter. We experimented with values of p =57{0.1,0.5,1} and found that the first temperature schedule (Eq. 4.7) and a value ofp = 0.5 produce the best results. We use these parameters in all our experiments.Acceptance criterionThe final building block of SA is an acceptance criterion, which decides whether thenew (perturbed) state will be accepted or rejected. Following Zeger et al. (1992),we decide to always accept the new state. As we will show, this criterion givesexcellent results in practice, and makes it very easy to control for running time, aseach iteration is practically identical to LSQ.4.2.3 Recap: SR-C and SR-DTo summarize, we have now introduced two algorithms that define crude (but, aswe will see, very effective) approximations to simulated annealing: SR-C and SR-D.These approximations are also very simple to implement, and are in fact not verydifferent in practice from LSQ. To highlight this simplicity, we have outlined theLSQ optimization pipeline in Algorithm 3:Algorithm 3 Block coordinate descent approaches to MCQ1: function LSQ(X , I)2: B← initialization()3: i← 1 . Iteration counter4: while i < I do5: C← argminC ‖X−CB‖2F . Codebook update6: B← argminB ‖X−CB‖2F . Encoding step7: i← i+18: end while9: return C,B10: end functionNotice that• SR-C follows Algorithm 3 exactly, except that line 5 is replaced byC← argminC‖piC(X , i)−CB‖2F ;• SR-D follows Algorithm 3 exactly, except that line 6 is replaced byB← argminB‖X−piD(C, i)B‖2F .58In other words, SR-C and SR-D amount to adding noise in different parts ofthe typical MCQ optimization pipeline, but the workhorse functions that performthe codebook-update and the encoding encoding step remain unchanged. As aresult, if in the future better codebook-update or encoding functions are developed,they can be seamlessly integrated into our pipelines. Finally, we note that ourmethods involve only minimal computational overhead, as they only require thecomputation of the covariance of either X (which can be computed once and re-used many times in (SR-C)), or C, which is a compact variable independent of n. Inpractice, this overhead is negligible: < 0.1 seconds for SR-C, and < 0.01 secondsfor SR-D.4.3 ExperimentsWe now describe the evaluation protocol that we use to empirically validate ourcontributions.Datasets and protocolsWe evaluate our approach on five datasets. The first three are SIFT1M (Je´gouet al., 2011), LabelMe22K (Russell et al., 2008) and MNIST (LeCun et al., 1998).These correspond to handcrafted features (SIFT and GIST), and small raw images(MNIST). The next two datasets are based on features extracted from deep con-volutional neural networks. The VGG dataset is extracted from the last layer ofthe CNN-M-128 network provided by Chatfield et al. (2014); this network inter-nally compresses the features by limiting the dimensionality of the output to 128units. Finally, we put together the Deep1M dataset, by sampling from the 10 mil-lion example set provided with the recently introduced Deep1B dataset (Babenkoand Lempitsky, 2016). These features come from the last convolutional layer ofa network similar to GoogLeNet v3 (Szegedy et al., 2014), and have been PCA-projected to 96 dimensions. Table 4.1 summarizes the datasets used in our ANNexperiments.We follow the same protocol as in the previous chapter, and use the training setto learn the codebooks C; we then use those codebooks to encode the base set (i.e.,obtain B), and finally use the query set to find approximate nearest neighbours in59Dataset d # training # query # baseSIFT1M 128 100 000 10 000 1 000 000LabelMe22K 512 – 2 000 20 019MNIST 784 – 10 000 60 000Deep1M 96 100 000 10 000 1 000 000VGG 128 100 000 10 000 1 000 000Table 4.1: Statistics of the datasets used in our ANN experiments.the compressed base set. MNIST and LabelMe22K, however, do not have a trainset, so we use the base set to learn both C and B. This protocol is in line withprevious work (Gong and Lazebnik, 2011; Zhang et al., 2014).EvaluationTo evaluate our first contribution (i.e., fast codebook update), we measure code-book update times. We also use our improved codebook update method whilereproducing the results of LSQ (introduced in Chapter 3). Obtaining the same re-call@N values as in previous work while using our fast codebook update methodvalidates that our contribution does not affect the quality of the solution.To evaluate SR, we report recall@N on all our datasets. Recall@N representsthe empirical probability over the query set that the actual nearest neighbour ofthe query is found within the first N retrieved entries. In the retrieval literature,recall@1 is often considered the most important value in this curve. We also runall the methods ten times on each dataset and report the mean of the five trials tocontrol for the inherent randomness in recall@N.BaselinesOur baselines are the orthogonal MCQ methods PQ (Je´gou et al., 2011) and OPQ (Geet al., 2014; Norouzi and Fleet, 2013), as well as the non-orthogonal RVQ (Chenet al., 2010), ERVQ/SQ (Ai et al., 2014; Martinez et al., 2014) and CQ (Zhang et al.,2014), as well as LSQ, i.e., the method that we introduced in Chapter 3. We controlfor memory usage by making our baselines and our method produce final codes of6064 and 128 bits. This means that PQ, OPQ and CQ use 8 and 16 codebooks, whilethe rest of the methods use 7 and 15 codebooks, plus an extra codebook to encodethe norm of the approximation. This has the side effect of controlling for querytime as well, because all the methods use 8 (resp. 16) table lookups to computeapproximate distances.4.4 ResultsIn this section we present the results of our experiments.4.4.1 Fast codebook update64 bits 128 bitsSIFT1M CG Chol CG CholJulia, C++ 14.2s 5.6s 38.5s 22.5sJulia, CUDA 6.8s 1.2s 20.3s 4.3s64 bits 128 bitsDeep1M CG Chol CG CholJulia, C++ 9.5s 7.2s 30.7s 24.2sJulia, CUDA 3.3s 1.0s 10.6s 4.1sTable 4.2: Total time per LSQ/LSQ++ iteration, depending on how we updateC (CG or Cholesky), and how we update B (using a C++ or a CUDAimplementation).We show the total elapsed time for training once we incorporate our codebookupdate method on Table 4.2. Our method saves anywhere from 2.3 (Deep1M, 64bits) to 16 seconds (SIFT1M, 128 bits) of training time per iteration. Alternatively,when encoding is GPU-accelerated, it results in 2.2−5.6× speedups in practice.Large-scale experimentsOn Figure 4.1, we show a “stress-test” comparison between our method for fastcodebook update and Conjugate gradient (CG), using dataset sizes of n= {104,105,106,107}.We take the first n training vectors from the SIFT1B (Je´gou et al., 2011) andDeep1B (Babenko and Lempitsky, 2016) datasets, and generate a random B. Thisis a specially easy case for CG, and it takes only 2-3 iterations to converge. Even inthis case, our method is orders of magnitude faster than previous work, and staysunder 10 seconds in all cases, while CQ takes up to 700 seconds for n = 107. Our61104 105 106 107Dataset size (log)100101102103Seconds (log)Codebook update time SIFT 1B104 105 106 107Dataset size (log)100101102103 Codebook update time Deep 1BCG 128 bits CG 64 bits Ours 128 bits Ours 64 bitsFigure 4.1: Time for codebook update as a function of dataset size with up toten million vectors.method is only slower on small training sets, perhaps due to the complexity ofmatrix inversion, which is independent of n.4.4.2 SR-C and SR-DSR requires the setting of several hyperparameters. First, we have to choose theperturbation method to use (either SR-C, or SR-D), and second, we have to choosethe temperature schedule with its corresponding decay parameter. We report re-sults using both methods, Eq. 4.7 for temperature decay, and p = 0.5. All resultsreported in this work have been obtained using the same hyperparameters, whichmakes our two contributions usable as out-of-the-box improvements on top of LSQwithout hyperparameter tuning.Experimentally, we have found that SR-D (which perturbs the codebooks) per-forms better on handcrafted features and on MNIST, and SR-C performs better ondeep features, which may be used as simple rule of thumb for practitioners. How-ever, SR-C also tends to be more unstable.Query/base datasetsIn Figure 4.2, we show results on the small query/base datasets MNIST and La-belMe22K, after training for 100 iterations. We also present results in detail inTable 4.3.62100 101 102 103 104 105Seconds (log)0.100.150.200.250.300.350.400.450.50Recall@1MNIST 64 bitsCQLSQSR-CSR-DLSQ GPUSR-C GPUSR-D GPUERVQRVQOPQPQ100 101 102 103 104 105Seconds (log)0.20.30.40.50.60.7Recall@1MNIST 128 bits100 101 102 103 104 105Seconds (log)0.100.150.200.250.300.350.400.450.50Recall@1LabelMe 64 bits100 101 102 103 104 105Seconds (log)0.20.30.40.50.60.7Recall@1LabelMe 128 bitsFigure 4.2: Running time vs recall@1 on the MNIST and LabelMe datasets.MNIST – 64 bits LabelMe22K – 64 bitsR@1 R@2 R@5 R@1 R@2 R@5PQ 29.76±0.37 45.24±0.33 67.88±0.63 16.63±0.46 24.43±0.73 38.61±0.74OPQ 35.13±0.65 51.89±0.64 74.95±0.47 32.43±0.78 46.17±0.92 66.70±1.08RVQ 32.85±0.70 48.98±0.82 71.83±0.83 28.56±0.97 41.47±1.22 60.57±1.68ERVQ 34.70±0.45 51.25±0.62 73.95±0.58 30.28±0.91 43.25±1.05 62.36±1.45CQ 40.94 58.80 81.16 31.40 45.45 65.50LSQ-8 42.38±0.49 60.78±0.38 83.42±0.28 35.32±1.13 50.39±1.54 71.25±0.70SR-D-8 42.90±0.34 61.25±0.29 83.29±0.26 37.61±1.00 52.54±0.94 73.45±0.91SR-C-8 42.26±0.26 60.79±0.41 82.75±0.29 32.51±0.86 46.34±0.78 66.33±1.03Table 4.3: Detailed recall@N on the MNIST and LabelMe22K datasets.63We can observe that SR-D always outperforms LSQ, and using the fast code-book update means that the method is also more efficient in practice. We alsoobserve that CQ (Zhang et al., 2014) is an outlier in terms of computation – themost expensive method by far – , but is not competitive in terms of recall. Finallywe also observe that, in these two datasets, SR-C often yields lower recall than LSQ.SR-D method achieves the best improvement in LabelMe22K at 64 bits, wherewith 100 iterations, it obtains 0.023 (absolute) better recall@1, or a relative 6.4%improvement over the state of the art. In other cases, the gain ranges from 0.005 to0.015 in recall@1 over the state of the art, requiring less computation than previouswork.Train/query/base datasetsIn Figure 4.3, we show the recall@1 achieved by our methods on train/query/basedatasets as a function of the total time they used for learning. In this Figure, we rantraining for 100 iterations and base encoding for 128 iterations.We observe a pattern similar to the previous experiment: SR-D obtains higherrecall than LSQ in all cases. However, we also observe that SR-C largely outper-forms both LSQ and SR-D on the VGG dataset, which has internally been com-pressed by the network down to 128 dimensions. This suggests that SR-C mightbe better suited for applications that use deep features, which currently dominatecomputer vision applications.SIFT1M – 64 bits SIFT1M – 128 bitsR@1 R@10 R@100 R@1 R@2 R@5PQ 22.59±0.43 59.90±0.28 91.98±0.17 44.53±0.40 60.49±0.40 78.94±0.28OPQ 24.43±0.41 63.73±0.27 94.20±0.16 46.14±0.50 62.27±0.36 80.97±0.27ERVQ 22.37±0.34 60.48±0.43 92.58±0.17 42.31±0.80 57.91±0.98 76.77±0.96ERVQ 23.63±0.37 63.25±0.46 93.92±0.24 45.43±0.40 61.64±0.51 80.11±0.43CQ 16.29 49.95 85.59 33.73 48.03 66.25AQ 26 70 95 − − −LSQ-128 29.85±0.23 73.10±0.58 97.46±0.13 55.62±0.38 72.85±0.19 89.46±0.21SRD-128 30.72±0.29 74.25±0.36 97.65±0.10 56.09±0.35 73.02±0.27 89.66±0.33SRC-128 30.60±0.32 73.80±0.34 97.50±0.13 52.16±0.37 69.07±0.37 86.40±0.35Table 4.4: Detailed recall@N values on the SIFT1M dataset.Finally, Tables 4.4, 4.5 and 4.6 show the detailed performance of our methods64100 101 102 103 104 105Seconds (log)0.000.050.100.150.200.250.300.35Recall@1SIFT1M 64 bitsCQLSQSR-CSR-DLSQ GPUSR-C GPUSR-D GPUERVQRVQOPQPQ100 101 102 103 104 105Seconds (log)0.200.250.300.350.400.450.500.550.60Recall@1SIFT1M 128 bits100 101 102 103 104 105Seconds (log)0.000.050.100.150.200.250.300.35Recall@1Deep1M 64 bits100 101 102 103 104 105Seconds (log)0.200.250.300.350.400.450.500.550.60Recall@1Deep1M 128 bits100 101 102 103 104 105Seconds (log)0.000.050.100.150.200.250.300.35Recall@1VGG 64 bits100 101 102 103 104 105Seconds (log)0.200.250.300.350.400.450.500.550.60Recall@1VGG 128 bitsFigure 4.3: Running time vs recall@1 on the SIFT1M, Deep1M and VGGdatasets.65Deep1M – 64 bits Deep1M – 128 bitsR@1 R@10 R@100 R@1 R@2 R@5PQ 09.13±0.41 33.54±0.24 73.14±0.54 27.98±0.28 41.16±0.45 60.10±0.47OPQ 15.90±0.43 52.82±0.47 89.61±0.29 35.47±0.59 50.75±0.57 71.15±0.31RVQ 15.90±0.53 52.64±0.49 89.80±0.38 35.38±0.73 50.70±0.91 71.45±0.68ERVQ 17.87±0.37 56.73±0.39 91.95±0.17 38.76±0.48 55.05±0.54 75.76±0.61LSQ-128 21.95±0.42 64.83±0.61 95.30±0.25 41.15±0.44 57.90±0.53 78.18±0.37SRD-128 22.60±0.55 66.64±0.49 95.75±0.15 42.28±0.62 58.95±0.57 79.23±0.45SRC-128 22.56±0.33 66.30±0.36 95.75±0.16 42.83±0.29 59.77±0.42 80.01±0.30Table 4.5: Detailed recall@N values on the Deep1M dataset.VGG – 64 bits VGG – 128 bitsR@1 R@10 R@100 R@1 R@2 R@5PQ 07.32±0.30 29.91±0.45 73.39±0.44 25.25±0.44 37.82±0.55 57.12±0.32OPQ 09.84±0.18 37.68±0.51 81.26±0.39 26.84±0.22 39.48±0.25 58.89±0.36RVQ 14.23±0.20 51.62±0.53 93.02±0.15 32.34±0.81 47.72±0.82 69.19±0.90ERVQ 14.67±0.42 52.67±0.38 93.89±0.25 35.60±0.31 51.36±0.59 72.98±0.40LSQ-128 18.44±0.26 61.13±0.39 96.49±0.16 39.60±0.64 56.12±0.52 77.38±0.37SRD-128 19.36±0.43 63.59±0.41 97.48±0.14 41.17±0.31 58.15±0.57 79.11±0.22SRC-128 19.60±0.29 63.87±0.32 97.43±0.13 44.67±0.45 62.62±0.37 83.60±0.23Table 4.6: Detailed recall@N values on the VGG dataset.and a comparison with previous work. We can again observe that the SR variantsconsistently obtain state-of-the-art performance when the computational budget isidentical to previous work. SR-C works specially well on the VGG dataset; for ex-ample, when using 64 bits, SR-C gains 0.016 (relative 6.2%) in Recall@1 and 0.05points (relative 12.8%) when using 128 bits with respect to LSQ. In this challeng-ing dataset, OPQ obtains a much smaller improvement of 0.015 (resp 0.016) overPQ. Finally, we observe a modest improvement of SR on SIFT1M, consistent withprevious work (Babenko and Lempitsky, 2016) which has shown that hand-craftedfeatures show small improvements with non-orthogonal codebooks.4.4.3 Comparison against Competitive quantization (COMPQ)Competitive quantization (COMPQ) (Ozan et al., 2016a) is an MCQ method basedon Stochatic gradient descent (SGD) with a batch size of one. To bypass the non-diffenrentiability of the encoding step, the authors consider the chosen codes as the66single sample of a function that produces soft assignments into the codebooks. Thisapproximation can be seen as a single-sample estimate of a policy gradient (Suttonet al., 2000).So far, in our evaluation we have omitted a direct comparison with COMPQ.This might strike the reader as odd, since COMPQ reports the best result to date onSIFT1M (recall@1 of 0.352). Part of the reason is that the experiments reportedby Ozan et al. (2016a) consider a quadratic query time, while in our experimentswe have set one codebook aside to keep query time linear – and thus comparablewith our baselines.Since there is no publicly available implementation of COMPQ, we have tried toreproduce the reported results ourselves, with moderate success. We have not, forexample, been able to reproduce the transform coding initialization reported in thepaper, but have instead used RVQ (Chen et al., 2010), which Ozan et al. reported toachieve slightly worse results. We obtained a recall@1 of 0.346 using 250 epochs(the parameter used in the best reported result). The small difference in recall maybe attributed to our different initialization.However, the largest barrier to experimentation on our side is that our COMPQimplementation, written in Julia, takes roughly 40 minutes per epoch to run. Thismeans that our experiment on SIFT1M with 64 bits took almost one week to finish.We contacted the COMPQ authors, and they mentioned using a multithreaded C++implementation with pinned memory, an ad-hoc sort implementation, and specialhandling of threads. Their implementation takes 551 seconds per epoch, or about38 hours (∼ 1.5 days) in total for 250 epochs on a desktop with a 10-core XeonE5 2650 v3 @2.3 GHz CPU. This is over three orders of magnitude more compu-tationally expensive than Product quantization (PQ) (Je´gou et al., 2011), a baselineconsidered in the same publication.We compare COMPQ to LSQ and SR-D using a multithreaded C++ encodingimplementation, as well as our GPU implementation, and show our results on Ta-ble 4.7. We use m = 8 codebooks in total, which controls for query time andmemory use with respect to COMPQ. With only 25 training iterations and 32 ILSiterations, SR-D narrows the difference between COMPQ and LSQ by 50%, from0.012 to 0.006, and is also overall faster due to the faster codebook update.We iteratively double the computational budget of SR-D (trading off compu-67Iters Init Training Base encoding Total R@1COMPQ (C++) 250 – – – 38 h 0.352LSQ (Julia, C++) 25 2.6 m 6.34 m 5.8 m (32 iters) 15.2 m 0.340LSQ (Julia, CUDA) 25 1.1 m 2.8 m 29 s (32 iters) 4.4 m 0.340SR-D (Julia, CUDA) 25 1.1 m 33 s 29 s (32 iters) 2.1 m 0.346SR-D (Julia, CUDA) 50 2.2 m 1.1 m 58 s (64 iters) 4.3 m 0.348SR-D (Julia, CUDA) 100 4.4 m 2.2 m 1.9 m (128 iters) 8.5 m 0.351SR-D (Julia, CUDA) 100 4.4 m 2.2 m 3.9 m (256 iters) 10.5 m 0.353Table 4.7: Comparison between CompQ, LSQ and LSQ++ on SIFT1M using64 bits.tation for accuracy), and bring the difference in recall to 0.001 with 100 trainingiterations and 128 Base encoding ILS iterations. Doubling the budget of this finalstep puts our method above CompQ by 0.001 in recall@1. Even under these cir-cumstances, SR-D is 200× faster than COMPQ. Finally, we note that, since COMPQis based on SGD with a batch size of one, it is unlikely to benefit from the parallelarchitectures found in modern GPUs, so the gap in speed with our method is noteasy to close.4.5 Conclusions and future workWe have introduced two stochastic relaxation methods for MCQ that provide inex-pensive approximations to simulated annealing. The methods consistently improveupon the state of the art, without requiring extra computation. We have also intro-duced a method for fast codebook update that results in faster training in MCQ.Both our improvements can be used as out-of-the-box improvements on top ofLSQ, and define a new state of the art in Multi-codebook quantization (MCQ).It is intuitively appealing to think of Stochastic eelaxation (SR) as an approx-imation to Simulated annealing, and it would be interesting to further investigatethis theoretically. In particular, we have noticed that the noise introduced in SR re-sults in a perturbation of the parameters that define the encoding Markov RandomFields, which suggests a connection between our method and the perturb-and-MAPframework of Papandreou and Yuille (2011). Exploring this theoretical connection68is an interesting area of future work.Finally, we note that while the bottleneck of our method now lies in updatingthe codes, for Composite quantization (CQ) (Zhang et al., 2014), the computationalburden is in the codebook update step. Future improvements to CQ may focus onusing recent efficient optimizers for large-scale continuous problems, such as thestochastic average gradient (Schmidt et al., 2017).69Chapter 5Automatic hyperparameteroptimization of MCQ algorithmsI think my spaceship knows which way to go—David BowieThe most computationally challenging part of multi-codebook quantization(MCQ) optimization is the encoding step, which amounts to maximum-a-posteriori(MAP) estimation of multiple fully-connected Markov random fields (MRFs). InChapter 3, we introduced a method based on iterated local search (ILS) for thisproblem. We have also introduced a graphics processor unit (GPU) implementationof our encoding algorithm, which reduces the encoding time in practice.In Chapter 4, we also introduced a method for codebook update which is ordersof magnitude more efficient than conjugate gradient methods. Together with ourGPU encoding implementation, these contributions greatly reduce training time inMCQ. We further introduced a simple modification to our algorithm that adds noiseto either the data or the codebooks, according to a pre-defined schedule, resultingin improved results.The algorithms introduced in Chapters 3 and 4 have a number of hyperparam-eters that control their behaviour and affect the quality of the optimization. Exam-ples of these hyperparameters include the number of iterations in ILS, the numberof iterated conditional modes (ICM) iterations, and the formula of the temperature70decay, which is itself parameterized by a scalar that controls the decay speed. Sofar, we have set all the hyperparameters to the same values in all our experiments,tunning their values from exhaustive search in a small search range. Thus, the per-formance attained by our algorithms demonstrates their robustness across a rangeof different input distributions including raw images, handcrafted image featuresand features learned by deep convolutional neural networks. This robustness isimportant to make our algorithms available to a large user base, and gives em-pirical evidence on the performance of the algorithm as out-of-the-box tools forlarge-scale MCQ.However, one of our main use cases of interest is the preprocessing of verylarge-scale datasets (with n ≥ 109 vectors) for fast approximate nearest neigh-bour search (ANNS). In this case, it is reasonable for a practitioner to spendtime finding a set of hyperparameters that work best on the distribution of interest.This dataset-specific hyperparameter tuning is already commonplace in algorithmsthat perform short-listing on very large datasets such as the Inverted File (Je´gouet al., 2011) and the inverted multi-index (Babenko and Lempitsky, 2012). Im-portantly, Matsui et al. (2015) found that the hyperparameters of these large-scaleindices have a large influence in their performance, which motivated them to de-velop the PQTable (Matsui et al., 2015), a hyperparameter-free data structure forlarge-scale indexing.Similarly, previous MCQ algorithms such as composite quantization (CQ) andsparse composite quantization (SCQ) (Zhang et al., 2014, 2015), require the settingof several hyperparameters, and we have learned, after inspecting the experimentalsource code, that the performance reported in their respective publications wasattained after setting these hyperparameters to different values for each dataset –a detail that is unfortunately missing in the publications. Therefore, performingdataset-specific hyperparameter optimization of our algorithms also amounts tobenchmarking on equal grounds with previous work.In this chapter, we explore the use of meta-algorithms for automatic hyperpa-rameter tuning, and apply them to MCQ. Our goal is to discover the performancepotential of our algorithms when they are adapted to specific input domains.715.1 Background and related workMachine learning algorithms typically have both parameters and hyperparameters.The parameters of the model are the set of values that are found by an optimizationalgorithm, while the hyperparameters are values that control either the space andsize of the model, or the behaviour of said optimization algorithm.For example, in a deep neural network, the parameters include the weightsof each layer, such as the convolutional filters or linear transformations, whichare typically learned via Stochatic gradient descent (SGD). The hyperparametersinclude the architecture of the network itself, the learning rate of SGD and themethod for weight initialization.Automatic hyperparameter optimizationSince hyperparameters are not typically optimized by an automated algorithm, it iscommon for researchers to set the hyperparameters of an algorithm “by hand”; thatis, by trial and error on a small set of values, or by grid search (exhaustive com-binatorial enumeration over a predefined range). Prominent deep learning tech-niques (Krizhevsky et al., 2012; Sutskever et al., 2014) owe their success, at leastpartially, to the finding of a set of hyperparameters that are stable enough such thatmanual hyperparameter tuning does not constitute a barrier to achieve state-of-the-art results on a number of benchmarks.In contrast, automatic hyperparameter optimization techniques (Bergstra et al.,2011; Hutter et al., 2007, 2011; Snoek et al., 2012) aim to replace the hand tuning ofhyperparameters with more principled algorithms that, given a fixed computationalbudget, trade-off exploitation and exploration to find the best hyperpameters for thetask at hand.Most methods for hyperparameter optimization are based on black-box opti-mization; i.e., they have no access to the details of the function (or algorithm) thatthey are trying to optimize, other than the function output for a given set of (hyper)parameters. Some methods thus internally model the optimization landscape usingdifferent underlying assumptions, or use simple heuristics to probe the hyperpa-rameter space.72ParamILSHutter et al. (2007) treat the hyperpameters of an algorithm as parameters to opti-mize in an iterated local search (ILS) framework. Given an initial hyperparameterconfiguration, ParamILS iteratively changes a single parameter (i.e., considers theone-exchange neighbourhood), and uses iterative first improvement as a surrogatelocal search procedure. The new solution (set of hyperparameters) is accepted if ityields better or equally good results. After each iteration, the search is randomlyrestarted from scratch with a small probability. Hutter et al. demonstrate goodresults optimizing Boolean satisfiability (SAT) solvers and randomly generated bi-nary Bayesian networks.SMACHutter et al. (2011) introduce Sequential model-based optimization for algorithmconfiguration (SMAC), which uses a random forest to predict expected hyperparam-eter performance. Sequentail model-based optimization, often known as “Bayesianoptimization” in the machine learning literature (Brochu et al., 2010), often usesGaussian processes to model the optimization landscape. Using random forestsinstead of Gaussian processes allows a more elegant approach to deal with integerand categorical parameters, and provides a natural way to quantify uncertainty inthe estimation. The authors demonstrate performance on par with or superior toParamILS on multi-instance algorithm configuration as applied to distributions ofBoolean satisfiability (SAT) problems, and to configure a commercial mixed integerprogramming solver.HyperoptBergstra et al. (2011) introduce Hyperopt, a software package that instead of mod-elling the performance of an algorithm given its hyperparameters, models the prob-ability that the performance lies between a fixed quantile of the losses observed sofar. These quantiles are in turn modeled with a Tree Parzen estimator, which alsoallows for efficient closed-form interpolation of expected hyperparameter perfor-mance.73SpearmintSnoek et al. (2012) implement Bayesian optimization to sequentially train machinelearning models under different hyperparameters. Spearmint assumes that the out-puts of the unknown function (the performance of the machine learning model)correspond to samples from a Gaussian process. Thus, it is possible to estimatea posterior of the parameter distribution and, thanks to the Gaussian assumption,finding the maximum of the surrogate function is tractable.5.1.1 Choosing a hyperparameter optimizerEggensperger et al. (2013) perform a head-to-head empirical comparison betweenSMAC, Hyperopt, and Spearmint on a series of benchmarks including MNIST clas-sification (LeCun et al., 1998), online latent Dirichlet allocation (Blei et al., 2003)for Wikipedia articles, neural networks, and the Auto-Weka framework (Thorntonet al., 2013). For a fixed running time or number of function evaluations, SMACfound better results than its competitors in 12 out of 15 scenarios. Based on thisempirical evidence, we have decided to use SMAC for our experiments.5.2 Experimental setupWe automatically optimize the hyperparameters of our algorithms using a Pythonimplementation of SMAC, which calls our own routines implemented in Julia. Aftermultiple iterations, SMAC returns the best hyperparameter configuration it found(also called the “incumbent” configuration), and the mean value that its internalmodel predicts for the best configuration.We list the parameters that we expose to SMAC in Table 5.1. We also enforce asimilar running time across different hyperparameter configurations by setting thenumber of ICM iterations to b32/Number of ILS iterationsc (since in the defaultconfiguration b32/8c = 4 ICM iterations). We let SMAC test 200 hyperparameterconfigurations in total, and we add the constraints that the Temperature scheduleand the Temperature decay should only be set when either SR-C or SR-D are cho-sen as the Stochastic eelaxation (SR) method. We initialize the codes B and thecodebooks C with Optimized product quantization (OPQ) followed by Optimizedtree quantization (OTQ). Finally, we perform 25 alternating updates of the codes B74Hyperparameter Type Domain DefaultNumber of ILS iterations Integer {1,2,3, . . . 16} 8Number of codebooks to perturb Integer {0,1,2, . . . m} 4Randomize the ICM codebook order Boolean [true, false] trueSR method Categorical {LSQ, SR-D, SR-C} SR-DTemperature schedule (Eq. 4.7-4.9) Categorical {1st, 2nd, 3rd} 1stTemperature decay Continuous [0,1] 0.5Table 5.1: MCQ hyperparameters optimized by SMAC; m corresponds to thenumber of codebooks (7 or 15, depending on the memory budget). Thedefault values correspond to the configuration that we have used in allour experiments so far.and the codebooks C.For datasets with train/query/base partitions, we encode the base set with thesame number of ILS iterations as given by SMAC, and set the total number of ICMiterations to b128/Number of ILS iterationsc. This computational budget is similarto that used for the results reported in Chapter 3.5.3 ResultsFor each dataset, we report the best configuration found by SMAC. We then runthis configuration 10 times, and report the obtained recall@1. If under this sce-nario the new configuration is consistently better than the default configuration,we also increase the computational budget to 100 updates of B and C, and, fortrain/query/base datasets, encode the base set for b128/Number of ILS iterationscICM iterations, which roughly corresponds to computational budget of the resultsreported in Chapter 4.Configurations found by SMACWe show the configurations obtained by SMAC in Table 5.2.We note that SMAC always chooses to perturb at least one code, which is inline with our finding from Chapter 3 that adding perturbations to the local searchroutine is necessary to escape local minima. SMAC also always prefers SR-D or75LabelMe MNISTHyperparameter 64 bits 128 bits 64 bits 128 bitsNumber of ILS iterations 9 8 9 8Number of codebooks to perturb 1 4 5 4Randomize the ICM codebook order true true false falseSR method SR-D SR-D SR-D SR-DTemperature schedule (Eq. 4.7-4.9) 1st 1st 1st 1stTemperature decay 0.43 0.5 0.19 0.83SIFT1M Deep1M VGGHyperparameter 64 bits 128 bits 64 bits 128 bits 64 bits 128 bitsNumber of ILS iterations 8 7 8 15 8 10Number of codebooks to perturb 4 2 4 2 4 5Randomize the ICM codebook order true true true true true falseSR method SR-D SR-D SR-D SR-C SR-C SR-CTemperature schedule (Eq. 4.7-4.9) 1st 1st 1st 1st 1st 1stTemperature decay 0.65 0.19 0.5 0.95 0.71 0.94Table 5.2: MCQ hyperparameters obtained by SMAC on query/base datasets(top) and train/query/base datasets (bottom).SR-C over LSQ, which confirms our intuition from Chapter 4 that SR is alwaysbetter than plain LSQ. Finally, SMAC also suggests the use of Schedule 1 (Eq 4.7)over the other temperature decay options; this is in line with our own findings, andthose of Zeger et al. (1992).Other parameter settings are more surprising. For example, SMAC has droppedthe randomization of ICM on MNIST, both for 64 and 128 bits, and for VGG with128 bits. The temperature decay is also more pronounced (p> 0.9) on the Deep1Mand VGG datasets. Finally, there is ample variety in the number of ILS iterationsincluding {7,8,9,10,15}, as well as in the number of codes to perturb: {1,2,4,5}.Testing the new configurationsWe show the results obtained by the SMAC configurations and LSQ, SR-D and SR-Cin Tables 5.3, 5.4, 5.5, 5.6, and 5.7. We omit values for SMAC when it did notsuggest a different configuration from the default one.On MNIST (Table 5.3) at 64 bits, the configuration found by SMAC obtainsa 1% absolute improvement over the previous best results obtained by SR-D. We76MNIST – 64 bits MNIST – 128 bitsR@1 R@10 R@100 R@1 R@2 R@525updates LSQ-8 41.88±0.28 93.60±0.20 99.99±0.01 63.26±0.41 82.12±0.50 96.28±0.20SR-D-8 42.30±0.33 93.74±0.15 99.99±0.01 63.65±0.28 82.41±0.32 96.41±0.20SR-C-8 41.80±0.36 93.55±0.27 99.99±0.01 61.36±0.27 80.43±0.24 95.48±0.21SMAC 43.62±0.41 94.36±0.14 100.00±0.01 64.09±0.29 82.81±0.23 96.61±0.20100updates LSQ-8 42.38±0.49 93.82±0.17 99.99±0.01 64.93±0.40 83.60±0.35 96.91±0.13SR-D-8 42.90±0.34 93.83±0.27 99.99±0.01 65.67±0.31 84.33±0.21 97.15±0.09SR-C-8 42.26±0.26 93.53±0.28 99.99±0.01 61.18±0.38 80.26±0.32 95.33±0.20SMAC 45.28±0.53 95.31±0.22 100.00±0.01 65.81±0.39 84.49±0.17 97.21±0.13Table 5.3: Detailed recall@N values on the MNIST dataset.further investigate this result by running the configuration for 100 iterations – thecomputational budget of the results presented in Chapter 4. This leads to a totalimprovement of 2.38% (absolute) on this dataset, which defines a new state of theart. At 128 bits, the improvements are smaller – between 0 and 1% – yet consistent.On the remaining datasets, the configurations obtained by SMAC do not consis-tently improve upon the default baseline. This suggests that the internal model ofSMAC may have modelled noise in recall@N as an improvement.LabelMe – 64 bits LabelMe – 128 bitsR@1 R@10 R@100 R@1 R@2 R@525updates LSQ-8 35.66±0.71 84.32±0.76 99.76±0.09 53.73±0.64 72.05±0.49 89.81±0.38SR-D-8 36.64±0.69 85.23±0.56 99.78±0.09 54.51±0.86 72.78±0.92 90.70±0.67SR-C-8 34.04±1.67 82.15±0.86 99.54±0.12 44.54±0.90 60.91±1.23 81.38±1.17SMAC 36.40±0.94 85.16±0.49 99.80±0.11 – – –Table 5.4: Detailed recall@N values on the LabelMe dataset.SIFT1M – 64 bits SIFT1M – 128 bitsR@1 R@10 R@100 R@1 R@2 R@525updates LSQ-32 29.42±0.34 72.43±0.43 97.36±0.17 55.15±0.34 72.14±0.43 88.86±0.29SR-D-32 29.98±0.36 72.87±0.33 97.47±0.15 54.96±0.64 72.05±0.49 88.98±0.17SR-C-32 29.55±0.45 72.89±0.41 97.40±0.18 52.24±0.47 69.15±0.28 86.56±0.21SMAC 29.73±0.19 72.96±0.31 97.43±0.12 55.11±0.30 72.30±0.34 88.94±0.24Table 5.5: Detailed recall@N values on the SIFT1M dataset.77Deep1M – 64 bits Deep1M – 128 bitsR@1 R@10 R@100 R@1 R@2 R@525updates LSQ-32 21.11±0.17 63.56±0.49 94.81±0.16 39.24±0.49 55.48±0.49 76.04±0.34SR-D-32 21.33±0.43 63.94±0.30 95.02±0.21 39.88±0.48 56.33±0.44 76.76±0.45SR-C-32 21.43±0.22 64.31±0.48 95.15±0.21 40.07±0.33 56.41±0.37 77.02±0.39SMAC – – – 40.02±0.43 56.65±0.46 76.80±0.30Table 5.6: Detailed recall@N values on the Deep1M dataset.VGG – 64 bits VGG – 128 bitsR@1 R@10 R@100 R@1 R@2 R@525updates LSQ-32 17.63±0.33 59.22±0.48 95.85±0.14 38.43±0.55 54.91±0.37 76.26±0.44SRD-32 18.52±0.38 60.86±0.45 96.60±0.26 39.19±0.38 55.95±0.34 76.84±0.36SRC-32 18.56±0.43 61.16±0.48 96.63±0.25 42.13±0.32 59.47±0.34 80.81±0.46SMAC 18.33±0.24 60.82±0.45 96.65±0.18 41.42±0.38 58.48±0.59 79.75±0.46Table 5.7: Detailed recall@N values on the VGG dataset.5.4 Conclusions and future workWe have experimented with an automatic approach to algorithm configuration for MCQ.Using this method, we have found a new state-of-the-art configuration on MNIST,a dataset of raw images, but otherwise have obtained no improvement over thealgorithm configuration that we had found by hand.Our experiments were rather small-scale, as we were able to run them in lessthan a week on a standalone desktop with a single GPU. Future work may explorebroader configurations (for example, including the OPQ initialization), or may letthe optimization run for more iterative updates in each function probe.78Chapter 6Deep stochastic local search:learning SLS perturbations withdeep reinforcement learningI stopped my dreamingI won’t do too much scheming these days—NicoIn Chapter 3, we introduced a method based on iterated local search (ILS) forthe encoding step of multi-codebook quantization (MCQ). Our encoding method isbased on principles from stochastic local search (SLS) (Hoos and Stu¨tzle, 2004), afamily of techniques based on random perturbations and iterative search known towork well for various NP-hard problems.Drawing inspiration from general principles and designing new algorithms forspecific problems is a typical approach to algorithm design. Doing so allows us totransfer knowledge across research fields, and gives us intuitive guiding principlesto rely on during the design process.An appealing alternative, however, is to avoid designing the encoding algo-rithm altogether, and instead learn a function that maps vectors to their assignmentsin a series of codebooks. Since the encoding problem is equivalent to maximum-a-posteriori (MAP) estimation in fully-connected Markov random fields (MRFs),79finding the optimal solution – even in a small set of vectors – is likely to be com-putationally demanding. In other words, acquiring labels (or “ground truth”) tosupervise the learning of an encoding function is inherently hard. Reinforcementlearning (RL) (Sutton and Barto, 1998) provides a framework to learn functionswithout supervision, as long as we can provide a “reward” function.However, learning an encoding function from scratch may prove too difficultwith limited computation – MAP estimation is, after all, an NP-hard problem towhich the research community has devoted many years of research. The samecommunity has, at the same time, developed useful guiding algorithmic principlessuch as SLS. In this chapter, we ask: is there a way in which we can combine thepower of SLS algorithms and reinforcement learning? We explore answers to thisquestion in the context of learning algorithms for MCQ encoding.6.1 Related workReinforcement learning (RL) is a classical subfield of artificial intelligence, distinctfrom the more common supervised learning in that RL can be used when no labelsare available. The most prominent reference in the subject is the book by Suttonand Barto (1998).In the most common setup, we have both an environment and an agent. For-mally, the environment is modelled as a set of states s ∈ S, and the agent mayperform a set of actions a ∈ A. The environment evolves stochastically, depend-ing on both the current state and the action that the agent takes, which under aMarkovian assumption can be fully described with a transition probability matrixP(s′ | s,a). Associated with each state-action pair, there is a (potentially latent)reward R : S×A→R that the agent obtains after each action. The goal is to learn adistribution over actions given the state (a.k.a. a “policy”) pi(a | s) that maximizesreward over a series of actions.6.1.1 Stochastic local search (SLS)Stochastic local search (SLS) (Hoos and Stu¨tzle, 2004) defines a family of meth-ods that alternate between searching the neighbourhood of a solution and stochas-tically perturbing the solution to escape local minima. Prominent examples of80SLS algorithms include Markov Chain Monte Carlo (MCMC), Simulated Anneal-ing (which we used in Chapter 4) and Iterated local search (ILS) (Lourenc¸o et al.,2003), (which we used in Chapter 3).A commonality of SLS algorithms is that they all, at some point in the searchprocess, sample from a distribution to perturb the current solution. In many cases,this distribution is either uniform, or otherwise amenable to parameterization andmanipulation, such as a conjugate prior. We note that this distribution could insteadbe learned using, for example, a neural network. Furthermore, in a deep RL setting,this distribution could be seen as a “policy”, and a pre-designed SLS algorithmcould be seen as the “environment”. This effectively allows us to leverage priorknowledge from SLS algorithms, incorporating an automatic learning component.6.1.2 Deep reinforcement learningRL provides a very general formulation that can in principle be applied to a widerange of problems. Classically, the agent would perceive the world via a set ofhandcrafted features. These features, however, have recently been replaced by deepneural networks that can be trained end-to-end for the task at hand. The intersectionbetween deep learning and RL is often called “deep reinforcement learning” (ordeep RL), and is our main subject of study in this chapter. Prominent breakthroughsof deep RL include attaining human-level performance in Atari games (Mnih et al.,2015), and superhuman performance in chess and the game of Go (Silver et al.,2017).Since deep neural networks are typically trained with stochastic gradient de-scent (SGD) and the agent often makes a discrete action in the environment, a largebody of work in deep RL focuses on methods to approximate gradients to train neu-ral networks in discrete settings (Bengio et al., 2013; Sutton et al., 2000; Williams,1992).Visual attentionIn modern convolutional neural networks (CNNs), the computational cost of pro-cessing an image is proportional to the image resolution. However, in tasks suchas classification, the object of interest is often located in a relatively small part of81the image. Moreover, in tasks such as image captioning, where the model predictsone word at a time, it is common for different words to refer to different objectsin the image. Xu et al. (2015) model the visual processing in image captioningas a convolutional network, and predict words using an Long Short-Term Memoryunit (LSTM) (Hochreiter and Schmidhuber, 1997). To model visual attention, theLSTM selects a square region in the image at each prediction timestep. Since se-lecting a single region of an image is a non-differentiable operation, the authors usethe REINFORCE algorithm (Williams, 1992) with multiple samples to estimate thegradient for backpropagation.Optimizing language metricsRennie et al. (2017) notice that most metrics of language prediction, often bor-rowed for image captioning, are non-differentiable, and thus models are oftentrained on surrogate losses such as cross entropy. To bridge this gap between ob-jectives, the authors use self-critical learning, a variant of the REINFORCE algo-rithm where the magnitude of the gradient is normalized with respect to the currentprediction of the model, instead of a randomly sampled baseline. Crucially, the au-thors pre-train their models with a cross-entropy loss, and after supervised trainingplateaus they switch to RL training.Superoptimizing programsBunel et al. (2017) apply deep RL to learn high-performance assembly programsfor problems from the “hacker’s delight” microbenchmarks (Warren, 2013). Theauthors define the reward function as a linear combination of speed and correctnessover a large collection of unit tests. The neural network takes as input a bag-of-words representation of the output of the GCC compiler, and as output predicts achange to a candidate program. After multiple iterations and proposals, the finalprogram obtained by the network is the output of the algorithm. This is very similarto a Monte-Carlo search, where the proposals are guided by the policy that thenetwork learns. In RL terms, the space of programs is the set of states S, the actionsA are changes to the current program, and the policy is the distribution over changesthat the network proposes.826.2 Problem formulationAlthough the MCQ formulation contains two latent variables (the codebooks C andthe codes B), we remind the reader that updating C is fairly easy; as in Chapter 4,we demonstrated that this can be done in less than a second for up to one millionvectors. The main challenge lies in updating the codes B. For this reason, weassume that the codebooks are given in our formulation, and focus on learning afunctionf ∗ = min f1nn∑i=1L ( f (xi,θ),C) , (6.1)whereL is a loss function, and f has two main components:1. A neural network with learnable parameters θ that predicts a series of distri-butions over codes f (x,θ)→ p(bˆ|x,θ). Since these are probability distribu-tions, their components sum to one: ∑hk=1 bˆ j,k = 1 for all j.2. An SLS-based method that uses these distributions in the perturbation step,and produces a final set of codes [b1,b2, . . . ,bm]. The intuition is that thesedistributions should be learned such that they accelerate the finding of goodsolutions by SLS algorithms. For this reason, we refer to our approach asdeep stochastic local search.In our case, we experiment with several architectures for the first part of thefunction, and for the SLS method we use the encoding algorithm based on ILS withICM encoding that we introduced in Chapter 3. The loss function – its negativebeing the reward – is simply the quantization error obtained by stochatic localsearch after using the proposed distributions and running for a number of iterations:L ( f (xi,θ),C) = ‖x−∑mj C j ·b j‖22.6.3 Experimental setupIn our setup, we assume that a set of codebooks fully determines the set of statess ∈ S, as they define the unary and binary terms of the MRF. An action a is theprediction of a distribution over the set of codes, which our SLS algorithm samplesfrom to compute quantization error – the negative of the “reward” R(s,a). The83objective is to learn to predict a sampling distribution given the states – a “policy”pi(a | s) that the SLS algorithm can use to obtain good results faster than with auniform distribution, as is typically used in SLS.DatasetsWe experiment on the train/query/base datasets that we used in Chapters 3 and 4:SIFT1M, VGG and Deep1M. We explicitly avoid query/base datasets such as La-belMe and MNIST because they lack a separate training partition. We learn ourfunctions on the training sets, and test for generalization on the base set. Our goalis to improve recall@N for a fixed computational encoding budget.Network designsWe experimented with three main network architecture:1. An early fusion network, which has a small set of shared parameters be-fore branching off to predict each codebook independently. The network isdepicted in Figure 6.1 (a).2. A late fusion network, which shares most of the parameters and branches offlate to predict each codebook. This network is depicted in Figure 6.1 (b).3. A recurrent neural network, which predicts each code sequentially, depictedin Figure 6.2.Our feed-forward architectures are based on the bilinear block, a computation-ally efficient building block that has recently achieved state-of-the-art performanceon 3d human pose estimation (Martinez et al., 2017b). Similarly, our recurrent ar-chitecture is based on a residual block that has shown state of the art performanceon human motion prediction (Martinez et al., 2017a).TrainingWe train our networks approximating the gradient with 32 runs of LSQ, and leteach episode run for 8 ILS iterations. We experimented with both self-criticallearning (Rennie et al., 2017) (i.e., using the current prediction of the model to84x+(a) (b)b̂1 , …b̂2 ,SLS( )b̂mx…++++ +b̂1 , …b̂2 ,SLS( )b̂mLinear Batch Norm Dropout 0.5RELUFigure 6.1: (a) Architecture with early fusion. (b) Architecture with late fu-sion.x+LSTMb̂1 , …b̂2 ,SLS( )b̂m…+LSTM+LSTMLinear Batch Norm Dropout 0.5RELUFigure 6.2: Recurrent architecture.85normalize the reward) and a random sampling baseline (as is common in RL), andfound no difference in performance. We also experimented with the three archi-tectures mentioned above, and obtained the best performance with the early fusionmodel.We implemented our networks on Tensorflow (Abadi et al., 2016), and trainedthem for 4 epochs with Adam (Kingma and Ba, 2014), and a learning rate of 0.001and a linear decay of 0.99 – this takes around one day of computation, and theseparameters have shown good results on similar setups (Martinez et al., 2017b).6.4 ResultsSince we are interested in accelerating the optimization speed of LSQ encoding,we compare vanilla LSQ, SR-D and SR-C (that is, the methods sampling from uni-form distributions), vs the same methods sampling from learned distributions. Theresults for our main network are shown on Figure 6.3.We observe that learning a sampling distribution barely makes a difference inmost cases. The only exception is on the VGG dataset at 128 bits, where SR-C witha uniform distribution has a poor start, and catches up with other methods at 16 ILSiterations to finally outperform the rest of the baselines by rougly 0.05 in recall. Inthis case, learning a sampling distribution accelerates convergence during the first16 iterations, but progress is roughly equal after 32 iterations.6.5 Conclusions and future workWe have shown our attempts to learn encoding functions for MCQ combiningStochastic local search (SLS) with deep reinforcement learning. We have describedthe learning algorithms and network architectures that we used, an discussed thelimitations of our approach. In particular, the functions that our models learned didnot noticeably improve upon an SLS baseline with a uniform (not learned) distri-bution in most cases, and never did so after 16 ILS iterations.Multiple reasons could be to responsible for the shortcomings of our approachin experimental performance. For once, MAP estimation might simply be a hardproblem for a neural network to solve, as we are not aware of deep RL outper-forming classical algorithms in similar scenarios. Another explanation might be86Figure 6.3: Recall@1 as a function of ILS iterations on the SIFT1M, VGGand Deep1M datasets. The methods with “RL” are using learned sam-pling distributions, while those without “RL” correspond to the methodsintroduced in Chapters 3 and 4 (ie, using a uniform distribution for per-turbation sampling).that we simply need larger models and more computation to robustly learn sam-pling distributions; after all, recent breakthroughs in deep RL (Mnih et al., 2015;Silver et al., 2017) have used several orders of magnitude more computation thanis currently available to us. Revisiting these approaches in the future, with moreavailable compute cycles and more robust deep RL algorithms, is an interestingarea of future work.87Chapter 7Rayuela.jl: A package forlarge-scale similarity searchThe paramedic thinks I’m clever cause I play guitarI think she’s clever cause she stops people dying—Courtney BarnettTogether with this thesis we are releasing Rayuela.jl, a software package forlarge-scale similarity search written in Julia (Bezanson et al., 2014). Rayuela.jlis available under an MIT licence, and can be downloaded at https://github.com/una-dinosauria/Rayuela.jl. Multiple motivations led us to develop our library:• As an approximate nearest neighbour search (ANNS) tool. In practice,approximate nearest neighbour search (ANNS) is rarely a problem of interestby itself. In other words, ANNS typically arises as a subproblem of other tasksuch as large-scale retrieval or classification. We want to release a library thatdevelopers can use to tackle large-scale ANNS problems that they run into,therefore maximizing the impact of our research.• As a free alternative to FAISS. Facebook AI Similarity Search (FAISS) (John-son et al., 2017) is another library that implements multiple MCQ baselines.Unfortunately, FAISS currently implements only orthogonal MCQ methods.Furthermore, FAISS is distributed under a BSD + Patents licence, which88includes strong patent retaliation terms. This licence has been explicitlydisavowed by the Apache Software Foundation1, which prompted multipleopen source software projects to stop using React.js, a popular javascriptlibrary written by Facebook. In September 2017, Facebook changed thelicence of React.js and other of their projects to MIT, but to date has notchanged the licence of FAISS.• For scientific reproducibility. We also want to provide a free codebasewhere known MCQ baselines can be reproduced, and reference implementa-tions can be freely consulted. This has the side effect of making our researchreproducible, which is unfortunately not common in this field (yet).• To test Julia’s capabilities. Julia is a nascent programming language thatis emerging as an alternative to Python and Matlab in scientific computing.Writing our implementations in Julia serves as a use-case from which wederived insights into the advantages and limitations of this new language.There are multiple benefits of reproducible research beyond ensuring the in-tegrity of publicly funded research. We find the following note by Donoho (2010)particularly eloquent:In my own experience, error is ubiquitous in scientific computing, andone needs to work very diligently and energetically to eliminate it.One needs a very clear idea of what has been done in order to knowwhere to look for likely sources of error. I often cannot really be surewhat a student or colleague has done from [their] own presentation,and in fact often [their] description does not agree with my own un-derstanding of what has been done, once I look carefully at the scripts.Actually, I find that researchers quite generally forget what they havedone and misrepresent their computations.Computing results are now being presented in a very loose, breezyway in journal articles, in conferences, and in books. All too often onesimply takes computations at face value. This is spectacularly against1https://github.com/facebook/react/issues/1019189the evidence of my own experience. I would much rather that at talksand in referee reports, the possibility of such error were seriously ex-amined.In the following, we describe the functionality implemented in Rayuela.jl. Wethen describe some of the strengths and downsides of Julia as our programminglanguage of choice, and finally discuss future work that we would like to see im-plemented in Rayuela.jl.7.1 FunctionalityThe largest effort, parallel to ours, to reproduce MCQ baselines is currently theFacebook AI Similarity Search (FAISS) library (Johnson et al., 2017). However,FAISS currently does not implement any non-orthogonal MCQ methods – Rayuela.jlfills this gap in reproducible research. In Table 7.1 we list a number of MCQ algo-rithms, whether the authors released code with their paper, whether the algorithmis implemented in FAISS, and whether it is implemented in Rayuela.jl. Overall,Rayuela implements seven MCQ methods that FAISS does not currently provide.Rayuela.jl, like FAISS, includes GPU-accelerated versions of the most computationally-expensive algorithms that it implements. In our case, we provide CUDA imple-mentations of LSQ and SR (both introduced in this thesis), and OTQ (Babenko andLempitsky, 2015).7.2 StrengthsThe main motivation behind developing Julia as a language for scientific com-puting is that languages such as R, Python and Matlab are interpreted, not com-piled. While interpreters are good for rapid prototyping, they are often not goodat generating efficient low-level instructions. In practice, this means that usersrely on libraries written in C (such as Numpy), and often have to write their ownperformance-critical routines in lower-level languages to achieve satisfactory run-ning times. This issue is commonly known as the “two language problem”. Julialeverages the Low Level Virtual Machine (LLVM) compiler infrastructure with astrong type system, which allows it to produce highly efficient machine code. This,90Name (Reference) Abbr. Code FAISS (GPU) Rayuela (GPU)Orthogonal methodsTransform coding (Brandt, 2010) TCExpectation-based coding (Sandhawalia and Je´gou, 2010) –Product quantization (Je´gou et al., 2009, 2011) PQ Matlab, C X(X) XOptimized product quantization (Ge et al., 2014) OPQ Matlab X(X) XCartesian k-means (Norouzi and Fleet, 2013) CKM Matlab, C++ X(X) XPolysemous codes (Douze et al., 2016) – C++ XSemi-orthogonal methodsComposite quantization (Zhang et al., 2014) CQ MSVC C++Optimized tree quantization (Babenko and Lempitsky, 2015) OTQ X(X)Optimized Cartesian k-means (Wang et al., 2014) OCKMNon-orthogonal methodsResidual vector quantization (Chen et al., 2010) RVQ XEnhanced residual vector quantization (Ai et al., 2014, 2017) ERVQ XAdditive quantization (Babenko and Lempitsky, 2014a) AQ PythonStacked quantizers (Martinez et al., 2014) SQ Matlab, C++ XSelf-organizing binary encoding (Ozan et al., 2016c) SOBEJoint k-means quantization (Ozan et al., 2016b) JKMCompetitive quantization (Ozan et al., 2016a) COMPQ XPyramid additive quantization (Muravev et al., 2017) –Fast additive quantization (Li et al., 2017) –Local search quantization (This thesis) LSQ Julia, CUDA X(X)Stochastic eelaxations (This thesis) SR Julia, CUDA X(X)Table 7.1: Summary of MCQ methods and their implementationsin theory, allows users to write their entire codebases in Julia, without having to re-sort to lower-level languages.In our experience, Julia delivers on this promise. While we provide C++ im-plementations of both LSQ and OTQ encoding, we note that the Julia version of LSQis only 1.5-2× slower than the C++ version, and the Julia version of OTQ encodingis in fact faster and more memory-efficient than the C++ version.7.3 Challenges and limitationsMost drawbacks of Julia as a language are performance issues. Before the Ju-lia 1.0 release, language developers have been focused on implementing languagefeatures, and some routines are lacking in speed with respect to their C/C++ coun-91terparts. We explain some instances and consequences of these drawbacks in detail.MCQ search code: slow sorting and (lack of) multithreadingThe search code takes a query vector q and computes distance tables with respectto the codebooks C. It then uses table lookups to compute approximate distancesto the compressed database vectors B and, at the end, partially sorts the distancesto produce a final ranking of the top k neighbours. Our search code is currentlywritten in C++. The two main reasons for this are (1) that sorting in Julia is slow,and (2) that multithreading in Julia is not fully supported.Slow sorting (issue #939) was first reported in 2012 and, to date, Julia’s perfor-mance remains about 2× slower than its C++ counterpart.Unfortunately, to this day multithreading is only supported as an experimen-tal feature in Julia. Historically, this is due to LLVM (the compiler infrastruc-ture that Julia uses) not originally supporting OpenMP. LLVM started supportingOpenMP development in 2014, but by then Julia was a rather mature language,conceived without parallelism at its core. Slow sorting and lack of multithreadingmake our C++ implementation search code about 6× faster in C++ than in Julia,so we choose to keep the C++ implementation in the code base for the time being.ICM encoding: compiling Julia from sourceIterated conditional modes (ICM) with random perturbations is the core algorithminside Local search quantization (LSQ), introduced in Chapter 3. Since the algo-rithm may run on up to billions of vectors, it was crucial for us to develop a fastimplementation. We have developed a CUDA implementation of this algorithm,and we also provide a C++ implementation that is 1.5-2× faster than the Juliabaseline.Unfortunately, running CUDA code in Julia currently requires the user to com-pile Julia from source. This is sometimes non trivial and error-prone, and not idealfrom a usability perspective.927.4 Future workFuture work should address the limitation that we have described. Ideally, weshould be able to provide a package that is fully written in Julia, that offers multi-threading support, and that has a speed comparable to a C implementation.To the best of our knowledge, mulithreading will be officially supported in afuture version of Julia (1.x). When such support comes, we would like to use thefeature to eliminate the remaining the C++ code from our package.Getting rid of sorting in search codeSince in ANNS the user often needs only a list of k nearest neighbours, we coulduse a heap to keep track of the top k nearest neighbours at any given time, insteadof sorting them in the end. If the implementation of this data structure is efficientenough, we may be able to bypass the slow sorting issue and eliminate the C++search code.Writing CUDA code in JuliaThere is an interesting initiative to generate CUDA code from Julia called CUDA-native.jl (Besard et al., 2017). CUDANative.jl implements currently only a subsetof CUDA functionality, but this might be enough to generate the LSQ and OTQ en-coding routines that we currently provide in CUDA. The Julia 1.0 release shoulddrop the requirement for users to compile Julia from source.More MCQ methodsWe would like to implement other MCQ methods which, to this day, lack a pub-lic implementation and are therefore hard to reproduce. The method by Guoet al. (2016) runs k-means under the Mahalanobis distance to learn a compressionamenable to maximum inner product search (MIPS). Another interesting approach,due to Wu et al. (2017), is based on Stochatic gradient descent and focuses ondistributions with large range of norms, such as those arising in inverted files. Itwould be interesting to compare these two approaches with our current library toempirically determine their performance against state-of-the-art MCQ approaches.937.5 ConclusionWe have introduced Rayuela.jl, a package for large-scale similarity search. Rayuelafulfills our commitment to make our research accessible and reproducible, and fillsan important gap in reproducible research in the MCQ literature. We have also dis-cussed the strengths and limitations of Julia as a programming language of choice,and discussed future work that will make the package faster, more portable, andeasier to use.94Chapter 8ConclusionDid I do okay?Did I pave the way?—Belle and SebastianIn this thesis, we have introduced several improvements to algorithms that pro-duce compact codes used in large-scale similarity search. In Chapter 3, we intro-duced a method for fast encoding with better computation vs accuracy trade-offsthan its competitors. In Chapter 4, we added noise to either the codes or the code-books with a decaying temperature schedule, resulting in improved solutions. Wealso introduced a method for fast codebook update that exploits the structure of thebinary codes, resulting in shorter running times. These contributions make it moreappealing to use non-orthogonal codebooks in MCQ, as they drastically reduce thecomputation-vs-accuracy trade-offs of current methods.We have also used an automatic hyperparameter configuration tool (Hutteret al., 2011) to adapt our algorithms to particular data distributions in Chapter 5.This resulted in a new state of the art in a dataset of raw image digits (LeCun et al.,1998). Finally, we also explored the use of deep reinforcement learning techniquesto learn sampling distributions for stochastic local search procedures in Chapter 6.This, unfortunately, did not lead to noticeable improvements for most datasets.The main limitation that we faced during the development of our work was,largely, that it was very challenging for us to reproduce previous baselines. A sig-nificant step towards reversing this trend, we are releasing an open source library95that implements all our contributions and several MCQ baselines. Our implementa-tion is detailed in Chapter 7.Future workWe identify mainly three areas of future work in multi-codebook quatization (MCQ)1. Improvements upon local-search quantization (LSQ), introduced in this work,may focus on reducing the running time of the code update step, which is itscomputational bottleneck. This means more research into algorithms forlarge-scale combinatorial problems.2. Improvements upon composite quantization (CQ) (Zhang et al., 2014) mayfocus on reducing the running time of the codebook update step, which iscurrently its bottleneck. Currently, the algorithm uses the L-BFGS (Nocedal,1980) algorithm to solve a constrained optimization problem. Exploring theuse of recent algorithms for large-scale continuous optimization such as thestochastic average gradient (Schmidt et al., 2017) may prove useful in thiscase.3. Finally, improvements upon competitive quantization (COMPQ) (Ozan et al.,2016a) may focus on reducing the overall running time of this gradient-basedalgorithm. With its gradient-based optimization, this algorithm is uniquelysituated to be incorporated inAn interesting direction of future research in MCQ lies in incorporating vectorcompression into end-to-end trainable learning pipelines. The work by Jain et al.(2018) is a first step in this direction that backpropagates quantization error to theindexing structure. Their work, however, is technically based on hashing, as itcomputes distances with Hamming codes and does not use codebooks. Klein andWolf (2017) explicitly learn codebooks and codes in a deep learning pipeline, thusintroducing the first fully-differentiable MCQ pipeline. Merging these two ideasinto a fully-differentiable, deeply-learned indexing structure is an interesting openproblem in the field.96BibliographyM. Abadi, P. Barham, J. Chen, Z. Chen, A. Davis, J. Dean, M. Devin,S. Ghemawat, G. Irving, M. Isard, et al. Tensorflow: A system for large-scalemachine learning. In OSDI, volume 16, pages 265–283, 2016. → pages 21, 86L. Ai, Y. Junqing, T. Guan, and Y. He. Efficient approximate nearest neighborsearch by optimized residual vector quantization. In Content-Based MultimediaIndexing (CBMI), 12th International Workshop on, pages 1–4, 2014. → pages15, 18, 38, 49, 60, 91L. Ai, J. Yu, Z. Wu, Y. He, and T. Guan. Optimized residual vector quantizationfor efficient approximate nearest neighbor search. Multimedia Systems, 23(2):169–181, 2017. → pages 15, 18, 91A. Andoni, P. Indyk, T. Laarhoven, I. Razenshteyn, and L. Schmidt. Practical andoptimal LSH for angular distance. In Advances in Neural InformationProcessing Systems, pages 1225–1233, 2015. → pages 9F. Andre´, A.-M. Kermarrec, and N. Le Scouarnec. Cache locality is not enough:high-performance nearest neighbor search with product quantization fast scan.Proceedings of the VLDB Endowment, 9(4):288–299, 2015. → pages 18, 19, 35D. Arthur and S. Vassilvitskii. K-means++: The advantages of careful seeding. InProceedings of the eighteenth annual ACM-SIAM Symposium on DiscreteAlgorithms. → pages 48, 51, 52F. Aurenhammer. Voronoi diagrams - a survey of a fundamental geometric datastructure. ACM Computing Surveys (CSUR), 23(3):345–405, 1991. → pages 7A. Babenko and V. Lempitsky. The inverted multi-index. In Computer Vision andPattern Recognition, pages 3069–3076, 2012. → pages 21, 35, 71A. Babenko and V. Lempitsky. Additive quantization for extreme vectorcompression. In Computer Vision and Pattern Recognition, pages 931–938,972014a. → pages 3, 4, 15, 16, 18, 22, 23, 24, 25, 26, 35, 37, 38, 40, 41, 50, 53,91A. Babenko and V. Lempitsky. Improving bilayer product quantization forbillion-scale approximate nearest neighbors in high dimensions. arXiv preprintarXiv:1404.1831, 2014b. → pages 35, 40A. Babenko and V. Lempitsky. Tree quantization for large-scale similarity searchand classification. In Computer Vision and Pattern Recognition, pages4240–4248, 2015. → pages 3, 14, 16, 18, 29, 31, 37, 38, 40, 90, 91A. Babenko and V. Lempitsky. Efficient indexing of billion-scale datasets of deepdescriptors. In Computer Vision and Pattern Recognition, pages 2055–2063,2016. → pages 59, 61, 66A. Babenko, A. Slesarev, A. Chigorin, and V. Lempitsky. Neural codes for imageretrieval. In European Conference on Computer Vision, pages 584–599.Springer, 2014. → pages 39Y. Bengio, N. Le´onard, and A. Courville. Estimating or propagating gradientsthrough stochastic neurons for conditional computation. arXiv preprintarXiv:1308.3432, 2013. → pages 81J. L. Bentley. Multidimensional binary search trees used for associative searching.Communications of the ACM, 18(9):509–517, 1975. → pages 8J. S. Bergstra, R. Bardenet, Y. Bengio, and B. Ke´gl. Algorithms forhyper-parameter optimization. In Advances in Neural Information ProcessingSystems, pages 2546–2554, 2011. → pages 72, 73T. Besard, C. Foket, and B. De Sutter. Effective extensible programming:Unleashing julia on GPUs. arXiv preprint arXiv:1712.03112, 2017. → pages93J. Bezanson, A. Edelman, S. Karpinski, and V. B. Shah. Julia: A fresh approach tonumerical computing. arXiv preprint arXiv:1411.1607, 2014. → pages 5, 46,88D. W. Blalock and J. V. Guttag. Bolt: Accelerated data mining with fast vectorcompression. In ACM SIGKDD International Conference on KnowledgeDiscovery and Data Mining, pages 727–735, 2017. → pages 19D. M. Blei, A. Y. Ng, and M. I. Jordan. Latent Dirichlet allocation. Journal ofMachine Learning Research, 3(1):993–1022, 2003. → pages 7498Y. Boykov, O. Veksler, and R. Zabih. Fast approximate energy minimization viagraph cuts. IEEE Transactions on pattern analysis and machine intelligence,23(11):1222–1239, 2001. → pages 26J. Brandt. Transform coding for fast approximate nearest neighbor search in highdimensions. In Computer Vision and Pattern Recognition, pages 1815–1822,2010. → pages 12, 18, 91E. Brochu, V. M. Cora, and N. De Freitas. A tutorial on bayesian optimization ofexpensive cost functions, with application to active user modeling andhierarchical reinforcement learning. arXiv preprint arXiv:1012.2599, 2010. →pages 73R. Bunel, A. Desmaison, M. P. Kumar, P. H. Torr, and P. Kohli. Learning tosuperoptimize programs. In International Conference on LearningRepresentations, 2017. → pages 82Y. Cao, M. Long, J. Wang, H. Zhu, and Q. Wen. Deep quantization network forefficient image retrieval. In Association for the Advancement of ArtificialIntelligence, pages 3457–3463, 2016. → pages 20Y. Cao, M. Long, J. Wang, and S. Liu. Deep visual-semantic quantization forefficient image retrieval. In Computer Vision and Pattern Recognition,volume 2, pages 1328–1337, 2017. → pages 21M. E. Celebi, H. A. Kingravi, and P. A. Vela. A comparative study of efficientinitialization methods for the k-means clustering algorithm. Expert Systemswith Applications, 40(1):200–210, 2013. → pages 51W.-Y. Chan, S. Gupta, and A. Gersho. Enhanced multistage vector quantizationby joint codebook design. IEEE Transactions on Communications, 40(11):1693–1697, 1992. → pages 15K. Chatfield, K. Simonyan, A. Vedaldi, and A. Zisserman. Return of the devil inthe details: Delving deep into convolutional nets. In British Machine VisionConference, 2014. → pages 39, 59Y. Chen, T. Guan, and C. Wang. Approximate nearest neighbor search by residualvector quantization. Sensors, 10(12):11259–11273, 2010. → pages 15, 16, 18,23, 38, 49, 60, 67, 91G. F. Cooper. The computational complexity of probabilistic inference usingBayesian belief networks. Artificial Intelligence, 42(2-3):393–405, 1990. →pages 2499J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei. Imagenet: Alarge-scale hierarchical image database. In Computer Vision and PatternRecognition, pages 248–255, 2009. → pages 2D. L. Donoho. An invitation to reproducible computational research.Biostatistics, 11(3):385–388, 2010. → pages 89M. Douze, H. Je´gou, and F. Perronnin. Polysemous codes. In EuropeanConference on Computer Vision, pages 785–801, 2016. → pages 18, 19, 21, 91J. C. Dunn. A fuzzy relative of the ISODATA process and its use in detectingcompact well-separated clusters. Journal of Cybernetics, 3(3):32–57, 1973. →pages 51K. Eggensperger, M. Feurer, F. Hutter, J. Bergstra, J. Snoek, H. Hoos, andK. Leyton-Brown. Towards an empirical foundation for assessing Bayesianoptimization of hyperparameters. In NIPS Workshop on Bayesian Optimizationin Theory and Practice, volume 10, 2013. → pages 74C. Elkan. Using the triangle inequality to accelerate k-means. In InternationalConference on Machine Learning, pages 147–153, 2003. → pages 51J. K. Flanagan, D. R. Morrell, R. L. Frost, C. J. Read, and B. E. Nelson. Vectorquantization codebook generation using simulated annealing. In InternationalConference on Acoustics, Speech, and Signal Processing, pages 1759–1762,1989. → pages 52D. C.-L. Fong and M. Saunders. LSMR: An iterative algorithm for sparseleast-squares problems. SIAM Journal on Scientific Computing, 33(5):2950–2971, 2011. → pages 50, 53T. Ge, K. He, Q. Ke, and J. Sun. Optimized product quantization. Transactions onPattern Analysis and Machine Intelligence, 36(4):744–755, 2014. → pages 3,12, 18, 37, 38, 39, 40, 60, 91S. Geman and D. Geman. Stochastic relaxation, gibbs distributions, and thebayesian restoration of images. IEEE Transactions on pattern analysis andmachine intelligence, (6):721–741, 1984. → pages 26A. Gersho and R. M. Gray. Vector Quantization and Signal Compression. KluwerAcademic Publishers, 1992. ISBN 0792391810;9780792391814;. → pages 24Y. Gong and S. Lazebnik. Iterative quantization: A Procrustean approach tolearning binary codes. In Computer Vision and Pattern Recognition, pages817–824, 2011. → pages 3, 60100A. Gordo, J. Almazan, J. Revaud, and D. Larlus. End-to-end learning of deepvisual representations for image retrieval. International Journal of ComputerVision, 124(2):237–254, 2017. → pages 2R. Guo, S. Kumar, K. Choromanski, and D. Simcha. Quantization based fast innerproduct search. In Artificial Intelligence and Statistics, pages 482–490, 2016.→ pages 3, 20, 93S. Hochreiter and J. Schmidhuber. Long short-term memory. NeuralComputation, 9(8):1735–1780, 1997. → pages 82H. H. Hoos and T. Stu¨tzle. Stochastic local search: Foundations & Applications.Elsevier, 2004. → pages 22, 27, 30, 42, 55, 79, 80F. Hutter, H. H. Hoos, and T. Stu¨tzle. Automatic algorithm configuration based onlocal search. In Association for the Advancement of Artificial Intelligence,volume 7, pages 1152–1157, 2007. → pages 72, 73F. Hutter, H. H. Hoos, and K. Leyton-Brown. Sequential model-basedoptimization for general algorithm configuration. In International Conferenceon Learning and Intelligent Optimization, pages 507–523. Springer, 2011. →pages 72, 73, 95Y. Hwang, B. Han, and H.-K. Ahn. A fast nearest neighbor search algorithm bynonlinear embedding. In Computer Vision and Pattern Recognition, pages3053–3060, 2012a. → pages 51Y. Hwang, B. Han, and H.-K. Ahn. A fast nearest neighbor search algorithm bynonlinear embedding. In Computer Vision and Pattern Recognition, pages3053–3060, 2012b. → pages 15A. K. Jain. Data clustering: 50 years beyond k-means. Pattern RecognitionLetters, 31(8):651–666, 2010. → pages 51H. Jain, P. Pe´rez, R. Gribonval, J. Zepeda, and H. Je´gou. Approximate search withquantized sparse representations. In European Conference on Computer Vision,pages 681–696, 2016. → pages 19H. Jain, J. Zepeda, P. Pe´rez, and R. Gribonval. SUBIC: A supervised, structuredbinary code for image search. In International Conference on Computer Vision,pages 833–842, 2017. → pages 21H. Jain, J. Zepeda, P. Pe´rez, and R. Gribonval. Learning a complete imageindexing pipeline. Computer Vision and Pattern Recognition, pages4933–4941, 2018. → pages 21, 96101M. Ja¨rvisalo, D. Le Berre, O. Roussel, and L. Simon. The international SATsolver competitions. AI Magazine, 33(1):89–92, 2012. → pages 22H. Je´gou, M. Douze, and C. Schmid. Hamming embedding and weak geometricconsistency for large scale image search. In European Conference on ComputerVision, pages 304–317, 2008. → pages 11H. Je´gou, M. Douze, and C. Schmid. Searching with quantization: approximatenearest neighbor search using short codes and distance estimators. TechnicalReport RR-7020, INRIA, 2009. → pages 11, 12, 18, 91H. Je´gou, M. Douze, and C. Schmid. Product quantization for nearest neighborsearch. Transactions on Pattern Analysis and Machine Intelligence, 33(1):117–128, 2011. → pages 3, 12, 18, 21, 37, 38, 39, 40, 59, 60, 61, 67, 71, 91H. Je´gou, F. Perronnin, M. Douze, J. Sa´nchez, P. Perez, and C. Schmid.Aggregating local image descriptors into compact codes. Transactions onPattern Analysis and Machine Intelligence, 34(9):1704–1716, 2012. → pages12J. Johnson, M. Douze, and H. Je´gou. Billion-scale similarity search with GPUs.arXiv preprint arXiv:1702.08734, 2017. → pages 21, 88, 90Y. Kalantidis and Y. Avrithis. Locally optimized product quantization forapproximate nearest neighbor search. In Computer Vision and PatternRecognition, pages 2321–2328, 2014. → pages 40J. H. Kappes, B. Andres, F. Hamprecht, C. Schnorr, S. Nowozin, D. Batra,S. Kim, B. X. Kausler, J. Lellmann, N. Komodakis, et al. A comparative studyof modern inference techniques for discrete energy minimization problems. InComputer Vision and Pattern Recognition, pages 1328–1335, 2013. → pages25, 29A. M. Kibriya and E. Frank. An empirical comparison of exact nearest neighbouralgorithms. In European Conference on Principles of Data Mining andKnowledge Discovery, pages 140–151, 2007. → pages 8D. P. Kingma and J. Ba. Adam: A method for stochastic optimization. arXivpreprint arXiv:1412.6980, 2014. → pages 86B. Klein and L. Wolf. In defense of product quantization. arXiv preprintarXiv:1711.08589, 2017. → pages 21, 96102P. Kra¨henbu¨hl and V. Koltun. Efficient inference in fully connected CRFs withGaussian edge potentials. In Advances in Neural Information ProcessingSystems, pages 109–117, 2011. → pages 26A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classification with deepconvolutional neural networks. In Computer Vision and Pattern Recognition,pages 1097–1105, 2012. → pages 72Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner. Gradient-based learning appliedto document recognition. Proceedings of the IEEE, 86(11):2278–2324, 1998.→ pages 39, 59, 74, 95J. Li, X. Lan, J. Wang, M. Yang, and N. Zheng. Fast additive quantization forvector compression in nearest neighbor search. Multimedia Tools andApplications, 76(22):23273–23289, 2017. → pages 17, 18, 91Y. Linde, A. Buzo, and R. Gray. An algorithm for vector quantizer design. IEEETransactions on communications, 28(1):84–95, 1980. → pages 48, 52S. Liu, J. Shao, and H. Lu. Generalized residual vector quantization for largescale data. 2016. → pages 16H. R. Lourenc¸o, O. C. Martin, and T. Stu¨tzle. Iterated local search. In Handbookof Metaheuristics, pages 320–353. Springer, 2003. → pages 27, 81D. G. Lowe. Distinctive image features from scale-invariant keypoints.International Journal of Computer Vision, 60(2):91–110, 2004. → pages 2, 10Y. A. Malkov and D. Yashunin. Efficient and robust approximate nearest neighborsearch using hierarchical navigable small world graphs. arXiv preprintarXiv:1603.09320, 2016. → pages 9T. M. Martinetz, S. G. Berkovich, and K. J. Schulten. ‘Neural-gas’ network forvector quantization and its application to time-series prediction. IEEETransactions on Neural Networks, 4(4):558–569, 1993. → pages 51J. Martinez, H. H. Hoos, and J. J. Little. Stacked quantizers for compositionalvector compression. arXiv preprint arXiv:1411.2173, 2014. → pages 15, 18,20, 38, 49, 60, 91J. Martinez, J. Clement, H. H. Hoos, and J. J. Little. Revisiting additivequantization. In European Conference on Computer Vision, pages 137–153,2016a. → pages v103J. Martinez, H. H. Hoos, and J. J. Little. Solving multi-codebook quantization inthe GPU. In Workshop on Web-scale Vision and Social Media (VSM), ECCVworkshops, 2016b. → pages vJ. Martinez, M. J. Black, and J. Romero. On human motion prediction usingrecurrent neural networks. In Computer Vision and Pattern Recognition, pages4674–4683, 2017a. → pages 84J. Martinez, R. Hossain, J. Romero, and J. J. Little. A simple, yet effectivebaseline for 3d human pose estimation. In International Conference onComputer Vision, pages 2640–2649, 2017b. → pages 84, 86J. Martinez, S. Zakhmi, H. H. Hoos, and J. J. Little. LSQ++: Lower running timeand higher recall in multi-codebook quantization. In European Conference onComputer Vision, 2018. → pages vY. Matsui, T. Yamasaki, and K. Aizawa. PQtable: Fast exact asymmetric distanceneighbor search for product quantization using hash tables. In InternationalConference on Computer Vision, pages 1940–1948, 2015. → pages 71V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G. Bellemare,A. Graves, M. Riedmiller, A. K. Fidjeland, G. Ostrovski, et al. Human-levelcontrol through deep reinforcement learning. Nature, 518(7540):529–533,2015. → pages 81, 87M. Muja and D. G. Lowe. Fast approximate nearest neighbors with automaticalgorithm configuration. In International Conference on Computer VisionTheory and Applications, pages 331–340, 2009. → pages 8, 9, 12, 30A. Muravev, E. C. Ozan, A. Iosifidis, and M. Gabbouj. Pyramid encoding for fastadditive quantization. In Signal Processing Conference, 2017. → pages 17, 18,91Y. Nagata and S. Kobayashi. A powerful genetic algorithm using edge assemblycrossover for the traveling salesman problem. Informs Journal on Computing,25(2):346–363, 2013. → pages 27D. Nister and H. Stewenius. Scalable recognition with a vocabulary tree. InComputer Vision and Pattern Recognition, volume 2, pages 2161–2168, 2006.→ pages 9J. Nocedal. Updating quasi-newton matrices with limited storage. Mathematics ofComputation, 35(151):773–782, 1980. → pages 13, 96104M. Norouzi and D. J. Fleet. Minimal loss hashing for compact binary codes. InInternational Conference on Machine Learning, pages 353–360, 2011. →pages 39M. Norouzi and D. J. Fleet. Cartesian k-means. In Computer Vision and PatternRecognition, pages 3017–3024, 2013. → pages 3, 13, 18, 39, 60, 91M. Norouzi, A. Punjani, and D. J. Fleet. Fast exact search in hamming space withmulti-index hashing. Transactions on Pattern Analysis and MachineIntelligence, 36(6), 2014. → pages 8J. Nutini, M. Schmidt, I. Laradji, M. Friedlander, and H. Koepke. Coordinatedescent converges faster with the gauss-southwell rule than random selection.In International Conference on Machine Learning, pages 1632–1641, 2015. →pages 30A. Oliva and A. Torralba. Building the gist of a scene: The role of global imagefeatures in recognition. Progress in Brain Research, 155:23–36, 2006. →pages 17S. M. Omohundro. Five balltree construction algorithms. International ComputerScience Institute Berkeley, 1989. → pages 8E. C. Ozan, S. Kiranyaz, and M. Gabbouj. Competitive quantization forapproximate nearest neighbor search. IEEE Transactions on Knowledge andData Engineering, 28(11):2884–2894, 2016a. → pages 3, 17, 18, 40, 66, 67,91, 96E. C. Ozan, S. Kiranyaz, and M. Gabbouj. Joint k-means quantization forapproximate nearest neighbor search. In International Conference on PatternRecognition, pages 3645–3649, 2016b. → pages 16, 17, 18, 91E. C. Ozan, S. Kiranyaz, M. Gabbouj, and X. Hu. Self-organizing binaryencoding for approximate nearest neighbor search. In Signal ProcessingConference, pages 1103–1107, 2016c. → pages 16, 17, 18, 91G. Papandreou and A. Yuille. Perturb-and-MAP random fields: Using discreteoptimization to learn and sample from energy models. In InternationalConference on Computer Vision, pages 193–200, 2011. → pages 68J. Philbin, O. Chum, M. Isard, J. Sivic, and A. Zisserman. Object retrieval withlarge vocabularies and fast spatial matching. In Computer Vision and PatternRecognition, pages 1–8, 2007. → pages 51105A. S. Razavian, H. Azizpour, J. Sullivan, and S. Carlsson. CNN featuresoff-the-shelf: an astounding baseline for recognition. In Computer Vision andPattern Recognition Workshops, pages 512–519. IEEE, 2014. → pages 39S. J. Rennie, E. Marcheret, Y. Mroueh, J. Ross, and V. Goel. Self-criticalsequence training for image captioning. In Computer Vision and PatternRecognition, pages 7008–7016, 2017. → pages 82, 84B. C. Russell, A. Torralba, K. P. Murphy, and W. T. Freeman. Labelme: adatabase and web-based tool for image annotation. International Journal ofComputer Vision, 77(1-3):157–173, 2008. → pages 59H. Sandhawalia and H. Je´gou. Searching with expectations. In InternationalConference on Acoustics Speech and Signal Processing, pages 1242–1245,2010. → pages 11, 12, 18, 91M. Schmidt. UGM: Matlab code for undirected graphical models.http://www.cs.ubc.ca/∼schmidtm/Software/UGM/, 2007. Accessed:2018-08-18. → pages 30M. Schmidt, N. Le Roux, and F. Bach. Minimizing finite sums with the stochasticaverage gradient. Mathematical Programming, 162(1-2):83–112, 2017. →pages 69, 96A. Shrivastava and P. Li. Asymmetric LSH (ALSH) for sublinear time maximuminner product search (MIPS). In Advances in Neural Information ProcessingSystems, pages 2321–2329, 2014. → pages 3C. Silpa-Anan and R. Hartley. Optimised kd-trees for fast image descriptormatching. In Computer Vision and Pattern Recognition, pages 1–8. IEEE,2008. → pages 9D. Silver, J. Schrittwieser, K. Simonyan, I. Antonoglou, A. Huang, A. Guez,T. Hubert, L. Baker, M. Lai, A. Bolton, et al. Mastering the game of Go withouthuman knowledge. Nature, 550(7676):354, 2017. → pages 81, 87J. Sivic and A. Zisserman. Video google: A text retrieval approach to objectmatching in videos. In International Conference on Computer Vision, pages1–8, 2003. → pages 11N. Snavely, S. M. Seitz, and R. Szeliski. 25(3):835–846, 2006. → pages 2J. Snoek, H. Larochelle, and R. P. Adams. Practical bayesian optimization ofmachine learning algorithms. In Advances in Neural Information ProcessingSystems, pages 2951–2959, 2012. → pages 72, 74106I. Sutskever, O. Vinyals, and Q. V. Le. Sequence to sequence learning with neuralnetworks. In Advances in Neural Information Processing Systems, pages3104–3112, 2014. → pages 72R. S. Sutton and A. G. Barto. Reinforcement Learning: An Introduction,volume 1. MIT Press Cambridge, 1998. → pages 80R. S. Sutton, D. A. McAllester, S. P. Singh, and Y. Mansour. Policy gradientmethods for reinforcement learning with function approximation. In Advancesin Neural Information Processing Systems, pages 1057–1063, 2000. → pages67, 81C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan,V. Vanhoucke, and A. Rabinovich. Going deeper with convolutions. arXivpreprint arXiv:1409.4842, 2014. → pages 59C. Thornton, F. Hutter, H. H. Hoos, and K. Leyton-Brown. Auto-WEKA:Combined selection and hyperparameter optimization of classificationalgorithms. In ACM SIGKDD International Conference on KnowledgeDiscovery and Data Mining, pages 847–855, 2013. → pages 74R. Tibshirani. Regression shrinkage and selection via the lasso. Journal of theRoyal Statistical Society. Series B (Methodological), pages 267–288, 1996. →pages 37A. Torralba, R. Fergus, and W. T. Freeman. 80 million tiny images: A large dataset for nonparametric object and scene recognition. Transactions on PatternAnalysis and Machine Intelligence, 30(11):1958–1970, 2008. → pages 2J. Vaisey and A. Gersho. Simulated annealing and codebook design. InInternational Conference on Acoustics, Speech, and Signal Processing, pages1176–1179, 1988. → pages 52E. van den Berg and M. P. Friedlander. Probing the Pareto frontier for basispursuit solutions. SIAM Journal on Scientific Computing, 31(2):890–912, 2008.→ pages 37V. Vineet and P. Narayanan. Cuda cuts: Fast graph cuts on the GPU. In ComputerVision and Pattern Recognition Workshops, pages 1–8, 2008. → pages 26, 27J. Wang and T. Zhang. Composite quantization. arXiv preprintarXiv:1712.00955, 2017. → pages 13107J. Wang, J. Wang, J. Song, X.-S. Xu, H. T. Shen, and S. Li. Optimized Cartesiank-means. IEEE Transactions on Knowledge and Data Engineering, 27(1):180–192, 2014. → pages 14, 16, 18, 91X. Wang, T. Zhang, G.-J. Qi, J. Tang, and J. Wang. Supervised quantization forsimilarity search. In Computer Vision and Pattern Recognition, pages2018–2026, 2016. → pages 20H. S. Warren. Hacker’s Delight. Pearson Education, 2013. → pages 82Y. Weiss, A. Torralba, and R. Fergus. Spectral hashing. In Advances in NeuralInformation Processing Systems, pages 1753–1760, 2009. → pages 3, 11R. J. Williams. Simple statistical gradient-following algorithms for connectionistreinforcement learning. Machine Learning, 8(3-4):229–256, 1992. → pages81, 82J. Wu, C. Leng, Y. Wang, Q. Hu, and J. Cheng. Quantized convolutional neuralnetworks for mobile devices. In Computer Vision and Pattern Recognition,pages 4820–4828, 2016. → pages 3X. Wu, R. Guo, A. T. Suresh, S. Kumar, D. N. Holtmann-Rice, D. Simcha, andX. Y. Felix. Multiscale quantization for fast similarity search. In Advances inNeural Information Processing Systems, pages 5749–5757, 2017. → pages 20,93Y. Xia, K. He, F. Wen, and J. Sun. Joint inverted indexing. In InternationalConference on Computer Vision, pages 3416–3423, 2013. → pages 35K. Xu, J. Ba, R. Kiros, K. Cho, A. Courville, R. Salakhudinov, R. Zemel, andY. Bengio. Show, attend and tell: Neural image caption generation with visualattention. In International Conference on Machine Learning, pages2048–2057, 2015. → pages 82L. Xu, F. Hutter, H. H. Hoos, and K. Leyton-Brown. SATzilla: portfolio-basedalgorithm selection for SAT. Journal of Artificial Intelligence Research, 32:565–606, 2008. → pages 22C. Zach, D. Gallup, J.-M. Frahm, and M. Niethammer. Fast global labeling forreal-time stereo using multiple plane sweeps. In Vision, Modelling andVisualization (VMV), pages 243–252, 2008. → pages 26, 27K. Zeger, J. Vaisey, and A. Gersho. Globally optimal vector quantizer design bystochastic relaxation. IEEE Transactions on Signal Processing, 40(2):310–322,1992. → pages 48, 52, 55, 57, 58, 76108T. Zhang and J. Wang. Collaborative quantization for cross-modal similaritysearch. In Computer Vision and Pattern Recognition, pages 2036–2045, 2016.→ pages 13T. Zhang, C. Du, and J. Wang. Composite quantization for approximate nearestneighbor search. In International Conference on Machine Learning, pages838–846, 2014. → pages 3, 4, 13, 18, 20, 30, 35, 37, 38, 39, 40, 41, 44, 50, 60,64, 69, 71, 91, 96T. Zhang, G.-J. Qi, J. Tang, and J. Wang. Sparse composite quantization. InComputer Vision and Pattern Recognition, pages 4548–4556, 2015. → pages13, 35, 36, 37, 38, 39, 40, 45, 71109
- Library Home /
- Search Collections /
- Open Collections /
- Browse Collections /
- UBC Theses and Dissertations /
- Algorithms for large-scale multi-codebook quantization
Open Collections
UBC Theses and Dissertations
Featured Collection
UBC Theses and Dissertations
Algorithms for large-scale multi-codebook quantization Martinez-Covarrubias, Julieta 2018
pdf
Notice for Google Chrome users:
If you are having trouble viewing or searching the PDF with Google Chrome, please download it here instead.
If you are having trouble viewing or searching the PDF with Google Chrome, please download it here instead.
Page Metadata
Item Metadata
Title | Algorithms for large-scale multi-codebook quantization |
Creator |
Martinez-Covarrubias, Julieta |
Publisher | University of British Columbia |
Date Issued | 2018 |
Description | Combinatorial vector compression is the task of expressing a set of vectors as accurately as possible in terms of discrete entries in multiple bases. The problem is of interest in the context of large-scale similarity search, as it provides a memory-efficient, yet ready-to-use compact representation of high-dimensional data on which vector similarities such as Euclidean distances and dot products can be efficiently approximated. Combinatorial compression poses a series of challenging optimization problems that are often a barrier to its deployment on very large scale systems (e.g., of over a billion entries). In this thesis we explore algorithms and optimization techniques that make combinatorial compression more accurate and efficient in practice, and thus provide a practical alternative to current methods for large-scale similarity search. |
Genre |
Thesis/Dissertation |
Type |
Text |
Language | eng |
Date Available | 2018-12-12 |
Provider | Vancouver : University of British Columbia Library |
Rights | Attribution-NonCommercial-NoDerivatives 4.0 International |
DOI | 10.14288/1.0375712 |
URI | http://hdl.handle.net/2429/68041 |
Degree |
Doctor of Philosophy - PhD |
Program |
Computer Science |
Affiliation |
Science, Faculty of Computer Science, Department of |
Degree Grantor | University of British Columbia |
GraduationDate | 2019-02 |
Campus |
UBCV |
Scholarly Level | Graduate |
Rights URI | http://creativecommons.org/licenses/by-nc-nd/4.0/ |
AggregatedSourceRepository | DSpace |
Download
- Media
- 24-ubc_2019_february_martinez-covarrubias_julieta.pdf [ 2.41MB ]
- Metadata
- JSON: 24-1.0375712.json
- JSON-LD: 24-1.0375712-ld.json
- RDF/XML (Pretty): 24-1.0375712-rdf.xml
- RDF/JSON: 24-1.0375712-rdf.json
- Turtle: 24-1.0375712-turtle.txt
- N-Triples: 24-1.0375712-rdf-ntriples.txt
- Original Record: 24-1.0375712-source.json
- Full Text
- 24-1.0375712-fulltext.txt
- Citation
- 24-1.0375712.ris
Full Text
Cite
Citation Scheme:
Usage Statistics
Share
Embed
Customize your widget with the following options, then copy and paste the code below into the HTML
of your page to embed this item in your website.
<div id="ubcOpenCollectionsWidgetDisplay">
<script id="ubcOpenCollectionsWidget"
src="{[{embed.src}]}"
data-item="{[{embed.item}]}"
data-collection="{[{embed.collection}]}"
data-metadata="{[{embed.showMetadata}]}"
data-width="{[{embed.width}]}"
data-media="{[{embed.selectedMedia}]}"
async >
</script>
</div>
Our image viewer uses the IIIF 2.0 standard.
To load this item in other compatible viewers, use this url:
https://iiif.library.ubc.ca/presentation/dsp.24.1-0375712/manifest