Compositional Compression of Deep Image FeaturesUsing Stacked QuantizersbyJulieta Martinez-CovarrubiasA THESIS SUBMITTED IN PARTIAL FULFILLMENTOF THE REQUIREMENTS FOR THE DEGREE OFMaster of ScienceinTHE FACULTY OF GRADUATE AND POSTDOCTORALSTUDIES(Computer Science)The University of British Columbia(Vancouver)December 2014c© Julieta Martinez-Covarrubias, 2014AbstractIn computer vision, it is common for image representations to be stored as high-dimensional real-valued vectors. In many computer vision applications, such asretrieval, classification, registration and reconstruction, the computational bottle-neck arises in a process known as feature matching, where, given a query vector,a similarity score has to be computed to many vectors in a (potentially very large)database. For example, it is not uncommon for object retrieval and classification tobe performed by matching global representations in collections with thousands [27]or millions [10, 32] of images. A popular approach to reduce the computationaland memory requirements of this process is vector quantization.In this work, we first analyze several vector compression methods typicallyused in the computer vision literature in terms of their computational trade-offs. Inparticular, we observe that Product Quantization (PQ) and Additive Quantization(AQ) lie on the extremes of a compositional vector compression design choice,where the former assumes complete codebook independence and the latter assumesfull codebook dependence. We explore an intermediate approach that exploits ahierarchical structure in the codebooks. This results in a method that is largelycompetitive with AQ in structured vectors, and outperforms AQ in unstructuredvectors while being several orders of magnitude faster.We also perform an extensive evaluation of our method on standard bench-marks of Scale Invariant Feature Transform (SIFT), and GIST descriptors, as wellas on new datasets of features obtained from state-of-the-art convolutional neuralnetworks. In benchmarks of low-dimensional deep features, our approach obtainsthe best known-to-date results, often requiring less than half the memory of PQ toachieve the same performance.iiPrefaceA slightly more compact version of the work in Chapters 1-4 has been submittedfor publication as Martinez, J., Hoos, H. H., & Little, J. J.: Stacked Quantizers forCompositional Vector Compression. Chapters 5 and 6 consist of original, unpub-lished work.iiiTable of ContentsAbstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iiPreface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iiiTable of Contents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ivList of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viList of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viiAcknowledgments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . x1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 Related work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42.1 Deep features and their compression . . . . . . . . . . . . . . . . 42.2 Vector quantization . . . . . . . . . . . . . . . . . . . . . . . . . 62.3 Compositional quantization . . . . . . . . . . . . . . . . . . . . . 62.4 Product quantization . . . . . . . . . . . . . . . . . . . . . . . . 72.5 Optimized product quantization . . . . . . . . . . . . . . . . . . 92.6 Additive quantization . . . . . . . . . . . . . . . . . . . . . . . . 93 Stacked quantizers . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123.1 Encoding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123.2 Initialization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143.3 Codebook refinement . . . . . . . . . . . . . . . . . . . . . . . . 15iv4 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 174.1 Experimental setup . . . . . . . . . . . . . . . . . . . . . . . . . 174.2 Quantization error . . . . . . . . . . . . . . . . . . . . . . . . . . 204.3 Approximate nearest neighbour search . . . . . . . . . . . . . . . 224.4 Large-scale object classification . . . . . . . . . . . . . . . . . . 234.5 Running times . . . . . . . . . . . . . . . . . . . . . . . . . . . . 275 Image retrieval . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 295.1 Qualitative retrieval . . . . . . . . . . . . . . . . . . . . . . . . . 296 Encoding with local search . . . . . . . . . . . . . . . . . . . . . . . 346.1 Design choices in compositional quantization . . . . . . . . . . . 346.2 Iterative best improvement search . . . . . . . . . . . . . . . . . 356.3 Experimental setup . . . . . . . . . . . . . . . . . . . . . . . . . 367 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40vList of TablesTable 4.1 Summary of the datasets on which we test our approach. Thetwo datasets at the top of the table were introduced in [18] andare standard benchmarks in the literature. The four datasets onthe bottom are introduced in this work. . . . . . . . . . . . . . 18Table 6.1 Summary of the design choices that make AQ and SQ different. 35viList of FiguresFigure 3.1 A graphical model representation of codebook learning thathighlights the assumptions of PQ/OPQ, AQ, and our approach.Here we show 4 subcodebooks: a) PQ and OPQ assume sub-codebook independence; b) AQ attempts to solve the fully con-nected problem; c) we assume a hierarchical structure. . . . . 13Figure 3.2 Encoding as performed with Stacked Quantizers, shown for4 subcodebooks. Left: the vector is passed through a seriesof quantizers, with residuals further encoded down the line.Right: A geometric interpretation of our approach. After re-cursively encoding residuals, the representation is additive inthe encodings, and the quantization error is the remaining resid-ual. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14Figure 4.1 Quantization error on SIFT1M and GIST1M. SQ shows a mod-est advantage on structured features. . . . . . . . . . . . . . . 20Figure 4.2 Quantization error on deep features with different dimension-alities. The non-independent approaches AQ and SQ clearlyoutperform PQ and OPQ. SQ achieves the best performancewhen using 8 and 16 codebooks (64 and 128 bits per feature)in all cases. . . . . . . . . . . . . . . . . . . . . . . . . . . . 21Figure 4.3 Recall@N curves on the SIFT1M and GIST1M datasets. . . . 23Figure 4.4 Recall@N curves on the ConvNet1M-128 and ConvNet1M-1024 datasets. . . . . . . . . . . . . . . . . . . . . . . . . . . 24viiFigure 4.5 Recall@N curves on the ConvNet1M-2048 and ConvNet1M-4096 datasets. . . . . . . . . . . . . . . . . . . . . . . . . . . 25Figure 4.6 Top-5 classification error on the ILSVRC-2012 dataset as afunction of compression. The dotted black line corresponds toperformance without compression. The left pane shows per-formance using 128-dimensional deep features, and the rightpane shows performance for 1024-dimensional deep features. 26Figure 4.7 Left: Training time vs. Quantization error of the benchmarkedmethods on the ConvNet1M-128 training dataset (100K fea-tures). For clarity, we plot each 50 iterations for PQ and OPQand each 25 iterations for SQ after initialization. PQ and OPQcomplete 100 iterations after 286 and 336 seconds respectively(4.8 and 5.6 minutes), SQ takes ∼1500 seconds for initializa-tion and ∼1000 seconds for 100 iterations of codebook refine-ment (42 minutes in total). APQ takes∼2.7 hours for 10 itera-tions. Right: Encoding of the database set of 1M features. PQand OPQ take ∼5 seconds, SQ ∼20 seconds, and APQ ∼9.2hours. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27Figure 5.1 Precision@N curves on the ILSRVC 2012 dataset. . . . . . . 30Figure 5.2 Retrieval errors introduced by quantization noise in natural im-ages. Top: the roundness of the snail is incorrectly matched toan oyster in retrieval with PQ and OPQ, and to a turtle in PQ.Middle: A soccer ball is matched against other round objectssuch as an acorn and a tennis ball. Bottom: a lemon is matchedagainst apples and a lime. Best viewed digitally. . . . . . . . . 31Figure 5.3 Typical retrieval errors in person-made objects. Top: Binocu-lars are incorrectly matched to photographic cameras. Middle:A bib is incorrectly matched to t-shirts. Bottom: A pickuptruck is incorrectly matched to cars. Best viewed digitally. . . 32viiiFigure 5.4 Top: A mantis brings up grasshoppers, which share some struc-ture in the legs, but not the head. Bottom: A picture of a burritois incorrectly matched with croissants and sandwiches. Bestviewed digitally. . . . . . . . . . . . . . . . . . . . . . . . . 33Figure 6.1 Quantization error results for compositional quantizers withlocal search on the SIFT1M and ConvNet1M-128 datasets. . . 37ixAcknowledgmentsI would like to thank my supervisors, Holger Hoos and James Little, for their en-couragement and support. Their advice, both academic and professional, has beenthe main influence on how I do and think of research.I also want to thank my advisor, Nick Harvey, who convinced me to pursueresearch in computer vision instead of computational theory. The more time goesby, the more I realize the immense value of his advice. Finally, I want to thankmy mentor, Joanna McGrenere. Her professional advice and encouragement at thevery beginning of my research career could not have been more timely.I am also grateful to Peter Carr, with whom I was fortunate to spend the summerof 2014 at Disney Research Pittsburgh. This gave me the opportunity to work onvision problems with a focus on industrial applications, and allowed me to havefruitful discussions with the very talented interns at Disney. In particular, I wantto thank Mohammad Haris Baig for the many discussions on the optimization ofcompositional quantizers.I also want to thank my fellow members of the computer vision group atthe Laboratory for Computational Intelligence, Ankur Gupta and Frederick Tung.Their passion for applied computer vision to large-scale problems spawned myown interest in large-scale visual retrieval.Finally, I want to thank my friends, both geographically distant (Anahi, Erica,Fernanda and Silvia) and close (Alice, Hugh, Jaimie and Victor). Words cannotdescribe how much I value their friendship and appreciate their support.xChapter 1IntroductionComputer vision applications often involve computing the similarity of many high-dimensional, real-valued image representations, in a process known as featurematching. When large databases of images are used, this results in significantcomputational bottlenecks. For example, approaches to structure from motion,such as Photo Tourism [30], estimate the relative viewpoint of each image in largecollections of photographs by computing the pairwise similarity of several millionSIFT [21] descriptors, and it is now common for retrieval and classification datasetsto feature thousands [11, 19, 26] or millions [10, 32] of images.In the context of these computational bottlenecks, vector quantization has es-tablished itself as a default approach to scale computer vision applications. Quan-tization is usually performed on large datasets of local descriptors (e.g., SIFT [21]),or global representations (e.g., VLAD [17], Fisher vectors [25] or deep features [8]).Outside the computer vision community, the problem is also studied in informationtheory, multimedia retrieval and unsupervised learning.Vector quantization is usually posed as the search for a set of codewords (i.e.,a codebook) and assignments (i.e., codes) that minimize quantization error. Theproblem can be solved in a straightforward manner with the k-means algorithmwhich, unfortunately, scales poorly for large codebooks. While larger codebooksachieve lower quantization error, the downside is that encoding and search timesscale linearly with the codebook size.Several algorithms, such as kd-trees and hierarchical k-means, alleviate the1search and encoding problems by indexing the codebook with complex data struc-tures [22], achieving sublinear search time as a trade-off for recall. These ap-proaches, however, have a large memory footprint, because all the uncompressedvectors must be kept in memory, and do not scale well on very large databases.Another line of research considers approaches with an emphasis on low mem-ory usage, compressing vectors into small binary codes. While for a long timehashing approaches were the dominant trend [14], they were shown to be largelyoutperformed by Product Quantization (PQ) [18]. PQ is a compositional vectorcompression algorithm that decomposes the data into orthogonal subspaces, andquantizes each subspace independently. As a result, vectors can be encoded inde-pendently in each subspace, and distances between uncompressed queries and thedatabase can be efficiently computed through a series of table lookups.The conspicuous success of PQ can be attributed to its small memory footprint,low quantization error and fast search (1 million vectors can be searched exhaus-tively in around 100 milliseconds), which make it very attractive for scaling com-puter vision applications. In particular, PQ demonstrated impressive performanceand great scalability when applied to nearest neighbour search of SIFT features,which are a good fit to the codebook orthogonality of PQ. This stems from the factthat SIFT features are, in fact, independently aggregated every 8 dimensions [21].However, PQ proved less successful when applied to the semi-structured GISTdescriptors [18], and is likely not a good choice for features obtained from deepconvolutional networks. This work is, to the best of our knowledge, the first tothoroughly investigate the compression of deep features using vector quantization.Recently, Babenko and Lempitsky [4] introduced Additive Quantization (AQ),a generalization of PQ that retains its compositional nature, but is able to handlesubcodebooks of the same dimensionality as the input vectors. With a few caveats,AQ can also be used for fast approximate nearest neighbour search, and can beapplied to fast dot-product evaluation in a straightforward manner. Moreover, AQwas shown to consistently achieve lower quantization error than PQ, motivatingmore research of similar methods. However, since in AQ the codebooks are nolonger pairwise orthogonal (i.e., no longer independent), encoding cannot be doneindependently in each subspace. In fact, the encoding problem was demonstratedto be equivalent to inference on a fully-connected pairwise Markov Random Field,2which is well-known to be NP-hard [9]. In [4], beam search was proposed asa solution to this problem, but this results in very slow encoding, which greatlylimits the scalability of the proposed solution.In this work, we first analyze PQ and AQ as compositional quantizers, under aframework that makes the simplifying assumptions of PQ with respect to AQ ratherevident. We then investigate the computational complexity implications resultingfrom the differences between AQ and PQ, and finally derive an intermediate ap-proach that retains the expressive power of AQ, while being only slightly slowerthan PQ. We achieve a modest gain in performance (e.g., 3-4 % better retrievalrecall) when evaluating our approach on datasets of SIFT and GIST descriptors.However, we demonstrate large improvements, state-of-the-art compression ratesand excellent scalability when benchmarking on datasets of features obtained fromdeep convolutional neural networks, outperforming all our competitors in the 64-128 bit range. For example, when compressing 128-dimensional deep features,our approach achieves up to 15% better recall, and 7% better classification ratesin the ILSVRC 2012 dataset. We find these results particularly encouraging, sincefeatures obtained from deep networks are likely to replace hand-crafted features inthe foreseeable future.In summary, our approach can be seen as an improvement upon PQ and AQ,and in particular compares favourably to AQ in 3 ways: (i) it consistently achievessimilar or lower quantization error (and therefore, lower error than PQ), (ii) it isseveral orders of magnitude faster and (iii), it is also simpler to implement.Chapters 1-4 of this thesis present an extended version of work that has beensubmitted for publication to the IEEE International Conference on Computer Vi-sion and Pattern Recognition (CVPR) 2015 as Martinez, J., Hoos, H. H., & Little,J. J.: Stacked Quantizers for Compositional Vector Compression. This includesan analysis of the computational trade-offs of PQ and AQ, and the derivation ofStacked Quantizers, as well as experimental results on quantization error, nearestneighbour search and large-scale classification with compressed features. Chap-ters 5 and 6 present additional, original and unpublished work. In Chapter 5, wequalitatively explore the noise introduced by quantization of deep features, visuallyinspecting the nearest-neighbours of several image queries. Finally, in Chapter 6we present a preliminary study of local search for encoding.3Chapter 2Related workWe build upon recent work on compositional vector quantization, with a focus onthe compression of features obtained from deep convolutional neural networks.First, we review recent work that has, to a limited extent, investigated the com-pression of deep features applied to retrieval and classification. We then review,in detail, compositional vector quantization methods typically used in computervision applications.2.1 Deep features and their compressionIn 2012, Krishevsky et al. [20] demonstrated impressive performance on imageclassification, outperforming all their competitors by nearly 10 points on the Im-age Large Scale Visual Recognition Challenge (ILSVRC). Their method, based ondeep convolutional neural networks trained on raw RGB values, greatly departedfrom the main classification pipelines used in the computer vision community at thetime, which consisted mainly of local feature extraction and aggregation in high-dimensional vectors that attempted to capture the statistics of object categories.The ground-breaking result of Krizhevsky et al. [20] has spawned large amountsof work on the application of deep features to other visual tasks such as classifica-tion [20, 31], detection [13, 15] and retrieval [5, 8], demonstrating state-of-the-artperformance on every task. In spite of the growing popularity of deep features,we find that no previous work has rigorously addressed their compression. Thus,4the question what is the best way to compress deep learning features? remainslargely open. The question is also rather elusive, as more research into deep convo-lutional neural networks keeps producing more complex and powerful descriptors,and it seems likely that this trend will continue for at least a few years. In thiswork, we obtained deep features from networks of 7 layers, 5 convolutional and 2fully connected, which follow the architecture of Krishevsky et al. [20] and yieldedstate-of-the-art classification results as of July 2014. We note, however, that recentwork [29, 31] has shown that up to 19 convolutional layers yield features with sig-nificantly better performance (half the error rate of the network from [20]). Weexpect our work to generalize well to these new features, but their thorough evalu-ation is left for future work.In fairness, several papers have considered compressed deep features as part ofclassification or retrieval pipelines, observing rather graceful degradations in per-formance. Agrawal et al. [1] binarized deep features for classification by simplythresholding at 0, resulting in 4096-bit long descriptors. Babenko et al. [5] experi-mented with PCA and discriminative dimensionality reduction down to 16 dimen-sions for image retrieval. Finally, Chatfield et al. [8] experimented with intra-netcompression for final descriptors of 128, 1024 and 2048 dimensions. They latercompressed the 128-dimensional deep features using PQ to 128 and 256 bits, andexperimented with binarization in the 1024-2048 bit range for retrieval. These pa-pers, however, did not focus on investigating state-of-the-art compression methodson deep features; rather, they simply used compression in their pipelines. To thebest of our knowledge, we are the first to thoroughly study the compression of deepfeatures through vector quantization.We now introduce some notation mostly following [23]. We review the vec-tor quantization problem, the scalability approaches proposed by PQ and AQ, anddiscuss their advantages and disadvantages.52.2 Vector quantizationGiven a set of vectors X = {x1,x2, . . . ,xn}, the objective of vector quantization isto minimize the quantization error, i.e., to determineminC,b1n ∑x∈X‖x−Cb‖22, (2.1)where C ∈ Rd×k contains k cluster centers, and b ∈ {0,1}k is subject to the con-straints ‖b‖0 = 1 and ‖b‖1 = 1. That is, b may only index into one entry of C. Cis usually referred to as a codebook, and b is called a code.If we let X = [x1,x2, . . . ,xn] ∈ Rd×n contain all the x ∈X , and similarly letB = [b1,b2, . . . ,bn]∈ {0,1}k×n contain all the codes, the problem can be expressedmore succinctly as determiningminC,B1n‖X−CB‖22. (2.2)Without further constraints, one may solve expression 2.2 using the k-meansalgorithm, which alternatively solves for B (typically exhaustively computing thedistance to the k clusters in C for each point in X) and C (finding the mean ofeach cluster) until convergence. The performance of k-means is better as the sizeof the codebook, k, grows larger but, unfortunately, the algorithm is infeasible forlarge codebook sizes (for example, k = 264 clusters would far exceed the memorycapacity of current machines). The challenge is thus to handle large codebooksthat achieve low quantization error while having low memory overhead.2.3 Compositional quantizationOne way of scaling the codebook size looks at compositional models, where smallersubcodebooks can be combined in different ways to potentially represent an expo-nential number of clusters. Compositional quantization can be formulated similarlyto k-means, but restricted to a series of constraints that introduce interesting com-putational trade-offs. The objective function of compositional quantization can be6expressed asminCi,bi1n ∑x∈X‖x−m∑iCibi‖22, (2.3)that is, the vector x can be approximated not only by a single codeword indexedby its code b, but by the addition of its encodings in a series of codebooks. Werefer to the Ci as subcodebooks, and similarly call the bi subcodes. We let eachsubcodebook contain h cluster centres: Ci ∈ Rd×h, and each subcode bi remainslimited to having only one non-zero entry: ‖bi‖0 = 1, ‖bi‖1 = 1. Since each bimay take a value in the range[1,2, . . . ,h], and there are m subcodes, the resultingnumber of possible cluster combinations is equal to hm, i.e., superlinear in m. Nowwe can more succinctly write expression 2.3 asminC,B‖X−CB‖22 = minCi,Bi‖X− [C1,C2, . . . ,Cm]B1B2...Bm‖22, (2.4)where Bi = [bi1,bi2, . . . ,bin] ∈ {0,1}h×n. As we will show next, AQ, PQ and Opti-mized Product Quantization (OPQ) [12, 23] belong to this family of models.2.4 Product quantizationPQ, introduced by Je´guo et al. [18] in 2011, can be formulated right away withEq. 2.4 under the constraint that all the subcodebooks be pairwise orthogonal [23]:∀i, j : i 6= j→C>i C j = 0h×h, (2.5)that is, C is block-diagonal [23]:7C = [C1,C2, . . . ,Cm] =D1 0 . . . 00 D2 0.... . ....0 0 . . . Dm, (2.6)where the entries Di ∈ R(d/m)×h are the only non-zero components of C. Thisconstraint assumes that the data in X was generated from a series of mutually in-dependent subspaces (those spanned by the subcodebooks Ci), which is rarely thecase in practice. There are, however, some advantages to this formulation.The subcodebook independence of PQ offers 3 main advantages,1. Fast learning. Under the orthogonality constraint we can efficiently learn thesubcodebooks Ci by independently running k-means on d/m dimensions. Thecomplexity of k-means is O(nkdi) for n datapoints, k cluster centres, d dimen-sions and i iterations. PQ solves m d/m-dimensional k-means problems withh cluster centres each, resulting in a complexity of O(mnh(d/m)i) = O(nhdi);i.e., training PQ is as complex as solving a k-means problem with h clustercentres.2. Fast encoding. Once training is done, the encoding of the database can also beperformed efficiently in O(nhd) (in line with k-means), which is essential forvery large databases.3. Fast distance computation. Distance computation between a query q and aencoded vector ∑mi=1Cibi is efficient because the subcodebooks are orthogonal,and therefore the total distance is equal to the sum of the distances in eachsubspace [18]‖q−m∑i=1Cibi‖22 =m∑i=1‖qi−Dibi‖22, (2.7)where q = [q1,q2, . . . ,qm], and qi ∈ Rd/m. These distances can be precom-puted for each query and quickly evaluated with m table lookups. This is called8Asymmetric Distance Computation in [18] and is the mechanism that makes PQattractive for fast approximate nearest neighbour search.2.5 Optimized product quantizationOne of the main disadvantages of PQ is that X is forced to fit in a model that as-sumes that the data was generated from statistically independent subspaces. Lowerquantization error can be achieved if more degrees of freedom are added to themodel. In particular, since rotation is a distance-preserving operation, it seems nat-ural to experiment with codebook rotations that minimize quantization error. InOPQ, the objective function becomes [23]minR,C,B1n‖X−RCB‖22, (2.8)where C and B are expanded as in Eq. 2.4, and R belongs to the Special Orthog-onal Group SO(d). In this sense, PQ is a special case of OPQ where R is thed-dimensional identity matrix: R = Id . Later work by Je´gou et al. [17] notedthe importance of this additional parameter; however, the optimization of R wasdeemed “not tractable”, and using a random orthogonal matrix was suggested. In2013, independently, Ge et al. [12] and Norouzi & Fleet [23] proposed an iterativemethod similar to Iterative Quantization [14] that optimizes R in expression 2.8.We notice, however, that the orthogonality constraint is maintained from PQ toOPQ.Lower quantization error can be achieved if the independence assumption is notenforced, at the cost of more complex encoding and distance computation. Thesetrade-offs were first introduced in [4] and called Additive Quantization (AQ). Next,we briefly review AQ.2.6 Additive quantizationIn AQ, the subspaces spanned by the subcodebooks Ci are not mutually orthogo-nal (i.e., not mutually independent). Formally, and although not explicitly statedin [4], AQ solves the formulation of Eq. 2.3 without any further constraints. This9makes of AQ a strictly more general model than PQ/OPQ. However, this addedexpressiveness comes at a cost.The subcodebook dependence of AQ comes with 3 main disadvantages withrespect to PQ/OPQ,1. Distance computation. The distance between a query q and a encoded vector∑mi=1Cibi cannot be computed with m table lookups. However, it can be foundusing the identity‖q−m∑i=1Cibi‖22 = ‖q‖22−m∑i=12〈q,Cibi〉+‖m∑i=1Cibi‖22 (2.9)where the first term is a constant and does not affect the query ranking; thesecond term can be precomputed and stored for fast evaluation with m tablelookups, and the third term can either be precomputed and quantized for eachvector in the database (at an additional memory cost), or can be computed onthe fly as‖m∑i=1Cibi‖22 =m∑i‖Cibi‖22 +2m∑i6= j〈Cibi,C jb j〉 (2.10)where the terms can also be precomputed and retrieved in m table lookups.Thus, AQ has either a time (2m vs. m lookups) or memory overhead (for stor-ing the quantized result of Eq. 2.10) during distance computation with respectto PQ. Although this may sound as a major problem for AQ, it was shownin [4] that sometimes the distortion error gain can be high enough that allocat-ing memory from the code budget to store the result of Eq. 2.10 results in betterrecall and faster distance computation compared to PQ/OPQ. This motivates usto look for better solutions to the AQ formulation.2. Encoding. For a given set of subcodebooks Ci and a vector x, encodingamounts to choosing the optimal set of codes bi that minimize quantization error‖x−∑mi=1Cibi‖22. Unfortunately, without the orthogonality constraint the choice10of bi cannot be made independently in each subcodebook. This means that, inorder to guarantee optimality, the search for the best encoding must be doneover a combinatorial space of codewords. Moreover, it was shown in [4] thatthis problem is equivalent to inference on a fully connected pairwise MarkovRandom Field, which is well-known to be NP-hard [9].Since brute force search is not possible, one must settle for a heuristic searchmethod. Beam search was proposed as a solution in [4], resulting in rather slowencoding. Beam search is done in m iterations. At iteration i the distance iscomputed from each of the b candidate solutions to the set of k ·(m− i) plausiblecandidates (in the m− i codebooks that have not contributed to the candidatesolution). At the end of the iteration we have b2 candidate solutions, fromwhich the top b are kept as seeds for the next iteration [4]. The complexityof this process is O(m2bhd) = O(m2bhd), where b is the search depth. Aswe will show, this makes the original solution of AQ impractical for very largedatabases.3. Training. Training consists of learning the subcodebooks Ci and subcode-book assignments bi that minimize expression 2.3. A typical approach is to usecoordinate descent by fixing the subcodebooks Ci while updating the codes bi(encoding), and later fixing bi while updating Ci (codebook update). As a sideeffect of slow encoding, we find that training is also very slow in AQ. Whilethis might seem as a minor weakness of AQ (since training is usually done off-line, without tight time constraints), having faster training also means that for afixed time budget we can handle larger amounts of training data. In the quanti-zation setting, this means that we can use a larger sample to better capture theunderlying distribution of the database.In [4], codebook update is done by solving the over-constrained least-squaresproblem that arises from Eq. 2.4 when holding B fixed and solving for C. For-tunately, this decomposes into d independent subproblems of n equations overmh variables [4]. This corresponds to an optimal codebook update in the leastsquares sense. We find that compared to encoding this step is rather fast, andthus focus on speeding up encoding.11Chapter 3Stacked quantizersWe now introduce our proposed approach to compositional quantization.Within the subcodebook dependence-independence framework introduced inChapter 2, we can see that PQ and OPQ assume subcodebook independence, whileAQ embraces the dependence and tries to solve a more complex problem. As wewill show next, there is a fertile middle ground between these approaches. Wepropose a hierarchical assumption, which has the advantage of being fast to solvewhile maintaining the expressive power of AQ (see Figure 3.1).Design goals. Due to the superior performance of AQ, we want to maintain itskey property: subcodebook dependence. However, we look for a representationthat can compete with PQ in terms of fast training and good scalability, for whichfast encoding is essential. We propose to use a hierarchy of quantizers (see Fig-ure 3.2, left), where the vector is sequentially compressed in a coarse-to-fine man-ner.3.1 EncodingFast encoding is at the heart of our approach. We assume that the subcodebooksCi have a hierarchical structure, where C1 gives the coarsest quantization and CMthe finest. Encoding is done greedily. In the first step, we choose the code b1 thatmost minimizes the quantization error ‖x−C1b1‖22. Since all the subcodebooks12a) PQ / OPQ b) AQ c) SQ (ours)Figure 3.1: A graphical model representation of codebook learning that high-lights the assumptions of PQ/OPQ, AQ, and our approach. Here weshow 4 subcodebooks: a) PQ and OPQ assume subcodebook indepen-dence; b) AQ attempts to solve the fully connected problem; c) we as-sume a hierarchical structure.are small, the search for b1 can be done exhaustively (as in k-means). Note that inFigure 3.1 this corresponds to solving for the assignment to the top-level codebookwhich, according to our model, is assumed to have no dependencies.Next, we compute the first residual r1 = x−C1b1. We now quantize r1 usingthe codewords in C2, choosing the one that minimizes the quantization error ‖r1−C2b2‖22. Again, in Figure 3.1, this corresponds to solving for the assignment tothe second codebook, which only depends on the top one and is now fixed. Thisprocess is repeated until we run out of codebooks to quantize residuals, with thelast residual rm being equal to the total quantization error (see Figure 3.2, right).Now it is clear that we satisfy our first desired property, as the representation isadditive in the encodings: x≈ ∑mi=1Cibi, and the codewords all are d-dimensional(i.e., not independent of each other).The complexity of this step is O(mhd) for m subcodebooks, each having hsubcodewords, and a vector of dimensionality d. This corresponds to a slight in-crease in computation with respect to PQ (O(hd)), but is much faster than AQ13Quantizer 1Quantizer 3Quantizer 41)3)2)4)r 1 = x C 1 b 1r2 = r 1 C2b2r3 = r2 C3b3r 4 = r3 C 4 b 4r 1r2r3 r 4C 4 b 4C 1 b 1C2b2C3b3x ⇡ C 1 b 1 + C2b2 + C3b3 + C 4 b 4x = C 1 b 1 + C2b2 + C3b3 + C 4 b 4 + r 4Quantizer 2xx xxxFigure 3.2: Encoding as performed with Stacked Quantizers, shown for 4subcodebooks. Left: the vector is passed through a series of quantiz-ers, with residuals further encoded down the line. Right: A geometricinterpretation of our approach. After recursively encoding residuals, therepresentation is additive in the encodings, and the quantization error isthe remaining residual.(O(m3bhd)). Given that encoding is only slightly more expensive than PQ, we cansay that we have also achieved our second desired property.3.2 InitializationThe goal of initialization is to create a coarse-to-fine set of codebooks. This can beachieved by simply performing k-means on X , obtaining residuals by subtractingthe assigned codewords, and then performing k-means on the residuals until werun out of codebooks.Formally, in the first step we obtain C1 from the cluster centres computed by k-means on X , and we obtain residuals by subtracting R1 = X −C1B1. In the secondstep we obtain C2 from k-means on R1, and the residuals are refined to R2 = R1−C2B2. This process continues until we run out of codebooks (notice how this both isanalogous to, and naturally gives rise to, the fast encoding proposed before). By the14end of this initialization, we have an initial set of codebooks C = [C1,C2, . . . ,Cm]that have a hierarchical structure, and with which encoding can be performed in agreedy manner.The computational cost of this step is that of running k-means on n vectors mtimes, i.e., O(mnhdi) for subcodebooks of size h, dimensionality d and i k-meansiterations.3.3 Codebook refinementThe initial set of codebooks can be further optimized with coordinate descent. Thisstep is based on the observation that, during initialization, we assume that in orderto learn codebook Ci we only need to know codebooks C1,C2, . . . ,Ci−1. However,after initialization all the codebooks are fixed. This allows us to fine-tune eachcodebook given the value of the rest.Although it is tempting to use the least-squares-optimal codebook update pro-posed in [4], we have found that this tends to destroy the hierarchical subcodebookstructure resulting from initialization. Without a hierarchical structure encodingcannot be done fast, which is one of the key properties that we wish to maintain.We therefore propose an ad hoc codebook refinement technique that preserves thehierarchical structure in the codebooks.Let us define Xˆ as the approximation of X from its encodingXˆ =CB. (3.1)Now, let us define Xˆ−i as an approximation to the original dataset X obtainedusing the learned codebooks [C1,C2, . . . ,Cm] and codes B = [B>1 ,B>2 , . . . ,B>n ]> ,except for Ci, i.e.,Xˆ−i = Xˆ−CiBi. (3.2)We can now see that the optimal value of Ci given the rest of the codebook-sis obtained by running k-means on X −X i−1, i.e., the residual after removing thecontribution of the rest of the codebooks. Since we already know the cluster mem-bership to Ci (i.e., we know Bi) either from initialization or the previous iteration,15we need to update only the cluster centres instead of restarting k-means (similar tohow OPQ updates the codebooks given an updated rotation [12, 23]).Enforcing codebook hierarchy is of the essence. Therefore, we run our code-book update in a top-down manner. We first update C1 and update all codes. Next,we update C2 and update codes again. We repeat the process until we have updatedCm, followed by a final update of the codes. Updating the codes after each code-book update ensures that the codebook hierarchy is maintained. A round of updatesfrom codebooks 1 to m amounts to one iteration of our codebook refinement.The algorithm involves encoding using m codebooks in the first pass, m−1 inthe second pass, m− 2 in the third pass and so on until only one set of codes isupdated. This means that the time complexity of the codebook refinement proce-dure is quadratic in the number of codebooks. This is a significant increase withrespect to PQ/OPQ, which are linear in m during their training, and is comparedto the scaling of AQ. However, notice that the training usually has to be done onlyonce with a small data sample, and database encoding remains efficient. We willshow that this makes a huge difference for large datasets in Chapter 4.16Chapter 4ExperimentsIn this chapter, we evaluate Stacked Quantizers empirically, and compare our methodagainst previous work on compositional quantization. Our main interest is to re-duce quantization error, because it has been demonstrated to lead to better retrievalrecall, mean average precision and classification performance [4, 12, 18, 23]. Wealso demonstrate two applications of our method: (i) approximate nearest neigh-bour search and (ii) classification performance with compressed features. In allour experiments we use codebooks of size h = 256. This is particularly convenientbecause the codes can be stored in one byte each, which has been the standardpractice in previous work [4, 12, 18, 23]. This also means that when we refer toexperiments using 2, 4, 8 and 16 codebooks, we are using codes of 16, 32, 64 and128 bits per vector, respectively.4.1 Experimental setupDatasets. Initially, we test our method on 2 datasets, SIFT1M and GIST1M, in-troduced in [18]. SIFT1M consists of 128-dimensional SIFT [21] descriptors, andGIST1M consists of 960-dimensional GIST [24] descriptors. In SIFT1M 100 000vectors are given for training, 10 000 for query and 1 000 000 for the database. InGIST1M, 500 000 vectors are given for training, 5 000 for query and 1 000 000 forthe database.17Name Structured DimensionalitySIFT1M Yes 128GIST1M Yes 960ConvNet1M-128 No 128ConvNet1M-1024 No 1024ConvNet1M-2048 No 2048ConvNet1M-4096 No 4096Table 4.1: Summary of the datasets on which we test our approach. The twodatasets at the top of the table were introduced in [18] and are standardbenchmarks in the literature. The four datasets on the bottom are intro-duced in this work.Since hand-crafted features are consistently being replaced by features ob-tained from deep convolutional neural networks, we also consider a series of datasetsof deep features. We call these datasets ConvNet1M-128, ConvNet1M-1024, ConvNet1M-2048 and ConvNet1M-4096. We obtained these datasets by computing deep learn-ing features on the ILSVRC-2012 training dataset [10], using the CNN-M-128,CNN-M-1024, CNN-M-2048 and CNN-M networks provided by Chatfield et al. [7],and then subsampling equally at random from all classes – sampling was done in-dependently for each dataset. This networks follows the architecture proposed byZeiler and Fergus [33], with the exception that, for CNN-M-{128, 1024 and 2048},the last fully-connected layer was reduced from 4096 to 128, 1024 and 2048 unitsrespectively. It has been shown that this intra-net compression has a minimal effecton classification performance [7] and exhibits state-of-the-art accuracy on imageretrieval [8]. In these 4 datasets, we set aside 100 000 vectors for training, 10 000for query and 1 000 000 for database. See Table 4.1 for a summary of the datasetsused in our experiments.Baselines. We compare against 3 baselines. The first one is AQ as described byBabenko and Lempitsky [4], which consists of beam search for encoding and aleast-squares codebook update in an iterative manner. As in [4], we set the beamsearch depth b to 16 during training and to 64 for the database encoding. Al-18though [4] does not mention the number of iterations used during training, wefound that 10 iterations reproduce the results reported by the authors and, as wewill show, this is already several orders of magnitude slower than our approach.For code lengths of 64 and 128 bits (8 and 16 codebooks respectively) we use thehybrid APQ algorithm suggested in [4], where the dataset is first preprocessed withOPQ, and then groups of 4 subcodebooks are refined independently with AQ. APQwas proposed for practical reasons, as otherwise AQ would require several days tocomplete given more than 4 subcodebooks: the need for this approximation startsto show the poor scalability of AQ. Since no code for AQ is available, we wroteour own implementation and incorporated the optimizations suggested in [4].The second baseline is Optimized Product Quantization [12, 23], which wasbriefly introduced in Chapter 2. We use the publicly available implementation byNorouzi & Fleet1, and set the number of optimization iterations to 100. The thirdbaseline is Product Quantization [18]. We slightly modified the OPQ code to createthis baseline. We also use 100 iterations in PQ.1https://github.com/norouzi/ckmeans192 4 8 160.511.522.533.544.55 x 104 SIFT1MNumber of codebooksQuantization error PQOPQAQ/APQSQ2 4 8 160.50.60.70.80.911.11.21.3 GIST1MNumber of codebooksQuantization error PQOPQAQ/APQSQ2 4 8 160.050.10.150.20.250.3 ConvNet1M−128Number of codebooksQuantization error PQOPQAQ/APQSQFigure 4.1: Quantization error on SIFT1M and GIST1M. SQ shows a modestadvantage on structured features.4.2 Quantization errorOur quantization results for SIFT1M and GIST1M are shown in Figure 4.1. InSIFT1M, which features short structured vectors, we observe performance com-petitive with AQ/APQ. This is already good news, given the better scalability ofour method. In GIST1M, OPQ achieves a large gain compared to PQ, and thisgap is only slightly improved by AQ and SQ. We speculate that, due to the highly-structured nature of GIST and its high dimensionality, OPQ is able to find a rotationthat makes the distribution orthogonal in the subspaces spanned by the codebooksof OPQ. Therefore, there is little performance to gain with non-orthogonal code-books. We do not investigate this hypothesis further here, but direct the reader tothe recent work of Heo et al. [16], who have shown promising results on GISTdescriptors, reducing both the bias and the variance of the quantization error.We now benchmark our method on the 128-, 1024-, 2048- and 4096-dimensionaldeep features from our ConvNet1M datasets. The quantization results on thesedatasets are shown in Figure 4.2. As expected, due to the unstructured nature ofdeep features, we observe that AQ and SQ maintain a large advantage over PQ andOPQ, particularly for the small 128-dimensional features. Moreover, our methodremains the clear winner for 8 and 16 codebooks, and largely competitive with202 4 8 160.050.10.150.20.250.3ConvNet1M−128Number of codebooksQuantization error PQOPQAQ/APQSQ2 4 8 160.20.250.30.350.40.450.5ConvNet1M−1024Number of codebooksQuantization error PQOPQAQ/APQSQ2 4 8 160.20.250.30.350.40.450.5ConvNet1M−2048Number of codebooksQuantization error PQOPQAQ/APQSQ2 4 8 160.250.30.350.40.450.50.550.6ConvNet1M−4096Number of codebooksQuantization error PQOPQAQ/APQSQFigure 4.2: Quantization error on deep features with different dimensionali-ties. The non-independent approaches AQ and SQ clearly outperformPQ and OPQ. SQ achieves the best performance when using 8 and 16codebooks (64 and 128 bits per feature) in all cases.21AQ for 4 codebooks. These results suggest that codebook independence hurts thecompression of deep features particularly badly and motivates more research ofcompositional quantization methods that follow the formulation of expression 2.3.4.3 Approximate nearest neighbour searchWe demonstrate the performance of our method on fast search of K nearest neigh-bours with recall@N curves [18]. These curves represent the probability of the trueK nearest neighbours being in a retrieved list of N neighbours for varying N. Weset K = 1 and observe little variability for other values.Our retrieval results for SIFT1M, GIST1M and ConvNet1M-128 are shown onFigure 4.3; in these relatively low-dimensional datasets, we also benchmark ourmethod against AQ/APQ. However, when benchmarking on higher-dimensionaldeep features, we do not run AQ/APQ due to its high computational cost – themethod would take several days to train and then encode the base datasets. Wefocus on the results for deep features in the next Figures, showing our performancefor ConvNet1M-128 and ConvNet-1024 on Figure 4.4, and for ConvNet1M-2048and ConvNet-4096 on Figure 4.5.We confirm previous observations [4, 12, 23] that correlate quantization er-ror with nearest neighbour search performance. Our method shows performancewithin 3-4% of AQ on SIFT1M, consistently outperforming PQ and OPQ. OnGIST1M, OPQ achieves a very competitive performance, which is barely improvedby AQ and SQ.As expected, our methods shows its best performance when evaluated on datasetsof deep convolutional features: SQ largely outperforms PQ and OPQ on all theConvNet datasets when using 32 and 64 bits. When using 32 bits, the differ-ence in retrieval is as large as 15%, and up to 10% for 64 bits. Curiously, inspite of the large difference in quantization error observed in the training datasetfor 128-bit codes, OPQ obtains a comparable retrieval performance to SQ on deep1024-, 2048- and 4096-dimensional features. We speculate that, in these very high-dimensional spaces, features are so sparsely-located that the order in which theyare retrieved is less affected by their quantization error. However, we would expect22100 101 102 10300.20.40.60.81NRecallSIFT1M − 32 bits PQOPQAQSQ100 101 102 103 10400.20.40.60.81NRecallGIST1M − 32 bits PQOPQAQSQ100 101 102 103 10400.20.40.60.81NRecallConvNet1M−128 − 32 bits PQOPQAQSQ100 101 102 1030.20.40.60.81NRecallSIFT1M − 64 bits PQOPQAQSQ100 101 102 103 10400.20.40.60.81NRecallGIST1M − 64 bits PQOPQAQSQ100 101 102 103 10400.20.40.60.81NRecallConvNet1M−128 − 64 bits PQOPQAQSQ100 101 102 1030.40.50.60.70.80.91NRecallSIFT1M − 128 bits PQOPQAQSQ100 101 102 103 10400.20.40.60.81NRecallGIST1M − 128 bits PQOPQAQSQ100 101 102 103 1040.20.40.60.81NRecallConvNet1M−128 − 128 bits PQOPQAQSQFigure 4.3: Recall@N curves on the SIFT1M and GIST1M datasets.our method to maintain its performance advantage for larger datasets. Moreover,the best retrieval performance is always observed on the 128-dimensional deepfeatures, where our methods consistently outperforms all its competitors by∼5%.4.4 Large-scale object classificationWe study the trade-off in classification performance vs. compression rate on theILSVRC-2012 dataset using deep learning features. We trained a linear SVM onthe 1.2 million uncompressed examples provided, and preprocessed the featureswith L2 normalization, which was found to improve performance in [7]. The50 000 images in the validation set were preprocessed similarly and compressedbefore evaluation. This scenario is particularly useful when one wants to searchfor objects in large unlabelled datasets [2, 6], and in retrieval scenarios where clas-23100 101 102 10300.20.40.60.81NRecallConvNet1M−128 − 32 bits PQOPQSQ100 101 102 10300.20.40.60.81NRecallConvNet1M−1024 − 32 bits PQOPQSQ100 101 102 10300.20.40.60.81NRecallConvNet1M−2048 − 32 bits PQOPQSQ100 101 102 10300.20.40.60.81NRecallConvNet1M−4096 − 32 bits PQOPQSQ100 101 102 10300.20.40.60.81NRecallConvNet1M−128 − 64 bits PQOPQSQ100 101 102 10300.20.40.60.81NRecallConvNet1M−1024 − 64 bits PQOPQSQ100 101 102 10300.20.40.60.81NRecallConvNet1M−2048 − 64 bits PQOPQSQ100 101 102 10300.20.40.60.81NRecallConvNet1M−4096 − 64 bits PQOPQSQ100 101 102 1030.20.40.60.81NRecallConvNet1M−128 − 128 bits PQOPQSQ100 101 102 10300.20.40.60.81NRecallConvNet1M−1024 − 128 bits PQOPQSQ100 101 102 10300.20.40.60.81NRecallConvNet1M−2048 − 128 bits PQOPQSQ100 101 102 10300.20.40.60.81NRecallConvNet1M−4096 − 128 bits PQOPQSQFigure 4.4: Recall@N curves on the ConvNet1M-128 and ConvNet1M-1024datasets.24100 101 102 10300.20.40.60.81NRecallConvNet1M−128 − 32 bits PQOPQSQ100 101 102 10300.20.40.60.81NRecallConvNet1M−1024 − 32 bits PQOPQSQ100 101 102 10300.20.40.60.81NRecallConvNet1M−2048 − 32 bits PQOPQSQ100 101 102 10300.20.40.60.81NRecallConvNet1M−4096 − 32 bits PQOPQSQ100 101 102 10300.20.40.60.81NRecallConvNet1M−128 − 64 bits PQOPQSQ100 101 102 10300.20.40.60.81NRecallConvNet1M−1024 − 64 bits PQOPQSQ100 101 102 10300.20.40.60.81NRecallConvNet1M−2048 − 64 bits PQOPQSQ100 101 102 10300.20.40.60.81NRecallConvNet1M−4096 − 64 bits PQOPQSQ100 101 102 1030.20.40.60.81NRecallConvNet1M−128 − 128 bits PQOPQSQ100 101 102 10300.20.40.60.81NRecallConvNet1M−1024 − 128 bits PQOPQSQ100 101 102 10300.20.40.60.81NRecallConvNet1M−2048 − 128 bits PQOPQSQ100 101 102 10300.20.40.60.81NRecallConvNet1M−4096 − 128 bits PQOPQSQFigure 4.5: Recall@N curves on the ConvNet1M-2048 and ConvNet1M-4096 datasets.254 8 160.20.250.30.350.4128 dimensionsNumber of codebookstop−5 error PQOPQAQ/APQSQ4 8 160.20.250.30.350.40.450.51024 dimensionsNumber of codebookstop−5 error PQOPQAQ/APQSQFigure 4.6: Top-5 classification error on the ILSVRC-2012 dataset as a func-tion of compression. The dotted black line corresponds to performancewithout compression. The left pane shows performance using 128-dimensional deep features, and the right pane shows performance for1024-dimensional deep features.sifiers are applied to large collections of images in search for high scores [8, 28].Notice that in this scenario, the only operation needed between the support vectorsand the database descriptors is a dot product; as opposed to distance computation,this can be done with m lookups in AQ and SQ, the same as for PQ and OPQ. Wereport the classification error taking into account the top 5 predictions.Classification results are shown on Figure 4.6. We observe a similar trend tothat seen in our quantization results, with PQ and OPQ consistently outperformedby AQ and SQ. Using 128-dimensional features our method performs similarly toAQ using 4 codebooks, but shows better performance for larger code sizes. Using1024-dimensional features AQ and SQ are practically equivalent but, curiously, itseems like the 128-dimensional features are more amenable to compression: for allcompression rates the 128-dimensional features outperform the 1024-dimensionalfeatures ([0.2646,0.2293,0.2101] vs. [0.2917,0.2562,0.2246] in top-5 error), eventhough when uncompressed the 1024-dimensional features perform slightly better(0.1999 vs. 0.1893). This suggests that, if quantization is planned as part of alarge-scale classification pipeline, low-dimensional features should be preferred260 2000 4000 6000 8000 100000.050.10.150.20.250.3SecondsQuantization error ← refinement startsPQOPQAPQSQ100101102103104105QuantizerSeconds PQOPQSQAPQFigure 4.7: Left: Training time vs. Quantization error of the benchmarkedmethods on the ConvNet1M-128 training dataset (100K features). Forclarity, we plot each 50 iterations for PQ and OPQ and each 25 itera-tions for SQ after initialization. PQ and OPQ complete 100 iterationsafter 286 and 336 seconds respectively (4.8 and 5.6 minutes), SQ takes∼1500 seconds for initialization and ∼1000 seconds for 100 iterationsof codebook refinement (42 minutes in total). APQ takes ∼2.7 hoursfor 10 iterations. Right: Encoding of the database set of 1M features.PQ and OPQ take ∼5 seconds, SQ ∼20 seconds, and APQ ∼9.2 hours.over high-dimensional ones. It is also noticeable that for extreme compressionrates (e.g., 32 bits) PQ and OPQ have error rates in the 35-45% range, while AQand SQ degrade more gracefully and maintain a 25-30% error rate.4.5 Running timesFigure 4.7 shows the running time for training and database encoding for PQ/OPQ,APQ and SQ on the ConvNet1M-128 dataset using 8 codebooks (64 bits). All mea-surements were taken on a machine with a 3.20 GHz processor using a single core.We see that SQ obtains most of its performance advantage out of initialization, butcodebook refinement is still responsible for a 20% decrease to the final quantiza-tion error (0.12 to 0.10). We also see that APQ largely improves upon its OPQinitialization, but these iterations are extremely expensive compared to PQ/OPQ,and 3 iterations take almost as much computation as the entire SQ optimization.27Moreover, encoding the database with the learned codebooks is extremely expen-sive with APQ (9.2 hours), while for PQ/OPQ and SQ it stays in the 5-20 secondrange. Projecting these numbers to the encoding of a dataset with 1 billion fea-tures such as SIFT1B [18] suggests that PQ/OPQ would need about 1.5 hours tocomplete, and SQ would need around 6 hours; however, APQ would need around1.05 years (!). Although all these methods are highly parallelizable, these numbershighlight the importance of fast encoding for good scalability.28Chapter 5Image retrievalIn Section 4.3 we empirically benchmarked our method on the task of fast ap-proximate nearest neighbour search. This is a task that has several applicationsin computer vision, such as structure from motion [30] and non-parametric clas-sification [32]. In this chapter, we investigate nearest-neighbour image retrievalin terms of semantic classification. In this scenario, we are not interested in mea-suring whether the retrieved neighbours are close in the Euclidean space. Rather,given an image, we want to quantify how many of the retrieved neighbours belongto the same class. This can be evaluated with Precision@N curves, which indicatethe empirical probability that the top N retrieved images belong to the same class asthe query. Figure 5.1 shows our results on this task on the ILSVRC 2012 dataset.We used the same train and query partitions as in Chapter 4, and show resultsfor 64-bit codes from the ConvNet1M-128 dataset. We observe that our methodachieves a consistent advantage of ∼2% over OPQ, ∼4% over PQ in precision forall values of N.5.1 Qualitative retrievalIn this section we show qualitative results of the previous task. Our goal is toobserve typical error cases, to get a visual sense of the noise introduced due toquantization error in image retrieval.We identify two types of errors: shape errors and part errors. Shape errors290 10 20 30 40 50 60 70 80 90 1000.50.510.520.530.540.550.560.570.580.590.6NPrecision PQOPQSQFigure 5.1: Precision@N curves on the ILSRVC 2012 dataset.occur when the shape of an object is highly discriminative, but fine details arelost due to compression artifacts. Part errors occur on objects that are made ofmany parts (e.g., a car that is made of lights, windows, tires, etc), and this causesconfusion with other object categories that share some of the parts (e.g., the tire ofa car might make it match a truck, or a motorcycle).Qualitative results are shown in Figures 5.2, 5.3 and 5.4. In the three figures,we show the query at the left, and the closest 5-nearest neighbours under differentcompression techniques. Incorrect matches are bordered with yellow. At the top,we show the neighbours returned when using SQ, then using OPQ, and finallyusing PQ. In Figure 5.2, we show examples of shape errors: as compression getsnoisier, we see soccer balls being matched to tennis balls or acorns, and snailsmatched to turtles. This suggests compression might make features focus more oncoarse features, but also more prone to lose fine details. In Figure 5.3, we showexamples of part errors: a set of binoculars is matched with photographic cameras,with which it shares a lens; a bib is matched to t-shirts, which also feature roundnecks, and a pickup truck is matched against racing and suburban cars. Finally, inFigure 5.4 we show examples that observe both shape and part errors; please referto the caption for further details.30PQOPQlemonSQPQOPQsoccer ballSQPQOPQsnailSQFigure 5.2: Retrieval errors introduced by quantization noise in natural im-ages. Top: the roundness of the snail is incorrectly matched to an oysterin retrieval with PQ and OPQ, and to a turtle in PQ. Middle: A soccerball is matched against other round objects such as an acorn and a ten-nis ball. Bottom: a lemon is matched against apples and a lime. Bestviewed digitally.31PQOPQbinoculars, field glasses, opera glassesSQPQOPQbibSQPQOPQpickup, pickup truckSQFigure 5.3: Typical retrieval errors in person-made objects. Top: Binocularsare incorrectly matched to photographic cameras. Middle: A bib isincorrectly matched to t-shirts. Bottom: A pickup truck is incorrectlymatched to cars. Best viewed digitally.32PQOPQmantis, mantidSQPQOPQSQFigure 5.4: Top: A mantis brings up grasshoppers, which share some struc-ture in the legs, but not the head. Bottom: A picture of a burrito isincorrectly matched with croissants and sandwiches. Best viewed digi-tally.33Chapter 6Encoding with local searchIn Chapter 2, we introduced a framework that focused on the different formula-tions that PQ, OPQ and AQ attempt to optimize, and later in Chapter 3 we deriveda method that maintains the formulation of AQ, but is focused on keeping fast en-coding, being only slightly more complex than PQ in this step. As we pointed out,our approach represents a middle ground between PQ and AQ.In a similar spirit, we can also examine AQ and SQ under a common frame-work, with the intent of exploring middle grounds between them. To do so, it isnecessary to think about the design choices that characterize AQ and SQ.6.1 Design choices in compositional quantizationThere are three choices one must make when designing a compositional quantizer:1. Initialization. The optimization process starts with a set of codebooks thatwill be refined subsequently. In AQ, these codebooks are obtained by, first,generating random codes Bi for the entries in the database, and then solvingfor the codebooks Ci via least-squared error minimization. In SQ, we opted forinitializing the codebooks in a top-down manner, with the goal of creating ahierarchical structure that allows for fast encoding.2. Codebook update. AQ and SQ also differ in the way that, for a given set ofcodes Bi, the codebooks Ci are updated. In AQ this is obtained via least-squarederror minimization, resulting in optimal update. In SQ, since our main goal is to34Init Codebook update EncodingAQ Random Least-Squares Beam searchSQ Greedy Local GreedyTable 6.1: Summary of the design choices that make AQ and SQ different.preserve a codebook hierarchy (and with it fast encoding), we proposed a morelocalized approach, updating each codebook individually while keeping the restfixed. For more details about this procedure, please refer to Section 3.3.3. Encoding. Encoding is performed using a modified version of beam search inAQ, while in SQ it is done in a top-down manner, resulting in drastically dif-ferent running times for the encoding of very large databases (see Section 4.5for details). In this chapter, we are mainly interested in exploring approachesbetween sequential, greedy encoding (as performed in SQ), and more sophisti-cated search (such as the beam search of AQ).For a summary of the differences between AQ and SQ, see Table 6.1. Looking atthese design choices, it is natural to wonder, how important is each design choice?.For example, is greedy initialization better than initializing at random? Is localcodebook update inherently better than a least-squares update? What will be therperformance of different combinations of these approaches, such as a greedy ini-tialization, followed by least-squares codebook updates and greedy encoding? De-termining the importance – and the interplay – of these design choices is our firstgoal. Our second goal is to investigate an intermediate approach between greedyencoding and beam search, to further determine the importance of search duringencoding.6.2 Iterative best improvement searchWe investigate local search as an intermediate approach between greedy encod-ing and beam search. Our method of choice is Iterative best improvement search(IBI). In IBI, an initial code configuration is given (which we obtain with greedyencoding), all the neighbouring configurations are explored, and the best neigh-35bour is selected as the new encoding. Since each code is composed of m subcodes,each of which index into a subcodebook, we define a neighbour configuration as achange in a single subcode of the original configuration. For example, assumingsubcodebooks of size h = 256, the codeB = [B1,B2,B3,B4] = [008,056,021,253]has neighbours[001,056,021,253], [002,056,021,253], [003,056,021,253], . . . , [256,056,021,253],[008,001,021,253], [008,002,021,253], [008,003,021,253], . . . , [008,256,021,253],...[008,056,021,001], [008,056,021,002], [008,056,021,003], . . . , [008,056,021,256],resulting in a total of h ·m neighbours. This approach explores more configurationsthan the greedy encoding of SQ, while remaining linear in m, and is cheaper thanthe quadratic cost (in m) incurred by beam search in AQ. Since the performance ofIBI is an upper-bound in performance to more greedy approaches such as iterativefirst improvement search, we focus solely on IBI.6.3 Experimental setupWe can now build different compositional quantizers making different design choicesat each step. We explore the importance of initialization with random and greedyinitializations, and of encoding with Greedy or IBI search. The local codebookupdate of SQ was designed with the sole purpose of preserving the codebook hier-archy of SQ, so we exclude it from these experiments.Competing quantizers. We compare four quantizers: RLG, RLI, GLG, and GLI.The first letter indicates the initialization – “R” for random, and “G” for greedy.Codebook update is done via least-squared error minimization, and therefore the362 4 8 160.511.522.533.544.555.5 x 104Number of codebooksQuantization errorSIFT1M RLGRLIGLGGLIAQSQ2 4 8 160.040.060.080.10.120.140.160.180.20.220.24Number of codebooksQuantization errorConvNet1M−128 RLGRLIGLGGLIAQSQFigure 6.1: Quantization error results for compositional quantizers with localsearch on the SIFT1M and ConvNet1M-128 datasets.“L” used in the middle. The final letter indicates “I” for IBI search, and “G” forgreedy encoding. When using greedy encoding we run the optimization for 50iterations, and run 10 iterations when using IBI encoding.Datasets We test our quantizers on the SIFT1M and ConvNet1M-128 datasets,which show the most promising results for SQ in the evaluations of Chapter 4.Results and discussion. We report the quantization error achieved by these meth-ods on Figure 6.1. In SIFT1M, we observe that GLI – that is, greedy initialization,least-squares codebook update and IBI encoding – outperforms other local-searchmethods when using 32 and 64 bits (i.e., 8 and 16 codebooks). However, all the lo-cal search methods achieve practically the same performance when using 128 bits,and in all cases they are outperformed by AQ and SQ. On ConvNet1M-128 we ob-serve different behaviours: all the local search methods outperform AQ when using64 and 128 bits, and achieve performance almost similar to SQ for all codebooksizes. Moreover, while the approaches with greedy initialization perform best inSIFT1M, in ConvNet1M-128 we see quite the opposite. These mixed results makeit hard to draw conclusions on the importance of initialization, and we suggest37more research to further clear this question up.It is notable that in ConvNet1M-128 AQ shows the worst performance acrossall methods. This suggests that OPQ is a very bad initialization strategy, andencourages more research into local-search methods for encoding deep features.Since we observe a different behaviour in SIFT1M, we believe it is too early tocomment on the general effectiveness of local search for encoding, and rather sug-gest more experimentation with more datasets and more iterations.38Chapter 7ConclusionsWe have introduced Stacked Quantizers as an effective and efficient approach tocompositional vector compression. After analyzing PQ and AQ in terms of theircodebook assumptions, we derived a method that combines the best of both worlds,being only slightly more complex than PQ, while maintaining the representationalpower of AQ. We have demonstrated state-of-the-art performance on datasets ofSIFT, GIST and, perhaps most importantly, deep convolutional features. This finalresult is particularly strong, as we achieve results comparable to PQ and OPQrequiring only half the memory of these approaches. We believe that this resultis particularly important, because deep features are likely to replace hand-craftedfeatures in the foreseeable future.We have also characterized the qualitative results of image retrieval under dif-ferent compression schemes and quantified image retrieval precision using PQ,OPQ and our method. We also presented a preliminary investigation into the useof local search for encoding, in an attempt to explore the performance of a middle-ground approach between AQ and SQ.Future work should look at the integration of SQ with non-exhaustive indexingtechniques such as the inverted file [18] or the inverted multi-index [3]. We wouldalso like to try other optimization approaches that have proven fruitful in network-like approaches, such as stochastic gradient descent and conjugate gradient.39Bibliography[1] P. Agrawal, R. Girshick, and J. Malik. Analyzing the performance ofmultilayer neural networks for object recognition. In European Conferenceon Computer Vision (ECCV), pages 329–344. 2014. → pages 5[2] R. Arandjelovic and A. Zisserman. Multiple queries for large scale specificobject retrieval. In British Machine Vision Conference (BMVC), pages 1–11,2012. → pages 23[3] A. Babenko and V. Lempitsky. The inverted multi-index. In Computer Visionand Pattern Recognition (CVPR), pages 3069–3076, 2012. → pages 39[4] A. Babenko and V. Lempitsky. Additive quantization for extreme vectorcompression. In Computer Vision and Pattern Recognition (CVPR), pages931–938, 2014. → pages 2, 3, 9, 10, 11, 15, 17, 18, 19, 22[5] A. Babenko, A. Slesarev, A. Chigorin, and V. Lempitsky. Neural codes forimage retrieval. European Conference on Computer Vision (ECCV), pages584–599, 2014. → pages 4, 5[6] A. Bergamo, L. Torresani, and A. W. Fitzgibbon. Picodes: Learning acompact code for novel-category recognition. In Advances in NeuralInformation Processing Systems (NIPS), pages 2088–2096, 2011. → pages23[7] K. Chatfield, K. Simonyan, A. Vedaldi, and A. Zisserman. Return of thedevil in the details: Delving deep into convolutional nets. In British MachineVision Conference (BMVC), 2014. → pages 18, 23[8] K. Chatfield, K. Simonyan, and A. Zisserman. Efficient on-the-fly categoryretrieval using convnets and GPUs. arXiv preprint arXiv:1407.4764, 2014.→ pages 1, 4, 5, 18, 2640[9] G. F. Cooper. The computational complexity of probabilistic inference usingbayesian belief networks. Artificial intelligence, 42(2):393–405, 1990. →pages 3, 11[10] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei. Imagenet: Alarge-scale hierarchical image database. In Computer Vision and PatternRecognition (CVPR), pages 248–255, 2009. → pages ii, 1, 18[11] M. Everingham, L. Van Gool, C. K. Williams, J. Winn, and A. Zisserman.The pascal visual object classes (VOC) challenge. International Journal ofComputer Vision (IJCV), 88(2):303–338, 2010. → pages 1[12] T. Ge, K. He, Q. Ke, and J. Sun. Optimized product quantization.Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 36(4):744–755, 2014. → pages 7, 9, 16, 17, 19, 22[13] R. Girshick, J. Donahue, T. Darrell, and J. Malik. Rich feature hierarchiesfor accurate object detection and semantic segmentation. In ComputerVision and Pattern Recognition (CVPR), pages 580–587, 2013. → pages 4[14] Y. Gong, S. Lazebnik, A. Gordo, and F. Perronnin. Iterative quantization: Aprocrustean approach to learning binary codes for large-scale imageretrieval. Transactions on Pattern Analysis and Machine Intelligence, 35(12):2916–2929, 2013. → pages 2, 9[15] K. He, X. Zhang, S. Ren, and J. Sun. Spatial pyramid pooling in deepconvolutional networks for visual recognition. In European Conference onComputer Vision (ECCV), pages 346–361. 2014. → pages 4[16] J.-P. Heo, Z. Lin, and S.-E. Yoon. Distance encoded product quantization. InComputer Vision and Pattern Recognition (CVPR), pages 744–755, 2014. →pages 20[17] H. Je´gou, M. Douze, C. Schmid, and P. Pe´rez. Aggregating local descriptorsinto a compact image representation. In Computer Vision and PatternRecognition (CVPR), pages 3304–3311, 2010. → pages 1, 9[18] H. Je´gou, M. Douze, and C. Schmid. Product quantization for nearestneighbor search. Pattern Analysis and Machine Intelligence, 33(1):117–128,2011. → pages vi, 2, 7, 8, 9, 17, 18, 19, 22, 28, 39[19] A. Krizhevsky and G. Hinton. Learning multiple layers of features from tinyimages. Computer Science Department, University of Toronto, Tech. Rep,2009. → pages 141[20] A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classification withdeep convolutional neural networks. In Advances in Neural InformationProcessing Systems (NIPS), pages 1097–1105, 2012. → pages 4, 5[21] D. G. Lowe. Distinctive image features from scale-invariant keypoints.International Journal of Computer Vision (IJCV), 60(2):91–110, 2004. →pages 1, 2, 17[22] M. Muja and D. G. Lowe. Fast approximate nearest neighbors withautomatic algorithm configuration. In Computer Vision Theory andApplications, 2009. → pages 2[23] M. Norouzi and D. J. Fleet. Cartesian k-means. In Computer Vision andPattern Recognition (CVPR), pages 3017–3024, 2013. → pages 5, 7, 9, 16,17, 19, 22[24] A. Oliva and A. Torralba. Building the gist of a scene: The role of globalimage features in recognition. Progress in brain research, 155:23–36, 2006.→ pages 17[25] F. Perronnin, J. Sa´nchez, and T. Mensink. Improving the Fisher kernel forlarge-scale image classification. In European Conference on ComputerVision (ECCV), pages 143–156, 2010. → pages 1[26] J. Philbin, O. Chum, M. Isard, J. Sivic, and A. Zisserman. Object retrievalwith large vocabularies and fast spatial matching. In Computer Vision andPattern Recognition (CVPR), pages 1–8, 2007. → pages 1[27] J. Philbin, O. Chum, M. Isard, J. Sivic, and A. Zisserman. Lost inquantization: Improving particular object retrieval in large scale imagedatabases. In Computer Vision and Pattern Recognition (CVPR), pages 1–8,2008. → pages ii[28] J. Sa´nchez and F. Perronnin. High-dimensional signature compression forlarge-scale image classification. In Computer Vision and PatternRecognition (CVPR), pages 1665–1672, 2011. → pages 26[29] K. Simonyan and A. Zisserman. Very deep convolutional networks forlarge-scale image recognition. arXiv preprint arXiv:1409.1556, 2014. →pages 5[30] N. Snavely, S. M. Seitz, and R. Szeliski. Photo tourism: exploring photocollections in 3d. Transactions on Graphics (TOG), 25(3):835–846, 2006.→ pages 1, 2942[31] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan,V. Vanhoucke, and A. Rabinovich. Going deeper with convolutions. arXivpreprint arXiv:1409.4842, 2014. → pages 4, 5[32] A. Torralba, R. Fergus, and W. T. Freeman. 80 million tiny images: A largedata set for nonparametric object and scene recognition. Transactions onPattern Analysis and Machine Intelligence (TPAMI), 30(11):1958–1970,2008. → pages ii, 1, 29[33] M. D. Zeiler and R. Fergus. Visualizing and understanding convolutionalneural networks. In European Conference on Computer Vision (ECCV),pages 818–833, 2014. → pages 1843