A N ANALYSIS OF MULTI-LEVEL FILTERING FOR HIGH DIMENSIONAL IMAGE DATA By Dominic Pok Man Tarn <, B. Sc. (Honour) Simon Fraser University , 1 9 9 4 A T H E S I S S U B M I T T E D I N P A R T I A L F U L F I L L M E N T O F T H E R E Q U I R E M E N T S F O R T H E D E G R E E O F M A S T E R O F S C I E N C E in T H E F A C U L T Y O F G R A D U A T E S T U D I E S C O M P U T E R S C I E N C E We accept this thesis as conforming to the required standard, T H E U N I V E R S I T Y O F B R I T I S H C O L U M B I A Apri l , 1996 © Dominic Pok Man Tarn, 1996 In presenting this thesis in partial fulfillment of the requirements for an advanced degree at the University of British Columbia, I. agree .that the Library'shall make it freely available for reference and study. I further agree that permission "for. extensive copying of this thesis for scholarly purposes may be granted by the head of my department or by his or her -representatives. It is understood ;that copying or publication of this thesis for financial gain shall not be allowed without my written permission. Computer Science . •" The University of British Columbia 2366 Main Mail < • : Vancouver, Canada . V6T 1Z4 " Date: '.m'iL 25,yiffi Abstract Image database systems are very useful in many applications. To design an effective image database system, high dimensional image feature vectors have to be extracted from the images automatically. Each comparison between them tends to be expensive, so sequen-tial comparisons are usually impractical. Moreover, the. traditional multi-dimensional indexing structures are incapable of handling these high-dimensional vectors efficiently. Thus, it has been proposed to abstract lower dimensional k-D vector from the original N-D feature vector, where k <C N. 2-level filtering is then used so that the k-D vector can fit in the indexing structure for coarse filtering and much fewer comparisons are needed between N-D vectors for the fine filtering stage. Unfortunately, both stages cannot be efficient at the same time. A major contribution of this thesis is to propose the idea of multi-level filtering by adding additional intermediate levels so that both the coarsest and finest filtering stages can be very efficient. Based on the cost models developed, the trends of 2-level and multi-level filterings are analysed and compared. The experimental evaluations further confirm that the 3-level filtering usually requires less C P U and I/O time than 2-level does. When compared with the previous approach of 2-level filtering, 3-levels can save from 15% to over 400% of time needed. Another contribution is to de-velop the optimizers which can find the (near) optimal configuration of 2-level and 3-level filterings for any image dataset. Experimental results show that in about 32 seconds, the developed optimizer can find a configuration whose total run-time per query exceeds that of the real optimal configuration by less than 2.5%. n Table of Contents Abstract ii List of Figures vii Acknowledgement viii 1 Introduction 1 1.1 Demand for Image Database Support •. 1 1.2 Image Features 2 1.2.1 Description of Image Content 2 1.2.2 Extraction of Image Features 2 1.2.3 Examples of Image Database Systems 3 1.3 High dimensionality of Image Feature Vectors 3 1.3.1 2-level Filtering Approach 4 1.4 Problem Definition and Thesis Contribution 7 1.4.1 . Problems 7 1.4.2 Multi-level filtering 8 1.5 Overview 10 2 Related Works 11 2.1 Systems Overiew 11 2.1.1 QBIC 11 2.1.2 Miyabi 12 iii 2.1.3 Photobook 12 2.2 Image Features and their Indexing . . 13 2.2.1 Colour Indexing • • • • 13 2.2.2 Shape and Texture 17 2.2.3 Eigenimages 19 2.3 Multi-dimensional Indexing Structure 20 2.3.1 K-D trees 21 2.3.2 K - D - B trees . 21 2.3.3 R-trees and its variants 22 3 Cost Models of Filtering 25 3.1 3-level Filtering 25 3.2 Motivations of Deriving Cost Model • • • • 26 3.3 A cost model of 2-level filtering 27 3.3.1 C P U cost of 2-level filtering 27 3.3.2 I/O cost of 2-level filtering 30 3.4 A cost model of 3-level filtering 31 3.4.1 C P U cost of 3-level filtering 31 3.4.2 I/O cost of 3-level filtering 31 3.5 A cost model of u-level filtering 32 . 3.5.1 C P U cost of w-level filtering 32 3.5.2 I/O cost of u-level filtering 32 4 Trend Analysis 34 4.1 Introduction . . . 34 4.2 With or without filtering: (4, N) vs (N) 35 4.3 Analysing the configuration (k,N) 35 iv 4.4 3-level vs 2-level filtering: (4, k, N) vs (k, N) at same k 39 4.5 3-level vs 2-level filtering: (4, k, N) vs (k, N) at their optimal k 41 4.6 4-level vs 3-level filtering: (4, klt k2, N) vs (4, fc, N) 46 5 Experimental Evaluation 49 5.1 Introduction 49 5.2 Details of Experiments 50 5.2.1 General Experimental Setup 50 5.2.2 Colour Histogram Experimental Setup 52 5.2.3 Eigenfaces Experimental Setup 55 5.3 Experimental Results 57 5.3.1 Colour Histograms . . . . . . . i 57 5.3.2 Eigenfaces *. 61 5.4 Effects of Different Values of Parameters 63 5.4.1 Page Size 63 5.4.2 C P U Speed 64 5.4.3 Indexing Structures . . . . 66 6 Design and Implementation of Optimizers 69 6.1 Introduction 69 6.2 Specifications of the Optimizer . . 70 6.2.1 Input to the Optimizer 70 6.2.2 Qualities of Service 70 6.2.3 Accuracy 71 6.3 Design of the Optimizer 72 6.3.1 A reversible 3-level Reference Optimizer 72 6.3.2 An Efficient 3-level Optimizer Using Sampling 75 v - 6.3.3 Selectivity Mix 78 6.3.4 A 2-level optimizer . 79 6.4 Experiments 80 6.4.1 3-level optimizer • 80 6.4.2 2-level Optimizer 86 7 Conclusions 89 7.1 Future Work 91 Bibliography 92 vi List of Figures 3.1 Definitions of symbols 29 4.2 Costs of 2-level-filtering with respect to A; 37 4.3 Trends of A a , A 2 and A i - A 2 40 4.4 Costs of 3-level filtering with respect to k 42 4.5 Total cost of 3-level and 2-level filtering 44 5.6 Relative efficiency with original histograms 58 5.7 Relative efficiency with transformed histograms 59 5.8 Relative efficiency with histograms from real images 60 5.9 Relative efficiency with eigenfaces 62 5.10 Effect of changing page size on 3-level filtering relative cost 63 5.11 Effect of changing C P U speed on multi-level filtering relative cost . . . . 65 5.12 Effects of changing indexing structure on 3-level and 2-level costs . . . . 67 6.13 Reversibility problem • • • ^4 6.14 Performance and accuracy of 3-level optimizer using different samplings in the second stage 83 6.15 Performance and accuracy of 3-level optimizer using different samplings in the first stage 84 6.16 Relative efficiency of different numbers w of selectivities . - 85 6.17 Performance and accuracy of 2-level optimizer using different samplings in the second stage 87 vii Acknowledgement First of all, I would like to thank Dr. Raymond Ng, my supervisor, for his ideas, time, and efforts. His support, patience and understanding are also invaluable to the completion of this thesis. Thanks, Raymond. I would also thank Dr. David Lowe, my second reader, for his time contributed to review this thesis and his constructive opinions. I am also grateful to Jack Snoeyink, my advisor, for his guidances in choosing courses when I was first admitted to the department. Special thanks go to Dr. Christos Faloutsos of the Department of Computer Science at the University of Maryland, College Park, and his students for allowing me to use their codes of the R-tree. I am equally grateful to Andishe Sedighian, my fellow student, for using her codes of K-D-tree. I would also like to thank Christopher Healey, my another fellow student, who teaches me about the color spaces. Many thanks also go to the professors in Simon Fraser University and City Polytechnic of Hong Kong. They let me know computer science is a very interesting area. I would also thank Dr. David Fracchia of Simon Fraser University. He was my supervisor during the summer research semester in 1993, and this experience encouraged me to continue research after bachelor graduation. I would especially like to express my appreciations to my parents and sister for their support and love. Most important of all, I thank Jesus Christ, my Lord and Saviour who leads me through all these years and let me reach this stage. This thesis would not be completed without His graces and loves to me. viii Chapter 1 Introduction 1.1 Demand for Image Database Support In the past few decades, databases have been an important research area as well as business tool. From the 1960's to 80's, almost all of the traditional database systems (such as relational database systems) were designed to support alpha-numeric data. However, the demand for image database support has become more prominent in recent years. Two often cited applications of image database systems are on-line museums and art galleries. The exhibits are made available as images so that users can browse through or find their required art works images. Real estate agents can use the traditional database systems to store different kinds of information, such as address, price, size, etc. However, the clients of the agents can find their favourite new homes easier if photographs of the houses can be searched on screen. Similarly, customers of a department store can find their favourite neck ties online, say, by specifying their favourite colours and/or patterns, and the system can retrieve the photographs of the ties which match these criteria. Finally, it is well documented that medical databases, scientific visualizations, engineering jobs and geographical information systems are four important application domains of image database systems[Jai93]. -1 Chapter 1. Introduction 2 1.2 Image Features 1.2.1 Description of Image Content If one wants to describe a certain car, he/she specifies the attributes of that particular car, such as its model, horsepower, dimensions, price and so on. In the same way, one specifies the attributes of an image in order to describe it. These attributes are called image features which are used to distinguish one image from others. Colour is one of the most important image features. A sky image can be instantly recognized as one with blue colour, and a sunset image is one with mixture of mostly red and yellow. The shapes of the objects in an image are also.very recognizable to human beings. For instance, a triangular shaped object can be easily spotted in an image containing Eiffel Tower. Texture is also a very useful image feature in some applications. Fashion designers can describe their favourite textile images by specifying the texture patterns, say, cross-hatched or stripped. Colour, shape and texture are three examples of image features which describe the syntactic properties of the image. These properties can be obtained directly from the pixels or group of pixels in the image. A n alternative method to describe an image is by the semantic objects of the image. For example, a user may ask for images which contain a car or a dog. 1.2.2 Extraction of Image Features A natural way to extract image features is to do so by human inspection. Every image in the database can be examined one by one, and its features can then be entered to the.system as the keywords of that particular image. Once the extraction of all images is completed, a user can then ask the system for the images according to the keywords of the content. Even though this approach is being used by some commercial image database systems, there are several serious flaws. Since the image database system may Chapter 1. Introduction 3 consist of tens of thousands of images (such as satellite images), specifying the contents manually one by one is too costly, if feasible at all. Even in the case where the database is sufficiently small, this method may not work because the specification of the keywords can be very subjective. For example, the database designer may consider it as a church, but the user may think that it is a chapel. It is therefore much more efficient and effective if the features can be extracted by the system itself automatically. 1.2.3 Examples of Image Database Systems It is an active research topic to develop image database systems which extract image features automatically. Query By Image Content (QBIC) [F+94] is a system developed by I B M to study the idea of querying an image database based on image content features such as colour, texture and shape of the whole or part of the image. The user can ask for all sunset images in the database by either directly describing the image features (e.g. 70% red and 30% yellow) or supplying a sample image (e.g. another similar sunset image). Miyabi system [H+93] is an image database system developed by N E C corporation. One of its strengths is the navigation guide provided by the system for the user throughout the searching process of a required image. For example, Miyabi allows the user to issue the next query by modifying one of the images returned as answer to the current query. Photobook [PPS94] is a system developed by the media lab of MIT. There are currently three subsystems of photobook, namely facebook for facial recognition, shape photobook, and texture photobook. More details of these systems will be discussed in Section 2.1. 1.3 High dimensionality of Image Feature Vectors Features of an image can be packed together in a vector called feature vector. It is generally preferable, if not necessary, to describe the image content by a large number of Chapter 1. Introduction 4 features. This implies that the high dimensionality of a feature vector is unavoidable in many cases. For instance, if there are not enough bins allocated for the spectrum in the red colour, different shades of red ( say, light and dark reds) may be treated as the same. Thus, colour histograms with 256 bins are used in [F+94]. In the same paper, twenty features are used to describe the image content based on shapes. Furthermore, a texture feature vector can have as many as 240 dimensions as in [A+95]. 1.3.1 2-level Filtering Approach Users of image database systems, unlike those of the relational database systems, do not ask for images that exactly match the given query. In an image database retrieval context, given a query feature vector Q, the user wishes to find all the images in the database —* —¥ whose vectors / are sufficiently close to Q. This often translates to the condition that the distance between / and Q, denoted by dist(Q,I), is within a threshold "radius" D.1 In other words, the answer set of query Q is the set IN = {T\dist(Q,I) < D) (1.1) where N is the dimension of the feature vectors. A naive approach to solving the above problem is certainly to compute dist(Q, I) for all the images stored in the database and compare the distances with threshold radius D. Even though this approach definitely returns the correct answers set Z/v, it is too inefficient, especially when the dimensionality of the feature vector is high or when the number of images stored in the database is large. In order to develop an efficient image database system, one solution is to insert the feature vectors of all images into an indexing structure. This technique is commonly *An alternative approach is to retrieve a certain specified number of closest images. Chapter 1. Introduction 5 used in traditional database systems to reduce the number of accesses and comparisons during searching. However, indexing structures which are used in traditional database systems (e.g. B-trees) usually support only one dimension index. There are a few multi-dimensional indexing structures, such as multi-dimensional binary search trees [Ben75], K-D-B trees [Rob81], R- trees [Gut84], R+-trees [SRF87], R*trees [B+90]. A common weakness among them is that they can only support data with fairly low dimension (e.g. < 20). Their searching performance degrades considerably when the dimension increases. This phenomenon is known as the "dimensionality curse". Because the dimensionality of image feature vectors can be more than 100, using existing multi-dimensional indexing structures is problematic. A 2-level filtering approach has been proposed in [F+94, SH93, SH94, A+95] to solve the above problem. In general, low dimensional vectors Iabs and Qabs are first abstracted —* —* from the original high dimensional image feature vectors I and Q respectively typically by a certain mathematical transformation technique which preserves most information. Query answering is broken up into two stages: coarse and fine filtering stages. During the coarse filtering stage, the distance between the low dimensional abstracted vector Q a b s of the query and those vectors Iabs of the images are measured. Only the abstracted vectors labs which are close enough to Q a b s (within threshold distance Dabs) can pass through the coarse filter. The coarse-level answer set, Iabs, is the set Tabs = {hbs | dist(Qabs, Iabs) < Dabs} (1-2) During the fine filtering stage, the original high dimensional image feature vectors / of the corresponding images in the coarse-level answer set Tabs are then compared with —* the query feature vector Q according to the original Equation(1.1). There are two important requirements of 2-level filtering. Chapter 1. Introduction 6 • Completeness. This requirement ensures that the final outcome after all stages of filtering must be exactly the same as those returned by direct comparison with full high dimensional feature vectors without using filtering. In other words, the correct image candidates cannot be excluded by the coarse filter (i.e. Xa\,s D X). In fact, whether the completeness requirement can be satisfied depends on the selection of threshold distances D and Dabs for both stages of filtering and the transformation which is used to abstract the low dimensional vector from the high dimensional vector . A family of this kind of transformation is distance preserving transforma-tions [F+94]. Examples of this family are Karhunen Loeve (KLT) , Discrete Fourier (DFT) and Discrete Cosine (DCT) transform. • Efficiency. Another essential requirement is efficiency. It means that the total-time spent on both the coarse and fine filtering stages has to be less than that on direct comparison by using the original full high dimensional vectors without filtering. In our application here, the filtering is likely to be efficient under the following conditions. 1. The cardinality of X a j s has to be sufficiently small so that there are only a few candidates left for the expensive high dimensional vector comparisons during the fine filtering stage. 2. If the dimension of Iabs vectors is low enough, they can be inserted into an . efficient multi-dimensional indexing structure. Therefore, only a small fraction of Iabs needs to be accessed and compared during the coarse filtering stage. Successful examples of applying 2-level filtering approach for high dimensional image data can be found in [F+94, SH93, SH94]. In [F+94], Faloutsos et al. recognize that Chapter 1. Introduction 7 measuring the distance between two colour histograms of 256 bins can,be very compu-tationally expensive. Thus, they propose to use the three average values of red, green and blue colour components of every image pixels as the 3-D abstracted feature vector in the coarse filtering stage. In [SH93, SH94], Sawhney and Hafner devise a technique which generalizes the coarse filter from 3 dimensions to any k dimensions. Therefore, the answer set returned by the coarse filtering stage can be represented as Ik-In the same paper [F+94], 2-level filtering approach is also used to match shape features. It uses 20-D shape feature vectors in the fine filtering stage. These vectors are abstracted to 2-D vectors by using the technique of Karhunen-Loeve transform (KLT) . In [A+95], a filtering approach is used to to match texture feature vectors. 1.4 Problem Definition and Thesis Contribution 1.4.1 Problems As can be seen in the previous section, the 2-level (coarse and fine) filtering is found to save time in many cases. Still, there are several questions which have to be addressed. 1. What is the optimal k in 2-level filtering? In [SH93, SH94], Sawhney and Hafner generalize the original 2-level colour his-togram filtering proposed by [F+94] to any coarse filtering dimension k. They also provide evidence showing that the filtering at k = 15 can outperform that at smaller values of k in terms of C P U time. This seems to suggest that the configuration with higher k needs less C P U time. Unfortunately, the dimensionality curse pro-hibits the dimension k from getting too highi It is still an unknown question as to how the overall performance is affected if the I/O time is also taken into account. This is an important issue because the dimension k determines the dimensionality of the indexing structure, and therefore it can also change the required I/O time. Chapter 1. Introduction 8 Therefore, the trend of the total cost as k increases is not obvious. Consequently, a framework is required to find the dimension k which gives the optimal total time of general 2-level filtering approach. 2. Can we perform better with 3-level filtering? If 2-level filtering works, can 3-level or 4-level work even better? Is there any limit of improving the performance by adding more levels? 3. How can we find the optimal configuration for filtering? Given an image database, it is important to find an optimal or near-optimal filtering configuration, be it 2-level, 3-level or 4-level. How can we find such a near-optimal configuration without spending ,too much time to test all the data? Can we design an optimizer that finds such a configuration? 1.4.2 M u l t i - l e v e l f i l t e r i n g In determining a good 2-level filtering configuration, we face with the following dilemma. If the dimension k of the coarse filtering stage is too low, there will be too many can-didates returned by the coarse filtering stage because the information abstracted by the low dimension feature vector is too little. Too much time will be spent on the high di-mensional image feature vector comparisons during the fine filtering stage. At the same time, the dimension k cannot get too high either because of the dimensionality curse of existing indexing structures. The crucial observation here is that the dimensionality of the indexing structure k is tied directly to the number of full high dimensional feature vector comparisons. The key idea examined in this thesis is multi-level filtering, which is designed precisely to detach these two quantities from one another. This allows us to enjoy the best of both worlds, namely keeping both the number of full high dimen-sional feature vector comparisons and the dimension of the indexing structure down. As Chapter 1. Introduction 9 the answer to second question in Section 1.4.1, the first contribution of this thesis is to analyse, develop and evaluate the idea of multi-level filtering. To be more specific, the idea of 3-level filtering is as follows. Full N-D feature vectors / are first extracted from the original images. They are then abstracted to a very low dimensional l-D vector indices /; where / <C N. I is selected to be 4 in this thesis. They can be stored in a l-D indexing structure. The high dimensional vectors are similarly reduced to k-D vectors where / < k <C N. Every k-D vector is stored in a flat file. The last level of filtering consists of the original full N-D image features which are also stored —* in another flat file. From a given query image, its high dimensional feature vector Q is —* —* • —* extracted. This vector Q is abstracted to Qi and Qk in the way same as described above. The filtering is done in three successive stages by comparing the feature vectors of query (Qh Qk and Q) and image (/;, Ik and / respectively) at the indexing structures and then the two flat file. Only the images which passes the previous coarser filtering stage will be passed to the current stage. The details of multi-level filtering will be described in Section 3.1 Another contribution of this thesis is the development of a cost model for evaluating 2- level filtering and multi-level filtering of high dimensional image data. This cost model allows us to estimate the I/O and C P U time spent on a query and therefore the overall optimal configurations of both filtering techniques. This cost model helps to answer the first two questions introduced in Section 1.4.1. The third contribution is the trend analysis of cost models of the 2-level and multi-level filtering. The trend analysis not only allows us to understand the performance trends of the 2-level and multi-level filtering, but also provides evidence showing how 3- level filtering typically costs less than 2-level filtering. Furthermore, this thesis also provides two sets of experimental results further confirming that 3-level filtering is more efficient than 2-level filtering. Chapter 1. Introduction 10 The final contribution is the design and implementation of an optimizer for choosing an optimal (or near-optimal) configuration of the filtering. Instead of processing all the image data in the whole database, the optimizer returns near-optimal configuration by considering only limited samples in database. The notion of near-optimal configuration requires a precise definition of optimality, and this leads to quantification of the accuracy. Based on statistical sampling techniques, the optimizer can guarantee to give a configu-ration which is accurate to a user-specified error level with user-specified confidence level. Thus, the optimizer can help to answer the third question posed in Section 1.4.1. 1.5 Overview The next chapter of the thesis gives a survey on the previous works related to the different systems, image features, indexing, examples of filtering and multi-dimensional indexing structures. Chapter three describes the details of 3-level filtering and also the cost models of 2-leveland multi-level filtering. Chapter four presents the trend analyses of these filtering approaches according to the cost models. Chapeter five discusses the various experiments performed and their results. Chapter six investigates the design principles and implementation of the optimizer. Chapter seven concludes the whole thesis. Chapter 2 Related Works This chapter starts with a discussion of three selected image database systems. It then investigates four different image features, namely colour, texture, shape, and eigenface. It also discusses their indexing techniques and the applications of filtering to these features. Finally, it examines several popular multi-dimensional indexing structures. 2.1 Systems Overiew 2.1.1 QBIC As introduced in section 1.2.3, QBIC is an image database system which allows query based on image contents such as colour, texture, shape as well as sketch features of the whole or portion of the image [N+93]. A l l these image features allow users to specify the queries efficiently and effectively. The colour features are represented as colour histograms in QBIC. The details of colour histograms will be described in Section 2.2.1. Shape is another important image feature of QBIC, but it is known that shape recognition is a difficult problem. QBIC is tested in [F+94] by using 20 shape features. These features in-clude area, circularity, eccentricity, major axis orientation, and a set of algebraic moment invariants [N+93]. Recall from section 1.2.3 that QBIC allows the specification of .queries by using sample image and direct query. The latter method of specification allows the users to enter the required colours, shapes or textures directly. QBIC provides a drawing pad for users to roughly sketch the required shapes. Abstractions of these sketches are 11 Chapter 2. Related Works 12 obtained by reducing their resolutions, and they can then be stored as sketch features. 2.1.2 Miyabi Like QBIC, Miyabi [H+93] allows the users to specify their queries in many different ways. First, a user can draw rough sketches and ask the system to look for similar images. Second, the user can also ask for images which look similar to any specified image in the previous answer set. Third, the user can refine the searching criteria by specifying some textual keywords too. In some special cases, the user can retrieve the required images based on either shape or colour only. For indexing, Miyabi generates a picture index for each image automatically that is based on the colour and shape features of the image. This picture index is composed of line and colour versions. Each image is divided into different regions according to this picture index, and the matching will be based on the shape, colour and position of each region. It is important that an image is divided into only a small number of regions. • It is because human beings can usually remember only a few regions, and therefore can only sketch images vaguely [H+93]. 2.1.3 Photobook The photobook is an image database system which uses the concept of semantics-preserving image compression [PPS94]. This concept makes its indices perceptually meaningful to human beings. The photobook system includes three important tools according to dif-ferent image features, namely face photobook for facial recognition, shape photobook for object shapes, and texture photobook for texture descriptions. Facial recognition is based on eigenimages or eigenfaces [TP91]. More details of the eigenimages will be given in Section 2.2.3. Experiments show that the recognition accuracy can be as good as 95%. Shape recognition is usually complicated by different viewing geometries between the image in the database and the query. Furthermore^ the objects in the images can be Chapter 2. Related Works 13 deformed. Shape photobook tackles this problem by using Finite Element Method (FEM) [PPS94]. The 2D-Wold like decomposition model [LP95] is chosen in texture photobook to index texture pattern because this model is related to the human perceptual similarity of grouped patterns. A common requirement among all image database systems is to provide efficient searching for user queries. This is particularly important for huge databases. The goal of this thesis is to study how the multi-level filtering can improve the performance of searching without sacrificing the correctness of results. 2.2 Image Features and their Indexing 2.2.1 Colour Indexing Definition and Properties Colour histograms have long been used in image processing for purposes such as image segmentation. However, as suggested by [SB90, SB91], it can also be an effective image feature vector. Given a discrete colour space defined by some colour axes (e.g. red, green, blue), the colour histogram is obtained by discretizing the image colours and counting the number of times each discrete colour occurs i n the image array [SB91]. More precisely, an A^-bin colour histogram x of an image is a vector (hi, /12, • • •, hpf) in a N-dimensional colour space, where hi represents the number of the pixels with.colour i. Alternatively, hi can be normalized to represent the coverage percentage of the colour i in the image. There are several adyantages of using colour histograms as image feature vectors. First, colour histogram indexing is quite robust in many application environments since colour histograms are invariant to translation and rotation about an axis perpendicular to the image plane, and change only slowly under change of angle of view, change in scale, occlusion [SB91, SKB94]. Even though colour constancy is an inherent problem of Chapter 2. Related Works 14 colour histogram, Funt and Finalayson in [FF91] solves this problem elegantly by using the characteristic that the ratio of the neighbouring pixel colours is constant under some circumstances. Colour Histogram Distances Sz Colour Similarity One way to measure the distance between two iV-bin colour histograms is to use L\-norm [SB90, SB91, FF91]. Another way is Z/2-norm, or called Euclidean distance. Both are quite efficient because they take 0(N) time to compute. Unfortunately, there is an inherent problem associated with both Zi-norm, i^-norm, and other unweighted dis-tances between colour histograms. These methods ignore the similarity between colours or "cross-talk effect" [SH93, SH94]. For example, suppose there are only three colour bins in the histogram, namely red, orange and blue. Therefore, if the three images X, Y, and Z are in purely red, orange and blue respectively, the unweighted distance between X and Y is the same as that between X and Z even though orange looks much closer to red than to blue. This is because the unweighted distance does not take this cross-talk effect between colours into consideration. A better method of measuring distance be-tween colour histograms is to weigh the distance by perceptual colour similarity matrix A. The entry atj in matrix A measures how close the colour i and the colour j are. The distance between colour histograms x and y involving the matrix A becomes where v1 is the transpose of v. When compared with other colour models, CIE LUV and Munsell HVC models [MY88] can reflect fairly accurate perceptual distance. Therefore, they are good can-didates to set up matrix A. dN<A(x, y) = (x - y)*A(x - y) (2.3) Chapter 2. Related Works 15 Because of its definition, the similarity matrix A is a real, symmetric matrix. By the spectral decomposition theorem of matrix algebra [Flu88], there exists an orthogonal matrix B and a diagonal matrix A, both of the dimension N x N such that A = BfAB (2.4) The rows of matrix B are the eigenvectors of A and the diagonal entries of A are the corresponding eigenvalues. Without loss of generality, A and B can be arranged such that the corresponding eigenvalues are in descending order of value. Filtering approaches of Colour histograms Section 1.3.1 has pointed out that the general 2-level filtering approach can help to solve the high dimensionality problem of image feature vectors. The following investigates in more details how 2-level filtering can be applied to the colour histograms using the examples of the Faloutsos approach [F+94] and the Sawhney and Hafner approach [SH93, SH94]. In this domain, the high dimensional feature vectors Q and I refer to the TV-bin colour histograms extracted from query image and each image in database respectively. —* —* * —* —* The distance function dist(Q,I) in Equation(l.l) is the distance function distj^^iQ,!) in Equaton(2.3) involving the similarity matrix A. In addition to the high dimensionality —* —* problem of colour histograms, computing dist^^iQ: I) is much more expensive than the Li -no rm and Z<2-norm distances because it is of the order 0(N2). Faloutsos approach proposes Iavg — (Ravg,Gavg,BavgY as the abstracted 3-D vector I^s used in the coarse filtering stage, where ^ w | w ^ w R ™ 9 = A(P)> Gavg = T T r Y . G(P)l B ^ 9 = TTr lZ B{P)l V V p=l V V P =l VV p=l' and W is the number of pixels in the image and R(p), G(p), B(p) are the red, green and blue values of the pixel p. Chapter 2. Related Works 16 Having computed the Qavg and I a v g for the query and every image respectively in the database, the filtering can be done by finding the coarse-level answer set Xavg. Iavg — {Iavg \ ds(Qavg, I a v g ) ^ Davg} (^'^) where d3(Qavg, I a v g ) = { Q a v g - IavgY(Qavg - I avg), and it is an efficient L 2 -norm distance function. The Faloutsos approach also derives a formula of Davg from D so that the completeness requirement can be satisfied. The final answer set IN = {T\dN,A{Q, I) < D} can now be much more efficiently computed based only on the small coarse-level answer set lavgi rather than on all the images in the database. Experiment results show that the Faloutsos approach saves huge amount of searching time when compared with the approach without filtering. The Faloutsos approach fixes the coarse filtering dimension to be 3, but Sawhney and Hafner approach generalizes the dimensionality of coarse filter to be any dimension k. In particular, this approach looks for a optimal transformation matrix Ck of dimension k x N that approximates I ( i.e. Ik = Ckl). This Ik can then be used as the abstracted low-dimensional vector Iabs in the coarse filtering stage. The coarse-level answer set is: IK = {Ik\dist{QkJk) < D k ) (2.6) The threshold distance Dk is simply equal to the original distance D. This transfor-mation Ck is optimal in the sense that it minimizes the maximum difference between the true distance cijv,^(Q,/) and the approximated distance dk(Qk,h)- By using Singular Value Decomposition (SVD), the key result in [SH93, SH94] is the theorem proving that the optimal transformation Ck is given by: Chapter 2. Related Works 17 Ck = y/A~kBk (2.7) where matrix Bk of dimension k x N is composed of the first k eigenvectors (i.e. rows) of B (as shown in Equation(2.4)), and Ak of dimension kxk is a diagonal matrix whose diagonal entries are the first k diagonal entries of A. Please note that as different symbols are used, the set Xavg from the Faloutsos ap-proach is not the same as X 3 from the Sawhney and Hafner approach with k — 3, even though the vectors of size 3 are used in both cases. Nonetheless, experiment results in [SH93, SH94] indicate that the behavior and performance of the Faloutsos approach is very close to those of the Sawhney and Hafner approach with k equal to 3 or 4. Thus, later on in the analysis and experimentation, whenever appropriate the Faloutsos approach is represented as the Sawheny and Hafner approach with k=4. 2.2.2 Shape and Texture Shape Shape features generally fall into two categories, namely global and local features. Global shape features are those which depend on the object as a whole. It is usually to describe the object with features which are invariant to translation, rotation and possibly scaling [KJ86]. Examples of global features include area, perimeter, number of holes, areas of holes, maximum and minimum radii from objects' centroids, algebraic moment invariants and normalized Fourier descriptor. In contrast, local features are those which depend only on a subset of the object. Examples are straight lines, circular arcs, and segments which are grouped together [SM92]. One of the main problems of global shape features is that it may not work well in a cluttered environment. For examples, the number of holes may be less when one object covers another. Therefore, global features should not Chapter 2. Related Works 18 be used in situation where the object may be partially visible or occluded[KJ86]. In such situation, local features are preferred. In QBIC, 20-D feature vectors I are used to store shape features of the images [F+94]. The 2-level filtering approach is also used here. It applies the technique of Karhunen Loeve Transform (KLT) on the 20-D I and truncates the resulting vector to a lower dimensional k-D vector Ik- Actually, K L T is also the same as principal component analysis which will be described in more details in Section 2.2.3. The experimental results show that it is sufficient for k to be 2. Texture In QBIC [N+93], texture features are composed of three components, namely coarseness, contrast and directionality. They describe its scale, vividness and whether the texture is directional and isotropic respectively. The texture photobook [PPS94] uses 2-D Wold-like decomposition to obtain the texture vectors. The texture field is decomposed into deterministic (or structural) and indeterministic (or stochastic) components which are orthogonal to each other. The deterministic component, in turn, is composed of half-plane deterministic (or harmonic) and generalized evanescent components. The details can be referred to [FMP93]. Alexandrov et al. [A+95] proposes another method to index texture. The Garbor filter is famous for its characteristic of optimality in localizing both space and frequency simultaneously. This filter can be modelled in any specified scale and orientation. It is this property to index texture. A family of Garbor filters of S different scales and O different orientations can be applied to any given texture image to output the required texture feature / . The texture vector I which is used in the experiment of [A+95] can be as high as 240-D. Alexandrov et al. then uses filtering to improve the performance. The low-dimensional abstracted vector Iabs developed by Alexandrov et al. is just a two Chapter 2. Related Works 19 dimensional vector J 2 . This abstracted vector J 2 is the output of the single filter which has the best discriminating power among the whole family of Garbor filters. Although all these image features may have different characteristics and applications, most of them share a common drawback. The vectors of these image features are usually very high dimensional. This imposes a serious problem for indexing. 2-level and multi-level filtering are designed to solve this problem. 2.2.3 Eigenimages Properties and its Indexing Explicit image features such as colour, texture and shape are those which can easily be observed and extracted by human perception. However, they may not be the optimal features for recognition. This brings up the idea of implicit image features. They are features which are automatically extracted by the computer and the best for efficient recognition. The extraction is done by a technique called Principal Component Anal-ysis (PCA) . The feature vectors in the real world are typically correlated among the dimensions. In other words, there exists redundancies in the storage of the data vector. The principle of P C A is to decorrelate these data as much as possible so as to reduce the redundancies to the minimal amount. The decorrelation can be visualized as the transformation of the multi-dimensional axes such that most of the variations of the data vectors are captured on few axes, and little or no variation of the data exhibits on the remaining axes. Therefore, the transformed data on those few axes can capture most of the information. In other words, P C A can analyse the set of images in the database and create the transformation matrix. The transformation gives a set,of N principle components (PCs) Chapter 2. Related Works 20 for each image.1 These PCs (also known as eigenimages in this application) are arranged in the descending order of eigenvalues so that the first few components can capture most of the feature information. Because these=PCs do not usually correspond to the features which can be easily recognized by human beings such as eyes, mouth, etc, they are called implicit features. The full set of these N PCs can be grouped into an N-D vector / . Comparing these image vectors I with the corresponding query vector Q at a threshold distance D gives the required answer set IN- The exact dimension of / depends on the size of the training images set. However, in most cases, this dimension can be well over 100. Therefore, direct comparison is unacceptable for its poor performance because of the high dimensionality. In [Sed95], Sedighian applies the 2-level filtering approach to solve the searching problem of high dimensionality. 2.3 Multi-dimensional Indexing Structure In traditional database systems, one of the most often used techniques to speed up the C P U and I/O performance is indexing. An advantage of indexing for query answering is to avoid accessing every single data item in the database system, but only a small fraction of them. There are different efficient indexing structures for traditional database systems. Binary search trees, B-trees and B +-trees are a few examples. Recall from 7 the previous sections that the image feature vectors are inherently high dimensional, so these indexing traditional indexing structures are insufficient. sEven though there are methods (e.g. Z-ordering [Ore86] ) to transform a K-D space onto a 1-D space using the technique of space filling curves, distance preserving is not guaranteed. In other words, points which are close to each other in K-D space may not be close in 1-D transformed space. These methods are therefore not desirable in our application. 1In fact, SVD on the covariance matrix gives the PCs. Chapter 2. Related Works 21 Multi-dimensional indexing structure such as K-D trees [Ben75], K-D-B trees [Rob81], R-trees [Gut84] and R+-trees[SRF87] and R*-trees [B+90]are developed for indexing the multi-dimensional vectors. The following describes these structures briefly. 2.3.1 K - D trees The structure and operations of K-D trees [Ben75] are basically similar to those of the l -D binary search tree. In a l -D binary search trees, there is only one key available to discriminate whether to traverse left or right during searching. However, in a K-D tree, this discriminating key varies with the level of traversal during searching. In general, the discriminating key alternates one after another among all the available K. keys. A node structure can be represented as (keyo, keyi,keyx-i)- The discriminating key of the i-ih level of the tree is the keyimo(iK- Moreover, it may need to traverse through multiple paths. It can be seen that the structure and the efficiency depend heavily on the data set and the sequence of the insertions. In the worst case, it can be degenerated to a linearly linked list. Adaptive K-D tree is proposed to tackle this problem [Sam90]. Instead of alternating among keys, the discriminating key of an adaptive K-D tree at a node is the one with the greatest range over the sub-tree rooted at that node. Therefore, it is possible that the discriminating keys for level i and i + 1 are the same. 2.3.2 K - D - B trees There are two important differences between K-D-B trees and K-D trees [Rob81]. First, the non-leaf nodes of a K-D-B tree do not contain the data points. Instead, they contain - the rectangles bounding all the data points in the leaf nodes of their subtrees. Second, the original K-D tree is not specifically designed for huge databases where all the indices cannot be fitted in the main memory. If a static environment is assumed where all the data is known before insertion and little or no modification will be made, Chapter 2. Related Works 22 the K-D tree can still be applicable. In this case, the tree is first built statically and then mapped onto the pages. However, in the dynamic environment of a huge database, K-D-B tree is designed to handle this case where it is necessary to store most of the indices in disk while it is being used [Rob81] [Gut84]. Each node of K - D - B tree is stored as a single page in disk so that efficient use can be made of a disk with paging. The above two multi-dimensional indexing structures are good for point data objects. In some other applications, point data objects may not be sufficient. Examples include spatial data objects in geographical information systems. Still, there are methods for converting spatial data objects to point data. One way is to transform a K - D rectangles to two K-D points. Take a 2-D rectangle with diagonal vertices (1,2) and (3,4) as an example. A possible corresponding 4-D point to represent this 2-D rectangle can be (1,2,3,4). However, higher dimensionality space may lead to poorer performance. In order to solve this problem, the following three multi-dimensional indexing structures are designed to handle spatial data objects. 2.3.3 R-trees and its variants In order to handle any irregular K-D spatial data object, one approximation is to represent the key of a node as K-D rectangle which bounds the object. This K-D rectangle key I can then be represented by K number of closed intervals. -, / = (Jo, h, • • •, ^A'-l) where Ii — [a,-, &;], with at- and b{ representing the lower and upper limit of the range in the i-th dimension of the rectangle. Each page can only store a singe node, which, in turn, can store a certain maximum number of rectangle keys. At the point when an extra rectangle is being added to a full node, that node has to be split into two. The rectangles which are in the original Chapter 2. Related Works 23 overflown node are then needed to be re-distributed into the two newly created nodes. The three members of R-tree family, namely R-tree, R +-tree, R*-tree differ in the ways of redistribution. The differences in the insertion algorithms between them affect the characteristics of the rectangles in the non-leaf nodes, and therefore the tree structures as well as the searching algorithms. In an R-tree, the rectangle key of each non-leaf node completely encloses all the rect-angles of the nodes at its subtree. During searching, if the query rectangle key overlaps with several rectangles of a certain non-leaf node, all the nodes of these overlapped rect-angles have to be traversed. Therefore, the efficiency of the searching depends on the overlapping area. Zero-overlapping area is thus the goal. Unfortunately, it is shown that R-trees can never achieve zero- overlapping area for keys of complete rectangles without splitting[SRF87]. In order to improve efficiency, R +-tree is designed to have zero-overlapping area by splitting the rectangle keys. The time of searching can then be reduced because it is less likely to follow the paths which do not lead to correct results. The R*-tree [B+90] enhances the searching performance from a more comprehensive approach. It also argues that the original R-tree which is based on minimizing the coverage area is not optimal. Like an R +-tree, the fe-distribution heuristic of R*-tree is based on coverage area and overlap area. In addition, it also considers two other optimization heuristics. The first one is to minimize the margin of the rectangle (i.e. sum of the edges length). The idea of this heuristic is as follows. Suppose the area is kept the same. When the margin is shorter, the rectangle is also more quadratic (i.e. it is closer to a square). Since the packaging of the square is better, the non-leaf rectangles can also be smaller. The goal of the second heuristic is to shorten the height of the tree, and therefore minimize the query cost. By combining all these criteria, R*-tree shows very encouraging experimental results compared with the original R-tree. Even though the above indexing structures can support multi-dimensional data, none Chapter 2. Related Works 24 of them can handle data efficiently with dimension over 20. Therefore, they cannot store the high dimensional image feature vectors directly. The aim of the 2-level filtering is to provide a coarse level filter where k-D vectors are abstracted from,the original high dimensional image features vectors. The dimensions of these A;-D vectors are usually so small that these vectors can fit in the indexing structure. Still, the coarse filter of cannot be too small. Otherwise, this may leave too many images for the expensive fine-level filtering. At the same time, the dimension k cannot be too large because of the "dimensionality curse". Therefore, the design goal of multi-level filtering is to solve this dilemma., Chapter 3 Cost Models of Filtering 3-level filtering is one of the major contributions of this thesis. The beginning of this chapter describes its details. The remaining sections then present the cost models which can help analysing and comparing 2-level, 3-level and multi-level filterings. 3.1 3-level Filtering As previewed in Section 1.4.2, the goal of the 3-level filtering is to solve the inherent problem of 2-level filtering. In other words, 3-level filtering can detach the dimension-ality of the indexing structure from the number of full high dimensional feature vector comparisons by adding an intermediate level of filtering. Suppose the N-D feature vector of the images in the database and the query are extracted to be I and Q respectively. / are then abstracted to much lower dimensional vectors Ik and /; by either P C A or SVD depending on particular application, where I < k <^ N. The dimension of 7; is small f —* —* —* enough so that I\ can be stored efficiently in the indexing structure. Ik and I can be stored in two separate flat files. Q are also abstracted to Qk and Q\ similarly. The three levels of filtering are done in following steps. 1. The coarsest level answer set X; is obtained by searching through the indexing structure with the query Q\ Xi = {T,\disti(Qi,Ii) < Di} 25 (3.8) Chapter 3. Cost Models of Filtering 26 2. 1\ is subset of the original database. Only the corresponding vectors Ik of this small subset are compared with Qk to return the second level answer set lk-lk = {Ik\distk{QkJk) <Dk f\(Ii G li)} (3.9) where // and Ik are the l-D and k-D abstractions of the same image. 3. Jfc is even a smaller subset of the database than J ; . The original high dimensional —* —* vectors / of this very small subset are compared wtih Q to return the final answer set 2jy. 1N = {T\distN(Q,T) <D f\(fk € lk)} (3.10) Since the transformation is chosen such that the distance between vectors is preserved after transformation, the threshold distances TJ/ and Dk can simply be equal to D. The 4-level, 5-level and even higher levels of filtering are simple extensions of the same idea of 3-level filtering by adding even more intermediate levels. 3.2 Motivations of Deriving Cost Model As the filtering concept is widely applicable, it is desirable to derive a cost model for the general filtering without restricting to any particular image index. It can serve two purposes. • If we are given with a particular image database, its corresponding high dimensional vectors and the selected indexing structure, the cost model can give us an estimation of C P U and I/O time spent on a query for a specific filtering configuration. From Chapter 3. Cost Models of Filtering 27 these data, an optimal number of levels of filtering and optimal dimension(s) can be obtained by comparing between these configurations. • Without the cost model, the hypothesis that 3-level filtering costs less than tradi-tional 2-level approach only seems true from a vague intuition. The cost model can help us to understand the trends of both of these filtering approaches, and therefore it can provide more concrete evidence that this hypothesis is correct in most cases. Because of the these two benefits, the cost models can help to find the optimal k in 2-level filtering and confirm the advantage of 3-level over 2-level filtering. Therefore, the cost models can answer the question 1 and 2 which are asked in Section 1.4.1. The rest of this chapter will discuss the cost models of 2-level,-3-level and the generalized w-level filterings. 3.3 A cost model of 2-level filtering 3.3.1 C P U cost of 2-level filtering Let M denote the total number of images in the database. Without filtering, the set IN is computed by checking the vectors of all M images. Thus, the C P U cost (in units of time) for computing IN is given by: • . CPU(N) = M * T d N (3.11) where T<iN denotes the C P U time taken to compute the L 2 -norm distance function rfjv() between Q and I. However, if the distance measurement between feature vectors involves similarity matrix A as Equation (2.3), TdN in the above Equation(3.11) should be replaced by TdN A which is the C P U time needed for computing Equation(2.3) between Q and I. Figure 3.1 summarizes the definitions of the symbols used throughout this paper. Chapter 3. Cost Models of Filtering 28 To compute the C P U cost of filtering, it is necessary to know the percentage of images that satisfy the filtering condition. Let ak denote the ratio of the cardinality of Xk to the total number of images, i.e., \Xk\/M. However, most multi-dimensional indexing structures, such as variants of R-trees, and k-d trees, do not directly support the kind of radius search required in the definition of Xk. Instead, a rectangular range search is used to approximate the radius search. More formally, it is the set Jk — {Ik\ AjLi < D} that is returned by the indexing' structure, where C[j 5 Z j 3XC the j-th entries of Qk,Ik respectively. The symbol fa is used to denote the ratio of the cardinality of Jk to the total number of images, i.e., \Jk\jM. Using a fc-dimensional indexing structure as a filter, the C P U cost of the 2-level filtering is given by: CPUiK,N) = Index.CPU(M,k,fa,P) + M * fa * Tdk + M * ak * 1 ^ 3 . 1 2 ) • The symbol (k, N) denotes the 2-level filtering configuration with ^-dimension in coarse-level filtering and TV-dimension in fine-level filtering. • The first term of the above C P U cost represents the C P U time taken to perform the range search that computes Jk, i-e., the time taken to search for fa* M points in a fc-dimensional indexing structure of M points and of page size P. • The second term of Equation (3.12) gives the C P U time to compute Xk from Jk. This amounts to calculating the i^-norm between Qk and Ik, for all Ik in Jk. The symbol Tdk in the second term denotes the C P U time taken to compute the Z/2-norm between two vectors of size k. • Finally, in order to compute the answer set ZJV, the full iV-dimensional vector distances between Q and / are required, for all the vectors in Xk. The C P U time required for this final step is described by the third term of Equation (3.12). Chapter 3. Cost Models of Filtering 29 symbols definitions N the number of dimensions of the image feature M number of images in database P page size Pt total number of pages occupied by full N-dimensional vectors Ptk. total number of pages occupied by transformed A;-D vectors TdN C P U time to compute Z-2-norm distance between two full N-dimensional vectors C P U time to compute distance involving similarity . matrix A between two full TV-dimensional vectors i.e., Equation (2.3) Tdk C P U time to compute the Z/2-norm between two vectors of k dimension OLk the ratio of the cardinality of Zk to M ft- the ratio of the cardinality of Jk to M Index-CPU{M,k,pk,P) C P U time to retrieve (5k * M points in a fc-dimensional indexing structure of M points and of page size P IndexJ/0{M,k,fik,P) number of page accesses to retrieve flk * M points in a A;-dimensional indexing structure of M points and of page size P Figure 3.1: Definitions of symbols Chapter 3. Cost Models of Filtering 30 3.3.2 I/O cost of 2-level filtering The I/O cost without filtering is quite straightforward. For simplicity, it is assumed that 4 bytes are used for every dimension of a vector. In other words, the size of a vector is AN. Thus, the I/O cost (in number of page accesses) for computing IN without filtering is given by: I/0{N) = ' M * AN I P (3.13) which is exactly the total number of pages occupied by the full TV-dimensional vectors, also denoted by the symbol Pt. To simplify presentation, we do not use the ceiling function (i.e., \P/AN\) in the above equation. Recall from Section 3.3.1 that the C P U cost of filtering consists of three parts: per-forming the range search to produce J K , reducing JK to IK, and finally computing IN-The situation for I/O cost is identical except that there is no I/O involved in the stage that reduces JK to IK. This is because the Ik are directly stored as fc-dimensional points in the indexing structure. Once the points./vectors are obtained, the appropriate Z^-norm can be computed outright, requiring no extra page access. Thus, the I/O cost of filtering consists of only the following two parts: I/0(ktN) = Index J/0(M, k, f3k, P) + Pt * (1 - (1 - l/Pt)a**M) (3.14) • Analogous to the first term of Equation (3.12) for the C P U cost, the first term above denotes the number of page accesses to retrieve f3k * M points in a ^-dimensional indexing structure of M points and of page size P. • The second term in the above equation is Cardenas' formula for estimating the dis-tinct number of pages occupied by ak*M randomly selected N-dimensional vectors [Car75]. A more accurate formula developed by Yao can also be used [Yao77]. Chapter 3. Cost Models of Filtering 31 3.4 A cost model of 3-level filtering 3.4.1 C P U cost of 3-level filtering The C P U cost is given by an extension to the 2-level formula in Equation (3.12): CPU{AXN) = Index-CPU(M,4,f34,P) + M * (34 * Tdi + M * a 4 * Tdk + M * ak * TdN (3.15) The notation (4, k, N) represents the 3-level filtering configuration with 4-D as the coars-est level, k-D as the intermediate level, and iV-D as the finest level-. The first two terms of the above equation are exactly the same as the first two terms of Equation (3.12) with k = 4. The third term corresponds to the C P U cost of producing Xk and computing Z/2-norm distances between Qk and all the a4 * M I^s. Finally, the last term describes the C P U cost of calculating the full iV-dimensional vectors distances of which there are ak * M of them. 3.4.2 I/O cost of 3-level filtering Similarly, the I/O cost is given by an extension to Equation (3.14): I/0{4XN) = Index J/0(M, 4, ft, P) + Ptk * (1 - (1 - l/Ptk)a**M) + Pt -l/Pt)ak*M) (3.16) where Ptk denotes the total number of pages occupied by the transformed vectors, of size k. The first term above is exactly the same as that in Equation (3.14) with k — 4. The second term is based on Cardenas' formula for estimating the distinct number of pages occupied by a4 * M randomly selected transformed vectors [Car75]. As a simple modification to Equation (3.13), the total number of pages occupied by the transformed vectors of size k is given by Ptk = AkM/P. Finally,.exactly the same as the last term in Chapter 3. Cost Models of Filtering 32 Equation (3.14), the last term in the above equation denotes the I/O cost.of computing IN from IK. 3.5 A cost model of u-level filtering 3.5.1 C P U cost of u-level filtering The same idea can be generalized to the cost models of u-level filtering. Let k\ < &2 < . . . < ku-2 be the u-2 parameters of a u-level indexing structure. That is to say, sandwiched between the first level of u-dimensional vectors and the last level of N-dimensional vectors, this structure has a level of transformed vectors of size ki, followed by a level of vectors of size k2, etc. Thus, the C P U cost can be defined as a simple extension to Equation (3.15): CPU{4MM ku_2,N) = -Index-CPU(M,4,/34,P) + M * f34 * Tdi + M * cc4 * Tdki + M * akl * Tdka + ...+ M * cYfcu_3 * Tdku_2 + M * aku_2 * TdN (3.17) 3.5.2 I /O cost of u-level filtering Similarly, the I/O cost of u-level filtering is a simple extension to Equation (3.16): I/0{4Mtk2_ku_2<N) = IndexJ/0(M,A,/34,P) + Ptkl * (1 - (1 - l/Ptkl)a**M) + Ptk2*(l-(l-l/Ptk2)a*i*M) + ...+ • Phu_2*(i-(i-i/ptku_2r^* M) + Pt * ( 1 - ( 1 - 1 / P t ) a ^ * M ) (3.18) where Ptkz = AkzM/P denotes the total number of pages occupied by the transformed vectors of size kz. Chapter 3. Cost Models of Filtering 33 It has been shown in this chapter that these cost models can capture both the C P U and I/O time needed for 2-level, 3-level, and u-level filtering. They will be used for trend analysis in the next chapter and experimental evaluation in chapter 5. Chapter 4 Trend Analysis 4.1 Introduction The cost model presented in Section 3.2 provides a good foundation for analysing the generaltrends of 2-level, 3-level and even u-level filtering costs. A very detailed analysis will be too complicated because of many unknowns involved, such as the characteristics of the dataset and the indexing structure. Even if this detailed analysis is feasible, it will be too specific to this particular set of variables. Unlike the detailed analysis, the trend analysis in this chapter does not suffer from this problem, and its results are general enough to be applied in various situations. This chapter first explains the reason why the 2-level filtering (k,N) costs much less than the approach without filtering (N). It then analyses the trend of 2-level filtering (k,N) with different dimensions k. Although this trend analysis cannot give a definite answer to finding the optimal k of 2-level filtering, it can estimate a general trend of how the cost changes with k. Afterwards, the next section compares the cost of 3-level filtering (4, k, N) with that of 2-level (k, N) at the same value of k (i.e., when the dimension of the intermediate level of 3-level filtering is equal to that of coarse level in 2-level filtering). The following section investigates whether 3-level filtering can still excel 2-level when they are at their respective optimal values of k. The final section examines whether there is any limit of increasing number of intermediate levels by comparing 4-level with 3-level filtering. 34 Chapter 4. Trend Analysis 35 4.2 With or without filtering: (4, TV) vs (TV) Recall that Faloutsos approach can be approximated by (4,7V). Even though the filter is fairly coarse, a4 (and thus, ft) is typically quite small. Thus, the quantity (1 — (1 — 1/Pt)ak*M) in Equation(3.14) is only a small percentage of the total number of pages (i.e., Pt). In fact, the difference between Pt and P i * (1 - (1 -l/Pt)ak*M) is much larger than the first term Index J jO(M,k, ft, P) in-Equation (3.14). Hence, the I/O cost, I/O^N) of the 2-level filtering approach is smaller than the I/O cost without filtering (cf: Equation (3.13)). Similarly, the smallness of a4 and ft accounts for significant reduction in C P U time. Thus, when a4 and ft are small, CPf/^jv) is smaller than CPU^N) in Equation (3.11). As discussed, in Section 3.3.1, the cost model of Faloutsos approach is the same as that of 2-level filtering as described here except the TjN A is used instead of TdN. It is known from the Equation (2.3) that TdN A is much larger than TdN because of the matrix A. Therefore, the third term in Equation (3.12) dominates other terms of CPU(4>N). When a4 and ft are small, CPU(4>N) is much smaller than CPU^yTherefore, this can explain the huge saving of Faloutsos approach when compared with the approach without filtering. 4.3 Analysing the configuration (A;, TV) As k increases, the trend of 2-level filtering cost is not as trivial as the analysis discussed above. The trend of the I/O cost in Equation(3.14) is determined by two "forces" going in opposite directions. • Observation 1 Pt * (1 - (1 - l/Pt)ak*M) decreases as k increases. Chapter 4. Trend Analysis 36 First, as k increases, it is known that ak decreases because less information is lost —* —* when I is truncated to Ik. Furthermore, all the other variables in Pt * (1 — (1 — l/Pt)a**M) are independent on k. Thus, the second term Pt*(l-(l-l/Pt)ak*M) of I/0(k,N) ( Equation( 3.14) ) decreases as k increases. This is shown in Figure 4.2(a). • O b s e r v a t i o n 2 Index-I/0(M, k,/3k,P) increases as k increases. When k increases, the size of each vector increases, as does the number of pages needed to retrieve the same number of vectors. More importantly, as the dimen-sion of the index increases, the number of page accesses grows at least linearly, if not exponentially, due to the effect of the so-called "dimensionality curse." This -outweighs the fact that (3k decreases as k increases. Consequently, the first term Index-I/0(M, k, (3k, P) increases as k increases. This is also shown in Figure 4.2(a). To understand the combined effect of these two opposite terms, the following key observation can be made: the rate of decrease of ak reduces as k increases. This is because Ik' corresponds to the k highest eigenvalues and eigenvectors of the similarity matrix. In other words, as A; increments by one, the extra dimension added corresponds to an eigenvalue that is less than all the k eigenvalues that have already been included. Thus, the impact of adding this extra dimension, though non-zero, is not as great as the impact of adding any of the k previous dimensions. As a consequence of this observation, the combined result of the above two opposite terms depends heavily on the ratio ak/a^+i • For small values of k, the ratio is significantly larger than 1. In this k increases, the decrease in the term Pt * (1 - (1 - 1 /Pt)°k*M) more than offsets the increase in the term Index J/0(M, k, /3k, P). This can be seen as the steep slope of the second term in Figure 4.2(a). The end result is a reduction in I/O cost, i.e., I/0^k,N) > 1/0(k+i,N)- However, for larger values of A:, the ratio cck/ctk+i approaches 1. Then, the decrease in the second term of Equation (3.14) is dominated by Chapter 4. Trend Analysis 37 (a) T rend of I/0{k>N) (b) Trend of CPU{k<N) Figure 4.2: Costs of 2-level filtering with respect to k the increase in the first term, resulting in a rise in I/O cost. The curve of the first term in Figure 4.2(a) is now much steeper than that of the second term. Thus, the trend of the I/O cost resembles a U-shape. The minimum cost configuration (k,N) depends on the exact slopes. Typically, A: is greater than 4. Hereafter, the value k that gives the minimum 2-level I/O cost is denoted as fc^tV O b s e r v a t i o n 3 I/0^:N) resembles a U-shape as k increases. The trend of the C P U cost is now considered. Like the analysis above, the trend of the three terms in C P U cost of Equation (3.12) can be analysed separately. • O b s e r v a t i o n 4 Index JJ PU(M, k,fa,P) increases as k increases. The behavior oi Index-CPU (M, k, fa, P) is similar to that of Index _7/0(M, k, fa,P) as Observation 2. In other words, the first term IndexJJPU(M, k,fa,P) of CPU(k,N) increases as k increases. Chapter 4. Trend Analysis 38 • Observation 5 M * /3k * Tdk increases as k increases. This is because the linear' increase in Tdk typically more than offsets the slight decrease in /3k. Hence, the second term M * /3k * Tdk increases as k increases. • Observation 6 M * ctk * T d N decreases as k increases. Similarly, since ctk decreases as k increases, the third term decreases. The net effect CPU(k,N) oi the sum of the three C P U terms are similar to that of I/0{k,N)- When k is small, the rapid decrease of the third term M * ctk * TdN will dominate the small increases of the other two terms. This is reflected as the steep slope of first term compared with those of the other two term in Figure 4.2(b). Therefore, the net effect on CPU[k,N) will decrease. However, when k is getting sufficiently large, the third term M * .ctk * TdN will follow ctk to flatten off. The first and the second term will then dominate, and C'PU^,N) will increase. This increase can also be seen in Figure 4.2(b). Observation 7 CPU[k,N) resembles a U-shape as k increases. Since both I /.0(k,N) a n d CPU(k,N) resemble U-shapes, it is logical to predict that the total cost of 2-level filtering should also resemble a U-shape which implies that there exists an k^j®1 which gives the minimum total cost of 2-level filtering. In the cases of Faloutsos approach and Sawhney and Hafner approach where TdN A is used instead of TdN, the analysis of the I/O cost is the same as above. The analysis of the C P U cost is different. T d N A helps the third term of Equation (3.12) to dominate the remaining two terms. Thus the trend of the third term, which is decreasing, defines the trend of the C P U cost. Since the trend of I/O cost resembles a U-shape, the trend of total cost resembles a wider U-shape. Although the exact value of optimal k1^ is still unknown, the above analysis can explain why the cost of Sawhney-Hafner filtering Chapter 4. Trend Analysis 39 approach at certain fcj^f ( where k^j^1 > 4 ) can cost less than that of the Faloutsos approach. 4.4 3-level vs 2-level filtering: (4, k, N) vs (k, N) at same k To answer the question of whether 3-level filtering leads to reduction in 1/0 cost when compared with 2-level, Equation (3.16) is compared with Equation (3.14). Since the last terms in both equations are the same, the difference between I/0^,N) a n d 1/0(4,fc,./v) is A i — A 2 , where: ' • A i = Index J/0(M,k,pk,P) - Index J/0(M, 4, ft,P), and • A 2 = Ptk*(l~(l-l/Ptkr**M). The trend of A 2 as k increases is first considered. Within A 2 , only Ptk depends on k, as 0 : 4 * M is constant with respect to k. When k increases, Ptk increases, but (1 — (1 - l/Ptk)a**M) decreases. The combined effect on A 2 can be analysed by investigating their respective relative rates of increasing and decreasing. If k increases by one, Ptk increases by Ptk+i - Ptk = AM/P. However, the term 1 - (1 - l/Ptk)ai*M decreases as k increases. Initially, when k is small, Ptk is small, and 1 — 1/Ptk is not too close to 1. The latter, in turn, causes 1 — (1 — 1 / P £ L ) Q 4 * m to be close to 1. This gives a rate of increase of A 2 close to 4M/P. As k increases, 1 — 1/Ptk becomes closer to 1, which causes 1 — (1 — 1/Pt k) Q i*M to become smaller. The consequence is that the rate of increase of A2 becomes much smaller. That is to say, the growth of A 2 flattens off at some point of k. The exact value of this k depends on the value of a4 * M. The larger 0:4 * M is, the. larger value this k is. In any event, the rate of increase of A 2 at any point cannot exceed AM I P. Observation 8 A 2 increases and eventually flattens off when k increases. Chapter 4. Trend Analysis 40 / (a) Case 1 (b) Case 2 Figure 4.3: Trends of A x , A 2 and A i - A2 The analysis of A i — A 2 can be done as follows. There are two cases. In the first case, if the rate of increase of Index JjO[M, k, fa, P) with respect to k is greater than AM/P, then A i — A 2 increases as k increases. That is to say, the larger the value of k, the larger is the difference between I/0(K,N) and I/O^^N) (Figure 4.3(a)). In the second case, the rate of increase of Index J./0(M,k, fa, P) with respect to k is less than AM/P. Then the growth of A i may only catch up with the growth of A 2 after the point k&2, when the growth of A 2 begins to flatten off. In other words, after this point, the larger the value of k, the larger A j — A 2 is (Figure 4.3(b)). Furthermore, when k = 4, A i - A 2 is equal to 0 - Pt4 * (1 - (1 - l/Pt4)ai*M), which is a very small negative number, because Pt4 is a small number. Then combined with the analysis described in the previous paragraph, A i — A 2 is expected to be positive even for rather small values of k (e.g., 6). And as argued above, as k increases, the value of Chapter 4. Trend Analysis 41 A i — A 2 is expected to become more and more positive, indicating that the cost I/0(k,N) of the 2-level filtering approach is getting higher and higher than the cost I/0(4tk,N) of 3-level filtering. As for the C P U cost, Equation (3.15) can be compared with Equation (3.12). Since the last terms in both equations are the same, the difference between CPU(k,N) and CPU(4ik,N) is A 3 - A 4 , where: • A 3 = IndexJCPU{M,k,fa,P)- IndexJCPU(M, 4, fa,P), and • A4 ' = M * (34 * Tdi + M * OL4 * Tdk - M * fa * Tdk. Similar to the above analysis of I/O cost, the larger the value of k, the larger is the value of A 3 . A 4 also grows as k increases, because of the growth in Tdk. However, Tdk is typically much smaller than Index-CPU(M, k, fa, P). Hence, A 3 dominate A 4 . In other words, CPU(k,N) is eventually greater than CPU(4tk,N)- ' The above analysis establishes that for both I/O and C P U costs, the 3-level filtering approach becomes more and more efficient than, the 2-level filtering approach for the same value oi k. 4.5 3-level vs 2-level filtering: (4,k,N)' vs (k,N) at their optimal k The next and most important question then is whether the 3-level filtering approach still outperforms the 2-level filtering when both approaches are at their respective optimal k values. The k value that gives the minimum total cost of the 3-level filtering approach can be denoted as & ^ | 3 , and as before, the minimum k value of the 2-level filtering is denoted as k^1. To determine the relationship between k^l3 and k^1, it is helpful to first examine the trend of C P U and I/O costs of the 3-level filtering approach. Among the four terms of CPU(4tk,N) m Equation(3.15), the first two terms, Index .CPU'(JVf, 4, fa, P) and M * fa * Tdi in are independent of k. On the one hand, Chapter 4. Trend Analysis 42 (a) Trend of I/0{4,k,N) (b) Trend of CPU{4tk,N) Figure 4.4: Costs of 3-level filtering with respect to k the third term, M * ct4 * Tdk increases with k. On the other hand, the Observation 6 of Section 4.3 indicates that the last term M * ak * TdN decreases with k. Because TjN has a dominating effect, CPU(4ik,N) w m follow the trend of its last term to decrease when k increases, at least when k is not too large. Their trends are shown in Figure 4.4(b). As compared with the trends shown in Figure 4.2(b), this trend is different from that of>CPU{k,N) since the total increase of Index JOPU(M, k,flk,P) and M * (3k * Tdk in CPU(k,N) is much more significant than the increase of M * a4 * Tdk in in CPU{4ik,N)-Observation 9 CPU^k,N) decreases when k increases, until k becomes very large in which case CPU(4ik,N) increases very slightly. As for the three terms of I/0^k,N) m Equation (3.16), the first term, IndexJlO(M,4.,f34,P) is constant with respect to k. The second term, Ptk * (1 — (1 — l/Ptk)ai*M) as indicated in Observation 8 of Section 4.4, grows as k increases,, but Chapter 4. Trend Analysis 43 flattens off eventually. The third term, Pt* (1 - (1 - l/Pt)ak*M) as also indicated in Observation 1 of Section 4.2, decreases.as k increases, but also flattens off eventually. The combined effect is that when k is small and ctk/ctk+i is not too close to 1, the decrease in the third term dominates the increase in the second term. This is because ctk is an exponent in the third term, whereas Ptk iri the second term is not. In other words, the cost I/0(4tk,N) is decreasing. Now as k increases, both the increase of the second term and the decrease of the third term flatten off. This translates to a very gradual decrease in I/0(4tk,N)- Eventually, when ctk/ctk+i is really close to 1, the second term finally outgains the third term, reversing the downward trend to an upward one. Because the growth of the second term itself flattens off, it is expected that the upward trend occurs only with large values of k, making the trend of I/0(4^,N) resemble a very wide U-shape as shown in Figure 4.4(a). In particular, I/0(4tk,N) is expected to reach its minimum beyond the point at which I/0(k,N) reaches its minimum. This is the case precisely because the 3-level filtering approach avoids using higher-dimensional indices. Observation 10 I/0(4,k,N) resembles a wider U-shape than I/0(k,N) does when k in-creases. By combining the two effects from Observations 9 and 10, it can be expected that total cost of 3-level filtering exhibits a U-shape wider than that of 2-level. In other words, the total cost of 3-level filtering will reach its minimum at a point beyond that of 2-level, i.e. k^1 < k ^ j 3 . The three graphs of Figure 4.5 show the three possible scenarios where k1^ < &^*£J3. Moreover, they also show the wide U-shaped curves of 3-level filtering total cost when compared with those of 2-level. In fact, the finding of kmin ^ ^min,3 l s precisely the key aspect that motivates the development of multi-level filtering as discussed in Section 1.4.2 (b) Scenario 2 (c) Scenario 3 Figure 4.5: Total cost of 3-level and 2-level filtering Chapter 4. Trend Analysis 45 Finally, some light can be shed on the relative performance of the 3-level and the 2-level filtering approaches when both approaches are at their respective optimal k values. It is already shown in the analysis of Section 4.4 that starting from some values of ko, the total cost CPU(k,N) + I/0(k,N) * dr of the 2-level filtering approach is expected to be greater than the total cost C PU(4ikiN) + I/0(4tk,N) * dr of 3-level filtering, where dr denotes the average time for reading one page from disk. There are now two cases. In the first case, the optimal k^j*1 of the 2-level approach exceeds k0. This is shown in Scenario 1 of Figure 4.5. Then it is immediately known that: CPU I [.total N \ + I/O/ ktotal N\* dr > CPU 14 ktotal N ) + I/'0 (4 fctotal N) * dr (4.19) \ iritn ' / \ min ' ' > ' min ' ' \ ' min * / Obviously, as A^*£'3 is the optimal k value of 3-level filtering, it is known by definition that the right-hand-side of the above inequality must be greater than the minimum total cost of CPU 14 ktotai M\+ I/O14 ktotal N) * dr. But a more revealing way to establish the \ mtn,3' . / V ' mm,3 * / same inference is to make use of the fact that k1^^ < A^'£j 3. The trends of CPU^k,N) and I/0(4ik,N) dictate that: C PU(4< ktotal^ ft} + 7/0^4^ tfotal^ * dr . > CPU(4ktotal^±[y)~^~I/0(4,ktota,+-l,N)*dr > ... >. CPU{4ik^iN) + r / 0 { 4 i k ^ t N ) * d r (4.20) Hence, by combining with Equation (4.19), the following is established: CPU{ fcjoj-i, N ) + 110{ fcjo,.«i N ) * dr > CPU{4t k ^ 3 , N ) + I/0{4t k^3,N) * dr(4.2l) That is to say, when the two approaches reach their respective minimums, the total cost of the 3-level filtering approach is smaller than that of the 2-level. This is shown in the scenario 1 of Figure 4.5. The above analysis only deals with the case when the optimal k^j®1 of the 2-level filtering approach exceeds k0. In the second case when A;^'"' is less than A:0, the comparison Chapter 4. Trend Analysis 46 is not as definite as in the first case. As k1^ < ko, the inequality sign in Equation (4.19) has to be reversed to give: CPU(ktotai iNj + I/O(ktotai j Ar\. * dr < C P U k t o t - a l , N) ^ 110 ( 4 , ' k t o t - a l , N) * dr (4.22) Combining this equation with Equation (4.20), however, does not give Equation (4.21). Thus, in this case, nothing definite can be concluded about which minimum value is smaller than the other one. Both possible cases are shown in Scenario 2 and 3 of Figure 4.5. Nonetheless, when this happens, it is expected that both minimum values are close to each other. Moreover, from experience, the critical value ko is typically quite small. For instance, for both experiments using random colour histograms in Section 5.3.1 and eigenfaces in Section 5.3.2, k0 and k**£ are around 6 and 12 respectively. Small ko values imply that the second case occurs much more rarely than the first case. Thus, the first case and Equation (4.21) are expected to hold most of the times. 4.6 4-level vs 3-level filtering: (4, kx,k2,N) vs (4, k, N) The I/O and C P U costs of u-level filtering (Equation(3.17) and 3.18) can be instantiated to those of 4-level filtering (4, k\, k2, N) as follows. CPU{4Mtk2tN) = Index JJPU{M^ fa, P) + M * fa * Tdi + M * a4 * Tdkj + M * akl * Tdk2 + M * ak2 * TdN (4.23) IIO{4MM,N) = IndexJ/0(M,4,fa,P) + Ptkl * (1 - (1 l/Ptkl)a**M) + Ptk2 * (1 - (1 - l/Ptk2)a*i*M) + Pt * (1 - (1 - l /Pt) a *2*^Ql.24) Chapter 4. Trend Analysis 47 When ki = k, the difference between CPU^^^N) and CPU^k,N)i as given by Equations(4.23) and (3.15), is the difference between M*akl *Tdk2 and M*(ak—ak2)*TdN. For small values of k (or ki), (ak — ctk2) is significant, causing M * (a:*; — Qk2) * TdN to dominate M * akl *Td^, and thus forcing C PU^kl,k2,N) to be less than CPU^k.N)- But the converse is true for medium to large values of k. Similarly, a comparison between Equations (4.24) and (3.16) reveals that 1/Oi^^^N) and I/O^^N) follow the same trend as their C P U counterparts. Thus, as far as total cost is concerned, it is expected that a 4-level structure could outperform a corresponding 3-level structure for small values of k, but the converse is true for larger values of k. Given that the optimal value A^'£' 3 of the 3-level filtering approach is typically not small, the latter statement implies that at that k value, the 3-level structure may be more efficient than the corresponding 4-level structure. But what about comparing the optimum of 4-level filtering with the optimum of 3-level filtering? Unfortunately, no analytic conclusion can be drawn that applies to most situations. However, the experimental results in the next chapter do indicate that while possible, it is not often that the optimal 4-level filtering can outperform the optimal 3- level filtering. And in situations where it does, the total cost difference between the 4- level optimum and the 3-level optimum is small. The cost models of C P U and I/O cost for even higher level, such as 5-level and 6-level can be instantiated similarly from the cost models of u-level filtering. However ^ as discussed in the previous paragraph, the limit seems to have already been reached by the 3-level, and occasionally 4-level filtering, there is little reason to consider filtering higher than four levels. This is because the extra storage space may outweigh the little gain in performance. In this section, it is shown that 2-level filtering (k, N) usually costs less than (4, N) when k > 4. Furthermore, the trend analysis also provides strong evidence that the optimal 3-level filtering (4, k, N) can also outperform the optimal 2-level filtering (k, N) in Chapter 4. Trend Analysis 48 most cases. In the next chapter, these findings are further confirmed by the experimental results. Chapter 5 Experimental Evaluation 5.1 Introduction r The trend analysis in Chapter 4 provides strong evidence that 3-level filtering can outper-form 2-level filtering in most cases. The goal of this chapter is to evaluate this analytical finding with experiments. Two different testbeds are chosen. One is a colour histogram which is an explicit image feature, while the other is an eigenface which is implicit, as discussed in Section 2.2.3. A colour histogram is chosen because it is a fairly popular feature among different image database systems. For example, one of the QBIC recognition techniques is based on colour histogram [F+94]. Likewise, Gong et al. also use colour histograms for their image database systems [G+94]. As discussed in Section 1.3, colour histogram is an example of high-dimensional image feature vector. The higher the dimension, the better the accuracy provided that all conditions are kept unchanged. For example, Faloutsos et al. uses as many as 256 bins for the dimension of the colour histogram [F+94] . Even higher dimensionality may be needed depending on the needs of the applications. Another testbed is eigenface [TP91]. It is an example of high-dimensional vector because the maximum number of dimension generated can be as high as the number of training images less one. This is certainly a considerable number. If the facial image database consists of five thousands facial images, and only one-tenth of them are used for training, the maximum dimension can be as high as 499. 49 Chapter 5. Experimental Evaluation 50 Using these two testbeds, we are now going to see which k gives the optimal configu-ration of 2-level filtering, and how much optimal multi-level filtering can save compared with the 2-level one. Before any result is presented, the details of experiments are de-scribed next. 5.2 Details of Experiments As there are two different testbeds, all the experimental details which are applicable to both are first described. Afterwards, the details specific to each of these two testbeds are then discussed separately. 5.2.1 General Experimental Setup Simulation In general, M images are first collected for the database, and the corresponding high-—* dimensional feature vectors I are extracted. (The exception is random colour histograms which are generated. The details will be discussed in the Section 5.2.2.) For a particular dimension k, the coarse vectors Ik are computed by reducing / by either P C A or SVD, depending on thfe application. A l l these Ik are inserted into an indexing structure of page size P , and the vectors I are stored in flat file. A query image is chosen among the image database, and its high-dimensional feature vector Q is also extracted and reduced —* —* to Qk- This Qk can then be submitted as query for the indexing structure. The set Jk = {h I A j= i l?j — ? i l 5: D] is returned as the result of the range searching by the indexing structure, fa is then the ratio of the cardinality of Jk and M. Each of the members in Jk is the compared with Q to give Xk as defined by Equation 2.6. ak can be found to be the ratio of the cardinality of Xk and M. During the tree traversal, the num-ber of pages accessed and the C P U time spent are recorded as IndexJf/0(M, k, fa, P) Chapter 5. Experimental Evaluation 51 and IndexJCPU(M,k, (3k,P) respectively. In order to avoid the bias of any particular query, the above procedures are repeated with 50 different queries in total for a given k. Similarly, the above complete process is repeated with every even number of k from 4 to 32. After all these steps, all the parameters needed for calculation of the every cost model are available, except Tdk and disk access time which will described in the following sections. These parameters can then be plugged into the cost models of 2-level filtering and 3-level filtering for each k and 4-level filtering for each k and I. The optimal values can then be found. Hardware Configuration Buffer Size. The buffer size is kept at the minimum value of one page. Typically, the buffer size can affect the number of pages to be accessed, and thus the searching perfor-mance. However, it does not affect the results of the following experiments significantly because of the two reasons. First, if a larger buffer size is used for the tree-like indexing structure, the additional buffer is naturally allocated to store the nodes near the root of the tree in order to increase the likelihood of reducing page faults. Unless there is a huge amount of buffer space, the performance gain is insignificant. Second, the saving of ac-cessing the flat file with larger buffer space can be achieved by another technique without actually having the space. The addresses of all the pages which need to be accessed are first recorded and sorted in a list. A l l pages which needs to be multiply accessed appear only once in the list. Therefore, each page needs to be accessed at most once, in this case, an 1-page buffer is sufficient. Disk Access Time. The random disk access time of a page is assumed to be 10 msec. Chapter 5. Experimental Evaluation 52 Computation Time Tdk- It is the C P U time needed to measure the the distance of two k-D vectors using £ 2 -norm. TdA,Td32 and Tdsl2 are obtained by measuring the actual C P U time taken for computing the distance between two 4-D, 32-D and 512-D vectors respectively. The measurements are repeated a thousand times, and the results are averaged. They are found to be 7.3997, 48.8579 and 828.5132 microseconds respectively. The other required Tdk can be calculated by linearly interpolation between 4-D and 32-D. A l l C P U timings are measured using a Sun Sparc-10. Indexing Structure The selected indexing structure used in the experiments is R-tree. The code of the R-tree was developed by Dr. Christos Faloutsos and his students. 5.2.2 Colour Histogram Experimental Setup Similarity Matrix A As discussed in Section 2.2.1, both CIE L U V and Munsell H V C colour models are suitable for the computation of the entries in the colour similarity matrix A. CIE L U V model is chosen for this experiment because it is simpler in concept, easier to implement and sufficient for our experiments. The similarity matrix A can be computed as follows. Each bin of the colour histogram corresponds to a point in the R G B colour space. To compute the entry a 8j in the matrix A , their points in R G B space are transformed to CIE L U V space. The distance dij between them are measured in the CIE L U V space by Z 2-norm. It can finally be normalized to any value between 0 and 1 as follows. Since the number of bins chosen for the colour histograms is 512, the size of this matrix Chapter 5. Experimental Evaluation 53 is 512x512. According to the Equation(2.4), singular value decomposition is applied on this matrix A using Matlab. Preprocessing of full high-dimensional histograms It is known from Equation 2.3 that computing the distance between two colour histograms involving cross-talk effect is of time complexity 0(N2). Even though the cost models of the filtering are still applicable by replacing by <ijv,A() in all formulas, the computation is very expensive especially when the dimension of the histogram is high. This happens in both the Faloutsos approach and Sawhney and Hafner approach [F+94, SH93, SH94]. Fortunately, dm^O can be computed in O(N) time using X 2-norm with some amount of preprocessing. The optimal transformation Ck defined in Equation (2.7) was proposed by Sawhney and Hafner for approximation purposes, i.e., k <C N. However, there is no particular reason to restrict k <C N, and not to allow k = N. The only doubt is when k is exactly —* —# A/, whether the transformation is exact. To answer this question, let IN = CNI and QN = C N Q , where CN = y/AB, and A and B are as defined in Equation (2.4). Equation (2.3) defines dN>A(Q,I) to be (Q — I)1 A(Q-I). Now by Equation (2.4), the latter term is identical to (Q — /)* (BTAB) (Q — I). Given that A is a diagonal matrix, BLAB is exactly the same as (Bty/A)(y/AB), or (y/ABY(y/AB). Thus, the term (Q - /)' A (Q - I), is equivalent to [(Q — I)*CN][CN{Q — I)}- Hence, it is necessary that dN,A{Q, I) is equivalent to dN(QN, IN), which, as defined in Equation (2.6), gives the L2-norm of QN and IN-Consequently, the IN can be stored for the final level of filtering. In this way, computing dN(QN, IN) costs much less time than dN,A{Q,I)-Chapter 5. Experimental Evaluation 54 Two sets of Colour Histograms There are two different types of colour histograms for the experiments. The first type is the colour histograms from real images. These images are collected from different sources. They include Smithsonian Institution images archive which consists mainly of exhibits images of museum, Hong Kong Picture Archive which consists of Hong Kong scenery and pop-stars.images, and some other landscape images. There are 1712 images in total. They cover a wide domain of different areas. After all these images are collected, they are cropped to remove the black margin, re-scaled to a fixed size of 128x128, and converted to Vista format [PL94]. The colour histogram of an image is built by incrementing the count of the bin which corresponds to the the R G B colour components of each pixel. After running through every pixel in the image, each bin count of every histogram can then be normalized to any value from 0 to 1. Another type is random colour histograms. Unlike the histograms from images, unlim-ited number of histograms can be generated easily without the restriction to the sources of testing images. Furthermore, the set of generated histograms is general enough without any possibility of bias to any particular domain. In [SB90], Swain and Ballard proposes a method to generate these random, but still realistic histograms. The procedure is as follows. 1. Randomly choose the number of non-empty bins Nne in histogram, where 3 < Nne < 512. 2. Randomly choose one of these Nne bins. Label it as bin i. 3. Increment the count of this bin i. 4. Repeat steps 2 to 3 Z times, where Z is the number of pixels in the image. Chapter 5. Experimental Evaluation 55 The colour histogram of a real image is typically clustered into a certain number of bins. The general idea of the above procedure is to simulate this clustering effect. 5.2.3 Eigenfaces Experimental Setup General Procedure The eigenfaces experiment basically follows the ideas used in [Sed95]. 400 grey-scale facial images were collected from the ftp site of Olivetti Research Laboratory. These images are taken from 40 different individuals. The sizes of all images are 112x92 pixels, and each pixel has 256 grey levels. Like the image testbed in the above section, these facial images are also stored in Vista format. The maximum number of principle components (i.e. 399) are extracted from these images using principle component analysis (PCA) . The full sets of all these principle components (PCs) are then the required full high-dimensional feature vectors. Some general ideas of P C A have been mentioned in Section 2.2.3, and the details of computing P C A will be discussed in the next section. However, the database of 400 images is too small for most practical purposes. A technique is devised ,in [Sed95] to increase the database to any size. The maximum and the minimum principle components at each dimension j of the above 400 images are recorded as maxj and mirij. A database of size M can be created by randomly generating M vectors where each j - t h dimension entry of every vector lies between maxj and mirij. This technique is to simulate the scenario that this new database is sampled to form a training set of these 400 facial images. Computing PCA In Section 2.2.3, it has been seen that the full set of N PCs are grouped into the vector I. In fact, these PCs are computed by applying a transformation matrix C onto / Chapter 5. Experimental Evaluation 56 [TP91, Sed95]. This matrix C can be found as follows. 1. Suppose there are M image feature vectors aft- of size RxR in the database and the training set consists of Ms sampled images. To ensure the rotation of axes is correctly done at the "center" of these images data, the axes have first to be translated to the average of all the feature vectors. The translated image data is called A l l of them can be combined to form a R2xMs matrix Y. 1 M s a = — J2 Xj yi = Xi - a Y = 2/1 2/2 • • • yiw, 2. The R2xR2 covariance matrix A is computed to measure how closely the dimensions of the data are correlated to each another. A = YYT 3. The eigenvectors 6tand corresponding eigenvalues A,- can then be computed for this covariance matrix A. Abi = Xibi 4. The eigenvectors bi is re-arranged according to the decreasing order of eigenvalues of Ai. 5. The required transformation matrix C for the P C A is then given by C - (^bx b2 ... bR2^ Chapter 5. Experimental Evaluation 57 The above procedure for computing the transformation matrix C is reasonable for data items where the dimension of the data vector R2 is small. However, there exists a serious practical problem in our case since R2 of even a small image can easily be over ten thousand. It is computationally infeasible to calculate the eigenvectors and eigenvalues of such a huge R2xR2 matrix A . To get around this problem, the eigenvectors b[ and eigenvalues A^ of a MsxMs matrix A' can be first calculated. A' = YTY The calculation of this matrix for images data is usually much faster because the number of sample Ms is much less than R2 in most cases. The required eigenvectors are equal to b{ = Yb\. Having t\, the original procedure can be followed as above. 5.3 Experimental Results 5.3.1 Colour Histograms Random Colour Histograms First, we will investigate how 3-level and 4-level filtering at their respective optimums compare with the optimums of the Sawhney and Hafner approach and the Faloutsos ap-proach using random colour histogram as the testbed. As discussed before, the Falout-sos approach has been found to behave experimentally very close to the Sawhney and Hafner approach with k = 4. Thus, in the graphs to be presented shortly, the nota-tion (4,512) and (Ar, 512) represents the Faloutsos approach and optimal Sawhney and Hafner approach respectively. Recall from Section 5.2.1 that both of them store origi-nal histograms. Hence, the ti/v^Q is used as the distance measurement in both cases. The 3-level and 4-level filtering approaches ( represented by (4, k, 512) and (4, k\, k2,512) respectively) store transformed histogram, and thus use d^-Chapter 5. Experimental Evaluation 58 Figure 5.6: Relative efficiency with original histograms The graph in Figure 5.6 compares the total time taken per query by various config-urations relative to the 3-level optimum (4,18,512). Clearly, the configuration (4,512) ( Faloutsos approach ) was the least efficient, taking over 9 times longer than (4,18, 512), and over 2 times longer than (12, 512). As the k value changes from 4 to 12, the configu-ration (12,512) becomes more efficient. However, even though (12,512) is the Sawhney and Hafner optimum, it still takes 4 times longer than the 3-level optimum. (4,18,512) is slightly more efficient than the 4-level optimum (4,8,28,512) which still outperforms the Sawhney and Hafner 2-level optimum. The selectivity at the finest level for this set of experiment is chosen to be about 0.1% and the page size is set to be 16K bytes. The efficiency gaps among all the configurations shown in Figure 5.6 are due largely to the fact that (4,512) and (12,512) store the original histograms, while the other two store the transformed histograms. The graph in Figure 5.7(a) shows the relative total time taken by various configurations, where this time all configurations store the transformed histograms. With transformed histograms, the efficiency gaps among the configurations Chapter 5. Experimental Evaluation 59 <4,512> <12,512> <4,18,512> <4,8.28,512> <4,512> <12,512> <4,18,512> <4,8,28,512> (a) Relative Total Time (b) Relative CPU & I/O Time Figure 5.7: Relative efficiency with transformed histograms become narrower. This is because with TdN A replaced by a much smaller T ^ , the last term in Equation (3.12) becomes less dominant. In fact, the entire C P U cost becomes much less than before, allowing the I/O cost to dominate in the total cost. This is confirmed by the graph shown in Figure 5.7(b). In that graph, for each configuration, the C P U and I/O costs are shown as percentages of the costs of (4,18,512). Note that even though for (4,18,512), both percentages are 100, this does not mean to say that the absolute C P U and I/O times are the same. It only means that the C P U and I/O costs of (4,18,512) are used as the bases for comparisons. The graphs in Figure 5.7 show that the 3-level optimum (4,18,512) and the 4-level optimum (4,8,28,512) st i l l compare favourably against the Sawhney and Hafner 2-level optimum (12,512). Specifically, the 2-level optimum takes 67% more C P U time and .18% more I/O time, and hence 22% more total time than 3-level optimum. The Faloutsos Chapter 5. Experimental Evaluation 60 (a) Storing Original Histograms (b) Storing Transformed Histograms Figure 5.8: Relative efficiency with histograms from real images configuration (4,512) takes 81% more total time than the 3-level optimum. Finally, even though the 4-level optimum takes 14% less C P U time than 3-level, a higher I/O cost forces the former configuration to be less efficient than the latter configuration. Colour Histograms from Real Images Next, we will investigate whether 3-level and 4-level filtering can still outperform 2-level if the testbed is changed to colour histograms derived from real images. We will first focus on the relative total time with all configurations storing transformed histograms at the finest level (i.e. ii/v() is used instead in all cases). Because there are fewer histograms from images than random histograms, the page size for this set of experiment is shrunk to 8K bytes. The selectivity of the finest filtering is raised to about 2%. The Figure 5.8(b) shows that the 3-level optimum (4,28,512) takes 15% less time than 2-level optimum Chapter 5. Experimental Evaluation 61 (18,512). On the other hand, the 3-level optimum is more than 5 times faster than 2-level optimum if the original histograms are used in the latter approach, as shown in Figure 5.8(a). This performance gain is even more significant than that of random histogram in the corresponding case. 5.3.2 E igenfaces The testbed is now changed from colour histograms to eigenfaces. We are going to use a slightly different methodology for this experiment. First, the notion of "original his-tograms vs transformed histograms" is no longer applicable in this set of experiments because only vectors of principle components of various dimensions are stored in the cor-responding levels of filtering. Second, only the costs of the optimal configurations of two, three and four levels filtering are reported. As presented in Section 5.2.3, the eigenfaces are used as an example of eigenimages and the data are those randomly generated in be-tween the dynamic range from original 400 training eigenfaces. Recall from Section 2.2.3 that the maximum dimension is the number of training images less one. The first set of the experiments uses this maximum number of dimension (i.e. 399) for the finest level of filtering. In order to compare with the results from random colour histogram as accurately as possible, the parameters for both sets of experiments are set to be very similar. A data set of size 5000 is used to match the size of random colour histogram. Like the experiment of random colour histograms, the page size is kept as 16K bytes, and the finest level selectivity is also to be 0.1%. The results of the experiments are shown in Figure 5.9(a). The 2-level optimum (12,399) takes 30% longer than the 3-level optimum (4,16,399). The 4-level optimum (4,14,32,399) takes 10% longer than the 3-level optimum, but is still takes faster than the 2-level optimum. Chapter 5. Experimental Evaluation 62 <4,16,399> <4,14,32,399> 140 120 100 0 E g 1 129% i H f i n % 100% • flip • ••1 rt < , 2 . , J 3 > <4,14,133> <4,14,32,133> (a) Finest level = 399-D (b) Finest level = 133-D Figure 5.9: Relative efficiency with eigenfaces However, it may be argued that the performance gain is due to the maximum di-mension used for finest level of filtering. One may prefer faster performance to extreme accuracy. Therefore, only a fraction of this maximum is required for the finest level. The second set of the experiments uses 133 dimensions for the finest level. Other details of the experiment are kept unchanged. The graph in Figure 5.9(b) is almost identical to the graph in Figure 5.9(a). It implies that the multi-level filtering can still be very useful even when the dimension of the feature vector is not very high. The experiments in this section show that multi-level filtering is very useful to enhance performance when compared with 2-level filtering no matter whether the testbed is colour histograms or eigenfaces. As predicted in Section 4.6, 3-level filtering may be the limit of multi-level filtering. Even 4-level filtering can usually outperform 2-level, it is inferior to 3-level. • Chapter 5. Experimental Evaluation 63 100 -> CO 95 CM - 90 0) > n 85 > 4-level 3-level ^ s * ^ % 41 Figure 5.10: Effect of changing page size on 3-level filtering relative cost 5.4 Effects of Different Values of Parameters 5.4.1 Page Size In this section, random colour histograms will be used as the testbed for experiments using different page sizes. Assume the size of each bin is 4 bytes. Because the size of one 512-bin colour histogram is already 2K byte, it is therefore unavoidable to have page sizes larger than the typical page sizes in the relational database. In this set of experiments, three different page sizes will be used, namely 4K, 8K and 16K bytes. The effect of different page sizes on the performance gain of optimal multi-level over the corresponding optimal 2-level filtering will be investigated. Unlike the previous sections, the results are normalized such that the cost of the corresponding optimal 2-level filtering at each page size is 100%. Figure 5.10 shows that the larger the page size is, and the smaller the percentage of the multi-level filtering costs, the more the multi-level saves compared with 2-level. Chapter 5. Experimental Evaluation 64 This effect can be explained if Figure 5.7 is examined again. The 2-level filtering I/O costs only 18% more than 3-level, but the 2-level CPU costs 67% more than 3-level. In other words, the performance gain of 3-level filtering over 2-level is much more . significant for CPU.than that for I/O. In fact, similar observations hold for 4-level filtering too. Unfortunately, the I/O cost shares a larger portion than CPU cost in the total cost. Therefore, the bottleneck of the overall multi-level saving is due to its relatively little saving of I/O cost. If the I/O cost plays a less significant role in total cost, the huge saving of CPU cost will be more obvious. Thus, there will be increased overall cost saving. One solution is to increase the page size, thus less pages are needed to be accessed. Therefore, the I/O cost will also be less, and share a smaller portion of the total cost. This explains why increasing page size can raise the relative performance gain of multi-level compared with 2-level. In fact, it can be predicted that increasing the page size further may save even more. . It is important to note that the explanation in the above paragraph assumes the page size does not affect access time. As matter of fact, this is a reasonable assumption. Nowadays, the transfer rate of hard disk can reach to the level of 20 M bytes per second. Doubling the page size from 8K to 16K bytes adds merely 0.39 msec which is insignificant compared with the total access time (which is assumed to be 10 msec per page). 5.4.2 C P U S p e e d This section will test how the saving of multi-level filtering is affected by varying the CPU speed. The experiments are very similar to the testing of different page sizes in the above Section 5.4.1, and the page size for these experiments is set to be 16K bytes. Three different CPU speeds are chosen: the speed of the original Sparc-10, half as fast and double as fast. Similarly, all the results reported are the percentage time relative to their respective 2-level optimal time. Chapter 5. Experimental Evaluation 65 8 If CO mm £ o tt 4-leveh/' 3-level<x»^'^ 05x Figure 5.11: Effect of changing C P U speed on multi-level filtering relative cost Figure 5.11.shows that the percentage time needed by both optimal 3-level and 4-level filtering relative to their optimal 2-level filtering increases with the speed of the C P U speed. In other words, the saving of 3-level and 4-level drops when the C P U is faster. The explanation is similar to the above Section 5.4.1. Recall that the saving of 3-level I/O cost is not much compared with that of C P U cost. However, I/O cost is the dominant factor of the total cost. Therefore, the total saving of 3-level cannot be too much either. As a result, if the C P U speed of the machine is faster than the testing machine ( Sun Sparc-10 ), the C P U cost will become cheaper. Relatively speaking, the I/O cost will become even more important among the total cost. Therefore, the saving of multi-level filtering decreases when the C P U is faster and vice versa. This implies that the benefits of multi-level filtering are particularly obvious for the machines with less processing power. Chapter 5. Experimental Evaluation 66 5.4.3 Indexing Structures The indexing structure used in all sets of experiment so far is R-tree. This is designed for spatial K-D objects and disk paging with large database. As far as our application is concerned, R-tree is not a very space efficient indexing structure compared with others such as adaptive K-D tree for two reasons. • In order to facilitate paging, R-tree stores a single node per page. When the node size is small compared with the page size, it wastes considerable amount of space. • In Section 2.3.3, it is seen that R-trees can support K-D spatial objects by specifying the K-D rectangular boxes bounding the objects. In contrast, the image feature vectors are only points in K-D space. When R-tree is used to store these vectors, the &>th dimension values of the points will be duplicated for both values of k-th dimension rectangular box range. Consequently, half of the space occupied is wasted. If a more compact indexing structure is used instead, some I/O accesses of the indices will be saved. Like the effect of increasing page size, fewer I/O accesses implies that the I/O cost occupies less share in the total cost. It seems to be true that this will eventually increase the performance gain of 3-level relative to 2-level. However, this con-clusion ignores other important factors. Consider the following experiment of replacing R-trees with more compact adaptive K-D-trees as indexing structures for storing colour histograms.1. A l l other parameters are kept unchanged. The analysis will be based on the comparison of the effects on the I/O costs of 3-level and 2-level filtering when the in-dexing structure is changed from R-tree to K-D-tree. Their C P U costs follow the similar analysis. 1The source codes of the adaptive K-D-tree are obtained from Ms Andishe Sedighian of Computer Science department in University of British Columbia Chapter 5. Experimental Evaluation 67 (a) 3-level I / O costs (b) 2-level I / O costs Figure 5.12: Effects of changing indexing structure on 3-level and 2-level costs Figure 5.12(a) shows that the I/O cost of 3-level filtering shifts downward when the indexing structure is changed from R-tree to K-D-tree. This is because the only difference between these two trends is the difference in the I/O costs of their respective 4-D tree structures. Indeed, this difference is independent of k. Therefore, the trend change is a simple downward shift. The I/O cost of 2-level filtering consists of the costs accessing the indexing structure and the flat file (i.e. the first and second terms of Equation. (3.14)). Figure 5.12(b) shows that the index I/O cost (first term) of 2-level filtering for K-D-tree is much less than that for R-tree. It is important to note that not only is the amount smaller, but also the slope flatter. The costs for accessing the flat file (second term) must be the same no matter what the indexing structures are. Combining the two terms, both the total I/O cost curve of 2-level filtering for R-tree and K-D-tree resembles U-shape. However, the Chapter 5. Experimental Evaluation 68 one of K-D-tree is a much wider U-shape than that of R-tree. This is due to the flatter slope of K-D-tree index term, not its smaller amount. Both the total I/O costs of 3-level and 2-level drop when a K-D-tree is used instead of R-tree. However, the cost of 3-level drops slightly while the 2-level drops sharply. Consequently, the 2-level optimum is so cheap that the 3-level optimum may lose. It can be concluded that the performance gain of 3-level filtering relative to 2-level filtering is not obvious when the indexing structure is changed. If the indexing structure behaves like K-D-tree ( i.e. its cost of accessing the index increases very slowly when the dimension k increases), 3-level filtering may not outperform 2-level filtering. However, if the indexing structure is only more compact, 3-level filtering may easily outperform 2-level filtering. As a summary, most of the experimental results agree with the the evidence provided by the trend analysis in the previous chapter that 3-level filtering can typically cost less than 2-level filtering. Chapter 6 Design and Implementation of Optimizers 6.1 Introduction Assume a dataset of M images is given. In order to create an efficient database for their image feature vectors, a certain filtering configuration needs to be selected so that the corresponding k-D indexing structure and the flat files of intermediate levels can be built with these image feature vectors. The configuration here refers to the number of levels of filtering and the dimension of each level. It has been shown in the previous chapters that 3-level, though often, may not always outperform 2-level and 4-level filtering. Further-more, Figures 5.7, 5.8 and 5.9 indicate that the optimal configurations need not be the same for all cases. The parameters such as ak, j3k, IndexJCPU and Index-I/0 may vary significantly according to the characteristics of the image features, size of the database, page size, indexing structures. It is necessary to find a new set of parameters for the new dataset, so its corresponding total cost and optimal configuration can be found. Other-wise, there is often a penalty at run-time for each query when sub-optimal configuration is used. Even a 10% run-time penalty for every query can add up to a significant amount. The goal of the optimizer is to find the optimal configuration. One simple approach of achieving this goal is to repeat the procedures of the previous experiments by considering all the images of the database and configurations find the optimal one with minimum cost. This exhaustive searching certainly takes a long time to complete, especially for huge databases. In the following, we present an optimizer that relies on sampling to find 69 Chapter 6. Design and Implementation of Optimizers 70 the optimal configuration for a certain dataset. 6.2 Specifications of the Optimizer 6.2.1 Input to the Optimizer Since the filtering structure, be it 2-level, 3-level or 4-level, is used for all queries, the op-timizer accepts a selectivity mix to be given as input. More specifically, the database de-signer can specify and input a list of the form: ( ( 7 1 , Pi) , . . . , (lw,Pw)), where 7 1 , . . . , 7™ are selectivities and p i , . . . ,pw are the respective (expected) frequencies. For instance, given that the database has 10,000 images, and that most users would ask for the top-10, top-15 and top-20 images, the database designer can input (0.001,33.3%), (0.0015,33.3%), and (0.002,33.3%). As will be detailed later, the output of the optimizer is the optimal configuration with value weighted by all the selectivities and their frequencies. 6.2.2 Qualities of Service Depending on how much "compile" time the database designer wants" to spend, differ-ent "qualities of service" can be provided There are three operations to be performed: (a) finding the 3-level optimum, (b) finding the 4-level optimum, and (c) finding the 2-level optimum. These operations are in ascending order of computation time, and they are performed by the 2-level, 4-level and 3-level optimizer respectively. As will be de-tailed later, Operation (a) takes the smallest amount of time because it is equivalent to searching for the minimum in a 1-dimensional space (i.e., the space of k, as presented in Section 3.4.1). Operation (b) takes considerably more time, because it amounts to searching for the minimum in a 2-dimensional space (i.e., the space of k^ and k2 as pre-sented in Section 3.5.-1). Finally, Operation (c) takes the most time, even though the search space is only 1-dimensional (i.e., the space of the dimensionality A:-of'the index). Chapter 6. Design and Implementation of Optimizers 71 This is because, unlike the other two operations, the 2-level optimizer has to estimate the C P U and I/O costs searching a ^-dimensional index - for each value of k tested by the optimizer. With respect to the three operations above, there are three qualities of service. The most basic service performs Operation (a) only. The next class of service does (a) and (b), and output the configuration with the lower cost found by the two operations. The most elaborate service performs all three operations, and output the lowest of the three. The rationale of such an ordering is as follows. As analyzed in Section 4.6, in most cases, the 3-level optimum is of lower total cost than the 4-level optimum and the 2-level optimum. And in those cases when this is not true, the 3-level optimum is not very far off the real optimum either. Thus, if limited time can be spent for optimization, performing Operation (a) alone is the best bet. If more time can be spent, it is worthwhile to perform Operation (b), which takes less time than Operation (c). If all three operations are carried out, it takes the longest time (e.g., tens of minutes of C P U time). But for an image database with very frequent colour queries, tens of minutes of compile time may still be worthwhile. The design of the the optimizer which finds the 3-level optimum will be discussed in Section 6.3.1 (3-level optimizer). Since the design of 4-level optimizer is a simple extension of the 3-level one, its discussion will be skipped from this thesis. Finally, Because the 2-level optimizer involves different issues, it will be discussed in Section 6.3.4. 6.2.3 Accuracy The two most important issues of an optimizer are its performance and accuracy. Obvi-ously, the performance can be easily measured in terms of the total time taken to find the optimal configuration. However, the definition of accuracy is not as obvious, let alone its measurement. What exactly is the accuracy of an optimizer? The accuracy must Chapter 6. Design and Implementation of Optimizers 72 be measured with respect to a certain reference point, or yardstick. In order to answer this question, this yardstick must be defined first. Recall that the ultimate goal of the optimizer is to find the optimal configuration (one with lowest cost) for any query. This yardstick could be defined as the the optimal configuration for any query, regardless of the resources and time needed. However, this goal is clearly unachievable because of two reasons. First, the real query of user is still unknown when the optimizer is invoked. Second, the configuration may be optimal for a query, but not for the others. There-fore, a certain reference query has to be defined for the sake of the cost comparisons even though the reference query may not be the same as the real queries of the users. Multiple reference queries are chosen to reduce the bias of any particular single reference query. The yardstick, or the best possible optimizer, is thus the one which considers all M images in the database with respect to the reference query (or queries). From here onwards, this yardstick is called the reference optimizer. 6.3 Design of the Optimizer In this section, the algorithm of the first proposed version of 3-level reference optimizer will be reported, and its problems will then be discussed. After that, the algorithm of its revised version will follow. A more efficient optimizer which uses sampling techniques will then be presented. Finally, the design of 2-level optimizer will be compared with that of 3-level. 6.3.1 A reversible 3-level Reference Optimizer As it will be seen in the latter sections, the 3-level reference optimizer is relatively simple when compared with the 2-level one. As the first terms in Equations(3.15) and (3.16) are constant to dimension k, it is sufficient. to consider only the remaining terms in Chapter 6. Design and Implementation of Optimizers 73 both equations. Only the selectivities ak and fa are needed to compare total costs of configurations with different k. In the beginning, it is also assumed that there is only one selectivity ((7,100%)) specified in the selectivity mix. The modification of the design which is needed for multiple selectivities in the mix will be considered in the Section 6.3.3. The objective of the reference optimizer for a database of size M is to 1. find the threshold distance D from 7; 2. compute the corresponding k-D selectivities (i.e. ak and fa ) from D and total cost for all k; choose a configuration with the minimum cost. The first approach to meet this objective is as follows. Algori thm 1 1. For a reference query Ql, compute the Li-norm distance dist^^l\Ql) between the original high-dimensional (N-D) feature vectors of every image I and query Ql. 2. Select the (7 x M)-th smallest distance of all dist^^iQ1) computed. This is the threshold distance Dl 3. Repeat step 1 and 2 with other reference queries. Calculate the average threshold distance Dav9 of all Di. -* -*• 4- For a given dimension k and query Ql, compute the Li-norm distances distk(Ik, Q\) between the abstracted k-D feature vector between every Ik and query Q\. 5. Count the number Ul of images whose feature distances from the corresponding k-D vector Q\ are less than Davg. The required selectivity ct\ is then given by Ul/M, while fa can be computed similarly by measuring range distance, rather than Li-norm distance. Chapter 6. Design and Implementation of Optimizers 74 (a) Irreversible (b) Reversible Figure 6.13: Reversibility problem 6. Repeat steps 4 and 5 with other reference queries. The required ak and fa are the averages of a\ and fa respectively over all i. 7. Having computed ctk and fa, the relative total cost can be calculated for comparison according to the cost models of Equations (3.16 and 3.15). 8. Repeat the whole procedures with different dimensions k and find the one which gives minimum relative cost. This is the required optimal configuration. When k = N, the computed selectivity ak should naturally be equal to the input selectivity 7. This property is called reversibility. An algorithm is reasonable only when it is reversible. Unfortunately, this algorithm is not reversible. Figure 6.13(a) shows that the problem happens when the threshold distance D1 and selectivities ak and faare averaged. Therefore, another approach is proposed to deal with this reversibility problem. Chapter 6. Design and Implementation of Optimizers 75 Algorithm 2 1. For a reference query Ql, compute the Li-norm distance distill, Ql) between the original high-dimensional ( N-D ) feature vectors of every image I and query Q1. 2. Select the (7 x M)-ih smallest distance of all dist^ilrQ1) computed. This is the threshold distance D%. —* ~* 3. For a given k and Q1, compute the Li-norm distances distk(I,Ql) between the abstracted k-D feature vectors of every image I and query Ql. 4- Count the number U of images whose feature distance less than D%. The required se-lectivity a\ is then given by U% jM, while f3k can be computed similarly by measuring range distance rather than Li-norm distance. 5. Having the computed a\ and J3k, the relative total cost cost\ of this query Ql and dimension k can be calculated. 6. Repeat steps 1 to 6 with different possible values of k and i. 7. For each dimension k, calculate the average cost costakV9 of costk over all i. 8. The dimension k which gives the minimum average cost is the optimal configuration. Basically, this revised version looks very similar to the original version. The crucial differences are at the points where the averagings are computed. Since this revised version completes the cycle of computing threshold distance and then ak ( and (3k) for a given query i and dimension k before it proceeds with other values of k and i, the requirement of reversibility can be satisfied. This is shown in Figure 6.13(b). 6.3.2 An Efficient 3-level Optimizer Using Sampling Algorithm 2 describes the best possible or reference optimizer because it consider all M images. As mentioned before, the problem is that it is inefficient especially when M is Chapter 6. Design and Implementation of Optimizers 76 large. It is desirable to devise an optimizer which can trade a certain degree of accuracy for better performance. In other words, only a fraction of the M images is sampled and considered in the above algorithm. The accuracy of the more efficient optimizer using sampling can be measured against the reference optimizer. By sampling theory, the exact numbers of samples can be computed according to the allowable errors and confidence levels which are specified by the database designer. Algorithm 2 is required to be modified for sampling. Instead of using all M images in step 1 and 2, only a sample fraction T\. of M is needed to calculate the threshold distance D\ for reference query Ql and dimension k. To compute selectivities a\ a n d a sample fraction F%k' of M is used. Now, the next section describes how to compute the sampling fractions Txk and J-lk' for a certain allowable error and confidence level. The sampling theory provides a way of computing the sample size needed such that the sample mean is close to the population mean with a certain allowable error and confidence level. In order to calculate the threshold distance D\ from the selectivity 7 (in step 1 and 2 of Algorithm 2), the (7 x M)- th smallest distance of distill, Q1) is used. In other words, this is to calculate sample 7 percentile, rather than the sample mean. Sample percentile is not as much discussed topic as sample mean. In [SSW92], Sarndal et al. describes how to use sampling to estimate population median, which is special case of 7 percentile. In general, estimating sample percentile can be quite complicated. However, since the sampling method used here is simple random sampling without.replacement, the implementation is still feasible. Simple modifications are made to the procedures presented in [SSW92] so that any 7 percentile can be accepted, rather than 7 = 0.5 only. Our aim is to compute the sample size m such that the given sample percentile 7 will be close to the population percentile with an allowable error e and confidence level cl (with the normal distribution statistic Zci/2, or simply Z). The formulas are as follows. Chapter 6. Design and Implementation of Optimizers 77 MMl - 7) ra Mi- + 7(1 - 7) • ra M where u is the variance and M is the population size. After the threshold distance is estimated, the k-D selectivities are then computed based on the threshold distance (in step 3 and 4 of Algorithm 2). The computation of sample A>D selectivities is, in fact, the estimation of the proportion of the samples whose k-D abstracted vector distance is less than threshold distance from the corresponding query vector. The sampling of proportion is very similar to that of mean. For a population size M, one approach of estimating the sample size ra' for given sample proportion p, population proportion P, allowable error e and confidence level cl (with the corresponding Z) is as follows [Coc64]. z2PQ ra where Q = 1 — P. Obviously, the population proportion P is unobtainable in our case.. A method of two steps sampling is used to estimate the population proportions P to determine the sample size m' [Coc64]. The pre-sampling step ( of size m[ ) is used to estimate pi of population proportion P, and then the sample size ra' for real sampling step can be computed as v. follows. Chapter 6. Design and Implementation of Optimizers 78 / Piqi , 3 - 8pi(?i 1 - 3pn7i m = h — h — v piqi vm-y T{' = — k M where qi = 1 — p\. Suppose p\ = 0.1, q\ = 0.9, e = 0.01, Z = 1.0. The population size M is 10000, and the initial sample size is 100. The calculated rn' is 998.33, and so the sample fraction r is 0.099833. 6.3.3 Se lec t iv i ty M i x Algorithm 2 can be generalized to handle selectivity mix ( ( 7 1 , Pi) , . . . , ('JwiPw)), not only single selectivity ((7,100%)). For a given query Q\ a threshold distance D1- is computed for each designer specified selectivity 7 ^ . The time of computing threshold distance TJ] for w number of 7 j needs not to be p times longer than that of a single 7 . The bottleneck is to compute the distN(I,Q*) for every images i (step 2 of Algorithm 2). In fact, this can be done once for all w number of jj. Although the selection of the (jj x M) has to be repeated w times, the selection algorithm can be implemented very efficiently in O(M). Similar savings hold for the procedure in calculation of k-D selectivities too. The expensive procedure of computing distk(Ik, Q\) for every images can also be done once for all w. In step (4), the counting of images whose feature distances less than the threshold distance TJ] can also be done in a single cycle for all w number of different j. With all these time-saving techniques, the algorithm for selectivity mix will only be slightly less efficient than that for single selectivity. The total cost of each configuration with selectivity 7 * can be weighted by the cor-responding frequency pl. The optimal configuration can be obtained by finding the one which gives the minimum weighted total cost. Chapter 6. Design and Implementation of Optimizers 79 6.3.4 A 2-level optimizer Since the algorithm for calculating the threshold distances of the 2-level optimizer is the same as that of the 3-level optimizer, the first two steps of Algorithm 2 hold for the 2-level optimizer too. But unlike the 3-level optimizer, the comparisons between the 2-level optimizers with different dimensions k have to take the indexing structure into consideration. In other words, the complete cost models of both C P U and I/O ( Equation(3.12) and (3.14) ) have to be computed. In addition to ct%k and /3k, Index-IOQ and Index-CPU() are both needed for different k and i. Therefore, the steps 3, 4 and 5 of Algorithm 2 need to be modified as follows. 3. For a given k, build an k-D indexing structure and insert all the feature vectors Ik of every images into it. 4. Search the indexing structure with the query Q\. Record the C P U time taken as IndexJjPUQ and pages accessed as Index -IOQ. ct\ and p\ can be computed similarly. 5. Having the computed Index-CPUQ, Index -IOQ, a\ and (3k, the total cost cost\ of this query i and dimension k can be calculated. This modified algorithm is then the most accurate 2-level optimizer which can be used as reference to other more efficient optimizers. Sampling technique can also be used to devise more efficient optimizers. Its idea for 2-level optimizer is the same as that for 3-level one in the calculation of threshold distances. However, the sampling of 2-level optimizer may not work as well for the following reason. For the 3-level optimizer, different sample sizes are computed for different queries i in the calculation of selectivities. If this were done for 2-level optimizer too, indexing structures of different sample sizes are needed to be built for different queries Ql (step 4). Chapter 6. Design and Implementation of Optimizers .80 Building all these indexing .structures is a very time consuming step. In order to avoid building indexing structures so many times for different sample sizes of different queries Q\ the maximum sample size is chosen among them. Therefore, only one indexing structure is built with this maximum number of samples. The advantage is that this bottleneck step is only required to be executed once for all queries. Unfortunately, it can be expected that even if a single sample size of a certain query Ql is large, say close to the population size, the indexing structure has to be built with almost all of the population. In other words, the technique of sampling may not help much to improve the performance. 6.4 Experiments 6.4.1 3-level optimizer Details of the Experiments In order to test how the design of the optimizers works in practice, they are implemented in C++. A dataset of 5000 512-bin random colour histograms is generated using the method described in Section 5.2.2. Ten histograms are chosen from these random his-tograms as the reference queries. The reference optimizer is first implemented by using all 5000 colour histograms. Recall that the procedures of optimizer in Algorithm 2 are divided into two stages. The first stage is to calculate the threshold distance (step 1 and 2), while the second one is to calculate the k-D selectivities of all k, and therefore the optimal cost and corresponding configuration can be found (step 3 to 8). The initial sample size mi for second stage is set to 500. The idea of the following experiments is to test the optimizers which use sampling at the first and second stage separately. Therefore, the individual effect of the samplings at each step can be observed. The exact procedures of testing 3-level (4, fc, 512) optimizer are as follows. Chapter 6. Design and Implementation of Optimizers 81 First of all, the reference optimizer is to be created by considering the whole popu-lation of 5000 histograms at both stages (i.e. no sampling at both stages). By following Algorithm 2, the optimal configuration (4, kref, 512) of reference optimizer and its cost function costakV9 can be found. Now, the more efficient optimizer can be created by modifying Algorithm 2 for sam-pling. The sample size for computing the threshold distance is fixed at population size (i.e. no sampling in the first stage). A certain confidence level cl and an allowable error e for the stage of computing the k-T) selectivities (sampling in the second stage) are chosen. This can give the optimal configuration (4, ksam, 512) and cost function costakvg of this optimizer using sampling. The allowable error e can be varied to investigate the effect of sampling at the second stage. Next, the effect of sampling in the first stage can be examined as follows. A certain confidence level cl' and error e' are chosen for computing threshold distance (first stage). The confidence level cl and error e can be fixed for computing the selectivities (second stage). Similarly, this gives another optimal configuration (4, Ar s a m , 512) and cost function costak9 of this optimizer using sampling, e' can be varied to investigate the effect of different sampling in the first stage. Error calculation Suppose the optimal configurations given by the reference optimizer and efficient opti-mizer using sampling are (4, fcre/,512) and (4, ksam, 512) respectively. Furthermore, both the cost functions costak9 and costak9 are now available for different k of reference op-timizer and optimizer using sampling respectively. One way of measuring the error of optimizer using sampling (4, ksam, 512) can simply be defined as the percentage differ-ence between the costs costT9 • and costT9 . However, the error may not be zero even if ksam = kref. A better way of defining error is the percentage difference between the costs Chapter 6. Design and Implementation of Optimizers 82 cost0?9 and costV19 . In other words, it is the difference between the costs computed by the same reference optimizer cost function with krej and ksam. This provides more accurate comparison between optimizers. Results The reason for using sampling for the optimizer is to save time when compared with the reference optimizer. The first set of the results shows the absolute time taken by the optimizer using three different allowable errors e in the second stage. These e are absolute errors which are chosen to illustrate the effect of different samplings. The results of the time needed are also reported in two parts separately. The first one measures the time taken to compute the threshold distance, while the second one measures the time taken to compute the selectivities k and the therefore optimal total cost for all k. The confidence level is assumed to be 70%. Figure 6.14(a) shows the result. It can be seen clearly that the time taken for computing the selectivities and the optimal cost (second stage) decreases when the allowable error e increases. This can be easily explained. Since e increases, the sample size computed from e drops, and the time taken will thus be less. The time taken by the first stage is also included for comparison purposes. It can also be seen that the time taken for computing the threshold distances (the first stage) almost remains unchanged when e increases. It is because the confidence level, error, and thus the sample size computed for the first stage are-kept constant in this set of experiments. Figure 6.14(b) shows the effects on the accuracy. The optimizer with sampling at confidence level of 70% and allowable error e of 0.003 gives the same configuration as the reference optimizer. Hence, no error is shown. Even when e reaches 0.005, the error is merely 2.48%. These results are very encouraging since the optimizer is shown to be both efficient and accurate. Chapter 6. Design and Implementation of Optimizers Figure 6.14: Performance and accuracy of 3-level optimizer using different samplings the second stage Chapter 6. Design and Implementation of Optimizers 84 (a) T o t a l t ime taken (b) E r r o r percentage Figure 6.15: Performance and accuracy of 3-level optimizer using different samplings in the first stage The next set of experiments is to test the effect of sampling in the first stage on accuracy and performance. Hence, the confidence level and allowable error are fixed (at 70% and 0.002 respectively ) for the second stage which computes all k-D selectivities and total costs. The experiment then compares the accuracy and performance at sampling with confidence level of 70% and three different allowable error e' of the first stage which computes threshold distance. Figure 6.15(a) shows the effect on performance. Similar to the result of varying sampling in the second stage, the time taken decreases as the allowable error increases because the computed sample sizes decrease. Further-more, it is also getting less accurate as allowable error e' increases. Even though the time taken is fairly short at e' = 0.0005, Figure 6.15b shows that the error is unreasonably high at almost 40%. In contrast, the error is much smaller when the sampling is varied in the second stage as shown in Figure 6.14b. It is thus better to vary the sampling in Chapter 6. Design and Implementation of Optimizers 85 200 2> > 180 ? o to o 160 140 120 100 4 80 y * y stage 2j / y y y stage 1 Figure 6.16: Relative efficiency of different numbers w of selectivities the second stage to achieve better performance with reasonable accuracy. The last set of experiments for 3-level optimizer is to test how the number of se-lectivities w in the designer specified selectivity mix can affect the performance of the optimizer. In order to avoid any effect of the selectivity and error values, all the selectiv-ities and the corresponding allowable errors are set to be the same throughout the mix, i.e. ((7, ^p%), • •., (7, ^ % ) ) - The graph in Figure 6.16 shows the relative time needed in the first stage (for computing threshold distance) and the second stage (for computing different fc-selectivities and total costs) at different number of selectivities. The darker curve shows that the. time needed for computing threshold distances is virtually the same for any number w of selectivities. This can be explained because the most time-consuming part which computes every distances (step 1 of Algorithm 2) is done only once for all number of selectivities. The lighter curve shows the time needed for the second stage increases slowly with w. In fact, the time needed increases by less than 100% when w increases from 1 to 5. This shows that the time-saving techniques Chapter 6. Design and Implementation of Optimizers 86 described in Section 6.3.3 work effectively. 6.4.2 2-lev.el Optimizer The experimental details, procedures and error calculation of 2-level optimizer are all very similar to those of 3-level optimizer. A 2-level reference optimizer is also needed to provide a yardstick of measuring accuracy. AIL the procedures of are still divided into the two stages of finding threshold distance and optimal configuration. Results Like the 3-level optimizer, both the performance and the accuracy results are also divided into two stages. The first set of the results tests the 2-level optimizers which use sampling at e = 0.002, 0.003 and 0.004 in the second stage and no sampling in the first stage. Thus, this set of results can show the effect of sampling in the second stage on the performance and accuracy of 2-level optimizer. Figure 6.17(a) shows similar results as the 3-level optimizer. The time taken in the second stage decreases when the allowable error increases. As shown in Figure 6.17(b), the sampling result is so accurate that the configuration returned by e = 0.002 and e = 0,003 are the same as the reference optimizer. Even when the error e reaches 0.004, the percentage error is only 0.7%. However, there are two important differences between the results of the 3-level and the 2-level optimizers. As explained in Section 6.3.4, the time saving of the 2-level optimizer due to sampling is not as great as that of 3-level optimizer since the computed sample sizes are typically close to the original population size. Moreover, the time spent on the second stage can be eight times more than that spent on first stage. This can be explained by the fact that the second stage of 2-level optimizer needs to find Index -IOQ and IndexJJ PUQ. This involves the building of the indexing structure which is a very time consuming process even when the sampling is Chapter 6. Design and Implementation of Optimizers 87 Figure 6.17: Performance and accuracy of 2-level optimizer using different samplings in the second stage Chapter 6. Design and Implementation of Optimizers 88 used. Hence, it is not necessary to do the experiment of varying the sampling allowable error e' in first stage. It is because the effect of the time saving of the first stage is insignificant on the total time spent on the both stages. Because of these two factors, the 3-level optimizer can be much faster than 2-level optimizer. For example, when e = 0.004, Figures 6.14 and 6.17 show that 3-level optimizer takes a total time of 32 seconds, but 2-level optimizer takes 227 seconds. It can be concluded that the 3-level optimizer can efficiently find the near-optimal (if not the real optimal) configuration with high accuracy by the technique of sampling. Although the 2-level optimizer may not be as efficient as the 3-level optimizer, Chap-ters 3 and 4 have shown that the configuration returned by the 3-level optimizer can typically outperform that returned by the 2-level one. Therefore, the most basic service provided by solely 3-level optimizer can be both efficient and accurate. For example, an experimental result indicates that in about 32 second, the 3-level optimizer can find the configuration whose error is less than 2-5%-Chapter 7 Conclusions In order to search efficiently for the database images which look similar to a specified query, features of the query image and the images in the database have to be extracted and compared. The dimension of these image feature vectors (such as colour, texture, shape and eigenimages) are typically very high so that sequential comparison or indexing is inefficient. 2-level filtering has been proposed to solve this problem. The high dimensional vectors can be abstracted to lower dimensional k-D vectors which are small enough to be inserted to multi-dimensional indexing structure. It is clear that 2-level filtering is more efficient because most of the invalid image candidates have been filtered out by the efficient coarse filtering stage, and the expensive fine filtering stage needs to process only the few remaining candidates. However, 2-level filtering still faces an inherent dilemma: When k is too small, there will be too many candidates left for the expensive fine filtering stage. When k is too large, the dimensionality curse makes the coarse filtering inefficient. This thesis proposes a new idea, multi-level filtering, to solve this dilemma. By inserting additional intermediate levels in between the coarsest and finest filtering stage, the dimension of the indexing structure will no longer affect the number of candidates left for the fine filtering stage. Although 3-level filtering looks beneficial intuitively, more concrete evidence can also been found in this thesis. To achieve this purpose, a cost model is derived for different filtering configurations. The cost model does not only estimate the actual time taken by a certain configuration for a particular dataset and indexing structure, it can also 89 Chapter 7. Conclusions 90 provide an essential foundation for trend analysis of different filtering configurations. The trend analysis is capable of predicting the general trends of 2-level (k, N) and 3-level filtering (4,k,N) with respect to different k. More importantly, it can also strengthen the important hypothesis that the optimal 3-level filtering can outperform the optimal 2- level filtering. Furthermore, experiments are set up to further test the correctness of the above hy-pothesis. Two different testbeds, namely colour histograms and eigenfaces, are selected. Both testbeds show that the optimal 3-level filtering can usually offer significant savings ranging from 15% to 30% when compared with the optimal 2-level filtering. In particu-lar, when compared with 2-level filtering approach by Sawhney and Hafner, the optimal 3- level filtering can be over 400% faster. It has also been shown that the percentage of saving can be further increased by enlarging the page and/or reducing the C P U speed. Last but not least, an optimizer is used to analyse the dataset and find the optimal configuration for it. Different qualities of services can be provided by different combina-tions of optimizers: 3-level alone, both 3-level and 4-level, and all 2 to 4-level optimizers. "A reference optimizer is first designed to take all the images of the database into account. Although this reference optimizer is too slow to be useful, it can serve as a yardstick so that the accuracies of other more efficient optimizers can be measured. These more effi-cient optimizers only consider a fraction of all images as samples. By sampling theories, the exact number of samples can be computed according to the designer specified confi-dence levels and error levels. In general, when more error is allowed, fewer samples are needed. Thus, the optimizer becomes efficient, but its accuracy drops. 2-level optimizer costs much more time than 3-level optimizer because it involves building the indexing structures. Typically, the 3-level optimizer alone is needed because it can usually gives the.optimal configuration among any number of levels of filtering, and it is the most efficient of all. Chapter 7. Conclusions 91 7.1 Future Work There are three possible extensions which can further verify the advantages of multi-level filtering. • One of the benefits of the cost models in this thesis is its flexibility. It is general enough to handle any dataset and indexing structure. However, it also suffers from the problem that the trend analysis may not reflect the real life situation too closely. This is because the behaviour of the indexing structure is not modelled very accu-rately. Therefore, only a rough general trend of the Index -IOQ and Index J3 PUQ is used for analysis. Hence, the overall trend can only be approximated. Possible future work is to analyse the behaviour of the indexing structure in detail. A more accurate formula can thus be derived to model both Index -IOQ and Index JZ'PU Q. Consequently, the results of the trend analysis can be more convincing. • In this thesis, only the R-tree and the K-D-tree have been considered as the indexing structure. It is certainly worth further investigation of the effects of other possible multi-dimensional indexing structures. For example, the R*-tree and other indexing structures which are designed for high-dimensionality data are good candidates. • Colour histograms and eigenfaces are the two selected testbeds to evaluate the idea of multi-level filtering experimentally. Although both of them are popular and represent explicit and implicit features, it is still worthwhile to test this idea with other high dimensional image features. Examples are shape and texture. This can enlarge the application domain of multi-level filtering. Another similar, but different idea is to increase the size of the database to test its effect. Bibliography [A+95] A . D. Alexandrov, W. Y . Ma, A . E l Abbadi and B.S. Manjunath. "Adaptive Filtering and Indexing for Image Databases." in SPIE Proceedings, Storage and Retriveal for Image and Video Database, III. Vol. 2420, pp. 12-23, February 1995. [B+90] Norbert Beckmann, Han-Peter Kriegel, Ralf Schnedier, Bernhard Seeger. "The R*-tree: A n Efficient and Robust Access Method for Points and Rectangles", in Proceedigs: ACM SIGMOD, pp. 322-311, 1990. [Ben75] Jon Louis Bentley. "Multidimensional Binary Search Trees Used for Associative Searching." in Communications of the ACM. Vol. 18, No. 9, pp. 509-517, 1975. [Car75] A . Cardenas. "Analysis and Performance of Inverted Data Base Structures" in Communications of the ACM, Vol 18, No. 5, pp253-263, 1975. [Chi94] Tzi-cker Chiueh. "Content Based Image Indexing" in Proceedings: 20th VLDB Conference, pp. 582-592, Satiago, Chile, 1994. [Coc64] Cochran, William Gemmell. Sampling techniques. New York, Wiley, 1963; [F+94] C. Faloutsos, W. Equitz, M.Flickner, W. Niblack, D.Petkovic, R. Barber. "Effi-cient and Effective Querying by Image Content" in Journal of Intelligent Information Systems, 3, 3/4, pp. 231-262, 1994. [FF91] B . V . Funt and G.D. Finlayson. Color constant color indexing. Simon Fraser University, School of Computing Science, Technical Report C S S / L C C R TR91-09, 1991. [Flu88] B . Flurry. Common Principal Components and Related Multivariate Models. John Wiley and Sons, 1988. [FMP93] Joseph M . Francos, Zvi Meiri and Boaz Porat. "A Unified Texture Model Based on a 2-D Wold-Like Decomposition, in IEEE transactions on signal processing, Vol. 41. No.8, pp.2665-2678, August 1993. [G+94] Yihong Gong, Hongjiang Zhang, H.C. Chuan and M . Sakauchi. "An Image Database System with Content Capturing and Fast Image Indexing Abilities" in Proceedings: Mulitimedia, 1994 International Conference, pp. 121-130, 1994. 92 Bibliography 93 [Gut84] Antonin Guttman. "R-Trees: A Dynamic index structure for Spatial Searching." in Proceedings: ACM SIGMOD, pp.47-57, 1984. [H+93] Kyoji Hirata, Yoshinori Hara, Naoki Shibata and Fusako Hirabayashi. "Media-based Navigation for Hypermedia Systems" in Proceedings: Hypertext '93, pp. 159-173, 1993. [Jai93] Ramesh Jain. "Workshop Report : NSF Workshop on Visual Information Man-agement Systems" in SIGMOD Record, Vol. 22, No. 3, pp. 57-75, September 1993. [KJ86] Thomas F. Knoll and Ramesh C. Jain. "Recognizing partially visible objects using . .feature indexed hypotheses", in IEEE journal of robotics and automation, Vol. RA-2 N o . l , pp. 3-13, March 1986. [LP95] F. Liu and R.W. Picard. Periodicity, directionality, and randomness: Wold fea-tures for image modeling and retrieval. M.I.T Media Lab. Technical Report No. 32., 1995 [MY88] Makoto Miyahara and Yasuhiro Yoshida. "Mathematical transform of (R,G,B) color data to Munsell (H,V,C) color data" in Visual Communications and Image Processing, SPIE Vol. 1001, pp. 650-657, 1988. [N+93] W. Niblack, R Barber, W. Equitz, M . Flickner, E . Glasman, D. Petkovic, P. Yanker, C. Faloutsos, G. Taubin. "The QBIC Project: Querying Images by Content Using Color, Texture, and Shape" in Proceedings of SPIE: Storage and Retrieval for Image and Video Database, pp. 173-187, Feb 1993. [Ore86] J . Orenstein. "Spatial Query Processing in an Object-Oriented Database Sys-tem" in Proceedings: ACM SIGMOD, pp. 326-336, 1986. [Per78] W. A . Perkins, "A model-based vision system for industrial parts". in IEEE trans-actions on computers, VOL C-27, pp.126-143, Februrary, 1978. [PL94] A . Pope, and D. Lowe. "Vista: A Software Environment for Computer Vision Research." in Proceedings 1994 IEEE Computer Society Conference, Seatle, WA, pp. 768-772, 1994. [PPS94] A . Pentland, R .W. Picard, S. Sclaroff. "Photobook: Tools for Content-Based Manipulation of Image Databases" . in Proceedings of SPIE : Storage and Retrieval for Image and Video Database II, vol. 2185, pp. 34-47, 1994. [Rob81] John T. Robinson. "The K-D-B-Tree: A Search Structure for Large Multidi-mensional Dynamic Indexes." in Proceedings: ACM SIGMOD, pp. 10-18,-1991. Bibliography 94 [Sam90] Hanan Samet. The Design and Analysis of Spatial Data Structures. Addison-Wesley, New York, N Y , 1990. , . [SB90] Michael J . Swain and Dana Ha. Ballard. "Indexing Via Color Histograms" in Proceedings: Computer Vision 1990, pp. 390-393, 1990 [SB91] Michael J . Swain and Dana Ha. Ballard. "Color Indexing" in International Jour-nal of Computer Vision, 7:1, pp. 11-32, 1991 . [Sed95] A . Sedighian. A Comparative Analysis of Multi-Dimensional Indexing Structures for eigenimages. Thesis 1996. [SH93] H . Sawhney and J . Hafner. Efficient Color Histogram Indexing. I B M , Technical Report R J 9572, 1993. [SH94] H . Sawhney and J . Hafner. "Efficient Color Histogram Indexing" in Proceedings: International Conference on Image Processing, 1994. [SKB94] Michael J. Swain, Roger E. Khan and Dana Ha. Ballard. "Low Resolution Cues for Guiding Saccadic Eye Movements" in Computer Vision and Pattern Recognition, pp. 737-740, 1992. [SM92] Fridtjof Stein and Gerrad Medioni. "Structural indexing: efficient 2-D object recongition". in IEEE transactions on pattern analysis and machine intelligence, Vol 14 No. 12 1992. [SRF87] Timos Sellis, Nick Roussopoulos, and Christos Faloutsos. "The R+-tree: A Dynamic Index for Multi-dimensional objects." in Proceedings of the 13th VLDB Conference, pp. 507-518, 1987. [SS94] Markus Strieker and Michael Swain. "The capacity of color histogram Indexing." in Computer vision and pattern recognition, pp. 704-708, 1994. [SSW92] Carl-Erik Sarndal, Bengt Swensson and Jan Wretman. Model assisted survey ' sampling. New York, Springer-Verlag, 1992. [TP91] Matthew A . Turk and Alex P. Pentland. "Face Recognition Using Eigenfaces". in Computer Vision and Pattern Recognition, pp. 586-591, 1991. [Yao77] S. Yao. "Approximating Block Accesses inDatabase Organizations" in Commu-nications of the ACM Vol. 20, No. 4, pp260-261, 1977.
- Library Home /
- Search Collections /
- Open Collections /
- Browse Collections /
- UBC Theses and Dissertations /
- An analysis of multi-level filtering for high dimensional...
Open Collections
UBC Theses and Dissertations
Featured Collection
UBC Theses and Dissertations
An analysis of multi-level filtering for high dimensional image data Tam, Dominic Pok Man 1996
pdf
Page Metadata
Item Metadata
Title | An analysis of multi-level filtering for high dimensional image data |
Creator |
Tam, Dominic Pok Man |
Date Issued | 1996 |
Description | Image database systems are very useful in many applications. To design an effective image
database system, high dimensional image feature vectors have to be extracted from the
images automatically. Each comparison between them tends to be expensive, so sequential
comparisons are usually impractical. Moreover, the. traditional multi-dimensional
indexing structures are incapable of handling these high-dimensional vectors efficiently.
Thus, it has been proposed to abstract lower dimensional k-D vector from the original
N-D feature vector, where k |
Extent | 5532572 bytes |
Genre |
Thesis/Dissertation |
Type |
Text |
FileFormat | application/pdf |
Language | eng |
Date Available | 2009-02-11 |
Provider | Vancouver : University of British Columbia Library |
Rights | For non-commercial purposes only, such as research, private study and education. Additional conditions apply, see Terms of Use https://open.library.ubc.ca/terms_of_use. |
DOI | 10.14288/1.0051429 |
URI | http://hdl.handle.net/2429/4435 |
Degree |
Master of Science - MSc |
Program |
Computer Science |
Affiliation |
Science, Faculty of Computer Science, Department of |
Degree Grantor | University of British Columbia |
GraduationDate | 1996-05 |
Campus |
UBCV |
Scholarly Level | Graduate |
AggregatedSourceRepository | DSpace |
Download
- Media
- 831-ubc_1996-0281.pdf [ 5.28MB ]
- Metadata
- JSON: 831-1.0051429.json
- JSON-LD: 831-1.0051429-ld.json
- RDF/XML (Pretty): 831-1.0051429-rdf.xml
- RDF/JSON: 831-1.0051429-rdf.json
- Turtle: 831-1.0051429-turtle.txt
- N-Triples: 831-1.0051429-rdf-ntriples.txt
- Original Record: 831-1.0051429-source.json
- Full Text
- 831-1.0051429-fulltext.txt
- Citation
- 831-1.0051429.ris
Full Text
Cite
Citation Scheme:
Usage Statistics
Share
Embed
Customize your widget with the following options, then copy and paste the code below into the HTML
of your page to embed this item in your website.
<div id="ubcOpenCollectionsWidgetDisplay">
<script id="ubcOpenCollectionsWidget"
src="{[{embed.src}]}"
data-item="{[{embed.item}]}"
data-collection="{[{embed.collection}]}"
data-metadata="{[{embed.showMetadata}]}"
data-width="{[{embed.width}]}"
async >
</script>
</div>
Our image viewer uses the IIIF 2.0 standard.
To load this item in other compatible viewers, use this url:
https://iiif.library.ubc.ca/presentation/dsp.831.1-0051429/manifest