A N ANALYSIS OF MULTI-LEVEL FILTERING FOR HIGH DIMENSIONAL IMAGE DATA By Dominic Pok Man Tarn <, B. Sc. (Honour) Simon Fraser University , 1 9 9 4 A T H E S I S S U B M I T T E D I N P A R T I A L F U L F I L L M E N T O F T H E R E Q U I R E M E N T S F O R T H E D E G R E E O F M A S T E R O F S C I E N C E in T H E F A C U L T Y O F G R A D U A T E S T U D I E S C O M P U T E R S C I E N C E We accept this thesis as conforming to the required standard, T H E U N I V E R S I T Y O F B R I T I S H C O L U M B I A Apri l , 1996 © Dominic Pok Man Tarn, 1996 In presenting this thesis in partial fulfillment of the requirements for an advanced degree at the University of British Columbia, I. agree .that the Library'shall make it freely available for reference and study. I further agree that permission "for. extensive copying of this thesis for scholarly purposes may be granted by the head of my department or by his or her -representatives. It is understood ;that copying or publication of this thesis for financial gain shall not be allowed without my written permission. Computer Science . •" The University of British Columbia 2366 Main Mail < • : Vancouver, Canada . V6T 1Z4 " Date: '.m'iL 25,yiffi Abstract Image database systems are very useful in many applications. To design an effective image database system, high dimensional image feature vectors have to be extracted from the images automatically. Each comparison between them tends to be expensive, so sequen-tial comparisons are usually impractical. Moreover, the. traditional multi-dimensional indexing structures are incapable of handling these high-dimensional vectors efficiently. Thus, it has been proposed to abstract lower dimensional k-D vector from the original N-D feature vector, where k Gavg = T T r Y . G(P)l B ^ 9 = TTr lZ B{P)l V V p=l V V P =l VV p=l' and W is the number of pixels in the image and R(p), G(p), B(p) are the red, green and blue values of the pixel p. Chapter 2. Related Works 16 Having computed the Qavg and I a v g for the query and every image respectively in the database, the filtering can be done by finding the coarse-level answer set Xavg. Iavg — {Iavg \ ds(Qavg, I a v g ) ^ Davg} (^'^) where d3(Qavg, I a v g ) = { Q a v g - IavgY(Qavg - I avg), and it is an efficient L 2 -norm distance function. The Faloutsos approach also derives a formula of Davg from D so that the completeness requirement can be satisfied. The final answer set IN = {T\dN,A{Q, I) < D} can now be much more efficiently computed based only on the small coarse-level answer set lavgi rather than on all the images in the database. Experiment results show that the Faloutsos approach saves huge amount of searching time when compared with the approach without filtering. The Faloutsos approach fixes the coarse filtering dimension to be 3, but Sawhney and Hafner approach generalizes the dimensionality of coarse filter to be any dimension k. In particular, this approach looks for a optimal transformation matrix Ck of dimension k x N that approximates I ( i.e. Ik = Ckl). This Ik can then be used as the abstracted low-dimensional vector Iabs in the coarse filtering stage. The coarse-level answer set is: IK = {Ik\dist{QkJk) < D k ) (2.6) The threshold distance Dk is simply equal to the original distance D. This transfor-mation Ck is optimal in the sense that it minimizes the maximum difference between the true distance cijv,^(Q,/) and the approximated distance dk(Qk,h)- By using Singular Value Decomposition (SVD), the key result in [SH93, SH94] is the theorem proving that the optimal transformation Ck is given by: Chapter 2. Related Works 17 Ck = y/A~kBk (2.7) where matrix Bk of dimension k x N is composed of the first k eigenvectors (i.e. rows) of B (as shown in Equation(2.4)), and Ak of dimension kxk is a diagonal matrix whose diagonal entries are the first k diagonal entries of A. Please note that as different symbols are used, the set Xavg from the Faloutsos ap-proach is not the same as X 3 from the Sawhney and Hafner approach with k — 3, even though the vectors of size 3 are used in both cases. Nonetheless, experiment results in [SH93, SH94] indicate that the behavior and performance of the Faloutsos approach is very close to those of the Sawhney and Hafner approach with k equal to 3 or 4. Thus, later on in the analysis and experimentation, whenever appropriate the Faloutsos approach is represented as the Sawheny and Hafner approach with k=4. 2.2.2 Shape and Texture Shape Shape features generally fall into two categories, namely global and local features. Global shape features are those which depend on the object as a whole. It is usually to describe the object with features which are invariant to translation, rotation and possibly scaling [KJ86]. Examples of global features include area, perimeter, number of holes, areas of holes, maximum and minimum radii from objects' centroids, algebraic moment invariants and normalized Fourier descriptor. In contrast, local features are those which depend only on a subset of the object. Examples are straight lines, circular arcs, and segments which are grouped together [SM92]. One of the main problems of global shape features is that it may not work well in a cluttered environment. For examples, the number of holes may be less when one object covers another. Therefore, global features should not Chapter 2. Related Works 18 be used in situation where the object may be partially visible or occluded[KJ86]. In such situation, local features are preferred. In QBIC, 20-D feature vectors I are used to store shape features of the images [F+94]. The 2-level filtering approach is also used here. It applies the technique of Karhunen Loeve Transform (KLT) on the 20-D I and truncates the resulting vector to a lower dimensional k-D vector Ik- Actually, K L T is also the same as principal component analysis which will be described in more details in Section 2.2.3. The experimental results show that it is sufficient for k to be 2. Texture In QBIC [N+93], texture features are composed of three components, namely coarseness, contrast and directionality. They describe its scale, vividness and whether the texture is directional and isotropic respectively. The texture photobook [PPS94] uses 2-D Wold-like decomposition to obtain the texture vectors. The texture field is decomposed into deterministic (or structural) and indeterministic (or stochastic) components which are orthogonal to each other. The deterministic component, in turn, is composed of half-plane deterministic (or harmonic) and generalized evanescent components. The details can be referred to [FMP93]. Alexandrov et al. [A+95] proposes another method to index texture. The Garbor filter is famous for its characteristic of optimality in localizing both space and frequency simultaneously. This filter can be modelled in any specified scale and orientation. It is this property to index texture. A family of Garbor filters of S different scales and O different orientations can be applied to any given texture image to output the required texture feature / . The texture vector I which is used in the experiment of [A+95] can be as high as 240-D. Alexandrov et al. then uses filtering to improve the performance. The low-dimensional abstracted vector Iabs developed by Alexandrov et al. is just a two Chapter 2. Related Works 19 dimensional vector J 2 . This abstracted vector J 2 is the output of the single filter which has the best discriminating power among the whole family of Garbor filters. Although all these image features may have different characteristics and applications, most of them share a common drawback. The vectors of these image features are usually very high dimensional. This imposes a serious problem for indexing. 2-level and multi-level filtering are designed to solve this problem. 2.2.3 Eigenimages Properties and its Indexing Explicit image features such as colour, texture and shape are those which can easily be observed and extracted by human perception. However, they may not be the optimal features for recognition. This brings up the idea of implicit image features. They are features which are automatically extracted by the computer and the best for efficient recognition. The extraction is done by a technique called Principal Component Anal-ysis (PCA) . The feature vectors in the real world are typically correlated among the dimensions. In other words, there exists redundancies in the storage of the data vector. The principle of P C A is to decorrelate these data as much as possible so as to reduce the redundancies to the minimal amount. The decorrelation can be visualized as the transformation of the multi-dimensional axes such that most of the variations of the data vectors are captured on few axes, and little or no variation of the data exhibits on the remaining axes. Therefore, the transformed data on those few axes can capture most of the information. In other words, P C A can analyse the set of images in the database and create the transformation matrix. The transformation gives a set,of N principle components (PCs) Chapter 2. Related Works 20 for each image.1 These PCs (also known as eigenimages in this application) are arranged in the descending order of eigenvalues so that the first few components can capture most of the feature information. Because these=PCs do not usually correspond to the features which can be easily recognized by human beings such as eyes, mouth, etc, they are called implicit features. The full set of these N PCs can be grouped into an N-D vector / . Comparing these image vectors I with the corresponding query vector Q at a threshold distance D gives the required answer set IN- The exact dimension of / depends on the size of the training images set. However, in most cases, this dimension can be well over 100. Therefore, direct comparison is unacceptable for its poor performance because of the high dimensionality. In [Sed95], Sedighian applies the 2-level filtering approach to solve the searching problem of high dimensionality. 2.3 Multi-dimensional Indexing Structure In traditional database systems, one of the most often used techniques to speed up the C P U and I/O performance is indexing. An advantage of indexing for query answering is to avoid accessing every single data item in the database system, but only a small fraction of them. There are different efficient indexing structures for traditional database systems. Binary search trees, B-trees and B +-trees are a few examples. Recall from 7 the previous sections that the image feature vectors are inherently high dimensional, so these indexing traditional indexing structures are insufficient. sEven though there are methods (e.g. Z-ordering [Ore86] ) to transform a K-D space onto a 1-D space using the technique of space filling curves, distance preserving is not guaranteed. In other words, points which are close to each other in K-D space may not be close in 1-D transformed space. These methods are therefore not desirable in our application. 1In fact, SVD on the covariance matrix gives the PCs. Chapter 2. Related Works 21 Multi-dimensional indexing structure such as K-D trees [Ben75], K-D-B trees [Rob81], R-trees [Gut84] and R+-trees[SRF87] and R*-trees [B+90]are developed for indexing the multi-dimensional vectors. The following describes these structures briefly. 2.3.1 K - D trees The structure and operations of K-D trees [Ben75] are basically similar to those of the l -D binary search tree. In a l -D binary search trees, there is only one key available to discriminate whether to traverse left or right during searching. However, in a K-D tree, this discriminating key varies with the level of traversal during searching. In general, the discriminating key alternates one after another among all the available K. keys. A node structure can be represented as (keyo, keyi,keyx-i)- The discriminating key of the i-ih level of the tree is the keyimo(iK- Moreover, it may need to traverse through multiple paths. It can be seen that the structure and the efficiency depend heavily on the data set and the sequence of the insertions. In the worst case, it can be degenerated to a linearly linked list. Adaptive K-D tree is proposed to tackle this problem [Sam90]. Instead of alternating among keys, the discriminating key of an adaptive K-D tree at a node is the one with the greatest range over the sub-tree rooted at that node. Therefore, it is possible that the discriminating keys for level i and i + 1 are the same. 2.3.2 K - D - B trees There are two important differences between K-D-B trees and K-D trees [Rob81]. First, the non-leaf nodes of a K-D-B tree do not contain the data points. Instead, they contain - the rectangles bounding all the data points in the leaf nodes of their subtrees. Second, the original K-D tree is not specifically designed for huge databases where all the indices cannot be fitted in the main memory. If a static environment is assumed where all the data is known before insertion and little or no modification will be made, Chapter 2. Related Works 22 the K-D tree can still be applicable. In this case, the tree is first built statically and then mapped onto the pages. However, in the dynamic environment of a huge database, K-D-B tree is designed to handle this case where it is necessary to store most of the indices in disk while it is being used [Rob81] [Gut84]. Each node of K - D - B tree is stored as a single page in disk so that efficient use can be made of a disk with paging. The above two multi-dimensional indexing structures are good for point data objects. In some other applications, point data objects may not be sufficient. Examples include spatial data objects in geographical information systems. Still, there are methods for converting spatial data objects to point data. One way is to transform a K - D rectangles to two K-D points. Take a 2-D rectangle with diagonal vertices (1,2) and (3,4) as an example. A possible corresponding 4-D point to represent this 2-D rectangle can be (1,2,3,4). However, higher dimensionality space may lead to poorer performance. In order to solve this problem, the following three multi-dimensional indexing structures are designed to handle spatial data objects. 2.3.3 R-trees and its variants In order to handle any irregular K-D spatial data object, one approximation is to represent the key of a node as K-D rectangle which bounds the object. This K-D rectangle key I can then be represented by K number of closed intervals. -, / = (Jo, h, • • •, ^A'-l) where Ii — [a,-, &;], with at- and b{ representing the lower and upper limit of the range in the i-th dimension of the rectangle. Each page can only store a singe node, which, in turn, can store a certain maximum number of rectangle keys. At the point when an extra rectangle is being added to a full node, that node has to be split into two. The rectangles which are in the original Chapter 2. Related Works 23 overflown node are then needed to be re-distributed into the two newly created nodes. The three members of R-tree family, namely R-tree, R +-tree, R*-tree differ in the ways of redistribution. The differences in the insertion algorithms between them affect the characteristics of the rectangles in the non-leaf nodes, and therefore the tree structures as well as the searching algorithms. In an R-tree, the rectangle key of each non-leaf node completely encloses all the rect-angles of the nodes at its subtree. During searching, if the query rectangle key overlaps with several rectangles of a certain non-leaf node, all the nodes of these overlapped rect-angles have to be traversed. Therefore, the efficiency of the searching depends on the overlapping area. Zero-overlapping area is thus the goal. Unfortunately, it is shown that R-trees can never achieve zero- overlapping area for keys of complete rectangles without splitting[SRF87]. In order to improve efficiency, R +-tree is designed to have zero-overlapping area by splitting the rectangle keys. The time of searching can then be reduced because it is less likely to follow the paths which do not lead to correct results. The R*-tree [B+90] enhances the searching performance from a more comprehensive approach. It also argues that the original R-tree which is based on minimizing the coverage area is not optimal. Like an R +-tree, the fe-distribution heuristic of R*-tree is based on coverage area and overlap area. In addition, it also considers two other optimization heuristics. The first one is to minimize the margin of the rectangle (i.e. sum of the edges length). The idea of this heuristic is as follows. Suppose the area is kept the same. When the margin is shorter, the rectangle is also more quadratic (i.e. it is closer to a square). Since the packaging of the square is better, the non-leaf rectangles can also be smaller. The goal of the second heuristic is to shorten the height of the tree, and therefore minimize the query cost. By combining all these criteria, R*-tree shows very encouraging experimental results compared with the original R-tree. Even though the above indexing structures can support multi-dimensional data, none Chapter 2. Related Works 24 of them can handle data efficiently with dimension over 20. Therefore, they cannot store the high dimensional image feature vectors directly. The aim of the 2-level filtering is to provide a coarse level filter where k-D vectors are abstracted from,the original high dimensional image features vectors. The dimensions of these A;-D vectors are usually so small that these vectors can fit in the indexing structure. Still, the coarse filter of cannot be too small. Otherwise, this may leave too many images for the expensive fine-level filtering. At the same time, the dimension k cannot be too large because of the "dimensionality curse". Therefore, the design goal of multi-level filtering is to solve this dilemma., Chapter 3 Cost Models of Filtering 3-level filtering is one of the major contributions of this thesis. The beginning of this chapter describes its details. The remaining sections then present the cost models which can help analysing and comparing 2-level, 3-level and multi-level filterings. 3.1 3-level Filtering As previewed in Section 1.4.2, the goal of the 3-level filtering is to solve the inherent problem of 2-level filtering. In other words, 3-level filtering can detach the dimension-ality of the indexing structure from the number of full high dimensional feature vector comparisons by adding an intermediate level of filtering. Suppose the N-D feature vector of the images in the database and the query are extracted to be I and Q respectively. / are then abstracted to much lower dimensional vectors Ik and /; by either P C A or SVD depending on particular application, where I < k <^ N. The dimension of 7; is small f —* —* —* enough so that I\ can be stored efficiently in the indexing structure. Ik and I can be stored in two separate flat files. Q are also abstracted to Qk and Q\ similarly. The three levels of filtering are done in following steps. 1. The coarsest level answer set X; is obtained by searching through the indexing structure with the query Q\ Xi = {T,\disti(Qi,Ii) < Di} 25 (3.8) Chapter 3. Cost Models of Filtering 26 2. 1\ is subset of the original database. Only the corresponding vectors Ik of this small subset are compared with Qk to return the second level answer set lk-lk = {Ik\distk{QkJk) N). When a4 and ft are small, CPU(4>N) is much smaller than CPU^yTherefore, this can explain the huge saving of Faloutsos approach when compared with the approach without filtering. 4.3 Analysing the configuration (A;, TV) As k increases, the trend of 2-level filtering cost is not as trivial as the analysis discussed above. The trend of the I/O cost in Equation(3.14) is determined by two "forces" going in opposite directions. • Observation 1 Pt * (1 - (1 - l/Pt)ak*M) decreases as k increases. Chapter 4. Trend Analysis 36 First, as k increases, it is known that ak decreases because less information is lost —* —* when I is truncated to Ik. Furthermore, all the other variables in Pt * (1 — (1 — l/Pt)a**M) are independent on k. Thus, the second term Pt*(l-(l-l/Pt)ak*M) of I/0(k,N) ( Equation( 3.14) ) decreases as k increases. This is shown in Figure 4.2(a). • O b s e r v a t i o n 2 Index-I/0(M, k,/3k,P) increases as k increases. When k increases, the size of each vector increases, as does the number of pages needed to retrieve the same number of vectors. More importantly, as the dimen-sion of the index increases, the number of page accesses grows at least linearly, if not exponentially, due to the effect of the so-called "dimensionality curse." This -outweighs the fact that (3k decreases as k increases. Consequently, the first term Index-I/0(M, k, (3k, P) increases as k increases. This is also shown in Figure 4.2(a). To understand the combined effect of these two opposite terms, the following key observation can be made: the rate of decrease of ak reduces as k increases. This is because Ik' corresponds to the k highest eigenvalues and eigenvectors of the similarity matrix. In other words, as A; increments by one, the extra dimension added corresponds to an eigenvalue that is less than all the k eigenvalues that have already been included. Thus, the impact of adding this extra dimension, though non-zero, is not as great as the impact of adding any of the k previous dimensions. As a consequence of this observation, the combined result of the above two opposite terms depends heavily on the ratio ak/a^+i • For small values of k, the ratio is significantly larger than 1. In this k increases, the decrease in the term Pt * (1 - (1 - 1 /Pt)°k*M) more than offsets the increase in the term Index J/0(M, k, /3k, P). This can be seen as the steep slope of the second term in Figure 4.2(a). The end result is a reduction in I/O cost, i.e., I/0^k,N) > 1/0(k+i,N)- However, for larger values of A:, the ratio cck/ctk+i approaches 1. Then, the decrease in the second term of Equation (3.14) is dominated by Chapter 4. Trend Analysis 37 (a) T rend of I/0{k>N) (b) Trend of CPU{k 4 ) can cost less than that of the Faloutsos approach. 4.4 3-level vs 2-level filtering: (4, k, N) vs (k, N) at same k To answer the question of whether 3-level filtering leads to reduction in 1/0 cost when compared with 2-level, Equation (3.16) is compared with Equation (3.14). Since the last terms in both equations are the same, the difference between I/0^,N) a n d 1/0(4,fc,./v) is A i — A 2 , where: ' • A i = Index J/0(M,k,pk,P) - Index J/0(M, 4, ft,P), and • A 2 = Ptk*(l~(l-l/Ptkr**M). The trend of A 2 as k increases is first considered. Within A 2 , only Ptk depends on k, as 0 : 4 * M is constant with respect to k. When k increases, Ptk increases, but (1 — (1 - l/Ptk)a**M) decreases. The combined effect on A 2 can be analysed by investigating their respective relative rates of increasing and decreasing. If k increases by one, Ptk increases by Ptk+i - Ptk = AM/P. However, the term 1 - (1 - l/Ptk)ai*M decreases as k increases. Initially, when k is small, Ptk is small, and 1 — 1/Ptk is not too close to 1. The latter, in turn, causes 1 — (1 — 1 / P £ L ) Q 4 * m to be close to 1. This gives a rate of increase of A 2 close to 4M/P. As k increases, 1 — 1/Ptk becomes closer to 1, which causes 1 — (1 — 1/Pt k) Q i*M to become smaller. The consequence is that the rate of increase of A2 becomes much smaller. That is to say, the growth of A 2 flattens off at some point of k. The exact value of this k depends on the value of a4 * M. The larger 0:4 * M is, the. larger value this k is. In any event, the rate of increase of A 2 at any point cannot exceed AM I P. Observation 8 A 2 increases and eventually flattens off when k increases. Chapter 4. Trend Analysis 40 / (a) Case 1 (b) Case 2 Figure 4.3: Trends of A x , A 2 and A i - A2 The analysis of A i — A 2 can be done as follows. There are two cases. In the first case, if the rate of increase of Index JjO[M, k, fa, P) with respect to k is greater than AM/P, then A i — A 2 increases as k increases. That is to say, the larger the value of k, the larger is the difference between I/0(K,N) and I/O^^N) (Figure 4.3(a)). In the second case, the rate of increase of Index J./0(M,k, fa, P) with respect to k is less than AM/P. Then the growth of A i may only catch up with the growth of A 2 after the point k&2, when the growth of A 2 begins to flatten off. In other words, after this point, the larger the value of k, the larger A j — A 2 is (Figure 4.3(b)). Furthermore, when k = 4, A i - A 2 is equal to 0 - Pt4 * (1 - (1 - l/Pt4)ai*M), which is a very small negative number, because Pt4 is a small number. Then combined with the analysis described in the previous paragraph, A i — A 2 is expected to be positive even for rather small values of k (e.g., 6). And as argued above, as k increases, the value of Chapter 4. Trend Analysis 41 A i — A 2 is expected to become more and more positive, indicating that the cost I/0(k,N) of the 2-level filtering approach is getting higher and higher than the cost I/0(4tk,N) of 3-level filtering. As for the C P U cost, Equation (3.15) can be compared with Equation (3.12). Since the last terms in both equations are the same, the difference between CPU(k,N) and CPU(4ik,N) is A 3 - A 4 , where: • A 3 = IndexJCPU{M,k,fa,P)- IndexJCPU(M, 4, fa,P), and • A4 ' = M * (34 * Tdi + M * OL4 * Tdk - M * fa * Tdk. Similar to the above analysis of I/O cost, the larger the value of k, the larger is the value of A 3 . A 4 also grows as k increases, because of the growth in Tdk. However, Tdk is typically much smaller than Index-CPU(M, k, fa, P). Hence, A 3 dominate A 4 . In other words, CPU(k,N) is eventually greater than CPU(4tk,N)- ' The above analysis establishes that for both I/O and C P U costs, the 3-level filtering approach becomes more and more efficient than, the 2-level filtering approach for the same value oi k. 4.5 3-level vs 2-level filtering: (4,k,N)' vs (k,N) at their optimal k The next and most important question then is whether the 3-level filtering approach still outperforms the 2-level filtering when both approaches are at their respective optimal k values. The k value that gives the minimum total cost of the 3-level filtering approach can be denoted as & ^ | 3 , and as before, the minimum k value of the 2-level filtering is denoted as k^1. To determine the relationship between k^l3 and k^1, it is helpful to first examine the trend of C P U and I/O costs of the 3-level filtering approach. Among the four terms of CPU(4tk,N) m Equation(3.15), the first two terms, Index .CPU'(JVf, 4, fa, P) and M * fa * Tdi in are independent of k. On the one hand, Chapter 4. Trend Analysis 42 (a) Trend of I/0{4,k,N) (b) Trend of CPU{4tk,N) Figure 4.4: Costs of 3-level filtering with respect to k the third term, M * ct4 * Tdk increases with k. On the other hand, the Observation 6 of Section 4.3 indicates that the last term M * ak * TdN decreases with k. Because TjN has a dominating effect, CPU(4ik,N) w m follow the trend of its last term to decrease when k increases, at least when k is not too large. Their trends are shown in Figure 4.4(b). As compared with the trends shown in Figure 4.2(b), this trend is different from that of>CPU{k,N) since the total increase of Index JOPU(M, k,flk,P) and M * (3k * Tdk in CPU(k,N) is much more significant than the increase of M * a4 * Tdk in in CPU{4ik,N)-Observation 9 CPU^k,N) decreases when k increases, until k becomes very large in which case CPU(4ik,N) increases very slightly. As for the three terms of I/0^k,N) m Equation (3.16), the first term, IndexJlO(M,4.,f34,P) is constant with respect to k. The second term, Ptk * (1 — (1 — l/Ptk)ai*M) as indicated in Observation 8 of Section 4.4, grows as k increases,, but Chapter 4. Trend Analysis 43 flattens off eventually. The third term, Pt* (1 - (1 - l/Pt)ak*M) as also indicated in Observation 1 of Section 4.2, decreases.as k increases, but also flattens off eventually. The combined effect is that when k is small and ctk/ctk+i is not too close to 1, the decrease in the third term dominates the increase in the second term. This is because ctk is an exponent in the third term, whereas Ptk iri the second term is not. In other words, the cost I/0(4tk,N) is decreasing. Now as k increases, both the increase of the second term and the decrease of the third term flatten off. This translates to a very gradual decrease in I/0(4tk,N)- Eventually, when ctk/ctk+i is really close to 1, the second term finally outgains the third term, reversing the downward trend to an upward one. Because the growth of the second term itself flattens off, it is expected that the upward trend occurs only with large values of k, making the trend of I/0(4^,N) resemble a very wide U-shape as shown in Figure 4.4(a). In particular, I/0(4tk,N) is expected to reach its minimum beyond the point at which I/0(k,N) reaches its minimum. This is the case precisely because the 3-level filtering approach avoids using higher-dimensional indices. Observation 10 I/0(4,k,N) resembles a wider U-shape than I/0(k,N) does when k in-creases. By combining the two effects from Observations 9 and 10, it can be expected that total cost of 3-level filtering exhibits a U-shape wider than that of 2-level. In other words, the total cost of 3-level filtering will reach its minimum at a point beyond that of 2-level, i.e. k^1 < k ^ j 3 . The three graphs of Figure 4.5 show the three possible scenarios where k1^ < &^*£J3. Moreover, they also show the wide U-shaped curves of 3-level filtering total cost when compared with those of 2-level. In fact, the finding of kmin ^ ^min,3 l s precisely the key aspect that motivates the development of multi-level filtering as discussed in Section 1.4.2 (b) Scenario 2 (c) Scenario 3 Figure 4.5: Total cost of 3-level and 2-level filtering Chapter 4. Trend Analysis 45 Finally, some light can be shed on the relative performance of the 3-level and the 2-level filtering approaches when both approaches are at their respective optimal k values. It is already shown in the analysis of Section 4.4 that starting from some values of ko, the total cost CPU(k,N) + I/0(k,N) * dr of the 2-level filtering approach is expected to be greater than the total cost C PU(4ikiN) + I/0(4tk,N) * dr of 3-level filtering, where dr denotes the average time for reading one page from disk. There are now two cases. In the first case, the optimal k^j*1 of the 2-level approach exceeds k0. This is shown in Scenario 1 of Figure 4.5. Then it is immediately known that: CPU I [.total N \ + I/O/ ktotal N\* dr > CPU 14 ktotal N ) + I/'0 (4 fctotal N) * dr (4.19) \ iritn ' / \ min ' ' > ' min ' ' \ ' min * / Obviously, as A^*£'3 is the optimal k value of 3-level filtering, it is known by definition that the right-hand-side of the above inequality must be greater than the minimum total cost of CPU 14 ktotai M\+ I/O14 ktotal N) * dr. But a more revealing way to establish the \ mtn,3' . / V ' mm,3 * / same inference is to make use of the fact that k1^^ < A^'£j 3. The trends of CPU^k,N) and I/0(4ik,N) dictate that: C PU(4< ktotal^ ft} + 7/0^4^ tfotal^ * dr . > CPU(4ktotal^±[y)~^~I/0(4,ktota,+-l,N)*dr > ... >. CPU{4ik^iN) + r / 0 { 4 i k ^ t N ) * d r (4.20) Hence, by combining with Equation (4.19), the following is established: CPU{ fcjoj-i, N ) + 110{ fcjo,.«i N ) * dr > CPU{4t k ^ 3 , N ) + I/0{4t k^3,N) * dr(4.2l) That is to say, when the two approaches reach their respective minimums, the total cost of the 3-level filtering approach is smaller than that of the 2-level. This is shown in the scenario 1 of Figure 4.5. The above analysis only deals with the case when the optimal k^j®1 of the 2-level filtering approach exceeds k0. In the second case when A;^'"' is less than A:0, the comparison Chapter 4. Trend Analysis 46 is not as definite as in the first case. As k1^ < ko, the inequality sign in Equation (4.19) has to be reversed to give: CPU(ktotai iNj + I/O(ktotai j Ar\. * dr < C P U k t o t - a l , N) ^ 110 ( 4 , ' k t o t - a l , N) * dr (4.22) Combining this equation with Equation (4.20), however, does not give Equation (4.21). Thus, in this case, nothing definite can be concluded about which minimum value is smaller than the other one. Both possible cases are shown in Scenario 2 and 3 of Figure 4.5. Nonetheless, when this happens, it is expected that both minimum values are close to each other. Moreover, from experience, the critical value ko is typically quite small. For instance, for both experiments using random colour histograms in Section 5.3.1 and eigenfaces in Section 5.3.2, k0 and k**£ are around 6 and 12 respectively. Small ko values imply that the second case occurs much more rarely than the first case. Thus, the first case and Equation (4.21) are expected to hold most of the times. 4.6 4-level vs 3-level filtering: (4, kx,k2,N) vs (4, k, N) The I/O and C P U costs of u-level filtering (Equation(3.17) and 3.18) can be instantiated to those of 4-level filtering (4, k\, k2, N) as follows. CPU{4Mtk2tN) = Index JJPU{M^ fa, P) + M * fa * Tdi + M * a4 * Tdkj + M * akl * Tdk2 + M * ak2 * TdN (4.23) IIO{4MM,N) = IndexJ/0(M,4,fa,P) + Ptkl * (1 - (1 l/Ptkl)a**M) + Ptk2 * (1 - (1 - l/Ptk2)a*i*M) + Pt * (1 - (1 - l /Pt) a *2*^Ql.24) Chapter 4. Trend Analysis 47 When ki = k, the difference between CPU^^^N) and CPU^k,N)i as given by Equations(4.23) and (3.15), is the difference between M*akl *Tdk2 and M*(ak—ak2)*TdN. For small values of k (or ki), (ak — ctk2) is significant, causing M * (a:*; — Qk2) * TdN to dominate M * akl *Td^, and thus forcing C PU^kl,k2,N) to be less than CPU^k.N)- But the converse is true for medium to large values of k. Similarly, a comparison between Equations (4.24) and (3.16) reveals that 1/Oi^^^N) and I/O^^N) follow the same trend as their C P U counterparts. Thus, as far as total cost is concerned, it is expected that a 4-level structure could outperform a corresponding 3-level structure for small values of k, but the converse is true for larger values of k. Given that the optimal value A^'£' 3 of the 3-level filtering approach is typically not small, the latter statement implies that at that k value, the 3-level structure may be more efficient than the corresponding 4-level structure. But what about comparing the optimum of 4-level filtering with the optimum of 3-level filtering? Unfortunately, no analytic conclusion can be drawn that applies to most situations. However, the experimental results in the next chapter do indicate that while possible, it is not often that the optimal 4-level filtering can outperform the optimal 3- level filtering. And in situations where it does, the total cost difference between the 4- level optimum and the 3-level optimum is small. The cost models of C P U and I/O cost for even higher level, such as 5-level and 6-level can be instantiated similarly from the cost models of u-level filtering. However ^ as discussed in the previous paragraph, the limit seems to have already been reached by the 3-level, and occasionally 4-level filtering, there is little reason to consider filtering higher than four levels. This is because the extra storage space may outweigh the little gain in performance. In this section, it is shown that 2-level filtering (k, N) usually costs less than (4, N) when k > 4. Furthermore, the trend analysis also provides strong evidence that the optimal 3-level filtering (4, k, N) can also outperform the optimal 2-level filtering (k, N) in Chapter 4. Trend Analysis 48 most cases. In the next chapter, these findings are further confirmed by the experimental results. Chapter 5 Experimental Evaluation 5.1 Introduction r The trend analysis in Chapter 4 provides strong evidence that 3-level filtering can outper-form 2-level filtering in most cases. The goal of this chapter is to evaluate this analytical finding with experiments. Two different testbeds are chosen. One is a colour histogram which is an explicit image feature, while the other is an eigenface which is implicit, as discussed in Section 2.2.3. A colour histogram is chosen because it is a fairly popular feature among different image database systems. For example, one of the QBIC recognition techniques is based on colour histogram [F+94]. Likewise, Gong et al. also use colour histograms for their image database systems [G+94]. As discussed in Section 1.3, colour histogram is an example of high-dimensional image feature vector. The higher the dimension, the better the accuracy provided that all conditions are kept unchanged. For example, Faloutsos et al. uses as many as 256 bins for the dimension of the colour histogram [F+94] . Even higher dimensionality may be needed depending on the needs of the applications. Another testbed is eigenface [TP91]. It is an example of high-dimensional vector because the maximum number of dimension generated can be as high as the number of training images less one. This is certainly a considerable number. If the facial image database consists of five thousands facial images, and only one-tenth of them are used for training, the maximum dimension can be as high as 499. 49 Chapter 5. Experimental Evaluation 50 Using these two testbeds, we are now going to see which k gives the optimal configu-ration of 2-level filtering, and how much optimal multi-level filtering can save compared with the 2-level one. Before any result is presented, the details of experiments are de-scribed next. 5.2 Details of Experiments As there are two different testbeds, all the experimental details which are applicable to both are first described. Afterwards, the details specific to each of these two testbeds are then discussed separately. 5.2.1 General Experimental Setup Simulation In general, M images are first collected for the database, and the corresponding high-—* dimensional feature vectors I are extracted. (The exception is random colour histograms which are generated. The details will be discussed in the Section 5.2.2.) For a particular dimension k, the coarse vectors Ik are computed by reducing / by either P C A or SVD, depending on thfe application. A l l these Ik are inserted into an indexing structure of page size P , and the vectors I are stored in flat file. A query image is chosen among the image database, and its high-dimensional feature vector Q is also extracted and reduced —* —* to Qk- This Qk can then be submitted as query for the indexing structure. The set Jk = {h I A j= i l?j — ? i l 5: D] is returned as the result of the range searching by the indexing structure, fa is then the ratio of the cardinality of Jk and M. Each of the members in Jk is the compared with Q to give Xk as defined by Equation 2.6. ak can be found to be the ratio of the cardinality of Xk and M. During the tree traversal, the num-ber of pages accessed and the C P U time spent are recorded as IndexJf/0(M, k, fa, P) Chapter 5. Experimental Evaluation 51 and IndexJCPU(M,k, (3k,P) respectively. In order to avoid the bias of any particular query, the above procedures are repeated with 50 different queries in total for a given k. Similarly, the above complete process is repeated with every even number of k from 4 to 32. After all these steps, all the parameters needed for calculation of the every cost model are available, except Tdk and disk access time which will described in the following sections. These parameters can then be plugged into the cost models of 2-level filtering and 3-level filtering for each k and 4-level filtering for each k and I. The optimal values can then be found. Hardware Configuration Buffer Size. The buffer size is kept at the minimum value of one page. Typically, the buffer size can affect the number of pages to be accessed, and thus the searching perfor-mance. However, it does not affect the results of the following experiments significantly because of the two reasons. First, if a larger buffer size is used for the tree-like indexing structure, the additional buffer is naturally allocated to store the nodes near the root of the tree in order to increase the likelihood of reducing page faults. Unless there is a huge amount of buffer space, the performance gain is insignificant. Second, the saving of ac-cessing the flat file with larger buffer space can be achieved by another technique without actually having the space. The addresses of all the pages which need to be accessed are first recorded and sorted in a list. A l l pages which needs to be multiply accessed appear only once in the list. Therefore, each page needs to be accessed at most once, in this case, an 1-page buffer is sufficient. Disk Access Time. The random disk access time of a page is assumed to be 10 msec. Chapter 5. Experimental Evaluation 52 Computation Time Tdk- It is the C P U time needed to measure the the distance of two k-D vectors using £ 2 -norm. TdA,Td32 and Tdsl2 are obtained by measuring the actual C P U time taken for computing the distance between two 4-D, 32-D and 512-D vectors respectively. The measurements are repeated a thousand times, and the results are averaged. They are found to be 7.3997, 48.8579 and 828.5132 microseconds respectively. The other required Tdk can be calculated by linearly interpolation between 4-D and 32-D. A l l C P U timings are measured using a Sun Sparc-10. Indexing Structure The selected indexing structure used in the experiments is R-tree. The code of the R-tree was developed by Dr. Christos Faloutsos and his students. 5.2.2 Colour Histogram Experimental Setup Similarity Matrix A As discussed in Section 2.2.1, both CIE L U V and Munsell H V C colour models are suitable for the computation of the entries in the colour similarity matrix A. CIE L U V model is chosen for this experiment because it is simpler in concept, easier to implement and sufficient for our experiments. The similarity matrix A can be computed as follows. Each bin of the colour histogram corresponds to a point in the R G B colour space. To compute the entry a 8j in the matrix A , their points in R G B space are transformed to CIE L U V space. The distance dij between them are measured in the CIE L U V space by Z 2-norm. It can finally be normalized to any value between 0 and 1 as follows. Since the number of bins chosen for the colour histograms is 512, the size of this matrix Chapter 5. Experimental Evaluation 53 is 512x512. According to the Equation(2.4), singular value decomposition is applied on this matrix A using Matlab. Preprocessing of full high-dimensional histograms It is known from Equation 2.3 that computing the distance between two colour histograms involving cross-talk effect is of time complexity 0(N2). Even though the cost models of the filtering are still applicable by replacing by A(Q,I) to be (Q — I)1 A(Q-I). Now by Equation (2.4), the latter term is identical to (Q — /)* (BTAB) (Q — I). Given that A is a diagonal matrix, BLAB is exactly the same as (Bty/A)(y/AB), or (y/ABY(y/AB). Thus, the term (Q - /)' A (Q - I), is equivalent to [(Q — I)*CN][CN{Q — I)}- Hence, it is necessary that dN,A{Q, I) is equivalent to dN(QN, IN), which, as defined in Equation (2.6), gives the L2-norm of QN and IN-Consequently, the IN can be stored for the final level of filtering. In this way, computing dN(QN, IN) costs much less time than dN,A{Q,I)-Chapter 5. Experimental Evaluation 54 Two sets of Colour Histograms There are two different types of colour histograms for the experiments. The first type is the colour histograms from real images. These images are collected from different sources. They include Smithsonian Institution images archive which consists mainly of exhibits images of museum, Hong Kong Picture Archive which consists of Hong Kong scenery and pop-stars.images, and some other landscape images. There are 1712 images in total. They cover a wide domain of different areas. After all these images are collected, they are cropped to remove the black margin, re-scaled to a fixed size of 128x128, and converted to Vista format [PL94]. The colour histogram of an image is built by incrementing the count of the bin which corresponds to the the R G B colour components of each pixel. After running through every pixel in the image, each bin count of every histogram can then be normalized to any value from 0 to 1. Another type is random colour histograms. Unlike the histograms from images, unlim-ited number of histograms can be generated easily without the restriction to the sources of testing images. Furthermore, the set of generated histograms is general enough without any possibility of bias to any particular domain. In [SB90], Swain and Ballard proposes a method to generate these random, but still realistic histograms. The procedure is as follows. 1. Randomly choose the number of non-empty bins Nne in histogram, where 3 < Nne < 512. 2. Randomly choose one of these Nne bins. Label it as bin i. 3. Increment the count of this bin i. 4. Repeat steps 2 to 3 Z times, where Z is the number of pixels in the image. Chapter 5. Experimental Evaluation 55 The colour histogram of a real image is typically clustered into a certain number of bins. The general idea of the above procedure is to simulate this clustering effect. 5.2.3 Eigenfaces Experimental Setup General Procedure The eigenfaces experiment basically follows the ideas used in [Sed95]. 400 grey-scale facial images were collected from the ftp site of Olivetti Research Laboratory. These images are taken from 40 different individuals. The sizes of all images are 112x92 pixels, and each pixel has 256 grey levels. Like the image testbed in the above section, these facial images are also stored in Vista format. The maximum number of principle components (i.e. 399) are extracted from these images using principle component analysis (PCA) . The full sets of all these principle components (PCs) are then the required full high-dimensional feature vectors. Some general ideas of P C A have been mentioned in Section 2.2.3, and the details of computing P C A will be discussed in the next section. However, the database of 400 images is too small for most practical purposes. A technique is devised ,in [Sed95] to increase the database to any size. The maximum and the minimum principle components at each dimension j of the above 400 images are recorded as maxj and mirij. A database of size M can be created by randomly generating M vectors where each j - t h dimension entry of every vector lies between maxj and mirij. This technique is to simulate the scenario that this new database is sampled to form a training set of these 400 facial images. Computing PCA In Section 2.2.3, it has been seen that the full set of N PCs are grouped into the vector I. In fact, these PCs are computed by applying a transformation matrix C onto / Chapter 5. Experimental Evaluation 56 [TP91, Sed95]. This matrix C can be found as follows. 1. Suppose there are M image feature vectors aft- of size RxR in the database and the training set consists of Ms sampled images. To ensure the rotation of axes is correctly done at the "center" of these images data, the axes have first to be translated to the average of all the feature vectors. The translated image data is called A l l of them can be combined to form a R2xMs matrix Y. 1 M s a = — J2 Xj yi = Xi - a Y = 2/1 2/2 • • • yiw, 2. The R2xR2 covariance matrix A is computed to measure how closely the dimensions of the data are correlated to each another. A = YYT 3. The eigenvectors 6tand corresponding eigenvalues A,- can then be computed for this covariance matrix A. Abi = Xibi 4. The eigenvectors bi is re-arranged according to the decreasing order of eigenvalues of Ai. 5. The required transformation matrix C for the P C A is then given by C - (^bx b2 ... bR2^ Chapter 5. Experimental Evaluation 57 The above procedure for computing the transformation matrix C is reasonable for data items where the dimension of the data vector R2 is small. However, there exists a serious practical problem in our case since R2 of even a small image can easily be over ten thousand. It is computationally infeasible to calculate the eigenvectors and eigenvalues of such a huge R2xR2 matrix A . To get around this problem, the eigenvectors b[ and eigenvalues A^ of a MsxMs matrix A' can be first calculated. A' = YTY The calculation of this matrix for images data is usually much faster because the number of sample Ms is much less than R2 in most cases. The required eigenvectors are equal to b{ = Yb\. Having t\, the original procedure can be followed as above. 5.3 Experimental Results 5.3.1 Colour Histograms Random Colour Histograms First, we will investigate how 3-level and 4-level filtering at their respective optimums compare with the optimums of the Sawhney and Hafner approach and the Faloutsos ap-proach using random colour histogram as the testbed. As discussed before, the Falout-sos approach has been found to behave experimentally very close to the Sawhney and Hafner approach with k = 4. Thus, in the graphs to be presented shortly, the nota-tion (4,512) and (Ar, 512) represents the Faloutsos approach and optimal Sawhney and Hafner approach respectively. Recall from Section 5.2.1 that both of them store origi-nal histograms. Hence, the ti/v^Q is used as the distance measurement in both cases. The 3-level and 4-level filtering approaches ( represented by (4, k, 512) and (4, k\, k2,512) respectively) store transformed histogram, and thus use d^-Chapter 5. Experimental Evaluation 58 Figure 5.6: Relative efficiency with original histograms The graph in Figure 5.6 compares the total time taken per query by various config-urations relative to the 3-level optimum (4,18,512). Clearly, the configuration (4,512) ( Faloutsos approach ) was the least efficient, taking over 9 times longer than (4,18, 512), and over 2 times longer than (12, 512). As the k value changes from 4 to 12, the configu-ration (12,512) becomes more efficient. However, even though (12,512) is the Sawhney and Hafner optimum, it still takes 4 times longer than the 3-level optimum. (4,18,512) is slightly more efficient than the 4-level optimum (4,8,28,512) which still outperforms the Sawhney and Hafner 2-level optimum. The selectivity at the finest level for this set of experiment is chosen to be about 0.1% and the page size is set to be 16K bytes. The efficiency gaps among all the configurations shown in Figure 5.6 are due largely to the fact that (4,512) and (12,512) store the original histograms, while the other two store the transformed histograms. The graph in Figure 5.7(a) shows the relative total time taken by various configurations, where this time all configurations store the transformed histograms. With transformed histograms, the efficiency gaps among the configurations Chapter 5. Experimental Evaluation 59 <4,512> <12,512> <4,18,512> <4,8.28,512> <4,512> <12,512> <4,18,512> <4,8,28,512> (a) Relative Total Time (b) Relative CPU & I/O Time Figure 5.7: Relative efficiency with transformed histograms become narrower. This is because with TdN A replaced by a much smaller T ^ , the last term in Equation (3.12) becomes less dominant. In fact, the entire C P U cost becomes much less than before, allowing the I/O cost to dominate in the total cost. This is confirmed by the graph shown in Figure 5.7(b). In that graph, for each configuration, the C P U and I/O costs are shown as percentages of the costs of (4,18,512). Note that even though for (4,18,512), both percentages are 100, this does not mean to say that the absolute C P U and I/O times are the same. It only means that the C P U and I/O costs of (4,18,512) are used as the bases for comparisons. The graphs in Figure 5.7 show that the 3-level optimum (4,18,512) and the 4-level optimum (4,8,28,512) st i l l compare favourably against the Sawhney and Hafner 2-level optimum (12,512). Specifically, the 2-level optimum takes 67% more C P U time and .18% more I/O time, and hence 22% more total time than 3-level optimum. The Faloutsos Chapter 5. Experimental Evaluation 60 (a) Storing Original Histograms (b) Storing Transformed Histograms Figure 5.8: Relative efficiency with histograms from real images configuration (4,512) takes 81% more total time than the 3-level optimum. Finally, even though the 4-level optimum takes 14% less C P U time than 3-level, a higher I/O cost forces the former configuration to be less efficient than the latter configuration. Colour Histograms from Real Images Next, we will investigate whether 3-level and 4-level filtering can still outperform 2-level if the testbed is changed to colour histograms derived from real images. We will first focus on the relative total time with all configurations storing transformed histograms at the finest level (i.e. ii/v() is used instead in all cases). Because there are fewer histograms from images than random histograms, the page size for this set of experiment is shrunk to 8K bytes. The selectivity of the finest filtering is raised to about 2%. The Figure 5.8(b) shows that the 3-level optimum (4,28,512) takes 15% less time than 2-level optimum Chapter 5. Experimental Evaluation 61 (18,512). On the other hand, the 3-level optimum is more than 5 times faster than 2-level optimum if the original histograms are used in the latter approach, as shown in Figure 5.8(a). This performance gain is even more significant than that of random histogram in the corresponding case. 5.3.2 E igenfaces The testbed is now changed from colour histograms to eigenfaces. We are going to use a slightly different methodology for this experiment. First, the notion of "original his-tograms vs transformed histograms" is no longer applicable in this set of experiments because only vectors of principle components of various dimensions are stored in the cor-responding levels of filtering. Second, only the costs of the optimal configurations of two, three and four levels filtering are reported. As presented in Section 5.2.3, the eigenfaces are used as an example of eigenimages and the data are those randomly generated in be-tween the dynamic range from original 400 training eigenfaces. Recall from Section 2.2.3 that the maximum dimension is the number of training images less one. The first set of the experiments uses this maximum number of dimension (i.e. 399) for the finest level of filtering. In order to compare with the results from random colour histogram as accurately as possible, the parameters for both sets of experiments are set to be very similar. A data set of size 5000 is used to match the size of random colour histogram. Like the experiment of random colour histograms, the page size is kept as 16K bytes, and the finest level selectivity is also to be 0.1%. The results of the experiments are shown in Figure 5.9(a). The 2-level optimum (12,399) takes 30% longer than the 3-level optimum (4,16,399). The 4-level optimum (4,14,32,399) takes 10% longer than the 3-level optimum, but is still takes faster than the 2-level optimum. Chapter 5. Experimental Evaluation 62 <4,16,399> <4,14,32,399> 140 120 100 0 E g 1 129% i H f i n % 100% • flip • ••1 rt < , 2 . , J 3 > <4,14,133> <4,14,32,133> (a) Finest level = 399-D (b) Finest level = 133-D Figure 5.9: Relative efficiency with eigenfaces However, it may be argued that the performance gain is due to the maximum di-mension used for finest level of filtering. One may prefer faster performance to extreme accuracy. Therefore, only a fraction of this maximum is required for the finest level. The second set of the experiments uses 133 dimensions for the finest level. Other details of the experiment are kept unchanged. The graph in Figure 5.9(b) is almost identical to the graph in Figure 5.9(a). It implies that the multi-level filtering can still be very useful even when the dimension of the feature vector is not very high. The experiments in this section show that multi-level filtering is very useful to enhance performance when compared with 2-level filtering no matter whether the testbed is colour histograms or eigenfaces. As predicted in Section 4.6, 3-level filtering may be the limit of multi-level filtering. Even 4-level filtering can usually outperform 2-level, it is inferior to 3-level. • Chapter 5. Experimental Evaluation 63 100 -> CO 95 CM - 90 0) > n 85 > 4-level 3-level ^ s * ^ % 41 Figure 5.10: Effect of changing page size on 3-level filtering relative cost 5.4 Effects of Different Values of Parameters 5.4.1 Page Size In this section, random colour histograms will be used as the testbed for experiments using different page sizes. Assume the size of each bin is 4 bytes. Because the size of one 512-bin colour histogram is already 2K byte, it is therefore unavoidable to have page sizes larger than the typical page sizes in the relational database. In this set of experiments, three different page sizes will be used, namely 4K, 8K and 16K bytes. The effect of different page sizes on the performance gain of optimal multi-level over the corresponding optimal 2-level filtering will be investigated. Unlike the previous sections, the results are normalized such that the cost of the corresponding optimal 2-level filtering at each page size is 100%. Figure 5.10 shows that the larger the page size is, and the smaller the percentage of the multi-level filtering costs, the more the multi-level saves compared with 2-level. Chapter 5. Experimental Evaluation 64 This effect can be explained if Figure 5.7 is examined again. The 2-level filtering I/O costs only 18% more than 3-level, but the 2-level CPU costs 67% more than 3-level. In other words, the performance gain of 3-level filtering over 2-level is much more . significant for CPU.than that for I/O. In fact, similar observations hold for 4-level filtering too. Unfortunately, the I/O cost shares a larger portion than CPU cost in the total cost. Therefore, the bottleneck of the overall multi-level saving is due to its relatively little saving of I/O cost. If the I/O cost plays a less significant role in total cost, the huge saving of CPU cost will be more obvious. Thus, there will be increased overall cost saving. One solution is to increase the page size, thus less pages are needed to be accessed. Therefore, the I/O cost will also be less, and share a smaller portion of the total cost. This explains why increasing page size can raise the relative performance gain of multi-level compared with 2-level. In fact, it can be predicted that increasing the page size further may save even more. . It is important to note that the explanation in the above paragraph assumes the page size does not affect access time. As matter of fact, this is a reasonable assumption. Nowadays, the transfer rate of hard disk can reach to the level of 20 M bytes per second. Doubling the page size from 8K to 16K bytes adds merely 0.39 msec which is insignificant compared with the total access time (which is assumed to be 10 msec per page). 5.4.2 C P U S p e e d This section will test how the saving of multi-level filtering is affected by varying the CPU speed. The experiments are very similar to the testing of different page sizes in the above Section 5.4.1, and the page size for these experiments is set to be 16K bytes. Three different CPU speeds are chosen: the speed of the original Sparc-10, half as fast and double as fast. Similarly, all the results reported are the percentage time relative to their respective 2-level optimal time. Chapter 5. Experimental Evaluation 65 8 If CO mm £ o tt 4-leveh/' 3-levelth dimension values of the points will be duplicated for both values of k-th dimension rectangular box range. Consequently, half of the space occupied is wasted. If a more compact indexing structure is used instead, some I/O accesses of the indices will be saved. Like the effect of increasing page size, fewer I/O accesses implies that the I/O cost occupies less share in the total cost. It seems to be true that this will eventually increase the performance gain of 3-level relative to 2-level. However, this con-clusion ignores other important factors. Consider the following experiment of replacing R-trees with more compact adaptive K-D-trees as indexing structures for storing colour histograms.1. A l l other parameters are kept unchanged. The analysis will be based on the comparison of the effects on the I/O costs of 3-level and 2-level filtering when the in-dexing structure is changed from R-tree to K-D-tree. Their C P U costs follow the similar analysis. 1The source codes of the adaptive K-D-tree are obtained from Ms Andishe Sedighian of Computer Science department in University of British Columbia Chapter 5. Experimental Evaluation 67 (a) 3-level I / O costs (b) 2-level I / O costs Figure 5.12: Effects of changing indexing structure on 3-level and 2-level costs Figure 5.12(a) shows that the I/O cost of 3-level filtering shifts downward when the indexing structure is changed from R-tree to K-D-tree. This is because the only difference between these two trends is the difference in the I/O costs of their respective 4-D tree structures. Indeed, this difference is independent of k. Therefore, the trend change is a simple downward shift. The I/O cost of 2-level filtering consists of the costs accessing the indexing structure and the flat file (i.e. the first and second terms of Equation. (3.14)). Figure 5.12(b) shows that the index I/O cost (first term) of 2-level filtering for K-D-tree is much less than that for R-tree. It is important to note that not only is the amount smaller, but also the slope flatter. The costs for accessing the flat file (second term) must be the same no matter what the indexing structures are. Combining the two terms, both the total I/O cost curve of 2-level filtering for R-tree and K-D-tree resembles U-shape. However, the Chapter 5. Experimental Evaluation 68 one of K-D-tree is a much wider U-shape than that of R-tree. This is due to the flatter slope of K-D-tree index term, not its smaller amount. Both the total I/O costs of 3-level and 2-level drop when a K-D-tree is used instead of R-tree. However, the cost of 3-level drops slightly while the 2-level drops sharply. Consequently, the 2-level optimum is so cheap that the 3-level optimum may lose. It can be concluded that the performance gain of 3-level filtering relative to 2-level filtering is not obvious when the indexing structure is changed. If the indexing structure behaves like K-D-tree ( i.e. its cost of accessing the index increases very slowly when the dimension k increases), 3-level filtering may not outperform 2-level filtering. However, if the indexing structure is only more compact, 3-level filtering may easily outperform 2-level filtering. As a summary, most of the experimental results agree with the the evidence provided by the trend analysis in the previous chapter that 3-level filtering can typically cost less than 2-level filtering. Chapter 6 Design and Implementation of Optimizers 6.1 Introduction Assume a dataset of M images is given. In order to create an efficient database for their image feature vectors, a certain filtering configuration needs to be selected so that the corresponding k-D indexing structure and the flat files of intermediate levels can be built with these image feature vectors. The configuration here refers to the number of levels of filtering and the dimension of each level. It has been shown in the previous chapters that 3-level, though often, may not always outperform 2-level and 4-level filtering. Further-more, Figures 5.7, 5.8 and 5.9 indicate that the optimal configurations need not be the same for all cases. The parameters such as ak, j3k, IndexJCPU and Index-I/0 may vary significantly according to the characteristics of the image features, size of the database, page size, indexing structures. It is necessary to find a new set of parameters for the new dataset, so its corresponding total cost and optimal configuration can be found. Other-wise, there is often a penalty at run-time for each query when sub-optimal configuration is used. Even a 10% run-time penalty for every query can add up to a significant amount. The goal of the optimizer is to find the optimal configuration. One simple approach of achieving this goal is to repeat the procedures of the previous experiments by considering all the images of the database and configurations find the optimal one with minimum cost. This exhaustive searching certainly takes a long time to complete, especially for huge databases. In the following, we present an optimizer that relies on sampling to find 69 Chapter 6. Design and Implementation of Optimizers 70 the optimal configuration for a certain dataset. 6.2 Specifications of the Optimizer 6.2.1 Input to the Optimizer Since the filtering structure, be it 2-level, 3-level or 4-level, is used for all queries, the op-timizer accepts a selectivity mix to be given as input. More specifically, the database de-signer can specify and input a list of the form: ( ( 7 1 , Pi) , . . . , (lw,Pw)), where 7 1 , . . . , 7™ are selectivities and p i , . . . ,pw are the respective (expected) frequencies. For instance, given that the database has 10,000 images, and that most users would ask for the top-10, top-15 and top-20 images, the database designer can input (0.001,33.3%), (0.0015,33.3%), and (0.002,33.3%). As will be detailed later, the output of the optimizer is the optimal configuration with value weighted by all the selectivities and their frequencies. 6.2.2 Qualities of Service Depending on how much "compile" time the database designer wants" to spend, differ-ent "qualities of service" can be provided There are three operations to be performed: (a) finding the 3-level optimum, (b) finding the 4-level optimum, and (c) finding the 2-level optimum. These operations are in ascending order of computation time, and they are performed by the 2-level, 4-level and 3-level optimizer respectively. As will be de-tailed later, Operation (a) takes the smallest amount of time because it is equivalent to searching for the minimum in a 1-dimensional space (i.e., the space of k, as presented in Section 3.4.1). Operation (b) takes considerably more time, because it amounts to searching for the minimum in a 2-dimensional space (i.e., the space of k^ and k2 as pre-sented in Section 3.5.-1). Finally, Operation (c) takes the most time, even though the search space is only 1-dimensional (i.e., the space of the dimensionality A:-of'the index). Chapter 6. Design and Implementation of Optimizers 71 This is because, unlike the other two operations, the 2-level optimizer has to estimate the C P U and I/O costs searching a ^-dimensional index - for each value of k tested by the optimizer. With respect to the three operations above, there are three qualities of service. The most basic service performs Operation (a) only. The next class of service does (a) and (b), and output the configuration with the lower cost found by the two operations. The most elaborate service performs all three operations, and output the lowest of the three. The rationale of such an ordering is as follows. As analyzed in Section 4.6, in most cases, the 3-level optimum is of lower total cost than the 4-level optimum and the 2-level optimum. And in those cases when this is not true, the 3-level optimum is not very far off the real optimum either. Thus, if limited time can be spent for optimization, performing Operation (a) alone is the best bet. If more time can be spent, it is worthwhile to perform Operation (b), which takes less time than Operation (c). If all three operations are carried out, it takes the longest time (e.g., tens of minutes of C P U time). But for an image database with very frequent colour queries, tens of minutes of compile time may still be worthwhile. The design of the the optimizer which finds the 3-level optimum will be discussed in Section 6.3.1 (3-level optimizer). Since the design of 4-level optimizer is a simple extension of the 3-level one, its discussion will be skipped from this thesis. Finally, Because the 2-level optimizer involves different issues, it will be discussed in Section 6.3.4. 6.2.3 Accuracy The two most important issues of an optimizer are its performance and accuracy. Obvi-ously, the performance can be easily measured in terms of the total time taken to find the optimal configuration. However, the definition of accuracy is not as obvious, let alone its measurement. What exactly is the accuracy of an optimizer? The accuracy must Chapter 6. Design and Implementation of Optimizers 72 be measured with respect to a certain reference point, or yardstick. In order to answer this question, this yardstick must be defined first. Recall that the ultimate goal of the optimizer is to find the optimal configuration (one with lowest cost) for any query. This yardstick could be defined as the the optimal configuration for any query, regardless of the resources and time needed. However, this goal is clearly unachievable because of two reasons. First, the real query of user is still unknown when the optimizer is invoked. Second, the configuration may be optimal for a query, but not for the others. There-fore, a certain reference query has to be defined for the sake of the cost comparisons even though the reference query may not be the same as the real queries of the users. Multiple reference queries are chosen to reduce the bias of any particular single reference query. The yardstick, or the best possible optimizer, is thus the one which considers all M images in the database with respect to the reference query (or queries). From here onwards, this yardstick is called the reference optimizer. 6.3 Design of the Optimizer In this section, the algorithm of the first proposed version of 3-level reference optimizer will be reported, and its problems will then be discussed. After that, the algorithm of its revised version will follow. A more efficient optimizer which uses sampling techniques will then be presented. Finally, the design of 2-level optimizer will be compared with that of 3-level. 6.3.1 A reversible 3-level Reference Optimizer As it will be seen in the latter sections, the 3-level reference optimizer is relatively simple when compared with the 2-level one. As the first terms in Equations(3.15) and (3.16) are constant to dimension k, it is sufficient. to consider only the remaining terms in Chapter 6. Design and Implementation of Optimizers 73 both equations. Only the selectivities ak and fa are needed to compare total costs of configurations with different k. In the beginning, it is also assumed that there is only one selectivity ((7,100%)) specified in the selectivity mix. The modification of the design which is needed for multiple selectivities in the mix will be considered in the Section 6.3.3. The objective of the reference optimizer for a database of size M is to 1. find the threshold distance D from 7; 2. compute the corresponding k-D selectivities (i.e. ak and fa ) from D and total cost for all k; choose a configuration with the minimum cost. The first approach to meet this objective is as follows. Algori thm 1 1. For a reference query Ql, compute the Li-norm distance dist^^l\Ql) between the original high-dimensional (N-D) feature vectors of every image I and query Ql. 2. Select the (7 x M)-th smallest distance of all dist^^iQ1) computed. This is the threshold distance Dl 3. Repeat step 1 and 2 with other reference queries. Calculate the average threshold distance Dav9 of all Di. -* -*• 4- For a given dimension k and query Ql, compute the Li-norm distances distk(Ik, Q\) between the abstracted k-D feature vector between every Ik and query Q\. 5. Count the number Ul of images whose feature distances from the corresponding k-D vector Q\ are less than Davg. The required selectivity ct\ is then given by Ul/M, while fa can be computed similarly by measuring range distance, rather than Li-norm distance. Chapter 6. Design and Implementation of Optimizers 74 (a) Irreversible (b) Reversible Figure 6.13: Reversibility problem 6. Repeat steps 4 and 5 with other reference queries. The required ak and fa are the averages of a\ and fa respectively over all i. 7. Having computed ctk and fa, the relative total cost can be calculated for comparison according to the cost models of Equations (3.16 and 3.15). 8. Repeat the whole procedures with different dimensions k and find the one which gives minimum relative cost. This is the required optimal configuration. When k = N, the computed selectivity ak should naturally be equal to the input selectivity 7. This property is called reversibility. An algorithm is reasonable only when it is reversible. Unfortunately, this algorithm is not reversible. Figure 6.13(a) shows that the problem happens when the threshold distance D1 and selectivities ak and faare averaged. Therefore, another approach is proposed to deal with this reversibility problem. Chapter 6. Design and Implementation of Optimizers 75 Algorithm 2 1. For a reference query Ql, compute the Li-norm distance distill, Ql) between the original high-dimensional ( N-D ) feature vectors of every image I and query Q1. 2. Select the (7 x M)-ih smallest distance of all dist^ilrQ1) computed. This is the threshold distance D%. —* ~* 3. For a given k and Q1, compute the Li-norm distances distk(I,Ql) between the abstracted k-D feature vectors of every image I and query Ql. 4- Count the number U of images whose feature distance less than D%. The required se-lectivity a\ is then given by U% jM, while f3k can be computed similarly by measuring range distance rather than Li-norm distance. 5. Having the computed a\ and J3k, the relative total cost cost\ of this query Ql and dimension k can be calculated. 6. Repeat steps 1 to 6 with different possible values of k and i. 7. For each dimension k, calculate the average cost costakV9 of costk over all i. 8. The dimension k which gives the minimum average cost is the optimal configuration. Basically, this revised version looks very similar to the original version. The crucial differences are at the points where the averagings are computed. Since this revised version completes the cycle of computing threshold distance and then ak ( and (3k) for a given query i and dimension k before it proceeds with other values of k and i, the requirement of reversibility can be satisfied. This is shown in Figure 6.13(b). 6.3.2 An Efficient 3-level Optimizer Using Sampling Algorithm 2 describes the best possible or reference optimizer because it consider all M images. As mentioned before, the problem is that it is inefficient especially when M is Chapter 6. Design and Implementation of Optimizers 76 large. It is desirable to devise an optimizer which can trade a certain degree of accuracy for better performance. In other words, only a fraction of the M images is sampled and considered in the above algorithm. The accuracy of the more efficient optimizer using sampling can be measured against the reference optimizer. By sampling theory, the exact numbers of samples can be computed according to the allowable errors and confidence levels which are specified by the database designer. Algorithm 2 is required to be modified for sampling. Instead of using all M images in step 1 and 2, only a sample fraction T\. of M is needed to calculate the threshold distance D\ for reference query Ql and dimension k. To compute selectivities a\ a n d a sample fraction F%k' of M is used. Now, the next section describes how to compute the sampling fractions Txk and J-lk' for a certain allowable error and confidence level. The sampling theory provides a way of computing the sample size needed such that the sample mean is close to the population mean with a certain allowable error and confidence level. In order to calculate the threshold distance D\ from the selectivity 7 (in step 1 and 2 of Algorithm 2), the (7 x M)- th smallest distance of distill, Q1) is used. In other words, this is to calculate sample 7 percentile, rather than the sample mean. Sample percentile is not as much discussed topic as sample mean. In [SSW92], Sarndal et al. describes how to use sampling to estimate population median, which is special case of 7 percentile. In general, estimating sample percentile can be quite complicated. However, since the sampling method used here is simple random sampling without.replacement, the implementation is still feasible. Simple modifications are made to the procedures presented in [SSW92] so that any 7 percentile can be accepted, rather than 7 = 0.5 only. Our aim is to compute the sample size m such that the given sample percentile 7 will be close to the population percentile with an allowable error e and confidence level cl (with the normal distribution statistic Zci/2, or simply Z). The formulas are as follows. Chapter 6. Design and Implementation of Optimizers 77 MMl - 7) ra Mi- + 7(1 - 7) • ra M where u is the variance and M is the population size. After the threshold distance is estimated, the k-D selectivities are then computed based on the threshold distance (in step 3 and 4 of Algorithm 2). The computation of sample A>D selectivities is, in fact, the estimation of the proportion of the samples whose k-D abstracted vector distance is less than threshold distance from the corresponding query vector. The sampling of proportion is very similar to that of mean. For a population size M, one approach of estimating the sample size ra' for given sample proportion p, population proportion P, allowable error e and confidence level cl (with the corresponding Z) is as follows [Coc64]. z2PQ ra where Q = 1 — P. Obviously, the population proportion P is unobtainable in our case.. A method of two steps sampling is used to estimate the population proportions P to determine the sample size m' [Coc64]. The pre-sampling step ( of size m[ ) is used to estimate pi of population proportion P, and then the sample size ra' for real sampling step can be computed as v. follows. Chapter 6. Design and Implementation of Optimizers 78 / Piqi , 3 - 8pi(?i 1 - 3pn7i m = h — h — v piqi vm-y T{' = — k M where qi = 1 — p\. Suppose p\ = 0.1, q\ = 0.9, e = 0.01, Z = 1.0. The population size M is 10000, and the initial sample size is 100. The calculated rn' is 998.33, and so the sample fraction r is 0.099833. 6.3.3 Se lec t iv i ty M i x Algorithm 2 can be generalized to handle selectivity mix ( ( 7 1 , Pi) , . . . , ('JwiPw)), not only single selectivity ((7,100%)). For a given query Q\ a threshold distance D1- is computed for each designer specified selectivity 7 ^ . The time of computing threshold distance TJ] for w number of 7 j needs not to be p times longer than that of a single 7 . The bottleneck is to compute the distN(I,Q*) for every images i (step 2 of Algorithm 2). In fact, this can be done once for all w number of jj. Although the selection of the (jj x M) has to be repeated w times, the selection algorithm can be implemented very efficiently in O(M). Similar savings hold for the procedure in calculation of k-D selectivities too. The expensive procedure of computing distk(Ik, Q\) for every images can also be done once for all w. In step (4), the counting of images whose feature distances less than the threshold distance TJ] can also be done in a single cycle for all w number of different j. With all these time-saving techniques, the algorithm for selectivity mix will only be slightly less efficient than that for single selectivity. The total cost of each configuration with selectivity 7 * can be weighted by the cor-responding frequency pl. The optimal configuration can be obtained by finding the one which gives the minimum weighted total cost. Chapter 6. Design and Implementation of Optimizers 79 6.3.4 A 2-level optimizer Since the algorithm for calculating the threshold distances of the 2-level optimizer is the same as that of the 3-level optimizer, the first two steps of Algorithm 2 hold for the 2-level optimizer too. But unlike the 3-level optimizer, the comparisons between the 2-level optimizers with different dimensions k have to take the indexing structure into consideration. In other words, the complete cost models of both C P U and I/O ( Equation(3.12) and (3.14) ) have to be computed. In addition to ct%k and /3k, Index-IOQ and Index-CPU() are both needed for different k and i. Therefore, the steps 3, 4 and 5 of Algorithm 2 need to be modified as follows. 3. For a given k, build an k-D indexing structure and insert all the feature vectors Ik of every images into it. 4. Search the indexing structure with the query Q\. Record the C P U time taken as IndexJjPUQ and pages accessed as Index -IOQ. ct\ and p\ can be computed similarly. 5. Having the computed Index-CPUQ, Index -IOQ, a\ and (3k, the total cost cost\ of this query i and dimension k can be calculated. This modified algorithm is then the most accurate 2-level optimizer which can be used as reference to other more efficient optimizers. Sampling technique can also be used to devise more efficient optimizers. Its idea for 2-level optimizer is the same as that for 3-level one in the calculation of threshold distances. However, the sampling of 2-level optimizer may not work as well for the following reason. For the 3-level optimizer, different sample sizes are computed for different queries i in the calculation of selectivities. If this were done for 2-level optimizer too, indexing structures of different sample sizes are needed to be built for different queries Ql (step 4). Chapter 6. Design and Implementation of Optimizers .80 Building all these indexing .structures is a very time consuming step. In order to avoid building indexing structures so many times for different sample sizes of different queries Q\ the maximum sample size is chosen among them. Therefore, only one indexing structure is built with this maximum number of samples. The advantage is that this bottleneck step is only required to be executed once for all queries. Unfortunately, it can be expected that even if a single sample size of a certain query Ql is large, say close to the population size, the indexing structure has to be built with almost all of the population. In other words, the technique of sampling may not help much to improve the performance. 6.4 Experiments 6.4.1 3-level optimizer Details of the Experiments In order to test how the design of the optimizers works in practice, they are implemented in C++. A dataset of 5000 512-bin random colour histograms is generated using the method described in Section 5.2.2. Ten histograms are chosen from these random his-tograms as the reference queries. The reference optimizer is first implemented by using all 5000 colour histograms. Recall that the procedures of optimizer in Algorithm 2 are divided into two stages. The first stage is to calculate the threshold distance (step 1 and 2), while the second one is to calculate the k-D selectivities of all k, and therefore the optimal cost and corresponding configuration can be found (step 3 to 8). The initial sample size mi for second stage is set to 500. The idea of the following experiments is to test the optimizers which use sampling at the first and second stage separately. Therefore, the individual effect of the samplings at each step can be observed. The exact procedures of testing 3-level (4, fc, 512) optimizer are as follows. Chapter 6. Design and Implementation of Optimizers 81 First of all, the reference optimizer is to be created by considering the whole popu-lation of 5000 histograms at both stages (i.e. no sampling at both stages). By following Algorithm 2, the optimal configuration (4, kref, 512) of reference optimizer and its cost function costakV9 can be found. Now, the more efficient optimizer can be created by modifying Algorithm 2 for sam-pling. The sample size for computing the threshold distance is fixed at population size (i.e. no sampling in the first stage). A certain confidence level cl and an allowable error e for the stage of computing the k-T) selectivities (sampling in the second stage) are chosen. This can give the optimal configuration (4, ksam, 512) and cost function costakvg of this optimizer using sampling. The allowable error e can be varied to investigate the effect of sampling at the second stage. Next, the effect of sampling in the first stage can be examined as follows. A certain confidence level cl' and error e' are chosen for computing threshold distance (first stage). The confidence level cl and error e can be fixed for computing the selectivities (second stage). Similarly, this gives another optimal configuration (4, Ar s a m , 512) and cost function costak9 of this optimizer using sampling, e' can be varied to investigate the effect of different sampling in the first stage. Error calculation Suppose the optimal configurations given by the reference optimizer and efficient opti-mizer using sampling are (4, fcre/,512) and (4, ksam, 512) respectively. Furthermore, both the cost functions costak9 and costak9 are now available for different k of reference op-timizer and optimizer using sampling respectively. One way of measuring the error of optimizer using sampling (4, ksam, 512) can simply be defined as the percentage differ-ence between the costs costT9 • and costT9 . However, the error may not be zero even if ksam = kref. A better way of defining error is the percentage difference between the costs Chapter 6. Design and Implementation of Optimizers 82 cost0?9 and costV19 . In other words, it is the difference between the costs computed by the same reference optimizer cost function with krej and ksam. This provides more accurate comparison between optimizers. Results The reason for using sampling for the optimizer is to save time when compared with the reference optimizer. The first set of the results shows the absolute time taken by the optimizer using three different allowable errors e in the second stage. These e are absolute errors which are chosen to illustrate the effect of different samplings. The results of the time needed are also reported in two parts separately. The first one measures the time taken to compute the threshold distance, while the second one measures the time taken to compute the selectivities k and the therefore optimal total cost for all k. The confidence level is assumed to be 70%. Figure 6.14(a) shows the result. It can be seen clearly that the time taken for computing the selectivities and the optimal cost (second stage) decreases when the allowable error e increases. This can be easily explained. Since e increases, the sample size computed from e drops, and the time taken will thus be less. The time taken by the first stage is also included for comparison purposes. It can also be seen that the time taken for computing the threshold distances (the first stage) almost remains unchanged when e increases. It is because the confidence level, error, and thus the sample size computed for the first stage are-kept constant in this set of experiments. Figure 6.14(b) shows the effects on the accuracy. The optimizer with sampling at confidence level of 70% and allowable error e of 0.003 gives the same configuration as the reference optimizer. Hence, no error is shown. Even when e reaches 0.005, the error is merely 2.48%. These results are very encouraging since the optimizer is shown to be both efficient and accurate. Chapter 6. Design and Implementation of Optimizers Figure 6.14: Performance and accuracy of 3-level optimizer using different samplings the second stage Chapter 6. Design and Implementation of Optimizers 84 (a) T o t a l t ime taken (b) E r r o r percentage Figure 6.15: Performance and accuracy of 3-level optimizer using different samplings in the first stage The next set of experiments is to test the effect of sampling in the first stage on accuracy and performance. Hence, the confidence level and allowable error are fixed (at 70% and 0.002 respectively ) for the second stage which computes all k-D selectivities and total costs. The experiment then compares the accuracy and performance at sampling with confidence level of 70% and three different allowable error e' of the first stage which computes threshold distance. Figure 6.15(a) shows the effect on performance. Similar to the result of varying sampling in the second stage, the time taken decreases as the allowable error increases because the computed sample sizes decrease. Further-more, it is also getting less accurate as allowable error e' increases. Even though the time taken is fairly short at e' = 0.0005, Figure 6.15b shows that the error is unreasonably high at almost 40%. In contrast, the error is much smaller when the sampling is varied in the second stage as shown in Figure 6.14b. It is thus better to vary the sampling in Chapter 6. Design and Implementation of Optimizers 85 200 2> > 180 ? o to o 160 140 120 100 4 80 y * y stage 2j / y y y stage 1 Figure 6.16: Relative efficiency of different numbers w of selectivities the second stage to achieve better performance with reasonable accuracy. The last set of experiments for 3-level optimizer is to test how the number of se-lectivities w in the designer specified selectivity mix can affect the performance of the optimizer. In order to avoid any effect of the selectivity and error values, all the selectiv-ities and the corresponding allowable errors are set to be the same throughout the mix, i.e. ((7, ^p%), • •., (7, ^ % ) ) - The graph in Figure 6.16 shows the relative time needed in the first stage (for computing threshold distance) and the second stage (for computing different fc-selectivities and total costs) at different number of selectivities. The darker curve shows that the. time needed for computing threshold distances is virtually the same for any number w of selectivities. This can be explained because the most time-consuming part which computes every distances (step 1 of Algorithm 2) is done only once for all number of selectivities. The lighter curve shows the time needed for the second stage increases slowly with w. In fact, the time needed increases by less than 100% when w increases from 1 to 5. This shows that the time-saving techniques Chapter 6. Design and Implementation of Optimizers 86 described in Section 6.3.3 work effectively. 6.4.2 2-lev.el Optimizer The experimental details, procedures and error calculation of 2-level optimizer are all very similar to those of 3-level optimizer. A 2-level reference optimizer is also needed to provide a yardstick of measuring accuracy. AIL the procedures of are still divided into the two stages of finding threshold distance and optimal configuration. Results Like the 3-level optimizer, both the performance and the accuracy results are also divided into two stages. The first set of the results tests the 2-level optimizers which use sampling at e = 0.002, 0.003 and 0.004 in the second stage and no sampling in the first stage. Thus, this set of results can show the effect of sampling in the second stage on the performance and accuracy of 2-level optimizer. Figure 6.17(a) shows similar results as the 3-level optimizer. The time taken in the second stage decreases when the allowable error increases. As shown in Figure 6.17(b), the sampling result is so accurate that the configuration returned by e = 0.002 and e = 0,003 are the same as the reference optimizer. Even when the error e reaches 0.004, the percentage error is only 0.7%. However, there are two important differences between the results of the 3-level and the 2-level optimizers. As explained in Section 6.3.4, the time saving of the 2-level optimizer due to sampling is not as great as that of 3-level optimizer since the computed sample sizes are typically close to the original population size. Moreover, the time spent on the second stage can be eight times more than that spent on first stage. This can be explained by the fact that the second stage of 2-level optimizer needs to find Index -IOQ and IndexJJ PUQ. This involves the building of the indexing structure which is a very time consuming process even when the sampling is Chapter 6. Design and Implementation of Optimizers 87 Figure 6.17: Performance and accuracy of 2-level optimizer using different samplings in the second stage Chapter 6. Design and Implementation of Optimizers 88 used. Hence, it is not necessary to do the experiment of varying the sampling allowable error e' in first stage. It is because the effect of the time saving of the first stage is insignificant on the total time spent on the both stages. Because of these two factors, the 3-level optimizer can be much faster than 2-level optimizer. For example, when e = 0.004, Figures 6.14 and 6.17 show that 3-level optimizer takes a total time of 32 seconds, but 2-level optimizer takes 227 seconds. It can be concluded that the 3-level optimizer can efficiently find the near-optimal (if not the real optimal) configuration with high accuracy by the technique of sampling. Although the 2-level optimizer may not be as efficient as the 3-level optimizer, Chap-ters 3 and 4 have shown that the configuration returned by the 3-level optimizer can typically outperform that returned by the 2-level one. Therefore, the most basic service provided by solely 3-level optimizer can be both efficient and accurate. For example, an experimental result indicates that in about 32 second, the 3-level optimizer can find the configuration whose error is less than 2-5%-Chapter 7 Conclusions In order to search efficiently for the database images which look similar to a specified query, features of the query image and the images in the database have to be extracted and compared. The dimension of these image feature vectors (such as colour, texture, shape and eigenimages) are typically very high so that sequential comparison or indexing is inefficient. 2-level filtering has been proposed to solve this problem. The high dimensional vectors can be abstracted to lower dimensional k-D vectors which are small enough to be inserted to multi-dimensional indexing structure. It is clear that 2-level filtering is more efficient because most of the invalid image candidates have been filtered out by the efficient coarse filtering stage, and the expensive fine filtering stage needs to process only the few remaining candidates. However, 2-level filtering still faces an inherent dilemma: When k is too small, there will be too many candidates left for the expensive fine filtering stage. When k is too large, the dimensionality curse makes the coarse filtering inefficient. This thesis proposes a new idea, multi-level filtering, to solve this dilemma. By inserting additional intermediate levels in between the coarsest and finest filtering stage, the dimension of the indexing structure will no longer affect the number of candidates left for the fine filtering stage. Although 3-level filtering looks beneficial intuitively, more concrete evidence can also been found in this thesis. To achieve this purpose, a cost model is derived for different filtering configurations. The cost model does not only estimate the actual time taken by a certain configuration for a particular dataset and indexing structure, it can also 89 Chapter 7. Conclusions 90 provide an essential foundation for trend analysis of different filtering configurations. The trend analysis is capable of predicting the general trends of 2-level (k, N) and 3-level filtering (4,k,N) with respect to different k. More importantly, it can also strengthen the important hypothesis that the optimal 3-level filtering can outperform the optimal 2- level filtering. Furthermore, experiments are set up to further test the correctness of the above hy-pothesis. Two different testbeds, namely colour histograms and eigenfaces, are selected. Both testbeds show that the optimal 3-level filtering can usually offer significant savings ranging from 15% to 30% when compared with the optimal 2-level filtering. In particu-lar, when compared with 2-level filtering approach by Sawhney and Hafner, the optimal 3- level filtering can be over 400% faster. It has also been shown that the percentage of saving can be further increased by enlarging the page and/or reducing the C P U speed. Last but not least, an optimizer is used to analyse the dataset and find the optimal configuration for it. Different qualities of services can be provided by different combina-tions of optimizers: 3-level alone, both 3-level and 4-level, and all 2 to 4-level optimizers. "A reference optimizer is first designed to take all the images of the database into account. Although this reference optimizer is too slow to be useful, it can serve as a yardstick so that the accuracies of other more efficient optimizers can be measured. These more effi-cient optimizers only consider a fraction of all images as samples. By sampling theories, the exact number of samples can be computed according to the designer specified confi-dence levels and error levels. In general, when more error is allowed, fewer samples are needed. Thus, the optimizer becomes efficient, but its accuracy drops. 2-level optimizer costs much more time than 3-level optimizer because it involves building the indexing structures. Typically, the 3-level optimizer alone is needed because it can usually gives the.optimal configuration among any number of levels of filtering, and it is the most efficient of all. Chapter 7. Conclusions 91 7.1 Future Work There are three possible extensions which can further verify the advantages of multi-level filtering. • One of the benefits of the cost models in this thesis is its flexibility. It is general enough to handle any dataset and indexing structure. However, it also suffers from the problem that the trend analysis may not reflect the real life situation too closely. This is because the behaviour of the indexing structure is not modelled very accu-rately. Therefore, only a rough general trend of the Index -IOQ and Index J3 PUQ is used for analysis. Hence, the overall trend can only be approximated. Possible future work is to analyse the behaviour of the indexing structure in detail. A more accurate formula can thus be derived to model both Index -IOQ and Index JZ'PU Q. Consequently, the results of the trend analysis can be more convincing. • In this thesis, only the R-tree and the K-D-tree have been considered as the indexing structure. It is certainly worth further investigation of the effects of other possible multi-dimensional indexing structures. For example, the R*-tree and other indexing structures which are designed for high-dimensionality data are good candidates. • Colour histograms and eigenfaces are the two selected testbeds to evaluate the idea of multi-level filtering experimentally. Although both of them are popular and represent explicit and implicit features, it is still worthwhile to test this idea with other high dimensional image features. Examples are shape and texture. This can enlarge the application domain of multi-level filtering. Another similar, but different idea is to increase the size of the database to test its effect. Bibliography [A+95] A . D. Alexandrov, W. Y . Ma, A . E l Abbadi and B.S. Manjunath. "Adaptive Filtering and Indexing for Image Databases." in SPIE Proceedings, Storage and Retriveal for Image and Video Database, III. Vol. 2420, pp. 12-23, February 1995. [B+90] Norbert Beckmann, Han-Peter Kriegel, Ralf Schnedier, Bernhard Seeger. "The R*-tree: A n Efficient and Robust Access Method for Points and Rectangles", in Proceedigs: ACM SIGMOD, pp. 322-311, 1990. [Ben75] Jon Louis Bentley. "Multidimensional Binary Search Trees Used for Associative Searching." in Communications of the ACM. Vol. 18, No. 9, pp. 509-517, 1975. [Car75] A . Cardenas. "Analysis and Performance of Inverted Data Base Structures" in Communications of the ACM, Vol 18, No. 5, pp253-263, 1975. [Chi94] Tzi-cker Chiueh. "Content Based Image Indexing" in Proceedings: 20th VLDB Conference, pp. 582-592, Satiago, Chile, 1994. [Coc64] Cochran, William Gemmell. Sampling techniques. New York, Wiley, 1963; [F+94] C. Faloutsos, W. Equitz, M.Flickner, W. Niblack, D.Petkovic, R. Barber. "Effi-cient and Effective Querying by Image Content" in Journal of Intelligent Information Systems, 3, 3/4, pp. 231-262, 1994. [FF91] B . V . Funt and G.D. Finlayson. Color constant color indexing. Simon Fraser University, School of Computing Science, Technical Report C S S / L C C R TR91-09, 1991. [Flu88] B . Flurry. Common Principal Components and Related Multivariate Models. John Wiley and Sons, 1988. [FMP93] Joseph M . Francos, Zvi Meiri and Boaz Porat. "A Unified Texture Model Based on a 2-D Wold-Like Decomposition, in IEEE transactions on signal processing, Vol. 41. No.8, pp.2665-2678, August 1993. [G+94] Yihong Gong, Hongjiang Zhang, H.C. Chuan and M . Sakauchi. "An Image Database System with Content Capturing and Fast Image Indexing Abilities" in Proceedings: Mulitimedia, 1994 International Conference, pp. 121-130, 1994. 92 Bibliography 93 [Gut84] Antonin Guttman. "R-Trees: A Dynamic index structure for Spatial Searching." in Proceedings: ACM SIGMOD, pp.47-57, 1984. [H+93] Kyoji Hirata, Yoshinori Hara, Naoki Shibata and Fusako Hirabayashi. "Media-based Navigation for Hypermedia Systems" in Proceedings: Hypertext '93, pp. 159-173, 1993. [Jai93] Ramesh Jain. "Workshop Report : NSF Workshop on Visual Information Man-agement Systems" in SIGMOD Record, Vol. 22, No. 3, pp. 57-75, September 1993. [KJ86] Thomas F. Knoll and Ramesh C. Jain. "Recognizing partially visible objects using . .feature indexed hypotheses", in IEEE journal of robotics and automation, Vol. RA-2 N o . l , pp. 3-13, March 1986. [LP95] F. Liu and R.W. Picard. Periodicity, directionality, and randomness: Wold fea-tures for image modeling and retrieval. M.I.T Media Lab. Technical Report No. 32., 1995 [MY88] Makoto Miyahara and Yasuhiro Yoshida. "Mathematical transform of (R,G,B) color data to Munsell (H,V,C) color data" in Visual Communications and Image Processing, SPIE Vol. 1001, pp. 650-657, 1988. [N+93] W. Niblack, R Barber, W. Equitz, M . Flickner, E . Glasman, D. Petkovic, P. Yanker, C. Faloutsos, G. Taubin. "The QBIC Project: Querying Images by Content Using Color, Texture, and Shape" in Proceedings of SPIE: Storage and Retrieval for Image and Video Database, pp. 173-187, Feb 1993. [Ore86] J . Orenstein. "Spatial Query Processing in an Object-Oriented Database Sys-tem" in Proceedings: ACM SIGMOD, pp. 326-336, 1986. [Per78] W. A . Perkins, "A model-based vision system for industrial parts". in IEEE trans-actions on computers, VOL C-27, pp.126-143, Februrary, 1978. [PL94] A . Pope, and D. Lowe. "Vista: A Software Environment for Computer Vision Research." in Proceedings 1994 IEEE Computer Society Conference, Seatle, WA, pp. 768-772, 1994. [PPS94] A . Pentland, R .W. Picard, S. Sclaroff. "Photobook: Tools for Content-Based Manipulation of Image Databases" . in Proceedings of SPIE : Storage and Retrieval for Image and Video Database II, vol. 2185, pp. 34-47, 1994. [Rob81] John T. Robinson. "The K-D-B-Tree: A Search Structure for Large Multidi-mensional Dynamic Indexes." in Proceedings: ACM SIGMOD, pp. 10-18,-1991. Bibliography 94 [Sam90] Hanan Samet. The Design and Analysis of Spatial Data Structures. Addison-Wesley, New York, N Y , 1990. , . [SB90] Michael J . Swain and Dana Ha. Ballard. "Indexing Via Color Histograms" in Proceedings: Computer Vision 1990, pp. 390-393, 1990 [SB91] Michael J . Swain and Dana Ha. Ballard. "Color Indexing" in International Jour-nal of Computer Vision, 7:1, pp. 11-32, 1991 . [Sed95] A . Sedighian. A Comparative Analysis of Multi-Dimensional Indexing Structures for eigenimages. Thesis 1996. [SH93] H . Sawhney and J . Hafner. Efficient Color Histogram Indexing. I B M , Technical Report R J 9572, 1993. [SH94] H . Sawhney and J . Hafner. "Efficient Color Histogram Indexing" in Proceedings: International Conference on Image Processing, 1994. [SKB94] Michael J. Swain, Roger E. Khan and Dana Ha. Ballard. "Low Resolution Cues for Guiding Saccadic Eye Movements" in Computer Vision and Pattern Recognition, pp. 737-740, 1992. [SM92] Fridtjof Stein and Gerrad Medioni. "Structural indexing: efficient 2-D object recongition". in IEEE transactions on pattern analysis and machine intelligence, Vol 14 No. 12 1992. [SRF87] Timos Sellis, Nick Roussopoulos, and Christos Faloutsos. "The R+-tree: A Dynamic Index for Multi-dimensional objects." in Proceedings of the 13th VLDB Conference, pp. 507-518, 1987. [SS94] Markus Strieker and Michael Swain. "The capacity of color histogram Indexing." in Computer vision and pattern recognition, pp. 704-708, 1994. [SSW92] Carl-Erik Sarndal, Bengt Swensson and Jan Wretman. Model assisted survey ' sampling. New York, Springer-Verlag, 1992. [TP91] Matthew A . Turk and Alex P. Pentland. "Face Recognition Using Eigenfaces". in Computer Vision and Pattern Recognition, pp. 586-591, 1991. [Yao77] S. Yao. "Approximating Block Accesses inDatabase Organizations" in Commu-nications of the ACM Vol. 20, No. 4, pp260-261, 1977.