ON T H E ORDERING OF MULTIATTRIBUTE DATA IN INFORMATION RETRIEVAL SYSTEMS by X I A N LIU B.A.Sc, Xi 'an Jiaotong University, 1982 M.A.Sc, Xi 'an Jiaotong University, 1987 M.Math, University of Waterloo, 1992 A THESIS S U B M I T T E D IN P A R T I A L F U L F I L L M E N T OF T H E R E Q U I R E M E N T S F O R T H E D E G R E E OF D O C T O R OF P H I L O S O P H Y in T H E F A C U L T Y OF G R A D U A T E STUDIES Department of Electrical Engineering We accept this thesis as conforming to the required standard T H E U N I V E R S I T Y OF BRITISH C O L U M B I A March, 1996 ©Xian Liu, 1996 In presenting this thesis in partial fulfilment of the requirements for an advanced degree at the University of British Columbia, I agree that the Library shall make it freely available for reference and study. I further agree that permission for extensive copying of this thesis for scholarly purposes may be granted by the head of my department or by his or her representatives. It is understood that copying or publication of this thesis for financial gain shall not be allowed without my written permission. Department of B e c f r l C * ! Engineer?!!^ The University of British Columbia Vancouver, Canada Date DE-6 (2/88) Abstract The requirement of ordering a set of multiattribute data arises frequently from spa-tial index processing and secondary key retrieval in modern information systems. The first application involves developing a single numerical index on a one-dimensional line for each point in multi-dimensional space, such that the spatial localizability can be preserved as best as possible. This work can be carried out by a mathematical trans-formation called spatial ordering. The Hilbert order has attained intensive interest in the literature due to its desirable performance. The lack of inexpensive encoding and decoding algorithms has been mentioned frequently in publications. The second application, secondary key retrieval, involves two issues: determining the resolution for each dimension in a multi-dimensional hashed space and ordering data blocks on disks. Aho and Ullman proposed a constrained nonlinear programming model applied to the partial match query. However, the literature is relatively silent on the range query retrieval. Faloutsos showed that the order determined by the reflected Gray code can significantly improve the cluster distribution of the data blocks. It is recommended to design symmetric Gray codes to reduce the bias introduced by the reflected Gray code. i i In this thesis, efficient analytical formulas are first derived to encode and decode the two-dimensional Hilbert order. Then the algorithms for the three-dimensional Hilbert order are discussed. The encoding and decoding processes for n-dimensional Hilbert orders (n > 4) would be considerably complicated and place a heavy computational burden on the application problems. Thus a new spatial order called the U order is proposed. Its major advantage is the significant simplicity of the encoding and decoding operations for multi-dimensional data. Performance assessments suggest that the U order is com-parable with the Hilbert order and superior to most other orders. The performance analysis is followed by a formal mathematical description for the n-dimensional U order. Then a set of analytical encoding and decoding formulas are presented. Besides the U order, several new spatial orders with the quadrant-recursive struc-ture are also investigated. The goal is to find the candidates that are competitive with the Hilbert order in performance and with relative simple encoding and/or de-coding formulas. Of these new orders, the Q4 order behaves best and its performance is very close to the Hilbert order. A n analysis shows that its encoding and decoding algorithms only need 64.5% and 84.2% of operations of the corresponding algorithms of the Hilbert order. The optimization of resolutions of multiattribute file systems is then discussed. Optimization models for the range query retrievals are presented by incorporating stochastic programming methodology. Three approaches axe discussed. The here-and-now and the wait-and-see approaches seem to be applicable to the situation in i i i which range queries follow a relative simple distribution. Four typical paradigms are analyzed. One of them leads to a result that is similar to the one used in industry to evaluate the performance of disk drives. The scenario tracking approach appears to be more flexible and more appropriate for the problems where the probability distribution is not available. It also provides an adaptive technique to formulate non-conventional stochastic programming problems. Then a stochastic programming software package is developed and implemented. Numerical experiments show the coordinating effect of the scenario tracking methodology. Finally, the symmetric Hamiltonian paths in a high-dimensional hypercube are discussed. A heuristic approach is applied to construct the symmetric Hamiltonian paths in the five-, six-, and seven-dimensional hypercubes. They are equivalent to the symmetrical Gray codes. In summary, this thesis proposes three major approaches applied to the ordering of multiattribute data: a set of analytical encoding and decoding formulas for the two-dimensional Hilbert order, two new spatial orders with significant simple encoding and decoding formulas, and a set of optimization models for the range query problem. iv Contents Abstract 1 1 Table of Contents v List of Tables x List of Figures xii Acknowledgement xviii 1 Introduction 1 2 The Hilbert Order 6 2.1 Overview 6 2.2 Metric Analysis of the 2D Hilbert Order 10 2.2.1 Range Query 15 2.3 Encoding and Decoding the 2D Hilbert Order 22 2.3.1 Encoding 23 2.3.2 Decoding 30 v 2.3.3 Implementation and Execution-Time Assessment 33 2.4 Encoding and Decoding the 3D Hilbert Order 37 2.4.1 Encoding 38 2.4.2 Decoding 49 2.5 Alternative Patterns of the Hilbert Order 50 2.5.1 Configuration of Pattern L I 53 2.5.2 Configuration of Pattern L2 54 2.5.3 Configuration of Pattern L3 58 2.5.4 Configuration of Pattern L4 60 2.6 Summary ; 62 3 The U Order 64 3.1 Overview 64 3.2 Metric Analysis of the 2D U Order 65 3.2.1 Partial Match Query 65 3.2.2 Range Query 70 3.3 Neighbour Finding 73 3.4 Neighbour Proximity 80 3.5 Encoding and Decoding 84 3.6 The Analytic Formulas for the 2D U Order 87 3.7 Extension to Three Dimensions 90 3.8 The Analytic Formulas for the 3D U Order 94 3.9 The n-dimensional U Order 96 vi 3.9.1 Definition 97 3.9.2 Encoding and Decoding 101 3.10 Kronecker Algebraic Structures of 2D Spatial Orders 103 3.10.1 The Z Order 103 3.10.2 The U Order 105 3.10.3 The X Order 106 3.11 Discussion on Conventional Bit-Manipulation Techniques 109 3.12 Summary 116 4 The Q Orders 117 4.1 Overview 117 4.2 The Q l Order 118 4.2.1 Configuration 118 4.2.2 Metric Analysis of Range Query 122 4.3 The Q2 Order 125 4.3.1 Configuration 125 4.3.2 Metric Analysis of Range Query 126 4.4 The Q3 Order 132 4.4.1 Configuration 132 4.4.2 Metric Analysis of Range Query 135 4.5 The Q4 Order 139 4.5.1 Configuration 139 4.5.2 Metric Analysis of Partial Match Query 142 vii 4.5.3 Metric Analysis of Range Query 143 4.6 The Q5, Q6, and Q7 Orders 147 4.7 Neighbour Finding and Proximity 151 4.8 Encoding and Decoding the Q4 Order 152 4.8.1 Encoding 153 4.8.2 Decoding 158 4.9 Summary 159 5 Optimization of the Key Resolution for Range Query Retrieval 162 5.1 Overview 163 5.2 Introduction to Stochastic Programming 165 5.3 The Here-and-Now Stochastic Programming Model 168 5.4 The Wait-and-See Stochastic Programming Model 175 5.5 The Scenario Tracking Stochastic Programming Model 178 5.6 Optimization of the Section Capacity 179 5.7 Generalization to Uncertainty Models 182 5.8 Summary 182 6 Optimization on the Bucket Distribution 184 6.1 Overview 184 6.2 Gray Coordinate Lattice and Symmetric Hn 189 6.3 Summary 195 7 Conclusion 187 viii Bibliography 201 Appendix A Implementation of Stochastic Programming Software and Numerical Experiments 209 ix List of Tables 2.1 Expected Number of Clusters for Resolution r 12 2.2 Coordinates and H-Values at Level 1 40 2.3 Tables for Encoding the 3D Hilbert Order 47 3.1 Expectation for Partial Match Query 70 3.2 Expectation for 2 x 2 Range Query 73 3.3 Neiboughood of the 8-connected Template 77 3.4 Average Maximum Distance of Finding 8-Neighbors in a 2D Grid . . 77 3.5 Average Total Distance of Finding 8-Neighbors in a 2D Grid 78 3.6 Average Maximum Distance of Finding 26-Neighbors in a 3D Grid . . 80 3.7 Average Total Distance of Finding 26-Neighbors in a 3D Grid . . . . 81 3.8 Average Farthest Distance of the Neighbors in a 2D Grid 82 3.9 Average Farthest Distance of the Neighbors in a 3D Grid 84 3.10 Coordinates and u-values at Level 1 92 4.1 Expectation for Partial Match Query 144 4.2 Behavior of the Average Case 148 x 4.3 Average Maximum Distance of Finding 8-Neighbors in a 2D Grid . . 151 4.4 Average Total Distance of Finding 8-Neighbors in a 2D Grid 152 4.5 Average Farthest Distance of the Neighbors in a 2D Grid 153 4.6 Computational Costs for Resolution r 160 5.1 Coefficients a*,- 174 6.1 Number of Hamiltonian Cycles in Hypercubes 189 .1 Input Data Set 1: Frequencies of Fields to be Specified 212 .2 Input Data Set 2: Weights of Range Queries 214 .3 Optimal Values of the Single Scenario Model 214 xi List of Figures 2.1 The B Order of Resolution 2 7 2.2 The D Order of Resolution 2 7 2.3 The G Order of Resolution 2 7 2.4 The Hilbert Order of Resolution 2 8 2.5 The K Order of Resolution 2 8 2.6 The R Order of Resolution 2 8 2.7 The RP Order of Resolution 2 9 2.8 The S Order of Resolution 2 9 2.9 The Z Order of Resolution 2 9 2.10 The Hilbert Order of Resolution 1 12 2.11 The Hilbert Order of Resolution 2 12 2.12 The Hilbert Order of Resolution 3 13 2.13 Path of the Hilbert Order of Resolution 1 . . 16 2.14 Path of the Hilbert Order of Resolution 2 16 2.15 Path of the Hilbert Order of Resolution 3 . . . 17 2.16 Path of the Hilbert Order of Resolution 4 . 18 xii 2.17 Structure and Orientation of Level 1 25 2.18 Structure and Orientation of Level 2 25 2.19 Two Instances with Directional Vectors at Level 2 26 2.20 Indexed Directions 27 2.21 The Hilbert Order of Resolution 4 . 31 2.22 Comparison of Encoding Algorithms for the Hilbert Order 36 2.23 Comparison of Decoding Algorithms for the Hilbert Order 36 2.24 Structure and Orientation of Level 1 39 2.25 Structure and Orientation of Level 2 40 2.26 Two Instances with Directional Vectors at Level 2 43 2.27 Rotation from U2 to U4 44 2.28 Rotation from U7 to Uf 44 2.29 Indexed Directions 48 2.30 Moore's Pattern of Resolution 2 51 2.31 Moore's Pattern of Resolution 3 51 2.32 Moore's Pattern of Resolution 4 52 2.33 Indexing L Patterns of Resolution 1 53 2.34 Pattern L I of Resolution 2 54 2.35 Pattern L I of Resolution 3 54 2.36 Pattern L I of Resolution 4 55 2.37 Pattern L2 of Resolution 2 56 2.38 Pattern L2 of Resolution 3 56 xiii 2.39 Pattern L2 of Resolution 4 57 2.40 Pattern L3 of Resolution 2 58 2.41 Pattern L3 of Resolution 3 59 2.42 Pattern L3 of Resolution 4 59 2.43 Pattern L4 of Resolution 2 60 2.44 Pattern L4 of Resolution 3 61 2.45 Pattern L4 of Resolution 4 61 3.1 The U Order of Resolution 1 65 3.2 The U Order of Resolution 2 66 3.3 The U Order of Resolution 3 66 3.4 Path of the U Order of Resolution 1 67 3.5 Path of the U Order of Resolution 2 67 3.6 Path of the U Order of Resolution 3 . 68 3.7 Average Maximum Distance of Finding Neighbours in a 2D Lattice . 79 3.8 Average Total Distance of Finding Neighbours in a 2D Lattice . . . . 79 3.9 The R D Order of Resolution 2 80 3.10 Average Farthest Distance of the Neighbours in a 2D Lattice 83 3.11 Average Farthest Distance of the Neighbours in a 3D Lattice 83 3.12 The First Level 85 3.13 The Second Level . . . . 85 3.14 Execution-Time Assessments for Encoding 2D Orders 86 3.15 Execution-Time Assessments for Decoding 2D Orders 87 xiv 3.16 The Three-Dimensional U Order of Resolution 2 91 3.17 Structure and Orientation of Level 1 91 3.18 Structure and Orientation of Level 2 93 3.19 Execution-Time Assessments for Encoding 3D Orders 94 3.20 Execution-Time Assessments for Decoding 3D Orders 95 3.21 The Four-Dimensional Hypercube 98 3.22 The Four-Dimensional U order of Resolution 1 99 3.23 The Z Order of Resolution 3 104 3.24 The X Order of Resolution 1 . . 106 3.25 The X Order of Resolution 2 107 4.1 The Q l Order of Resolution 1 119 4.2 Path of the Q l Order of Resolution 1 119 4.3 The Q l Order of Resolution 2 120 4.4 Path of the Q l Order of Resolution 2 120 4.5 The Q l Order of Resolution 3 121 4.6 Path of the Q l Order of Resolution 3 121 4.7 The Q2 Order of Resolution 1 126 4.8 The Q2 Order of Resolution 2 126 4.9 The Q2 Order of Resolution 3 127 4.10 Path of the Q2 Order of Resolution 1 127 4.11 Path of the Q2 Order of Resolution 2 •. 128 4.12 Path of the Q2 Order of Resolution 3 128 xv 4.13 The Q3 Order of Resolution 1 132 4.14 The Q3 Order of Resolution 2 133 4.15 The Q3 Order of Resolution 3 133 4.16 Path of the Q3 Order of Resolution 1 134 4.17 Path of the Q3 Order of Resolution 2 134 4.18 Path of the Q3 Order of Resolution 3 134 4.19 The Q4 Order, of Resolution 1 139 4.20 The Q4 Order of Resolution 2 140 4.21 The Q4 Order of Resolution 3 140 4.22 Path of the Q4 Order of Resolution 1 141 4.23 Path of the Q4 Order of Resolution 2 141 4.24 Path of the Q4 Order of Resolution 3 141 4.25 Paths of the Q5, Q6 and Q7 Orders of Resolution 3 150 4.26 The First Level 154 4.27 The Second Level 154 4.28 Structure and Orientation of Level 1 155 4.29 Structure and Orientation of Level 2 156 4.30 Two Instances with Directional Vectors at Level 2 156 6.1 Symmetric # 4 • 1 9 0 6.2 Approximate Symmetric He 192 6.3 Approximate Symmetric H5 193 6.4 Approximate Symmetric #7 195 xvi The Structure of the Software . . 213 xvii Acknowledgement The author wishes to acknowledge gratefully the valuable advice and guidance of Dr. Gunther F . Schrack, who supervised this thesis throughout all stages. It is his research on quadtrees and the corresponding spatial order that inspired the author's initial interest in exploring a class of new hierarchical data structures. In the research progress, the author was always encouraged by Dr. Schrack's great patience and understanding. The author has also benefited by reading the up-to-date literature timely recommended by him. His meticulous scholarship on critically reading and discussing the penultimate draft of the thesis is especially appreciated. This research was, in part, supported by the Natural Sciences and Engineering Research Council (Research Grant 5-0148 to Dr. Schrack). Sincere thanks should also be extended to the committee for co-supervising the research. Dr. Mabo Ito provided valuable suggestions to the macrostructure of this research. The queries from Dr. Don Gillies and Dr. Alain Fournier helped the author to refine some important concepts. The author has also benefited from the discussion with Dr. Rabab Ward in preparing the departmental examination. In the early phase of this research, Dr. David Kirkpatrick inspired the author to investigate and choose the appropriate methodology applied to the proposed subject. At the final oral examination, several constructive suggestions were given by the Examination Committee. Especially, Dr. Jagadish carefully read the thesis and provided valuable comments. The author would like to express his sincere gratitude to all of the members of the Examination Committee. The wife and daughter of the author have shown unusual patience and under-standing during the period of this study. It is they who had to endure the prolonged absences and the chronic absent-mindedness. Without their support, it would have been impossible for the author to complete the research on schedule. The author would like to express his deepest gratitude to them. xviii Chapter 1 Introduction The multi-dimensional lattice system is a fundamental model in modern computer sciences, especially in information systems. A record with n attributes can be mapped by hash functions to a point in an n-dimensional integer space. Thus a lattice can be defined to be the rectangular region that contains a subset of such points. The requirement for ordering a set,of multi-dimensional data arises frequently from two applications: • Spatial Index Processing • Secondary Key Retrieval The first application requires developing a single numerical index on a one di-mensional line for each point in multi-dimensional space, such that the spatial lo-calizability can be preserved as best as possible. This work can be carried out by a mathematical transformation called spatial ordering. Usually, a parameter called resolution is used to describe the granularity of the domain of a spatial order. For 1 example, a spatial order is of resolution r if it is defined on a two-dimensional square image with 2 r x 2 r pixels. Interest in spatial orders has been high since the 1960s because of their broad applications to many fields in applied sciences. The most representative ones include Computer Graphics, Image Processing, Geographic Information Systems (GIS), Dig-ital Halftoning, Mathematical Programming, and Image Encryption. Our research concentrates on the application of spatial orders to information re-trieval systems. Faloutsos [17] showed that the Hilbert order exhibits satisfactory performance. His conclusion is supported by the mathematical analysis presented by Jagadish [31]. Recently, Abel and Mark [2] assessed empirically the relative merits of five orderings (row, row-prime, Hilbert, Morton and Gray-Morton) for four paradigmatic geographi-cal data processing tasks in spatial analysis. They concluded that the Hilbert ordering deserves further investigation. Several researchers have mentioned the lack of inexpensive encoding and decoding algorithms for the Hilbert ordering (e.g. [2], [10], [17], [18]). The second application, secondary key retrieval, involves two issues: determining the resolution for each dimension in the multi-dimensional lattice system and order-ing the data units on disks. Aho and Ullman [3] proposed a constrained nonlinear programming model applied to the partial match query. In a partial match query, a bit-string with &,• bits called the field signature, is generated by the hash function hi with the specified field as its argument. The number of bits in a field signature 2 is called field resolution and it determines the cardinality of the hashed space. For instance, we may have: 00 if X < "Cr" 01 if "G" <x< "JV" 10 if "N" <x< UT" 11 others In this example, the field resolution is 2 and the cardinality of the hashed space is 4. In Aho and Ullmans' model, the arguments are field resolutions and the objective function to be minimized is the expected value of data blocks that need to be retrieved from the secondary storage to answer a query. The model proposed by Aho and Ullman presents an elegant mathematical formu-lation for the partial match query problem. The basic formulation has been developed into several variants. However, most models proposed are limited to the application of the partial match problem. The literature is relatively silent on the range query retrieval problem. The performance of the secondary retrieval depends on the ordering of the data units on disks. Recently, Faloutsos [15] proposed a new approach for this issue. Using combinatorial analysis, he showed that the order determined by the Gray Code can significantly improve the cluster distribution of the records. The major open question that remains is the lack of symmetric Gray codes to reduce the bias introduced by the reflected Gray code. We are motivated by the above issues. 3 This thesis is organized as follows. The Hilbert order is analyzed in Chapter 2. A recursive approach is presented first to analyze a typical paradigm. The approach is analytical, thus can be applied to other spatial orders. Next, a set of analyt-ical formulas is derived to encode and decode the two-dimensional Hilbert order. The execution-time assessment suggests that they are faster than a look-up-table ap-proach published previously. Then, the encoding and decoding algorithms for the three-dimensional Hilbert order are discussed. A n approach based on the rotational matrices is presented. Since the encoding and decoding processes for high-dimensional Hilbert orders involve complicated operations, we propose a new spatial order, the U order, in Chapter 3. A n analysis shows that the U order is superior to most other or-ders. The encoding and decoding formulas are presented. These operations are much simpler than those of the Hilbert order. In Chapter 4, several new spatial orders with the quadrant-recursive structure are investigated. The goal is to find the candidates that are competitive with the Hilbert order in performance and with relative simple encoding and/or decoding formulas. In Chapter 5, stochastic programming models for optimizing the resolutions of the multi-dimensional lattice system are proposed. The analysis concentrates on the range query problem. Three types of stochastic pro-gramming models are presented. Four paradigms are discussed. One of them leads to a result that is similar to the one used to evaluate the performance of disk drives. In Chapter 6, the symmetric Gray codes are analyzed. A heuristic approach is presented to construct the symmetric Hamiltonian paths in the five-, six-, and seven-dimensional hypercubes. Finally, a conclusion is given in Chapter 7. The development and imple-4 mentation of the stochastic programming algorithms are described in Appendix A . Numerical experiments are also reported. 5 Chapter 2 The Hilbert Order 2.1 Overview A spatial ordering is a bijection relating a finite n-dimensional lattice to a finite set of natural numbers. Some well-known orders (shown with resolution r = 2), namely, B (Beyer) order, D (Cantor-diagonal) order, G (Gray) order, H (Hilbert) order, K (Knuth) order, R (Row) order, RP (Row-prime) order, S (Spiral) order, and Z (Morton) order are presented in Fig. 2.1 to Fig. 2.9. Interest in spatial orders has been high since the 1960s because of their broad applications to many fields in applied sciences. The most representative ones are: • Computer Graphics, Image Processing and Geographic Information Systems (GIS): The Z order corresponding to the quadtree and octree structures have attained high attention since 1970s. Samet has provided an excellent survey and presented detailed analyses in his two monographs ([60], [61]). 6 15 7 13 5 3 11 1 9 12 4 14 6 0 8 2 10 Figure 2.1: The B Order of Resolution 2 0 1 5 6 2 4 7 12 3 8 11 13 9 10 14 15 Figure 2.2: The D Order of Resolution 2 10 11 9 8 12 13 15 14 6 7 5 4 0 1 3 2 Figure 2.3: The G Order of Resolution 2 7 5 6 9 10 4 7 8 11 3 2 13 12 0 1 14 15 Figure 2.4: The Hilbert Order of Resolution 14 13 1 2 4 6 11 9 0 3 15 12 10 8 5 7 Figure 2.5: The K Order of Resolution 2 12 13 14 15 8 9 10 11 4 5 6 7 0 1 2 3 Figure 2.6: The R Order of Resolution 2 8 15 14 13 12 8, 9 10 11 7 6 5 4 0 1 2 3 Figure 2.7: The R P Order of Resolution 2 15 14 13 12 4 3 2 11 5 0 1 10 6 7 8 9 Figure 2.8: The S Order of Resolution 2 10 11 14 15 8 9 12 13 2 3 6 7 0 1 4 5 Figure 2.9: The Z Order of Resolution 2 9 • Digital Halftoning: Limb presented an Ordered Dither method in 1969 [37]. Beyer [6] proposed an optimal criterion based on the wavelength analysis of the dot patterns (the spatial order he designed is called B order in the foregoing). Knuth [36] claimed another class of spatial orders and initialized the research for Dot Diffusion (the spatial order he designed is called K order'm the foregoing). • Mathematical Programming [58]. • Image Encryption [8]. The literature shows that various spatial orders have found extensive applications in modern science and engineering. Our research will concentrate on the application of spatial orders to information retrieval systems. Due to its desirable property, the Hilbert order has attained intensive interest in the literature. In this chapter, a recursive approach is presented first to analyze a typical application. The result matches the one reported by the algorithmical method suggested in the literature. Next, a set of analytical formulas based on Boolean algebra is proposed to encode and decode the two-dimensional Hilbert order. Finally, the encoding and decoding algorithms for the three-dimensional Hilbert order are discussed. 2.2 Metric Analysis of the 2D Hilbert Order Orenstein [47] showed that the Z order is well suited for range queries. Faloutsos [17] studied the performance of the Hilbert order for range query and near neighbour 10 query problems. Evaluations of the performance of the Hilbert, Z, G, R and R P orders applied to spatial data systems were presented by Jagadish [31] and Abel [2]. The metrics they employed are based on the following rationale. A particular spatial order determines a linear traversal in the multi-dimensional space. It specifies the order with which the data blocks are allocated on disk. The appropriate measures of cost associated with a particular linear traversal are: 1. number of disk blocks fetched; 2. number of non-consecutive disk blocks fetched. It is preferred to fetch a set of consecutive disk blocks, which is called a "cluster" or "run", due to the seek time involved. Jagadish [31] reported a set of statistics (see Table 2.1), which are based on two types of templates: 1 x n region for partial match query, where n = 2 r , and 2 x 2 region for range query. Since it is preferred to process fewer clusters, Table 2.1 suggests that the Hilbert order exhibits good performance. In addition, since the Hilbert order exhausts a quadrant of a square before exiting it and does not include diagonal paths (like the Z order), the locality of the range query is well preserved. This provides, among others, the motivation to design more efficient encoding and decoding algorithms for the Hilbert order. Fig. 2.10 to Fig. 2.12 show the Hilbert order with resolution r — 1,2, and 3, respectively. It is believed that the range metric analysis is of central importance for under-standing the inherent property of the Hilbert order and its significance for applica-tions. However, a range metric analysis for this order is not presented in the literature 11 Order Type Partial Match Query Range Query R 2 '" 1 + 0.5 2 RP 2 - (2 r - l ) " 1 Z 1.5 x 2 r " x 2.625 G 2 r - i + 2 - r - 1 2.5 H 2 r _a + 2 _ r _ ! 2 Table 2.1: Expected Number of Clusters for Resolution r 1 2 0 3 Figure 2.10: The Hilbert Order of Resolution 1 5 6 9 10 4 7 8 11 CO 2 13 12 0 1 14 15 Figure 2.11: The Hilbert Order of Resolution 2 12 21 22 25 26 37 38 41 42 20 23 24 27 36 39 40 43 19 18 29 28 35 34 45 44 16 17 30 31 32 33 46 47 15 12 11 10 53 52 51 48 14 13 8 9 54 55 50 49 1 2 7 6 57 56 61 62 0 3 4 5 58 59 60 63 Figure 2.12: The Hilbert Order of Resolution 3 due to its complexity. For this reason, in this chapter we present a comprehensive recursive approach for the range metric analysis of the Hilbert order. This approach will also serve as a general procedure to conduct a metric analysis for other spatial orders proposed in Chapter 3 and 4. A record consisting of two attributes corresponds to a point in the two-dimensional Cartesian space and a collection of such records corresponds to a lattice. For conve-nience, in the following an equivalent system in which the points are represented by cells is considered. Without loss of generality, the discussion is applied to a square lattice system with 2 r x 2 r cells, where the parameter r is the resolution. In a range query, the ranges of the attributes are specified in a multiple attribute record. Geometrically, a range query to the two-attribute record system corresponds to a subset of rectangular shape in the two-dimensional lattice system. Given a lattice 13 with 2 r x 2 r cells, there is a large number of various rectangular subsets. Apparently it is impractical to conduct an exhaustive analytical investigation for all of them, thus a particular subset is to be selected as a representative to gain some mathematical insight. A set of terminology covering some key concepts in the metric analysis is defined here. Definition 2.2.1 A cluster is a subset of records with contiguous addresses in the one-dimensional space. Equivalently, it is a subset of cells with contiguous indices in the two-dimensional lattice. Such a cluster can be represented graphically with a path that links the cells concerned. Definition 2.2.2 A template is a square region with 2 x 2 cells. According to the number of clusters contained, there are four types of templates: Definition 2.2.3 A Ck template (k = 1,2,3,4) is a template in which the four cell indices form k clusters. For example, in Fig. 2.12, the template with indices 8, 9, 10, and 11 is a C\ template; the template with indices 5, 6, 57, and 58 is a C2 template; the template with indices 8, 11, 12, and 13 is also a Ci template; the template with indices 10, 31, 32, and 53 is a C 3 template; 14 the template with indices 6, 9, 54, and 57 is a C 4 template. In the rest of this chapter as well as Chapter 3 and 4, the range query associated with the template defined above will serve as a representative for the purpose of analytical analysis. The number of Cjt templates in a T x 2 r square image will be denoted J V « ( * = 1,2,3,4). A particular spatial order specifies a rule to allocate the records into a one-dimensional space. Consideration is to be given to the number of clusters that need to be retrieved in a query. A small number is preferred to a large number because of the access architecture of modern disk systems. The expected value of clusters is used as the metric to evaluate the performance. 2.2.1 Range Query A graphical representation for the Hilbert order is more convenient for a metric anal-ysis. The graphs of resolution 1,2,3 and 4 are depicted in Fig. 2.13 to Fig. 2.16, respectively. The analysis is presented as a series of theorems: Theorem 2.2.1 NM + NM = ( i | ) 4 r - | ( r > 2 ) [proof] It follows from Fig. 2.13 and Fig. 2.14 that = 1 and JV 2 ( 1 ) = 47V1(1) + 1 = 5. The latter is established due to the fact that, in addition to the four one-cluster tem-plates arising from replicating the Hilbert order with resolution 1, a new one-cluster 15 Figure 2.13: Path of the Hilbert Order of Resolution 1 Figure 2.14: Path of the Hilbert Order of Resolution 2 template with a continuous path is formed. In general, = 4 i V | i \ + 1 holds for k even and = AN^ holds for k odd (k = 2 ,3 , . . . , r ) , according to the structure of the Hilbert order. With the intermediate variable 2* = + (k = 1,2,.. . , r — 1), we have: Tk = ATk-i + 1 k = 2 , 3 , . . . , r - l 16 n n n L J L j i r J r _J L J L_ r 1 r n LJ r J L - l LJ n r J n 1 L L j i Figure 2.15: Path of the Hilbert Order of Resolution 3 By iteration we have: Tk = 4 ^ - ; + ^ ^ j = l , 2 , . . . , & - 2 The closed form is obtained by setting j to k — 2: 4*- 2 - 1 Tk = 4*- 2 T 2 + l ^ — -U 2 ; 3 Therefore, 19 1 r _ 1 W 3 Theorem 2 .2 .2 NW + NV = ( l l ) r - ( ^ (r > 2 ) [proof] According to the structure of the Hilbert order (the graphs with resolution up to 4 are presented in Fig. 2.13 to Fig. 2.16), a sequence can be established as 17 F=bdFJ-H 3 $ : -+-F— -=r-F— -++-R—=r-E=f l : F*=3-=J"t=-r ? 7 ± : zJ-t-tj==bJ~ F=r~T~ F—FT--b -F - - -++-Figure 2.16: Path of the Hilbert Order of Resolution 4 follows: TV?* = 0 JV 2 ( 2 ) = 47V?) + 3 *3(a) = 4/V<2) + 8 iV 4 ( 2 ) = 4/V<2> + 14 iV 5 ( 2 ) = 4 iVi2 ) + 30 / v i 2 ) = 4/V<2> + 58 where the first term of the right-hand side is according to the quadruple constructing process from level k — 1 to level k (k = 2 , 3 , . . . , r), while the second term comes from the contribution of the additional new templates generated by the cells in the single 18 row or column which is one of the sides of the lower level Hilbert order. The subset consisting of all of the new templates is shaped like a Greek Cross with width of two cells. With the intermediate variable Tk = JVf } + NJQi (k = 1,2,.. . , r - 1), we have: 7 i = 3 Tk = 42V_ 1 +2 f c - 1 & (k > 2) where b = y . Consequently, by iteration, Tk = 4jTk_j + 2k-1(2j - 1 ) 6 Let j = k — 2, then Therefore, Theorem 2.2.3 Tk = (y)4 f c -(j)2 f e jVW + j V r ( i \ = 4 1 - 1 - 2 r + 1 (r > 2) [proof] According to the structure of the Hilbert order a sequence can be established. With the intermediate variable Tk = + Nl% (k = 1,2,.. . , r - 1), we have: T x = 1 19 Tk = m~i + 2 f c + 1 - 3 (k > 2) Consequently, by iteration, Tk = AjTk-j + 2 f c + 1 ( 2 J ' - l ) - ( 4 j - l ) Let j = k — 2, then T f e = 4 f c - 2 f c + 1 + l Therefore, r r _i = 4 r - 1 - 2 r + i • Theorem 2.2.4 N^.+ NW = ( 1 ) 4 ' - ( |)JT + | ( r > 2 ) [proof] According to the structure of the Hilbert order a sequence can be established. With the intermediate variable Tk = NJ^ + Nl% (k = 1,2,..., r - 1), we have: Ti = 0 Tk = 4r fc_x + ( j ^ * - 1 - 4 (Jfc > 2) 20 By iteration, Tk = 4»rJk_i + ( | )2 f c - 1 (y - l ) - ( | ) (4 ' - - l ) Let j = k — 2, then Therefore, Corollary 2.2.1 When r is large, the expectation of the number of clusters per range query tends to: Rr = 2 [proof] There are (2k - l ) 2 templates in a lattice system with 2* x 2k cells (& = 1,2,.. . , r). Let Pi,P2>P3> a i *d p 4 denote the occurrence frequencies of the templates with one, two, three, and four clusters in the lattice of resolution r. Then we have: P l r 1 ^ ( 2r _ 1)2 + ( 2 r - l _ 1)2 19 60 /y(2> + A f f a P 2 riJS) (2r - 1)2 + (2r-l _ 1)2 17 40 jy(3) + 7Vr (!\ P 3 = r^o ( 2r - 1)2 + (2r-l _ 1)2 1 21 J V P ( 4 ) + w r (! }i P i = (2 r - l ) 2 + (2 r - 1 - l ) 2 7 120 Consequently, when r is large, the expectation of the number of clusters per range tends to: Rr = lp!+ 2p2 + 3p3 + 4p4 19 17 3 _7_ 60 + 20 + 5 + 30 = 2 • The statistics obtained above will be compared with other typical spatial orders ap-plied in data processing systems later. 2.3 Encoding and Decoding the 2D Hilbert Order A space-filling curve, defined on an integer lattice, defines a corresponding order, e.g., the Hilbert order. By numbering the points visited sequentially on the curve, location codes are defined. For plotting purposes, several sequential algorithms for the Hilbert order (and others) have been proposed for generating the coordinates in sequence ([24], [65]). For many applications in information retrieval systems, however, random-access algorithms are required, i.e., encoding and decoding without previous context or historical information. An encoding algorithm derives the location code for a given coordinate point, while a decoding algorithm derives the coordinates from a given location code. 22 It appears that only the algorithms proposed by Fisher [18] and Jagadish [31] are random-access algorithms. Fisher's algorithms are serial-bit processes and use two short look-up tables. The algorithms are stated in terms of the occam language. Jagadish only presents the encoding algorithm and also uses look-up tables. In this section, encoding and decoding algorithms are proposed which do not depend on look-up tables; they are stated in terms of an analytical formulation; in addition, they appear to be faster than Fisher's according to experimental evidence. 2.3.1 Encoding The Hilbert orders and curves of various resolutions have been depicted in Fig. 2.10 to Fig. 2.16. In this section, the iteration rules to construct the Hilbert order are described first. These rules establish the basis for deriving the analytic encoding and decoding formulas. The Hilbert order is to be established in a subset Z) = { 0 < x < 2 r , 0 < j / < 2 r } in the two-dimensional integer space. The parameter r is the specified resolution or level of the spatial order. Small values of r correspond to low resolution or low levels. Given the coordinates of a particular point P in the lattice system D, the corre-sponding value of P in the Hilbert order (H-value in brief) is to be determined. This process is called encoding. Let the coordinate values x and y be expressed by bit-strings, x = xr-\... xiXo and y = yr-\.. .yiyo, and the corresponding H-value be expressed in terms of quaternary digits (h-digits in brief), h = 4 r _ 1 x / i r _ a + . . . + 4 1 x fti + 4° x h0. At level i 23 (i = 1,2,.. . , r), given xr_,- and yr-i, the t-th most significant digit of the coordinates x and y, /tr_,-, the t-th most significant digit of the H-value is to be determined. Some terminology is required in the following. A unit is an item specifying the indexing strategy for four quadrants of a square region. It can be represented by a simple graph with three edges and four nodes labeled by indexes in terms of qua-ternary digits (n-digits in brief), as shown in Fig. 2.17. Two important features are associated with a unit: orientation and aspect. The orientation indicates the current rotation state of a unit, while the aspect specifies the relation between h-digits and n-digits. The nodes of a unit are successively indexed clockwise, but the sequence of h-digits may be either clockwise or counterclockwise. The orientation of the unit at the lowest level (called level 1, due to the correspondence to the resolution r = 1) is called the principal orientation of the Hilbert order. Both orientation and aspect determine the structure of a unit. However, the orientation is the key feature in the recursive construction process and its iteration can be isolated from the effect of the aspect. According to the inherent property of the Hilbert order, the units generated from node 0, 1, 2, 3 of the unit at level k (k = 1,2,.. . , r) always rotate 90 degrees clockwise, 0 degree, 0 degree, and 90 degrees counterclockwise, to the orientation of the unit at level k, respectively. Given a unit with a specific orientation, the aspect affects only the current digit of the H-value. The aspect of a unit generated from the nodes 0 and 3 of the unit at the lower level is always reversed, while a unit generated from node 1 and 2 is 24 unchanged. The analysis for the orientation will be developed in the two-dimensional Cartesian coordinate system. As a reference point, the coordinate origin is set at the bottom left of the lattice. The principal structure at level 1 is specified as shown in Fig. 2.17. This structure determines the overall appearance of the order considered. Level 2 is generated by constructing four units, labeled Uv (v = 0,1,2,3), from the corresponding indexed nodes of the unit at level 1 (see Fig. 2.18). The indexes of nodes are obtained by concatenating a quaternary digit to the index of the unit. Note the nodes in these units are still indexed progressively clockwise. y l 2 0 3 x Figure 2.17: Structure and Orientation of Level 1 Figure 2.18: Structure and Orientation of Level 2 The orientation of a unit is specified by a normalized vector m. According to 25 the specified arrangement of the Hilbert order relative to the coordinate system, m G { (0 ,1 ) T , ( - 1 , 0 ) T , ( 0 , - 1 ) T , (1,0) T) 1. Two units with their normal vector are demonstrated in Fig. 2.19. to, 1] -1, 0] U3 U l Figure 2.19: Two Instances with Directional Vectors at Level 2 The iteration of the orientation from the current level to the next level could be characterized by multiplying vector m by a two-dimensional rotation matrix. If the positive rotation direction is defined counterclockwise, then a rotation with angular shift 6 can be represented by: R(6) = cos(6) —sin(9) sin(6) cos(8) For instance, i?(90°) = 0 - 1 1 0 Since matrix multiplication may involve many operations, we propose an alternative approach called the directional index method. Based on this method, analytic formulas for encoding and decoding will be derived. 1 The superscript "T" means transposition. In the following, it wi l l be ignored for simplicity. 26 The idea of indexing the directions is suggested by the fact that the rotational freedom comprises only the four principal axis directions, namely, x,y,—x, and — y, thus they are sufficient to characterize a unit's orientation. Therefore, a relation between the directional vector m and an index v can be defined (see Fig. 2.20): vector m index v (0,1) T 0 (-i,o) r 1 ( 0 , - i f 2 (i,o)r 3 Since v has four possible values, it can be represented by two bits a and 6. Similarly, -x 2 -y Figure 2.20: Indexed Directions two bits n2fc_i and ri2k-2 are used to represent an njdigit. From here on, the following notation will be used for Boolean variables p and q: p and q is denoted as pq, p inclusive or q is denoted as p + q, p exclusive or q is denoted as p © q, 27 the complement of p is denoted as p. According to the inherent structure of the Hilbert order being discussed, n2jt_i and n2fc_2 can be expressed in terms of a,b,Xk-i and yk-i'-"2fc-i = abxk-i + abyk-\ + abxk-\ + abyk-\ = b(a®xk-i) + b(a®yk-1) n2k-2 = b® © yk-i k = l ,2 , . . . ,r . The iteration of v can be carried out by either multiplying or adding an incremental factor that is governed by the inherent structure of a particular Hilbert order. In the following the addition approach is chosen by reason that it appears to facilitate anal-ysis as well as implementation. Denote the incremental factor by dv, the transition rules may then be explicitly presented by two if clauses: if n-digit — 0 then dv = 3; elseif n-digit = 3 then dv = 1; else dv = 0; Since dv G {0,1,3}, two bits are sufficient to represent it. With the two bits denoted as c and d, the relation defined by the if clauses above may be written as a table Tl2k-2 c d 0 0 1 1 0 1 0 0 1 0 0 0 1 1 0 1 28 which in turn can be formulated analytically as two Boolean expressions: c = n2jfc-in2fc-2 (2.1) d = n2k-\ © n 2 F C _2 (2.2) The updated value v' of the directional index v is obtained by the addition: v' = (v + dv) mod 4 (2.3) Let v' be represented by two bits a' and 6', then Eq.(2.3) can be formulated as: a' = (bd)®a®c (2.4) b' = b®d (2.5) Substituting (2.1) and (2.2) into (2.4) and (2.5) yields a' = [b(n2k-i © n2k-2)] © a © (n 2 f c_in 2 f c_ 2) (2.6) b' = b © n 2 f c _! © n 2 f c _ 2 (2.7) In order to obtain the final expressions of a' and b' in terms of a, b,Xk-i, and yk-i, tedious algebraical calculations need to be performed which will not be presented here. The iteration process of the directional index i ; can be analytically expressed as as follows: a r _i = 0 K-i = 0 Oj_i = ajfaQyfi + bjXjyj + bj + Xj + yj 6 i - i = bj(a,- © xj) + bj (oj © yj) i = r - l , . . . , 2 , l . 29 To determine the H-value of a particular point with two integer coordinates x and y, the aspect of the units have to be taken into consideration. Let the two bits of an h-digit be denoted by h2k and h2k+i, respectively. Then one has h2k = xk®yk h2k+i = akhxk + akbkVk + akbkXk + akbkyk = bk(ak © xk) + bk(ok © yk) Jb = 0 ,1 , . . . , i — 1. The order of this algorithm is 31r. As an example, the Hilbert order of resolution r = 4 is generated as shown in Figure 2.21. 2.3.2 Decoding Decoding is the inverse process of encoding: given the H-value of a particular point, the corresponding coordinate pair (x, y) is to be determined. A n analytical formula for decoding can be derived by the same methodology as applied to encoding. Let an h-digit be represented by the two bits h2k-i and h2k-2, and an n-digit be represented by the two bits n2k-i and n2k-2- Then one has «2fc- i = b®h2k-\ (2.8) n2jt-2 = b®hik-2 (2.9) fc = l , 2 , . . . , r where a and b are the two bits of the directional index v. 30 85 86 89 90 101 102 105 106 149 150 153 154 165 166 169 170 83 82 93 92 99 98 109 108 147 146 157 156 163 162 173 172 80 81 94 95 96 97 110 111 144 145 158 159 160 161 174 175 79 76 75 74 117 116 115 112 143 140 139 "138 181 180 179 176 78 77 72 73 118 119 114 113 142 141 136 137 182 183 178 177 65 66 71 70 121 120 125 126 129 130 135 134 185 184 189 190 64 67 68 69 122 123 124 127 128 131 132 133 186 187 188 191 63 62 49 48 47 44 43 42 213 212 211 208 207 206 193 192 60 61 50 51 46 45 40 41 214 215 210 209 204 205 194 195 59 56 55 52 33 34 39 38 217 216 221 222 203 200 199 196 58 57 54 53 32 35 36 37 218 219 220 223 202 201 198 197 5 6 9 10 31 28 27 26 229 228 227 224 245 246 249 250 4 7 8 11 30 29 24 25 230 231 226 225 244 247 248 251 3 2 13 12 17 18 23 22 233 232 237 238 243 242 253 252 0 1 14 15 16 19 20 21 234 235 236 239 240 241 254 255 Figure 2.21: The Hilbert Order of Resolution 4 31 The transition rules for iterating the index v are the same as applied to the encoding process. Substituting (2.8) and (2.9) into (2.6) and (2.7) yields a' = 0 © (iffi kjM))] 0 a ffi ((5 © k 2t-i)(6 ffi ^ t-j)) V = 6 © ( ( 6 © / i 2 f c - i ) e ( 6 © / i 2 f c _ 2 ) ) Extensive operations in Boolean algebra need to be performed to obtain a simplified analytical formula set. The detailed derivation is not reproduced here. The result is a' = a © (Ji2k-\h2k^2) b' = b © h2k-i © hk-2 To determine the values of the coordinates x and y from the h-digits, the aspect of the units has to be considered. According to the relation between n-digit and h-digit (Eq. (2.8), (2.9)), Xjt_i = abh2k-i + ab(h2k-i © h2k-2) + abh2k-\ + ab(h2k-i © h2k-2) = (bh2k-2) © a © h2k-i yk-i = ab(h2k-i © h2k-2) + abh2k-i + ab(h2k-\ © h2k-2) + abh2k-\ = (b + h2k-2) © a © h2k-i 32 Finally, the decoding formula can be expressed as follows: Or- l bT-i = 0 = 0 = ajh2j + h2j(aj © h2j+i) bj-x = aj © {h2j + h2j+i) Tj®h2j®h2j+1 j = r - l , . . . , 2 , l . Vk The order of this algorithm is = (bkh2k) © ak © h2k+i = {h + h2k) © ak © h2k+i k = 0 , l , . . . , r - 1. 19r. 2.3.3 Implementation and Execution-Time Assessment The operands in the foregoing formulas are binary digits. Some languages, however, support the bit-wise logical operations and, or, exclusive or, and complement on an entire machine word. Therefore, two integers, heven and h0dd, are introduced. They comprise the bits at the even and odd digit positions of the location code, respectively. Thus, when interleaved bit-wise, they will yield the location code. Assuming that x, y, heven and h0dd are declared as integers, the encoding algorithm 33 can be implemented in the C language as follows: mask = (1 << res) — 1; heven = x A y; tempi = x Sz y; tempi = (~ x & ~ y) &; masA:; a = 0; b = 0; / o r (i = 0; i < res — 1; + + i) { a = ((a &; heven) | (6 &; iempl) | (~ 6 & (<emp2))) >> 1; 6 = (((6 & (a A ~ x)) | (~ b & (a A ~ j/))) & mask) » 1; } ZiocM = (~ b Sz (a " x)) | (b & (~ a * y)); The decoding algorithm becomes: mask = (1 << res) — 1; a = 0; 6 = 0; iemp = heven " hodd; for (i = 0; i < res — 1; + + i) { a = (a " (~ (heven \ hodd))) » 1; 6 = ((~ b & masA;) A temp) » 1; } x = (6 &; ~ heven) a hodd; y = (b | heven) a hodd; The proposed algorithms as well as the algorithms presented in [18] were implemented in the C language. For small values of the resolution, the overhead cost of the system is significant. Thus we choose r=7 as the starting point. The time is evaluated in normalized values. The base used is the average time for encoding (or decoding) a 34 point over 4 points. It should be mentioned that the performance of the algorithms proposed in this section and Fisher's algorithms differs only by a constant. A simple analysis for the implemented codes gives the following results: • Encoding: I our algorithm: 21r Fisher's algorithm: 23r Thus, the ratio of the constants is 23/21 « 1.095, while the experiment gives 1.2 approximately. • Decoding: our algorithm: 16r Fisher's algorithm: 23r Thus, the ratio of the constants is 23/16 = 1.4375, while the experiment gives 1.36 approximately. The time assessment employs a C library function "clock". Another C library function "times" is also used, and the simulation result is similar. It is not reported here to save space. The programs have been run on SPARC stations 1, 2, IPC, and IPX. The data reported are from IPX. The results of execution-time tests for encoding and decoding are shown in Figure 2.22 and Figure 2.23, respectively. 35 Figure 2.23: Comparison of Decoding Algorithms for the Hilbert Order 36 2.4 Encoding and Decoding the 3D Hilbert Order The representation of three-dimensional data is important because of its crucial func-tion in solid modeling in computer aided design and computer graphics ([22], [61]), as well as the reported applications to spatial database management and image pro-cessing ([2], [17]). We believe that it is worthwhile to develop algorithms transforming three-dimen-sional data to a one-dimensional ordering and vice versa. The significance of the three-dimensional Hilbert ordering is suggested by the active research in two dimensions reported in the literature ([2], [10], [17], [18], [24], [31], [39], [65], etc.). However, little has been published in the literature on three-dimensional orders. In this section, algorithms for the three-dimensional Hilbert order are discussed. A digit-wise approach which maps the coordinates (x,y,z) of a particular point to the corresponding value in the three-dimensional Hilbert ordering and vice versa is described. The discussion for encoding is presented first. A small set of terminology is defined at the beginning, followed by the specification of geometric primitives. The constructing rule is then explored by means of the top-down approach. As the consequence of such a hierarchical process, an iterative algorithm is designed. The decoding process is described in the next section. 37 2.4.1 Encoding The main purpose of this section is to describe the iteration rules applied to construct the Hilbert order. The Hilbert order is to be established in a subset D = {0 < x < 2 r ,0 < y < 2 r , 0 < z < 2 r} in the three-dimensional integer space. The top-down approach is to be applied to construct an encoding formula for the Hilbert order. The cubic lattice is partitioned and indexed similarly to the two-dimensional case. A unit is an item specifying the indexing strategy for eight octants of a cubic region. It can be represented by a set of 2 x 2 x 2 cells, or by a simple graph with seven edges and eight nodes, as shown in Fig. 2.24. The latter will be used in this section, since it appears more convenient to present the unit's two important features: orientation and aspect. The orientation specifies the current rotation state of a unit, which can be represented by two normalized vectors. Given a unit with a specified orientation, the aspect affects the current digit of the H-value only. According to the inherent property of the Hilbert order, it is observed that the aspect of a unit generated from the nodes 0, 1, 6 and 7 of the unit at the lower level is always reversed, while a unit from nodes 2, 3, 4 and 5 remains the same. Consequently, a Boolean variable ds can be employed as the iteration factor: ds = false if node = 0 or 1 or 6 or 7 then ds = true 38 where node is the node index determined at the last level. The analysis for the orientation is to be developed in the three-dimensional Carte-sian coordinate system. As a reference point, the coordinate origin is set at the front bottom left of the lattice concerned. The frame of the lowest level (called level 1, due to the correspondence to the resolution r = 1) is specified as shown in Fig. 2.24. This frame determines the overall appearance of the ordering considered. Analytically, the relation between the coordinates and H-values at level 1 is shown in Table 2.2. Figure 2.24: Structure and Orientation of Level 1 Level 2 of resolution r = 2 is generated by constructing eight units, labeled with Uk (k = 0,1,2,3,4,5,6, 7), from the corresponding indexed nodes in the unit at level 1 (see Fig. 2.25). The orientation of a unit is specified by two normalized vectors m and n, orthog-onal to each other. Two units with their normal vector are shown in Fig. 2.26. Note that the orientation of f/2 has been chosen as the principal orientation of the Hilbert order discussed. The iteration of the orientation from the current level to next level is characterized 39 xbit ybit zbit h digit 0 0 0 0 0 1 0 1 0 1 1 2 0 0 1 3 1 0 1 4 1 1 1 5 1 1 0 6 1 0 0 7 Table 2.2: Coordinates and H-Values at Level 1 U2 U5 U3 U4 • ui / 71 l > — V U7 ZTZ71 U6 / — 7 1 Figure 2.25: Structure and Orientation of Level 2 40 by a three-dimensional rotation matrix. On a specified plane, the positive rotation direction is defined counterclockwise. For instance, on the x-y plane, a rotation with angular shift 8 can be represented by: . coS{0) -sin{8) 0 R{0) = sin(6) cos(0) 0 0 0 1 Since there are eight rotation states for the units, eight rotation matrices are needed. These matrices can be indexed according to the node indexes at level 1. Thus, the normal vectors specifying the orientation of the unit Uk at level 2, can be obtained by multiplying matrix Rk2 to the normal vectors specifying the orientation of the unit at level 1. The first and second subscript specify the final and initial states of a unit in a transition, respectively. The second subscript (2) also indicates the fact that the orientation of unit at level 1 is the same as the unit 2 at level 2. These matrices are: 0 1 0 - 1 0 0 0 0 1 0 0 1 R\2 = 0 1 0 - 1 0 0 41 i?22 — -#32 = i?42 — i?52 — R&2 — R72 = 1 0 0 0 1 0 0 0 1 0 0 - 1 - 1 0 0 0 1 0 0 0 1 1 0 0 0 1 0 1 0 0 0 1 0 0 0 1 0 0 - 1 0 1 0 1 0 0 0 - 1 0 1 0 0 0 0 1 42 [0, 0, 1] to, 1, 0] l - l , 0, 0] m U2 tO, 0, 1] n U7 Figure 2.26: Two Instances with Directional Vectors at Level 2 The transition from the unit at level 1 to the units at level 2 then can be char-acterized by means of rotation matrices. For instance, unit U7 is obtained by the following two equations: mj = Pt72rn2 Tl-j = i?7 2 n 2 where m7 = [-1,0,0] T , n7 = [0,0,1] T, m 2 = [0,1,0]T, and n 2 = [0,0, l ] r . Although the transition from level 1 to level 2 can be perceived visually, the recursive generating rules have to be found for constructing the higher level frames. The non-commutativity of matrix multiplication will have to be taken into account. Geometrically, this property can be interpreted by the statement that two unit-pairs with the same relative rotation may not be characterized by the same rotation matrix. A n example illustrating this situation is shown in Fig. 2.27 and Fig. 2.28. 43 [0, 0, 1] n [0, 0, 1] [•0, 1, 0] m R42 / t l , / U2 U4 Figure 2.27: Rotation from U2 to U4 l-l. 0, 0] m [0, 0, 1] n U7 Rf7 to , 0, 1] / Uf Figure 2.28: Rotation from Ur to Uf 4 4 It can be shown that the matrix determining the rotation from U2 to U4 is 0 0 1 R 4 2 = 1 0 0 0 1 0 Now consider another initial state f/7. The final state Uj is to be reached such that the relative rotation between U7 and U/ is the same as the one between U2 and f/4. One might expect that the corresponding rotation matrix, Rf7, is the same as i? 4 . Unfortunately, this is not the case. It can be easily verified that -R/7 = 0 - 1 0 0 0 1 - 1 0 0 In general, the problem can be stated as follows: Given an initial state Ua and a matrix Rka that determines the rotation from Ua to Uk, the same relative rotation from another initial state Ub to the corresponding final state Uk> needs to be characterized. The solution is a compound matrix. By matrix algebra, one can prove the following equation: -1 Rk'b = RbaRkaRba (2.10) Eq. (2.10) is crucial to develop a robust iterative algorithm. The encoding algorithm for the three-dimensional Hilbert ordering would have an obscure appearance but for employing Eq. (2.10). There exist, however, some variant formulations of Eq. (2.10). They are mathematically equivalent, but may be of significant difference in operating 45 efficiency. It is not intended to discuss this topic here, since the main purpose of this section is to present a compact and, more importantly, comprehensible algorithm. To determine the H-value of a particular point with three integer coordinates x, y, and z, a set of tables are introduced (see 2.3). These tables are constructed according to the complete set of units comprising the frame concerned. At level 2, there are seven distinct units. The types of other units at the higher levels are specified by an auxiliary procedure. The verification follows the foregoing iterative rules. It has been found that there are fourteen new units at level 3 and three new ones at level 4. The total of twenty-four types is sufficient to comprise the ordering discussed this section. A n H-value is expressed in terms of octal digits, namely, h = 8 r _ 1 x hr-i + .. .+8 1 x h\ + 8° x h0. At level i (i = 1,2,..., r), given xr_,- and yr_,-, the i — th most significant digit of the coordinates x and y, ftr_,-, the i — th most significant digit of the H-value is determined from an appropriate table. The iterated directional vector m and n are used as the key to select the table needed. The tables are subscripted as follows. Six numerals 0,1,2,3,4 and 5 are assigned to the six axis-directions in the three-dimensional Cartesian space, i.e., directions of x,—z,y,—x,z, and —y, respectively (Fig. 2.29). For instance, the unit with vectors m and n aligning axes y and x is associated with the table with index 20. Then the table corresponding to the unit with its normal vectors aligning the directions with indexes j and k is denoted as table tjk- For instance, table t24 is associated the unit with vectors m and n aligning axes y and z, respectively. The tables follow: In software the vectors m and n are initialized to [0,1,0] and [0,0,1], 46 h - digit Tables Xbit Vbit t<)2 to4 tu * 1 3 * 1 5 * 2 3 <24 0 0 0 4 0 0 4 6 6 2 2 0 4 4 0 0 0 1 3 3 7 7 1 5 5 1 1 5 5 1 0 1 0 5 1 1 5 5 1 1 5 3 3 7 7 0 1 1 2 2 6 6 2 2 6 6 2 2 6 6 1 0 0 7 7 3 3 7 7 3 3 7 7 3 3 1 0 1 0 4 4 0 0 4 4 0 6 6 2 2 1 1 0 6 6 2 2 4 0 0 4 4 0 0 4 1 1 1 1 5 5 1 3 3 7 7 5 1 1 5 h - digit s Tables, continued Zbit Xbit Vbit <32 * 3 4 <35 ^42 * 4 3 * 4 5 ^ 5 0 * 5 4 0 0 0 2 6 6 5 0 0 4 4 6 2 2 6 0 0 1 5 5 1 6 7 3 3 7 7 3 3 7 0 1 0 3 7 7 4 3 7 7 3 5 5 1 1 0 1 1 4 4 0 7 4 4 0 0 4 4 0 0 1 0 0 1 1 5 2 1 1 5 5 1 1 5 5 1 0 1 6 2 2 1 6 2 2 6 0 0 4 4 1 1 0 0 0 4 3 2 6 6 2 2 6 6 2 1 1 1 7 3 3 0 5 5 1 1 3 7 7 3 Table 2.3: Tables for Encoding the 3D Hilbert Order 47 z 4 2 Z y - X 3 7 0 x 5 -y l -z Figure 2.29: Indexed Directions respectively, according to the orientation specified at level 1. A n example is provided here. Given the resolution r = 2, the H-value correspond-ing to the point with coordinates x = l , j / = l , z = 2 is to be determined. First, the integers in base 10 are converted to base 2, namely, (x, y, z) = (01,01,10). According to the initial state of vectors m and n, the table with index 24 is used. The most significant digits of x,y and z are extracted as key, i.e., (0,0,1). Thus from Table 24 the octal digit 3 is obtained. This digit implies that rotation matrix R32 needs to be used to update the directional vectors, m and n then become [0,0,1] and [—1,0,0], respectively. Consequently, the table with index 43 is visited at level 2. Since the key is (1,1,0), consisting of the second most significant digit of (x,y,z), the octal digit 0 is retrieved from Table 43. Therefore, the H-value in base 10 is 8 1 x 3 + 8° x 0 = 24. The encoding process for r > 2 will need to iterate eight rotation matrices at each level. At level k, among eight units j = 0 , . . . , 7, there is one and only one unit, say, Uk, generating eight child-units t7* + 1 , j = 0, . . . , 7 . The generating scheme is 48 inherited from that constructing Uk itself and its seven siblings from their parent, a unit at level k — 1. It is Eq. 2.10 that characterizes the heredity. We end this section by indicating that the octree, a data structure widely used in computer graphics, spatial data analysis, and image processing [61], may be regarded as a special case of the hierarchical structure described above, since no rotation is needed at the levels of the octree when generating child-units. 2.4.2 Decoding The methodology is the same as that used for encoding, i.e., the top-down approach. The decoding process begins at the most significant digit of the given H-value, denoted as for-1, in base 8 system. This digit specifies the octant to which the point belongs at level 1, and determines how to construct eight child-octants from the current octant. Then hT-i is used as the key to extract the most significant digits of x, y, and z from an appropriate table. Next, the directional vectors are updated by the appropriate rotation matrix determined by for_i. At level 2, the second most significant digit of the H-value, denoted as for-2? is used to extract the second most significant digits of x,y, and z. The visited table is determined by the updated directional vectors. The process is repeated until the least significant digits) of x,y, and z are reached. To illustrate the process, consider an example. Given the resolution r = 2, the coordinate triple (x, y, z) corresponding to the point with the H-value 38 is to be de-termined. First, the decimal 38 is converted to octal notation (hi h0) = 46. According to the initial states of vector m and n, the table with index 24 is visited with hi as 49 key. Thus the bit-triple (1,0,1) is obtained. The value of hi implies that rotation matrix R42 should be used to update the directional vectors, m and n then become [0,0,1] and [1,0,0], respectively. Consequently, the table with index 40 is visited at level 2. Since the key is h0 = 6, the bit-triple (0,1,1) is obtained. Therefore, in decimal form, x = 2 1 x 1+ 2° x 0 = 2, y = 2 1 x 0 + 2° x 1 = 1, z = 2 1 x 1+ 2° x 1 = 3. The tables used for decoding are the same ones used for encoding, except that the roles of the key and the contents are exchanged. For implementation, one may employ a small inverted list or, most simply, introduce two sets of tables. The table size for encoding is 2 x 2 x 2 and for decoding are 8 x 3 . The issue of software efficiency is not to be investigated here. In this section, a hierarchical approach to encode and decode the Hilbert ordering is described. The main purpose of this section is to present the iterative property of the three-dimensional Hilbert ordering. Future work will focus on designing efficient algorithms. 2.5 Alternative Patterns of the Hilbert Order The Hilbert order discussed above is constructed by setting the starting and ending points at the bottom left and the bottom right of the lattice, respectively. Alternative geometric patterns can be generated by rotating the square lattice 90°, 180°, or 270° counterclockwise in the Cartesian coordinate system. A l l of them can be regarded as instances of a topological class with the feature that the ordering starts at a corner and ends at an adjacent corner. The literature concerning the Hilbert order has almost 50 exclusively referred to this particular class. However, there exist other topological classes. Moore [44] reported such a class. The graphical representations of resolution 1 of Moore's pattern is the same as shown in Fig. 2.13. The graphical representations of resolution 2,3 and 4 are depicted in Fig. 2.30 to Fig. 2.32. Figure 2.30: Moore's Pattern of Resolution 2 r r "1 1 I L L J L J r J r J r M m L I. J L J r r L L •i • J r J r J r "I r i L L J 1 1 L J Figure 2.31: Moore's Pattern of Resolution 3 In this section, we present other four topological classes, named pattern LI, L2, L3, and L4, respectively. They have not been found in the literature. The configurations of these patterns can be analytically described in terms of complex algebra. Let C be the complex space. Consider the region C + = {z|~ft(z) > 51 -t--fcj--- J - t - - zJ-t - --t-- t J - R f i - - tzh-zh -T--F--p3"E=: -t-- t J~ =d-fc--pj—t—-tJ~=J--fc--+.J-- tzh-J--=r-F--b f - -Figure 2.32: Moore's Pattern of Resolution 4 0,^(2) > 0, z € C}. We will construct the frame of the orders in C + . Given resolution r, consider a square D with its lower-left corner at the origin and with side 2 r , partitioned into four quadrants and indexed by 0, 1, 2, and 3, respectively. The sequence 0, 1, 2, 3 also specifies L patterns of resolution 1 (Fig. 2.33). The reference path for establishing the configuration of L patterns of resolution 1 in D is the same as shown in Fig. 2.13. 52 2.5.1 Configuration of Pattern LI The generation of the configuration of pattern L I from resolution 1 to 2 is determined by the following transformations: T0z = 1(1+j-z) Tiz = \ti + z) T2z = \(l+j + z) Tzz = l + i(j-z) where j is the imaginary number y/—T and Tk is the transformation that maps D onto its quadrant k (k = 0,1,2,3). The generation of the configuration from resolution A: to A;+l (A: > 2) is determined by the following transformations: T0z = l-jz T2z = \(l+j + z) T3z = l + 1 2 0 3 Figure 2.33: Indexing L Patterns of Resolution 1 53 The graphical representations of pattern L I of resolution 2,3 and 4 are depicted in Fig. 2.34 to Fig. 2.36. Figure 2.34: Pattern L I of Resolution 2 n n n n L J L j r J _r-J~ I T L-j r -L- - J r i r 1 LJ LJ LJ LJ Figure 2.35: Pattern L I of Resolution 3 2.5.2 Configuration of Pattern L2 The generation of the configuration of pattern L2 from resolution 1 to 2 is determined by the following transformations: 54 R f H 77= -=r-F— R-F— R-F-h± F—F=r~ F—F=| R-=r-F- -R- F--F] =J"t--: R : : -zJ-t - -- J - t - -=J-t= : f l : Figure 2.36: Pattern L I of Resolution 4 ft* = + 7/2z = I[ i+j(2-*)] ft* = | ( i + j + *) The transformations of the configuration from resolution fctofc-fl (k > 2) are the same as for pattern L I . The graphical representations of pattern L2 for the resolutions 2,3 and 4 are depicted in Fig. 2.37 to Fig. 2.39. 55 Figure 2.37: Pattern L2 of Resolution 2 r — r 1 r L L J L J r J r J r • L L — I L J mm r I r fm J L J L my r r 1 m J L • J L Figure 2.38: Pattern L2 of Resolution 3 56 -J—b—--j - t - --tf -3- - Lzf-—J— tz-tf pi??: b - E - : =J-t-- = J"t-EfiJ" Efti-Ef=t3-Figure 2.39: Pattern L2 of Resolution 4 57 2.5.3 Configuration of Pattern L3 The generation of the configuration of pattern L3 from resolution 1 to 2 is determined by the following transformations: T0z Txz T2z T3z 1 ._ -jz T 1 = 2-0- + *) The transformations of the configuration from resolution k to k + 1 (k > 2) are the same as for pattern L I . The graphical representations of pattern L3 for the resolutions 2,3 and 4 are depicted in Fig. 2.40 to Fig. 2.42. Figure 2.40: Pattern L3 of Resolution 2 58 n L J r L n _J 1 r r —1 LJ r J - J n L -| r l 1 L 1 LJ LJ Figure 2.41: Pattern L3 of Resolution 3 R f i - --T--F---T--F--lf-=3- t--tJ-lf=3- E==tJ-F±ff b - E - : tJ"=J" bf=3-b=-tJ-F ^ F ? ± ? : tJ~=J" =J"t=~ p-f-t:-Eft3~ EfiJ Figure 2.42: Pattern L3 of Resolution 4 59 2.5.4 Configuration of Pattern L4 The generation of the configuration of pattern L4 from resolution 1 to 2 is determined by the following transformations: T0z = \{j + z) Tiz = |[l+i(* + l)] T,z = I[i+j(2-*)] Tsz = The transformations of the configuration from resolution k to k + 1 (k > 2) are the same as for pattern L I . The graphical representations of pattern L4 for the resolutions 2,3 and 4 are depicted in Fig. 2.43 to Fig. 2.45. Figure 2.43: Pattern L4 of Resolution 2 60 r — | r r L L J L J r J r J r m L L — i L J r r J L m J r J r L l L J Figure 2.44: Pattern L4 of Resolution 3 ;J--t---d - t - -- t f -3 - -b=~t3~ pj~t—- E i - j --=r-F--Figure 2.45: Pattern L4 of Resolution 4 61 2.6 Summary Diverse applications of the Hilbert order have been reported in the literature. Many researchers have mentioned the lack of inexpensive encoding and decoding algorithms. In this chapter, a recursive approach is presented first to analyze a typical application. The result matches the one obtained by the algorithmical method suggested in the literature. Our approach is analytical, thus can be applied to other spatial orders. Next, a set of analytical formulas based on Boolean algebra is proposed to encode and decode the two-dimensional Hilbert order. The derivation process is conducted by extensive Karnaugh-map operations. A n execution-time assessment suggests that they are faster than the ones published previously. Then, the encoding and decoding algorithms for the three-dimensional Hilbert order are discussed. A n approach based on rotational matrices is proposed. Since a large number of Boolean variables is involved, the Karnaugh-map technique is inappropriate to derive irreducible analytic formulas. Other systematic methods like Quine-McCluskey Method ([42], [51], [52]) could be employed, but may lead to complicated mathematical operations. Therefore, it appears that the matrix approach is an acceptable method. Extensive simulations have been carried out on a SPARC station. The rotational-matrix algorithms can also applied to other spatial orders such as the U order and the X order discussed in the following chapter. The two-dimensional Hilbert order may have alternative patterns. Moore [44], an American mathematician, reported such a pattern in 1900. In this chapter, we proposed other four patterns of the Hilbert order. To the author's best knowledge, 62 these four patterns have not been reported elsewhere. In a two-dimensional lattice system of resolution 2, there are in total six possibilities of the quadrant-exhaustive and rook-connected patterns. In this sense, the original pattern, the Moore's pattern, and patterns LI through L4 comprise a complete set of the Hilbert order. The appli-cability of these alternative patterns may be paradigm-dependent. For example, the Finite-Element-Method (FEM) in electromagnetic field analysis would benefit from a particular pattern, since there are various boundary conditions that may make sense to "wrap around" the finite grid in defining a region. For the issue on computational complexity, it is anticipated that these alternative patterns behave similarly since the structure difference among them lies only in the iteration from resolution 1 to resolution 2. In Chapter 4, these patterns will be compared with each other for some typical applications in spatial data processing systems. The encoding and decoding processes for n-dimensional Hilbert orders (n > 4) would be considerably complicated and place a heavy computational burden on the user. Therefore, in the next chapter, we propose a new spatial order called the U order. 63 Chapter 3 The U Order 3.1 Overview It has been shown that the Hilbert order is fairly sophisticated and that it is a non-trivial task to derive and implement encoding and decoding algorithms. In large data processing systems, the response time for an interactive query is an important per-formance index. A fast addressing technique to point to positions of data sections is desired. In this chapter, a new spatial order called the U order is proposed. Its major advantage is a significant simplicity of the encoding and decoding operations for multi-dimensional data. We will show that the U order is comparable with the Hilbert order and that it is superior to most other orders by four paradigms, namely, partial match query, range query, eight-connected neighbour-finding, and neighbour proximity evaluation. Then, the encoding and decoding algorithms are discussed. Next, the U order is extended to the n-dimensional space. A formal mathematical 64 description is given. A set of analytical encoding and decoding formulas is also pre-sented. Then, the Kronecker algebraic structures of the two-dimensional U order as well as two other orders are briefly described. Finally, we show that conventional bit-manipulation techniques, such as bit-interleaving, bit-centering, bit-concatenation, and bit-complementation, are not sufficient for encoding most applied spatial orders. 3.2 Metric Analysis of the 2D U Order In this section, the performance of the U order applied to partial match queries and range queries in a two-dimensional attribute space is analyzed. The expected value of clusters (defined in Section 2.2) is used as the metric to evaluate the performance. The two-dimensional U orders of resolution 1, 2, and 3 are shown in Fig. 3.1, Fig. 3.2, and Fig. 3.3, respectively. 3 2 0 1 Figure 3.1: The U Order of Resolution 1 3.2.1 Partial Match Query In a partial match query, one of the attributes is specified in a two-attribute record while the other is not. Geometrically, the set of the selected records corresponds to a line parallel to one of the coordinate axes in the two-dimensional Cartesian space. 65 15 14 11 10 12 13 8 9 3 2 7 6 0 1 4 5 Figure 3.2: The U Order of Resolution 2 63 62 59 58 47 46 43 42 60 61 56 57 44 45 40 41 51 50 55 54 35 34 39 38 48 49 52 53 32 33 36 37 15 14 11 10 31 30 27 26 12 13 8 9 28 29 24 25 3 2 7 6 19 18 23 22 0 1 4 5 16 17 20 21 Figure 3.3: The U Order of Resolution 3 66 A graphical representation for the U order is more amenable for a metric analysis. The graphs for the resolutions 1,2, and 3 are shown in Fig. 3.4 to Fig. 3.6, respectively. Figure 3.4: Path of the U Order of Resolution 1 1*i 14 11 10 12 13 8 9 3 2 7 6 0 1 4 5 Figure 3.5: Path of the U Order of Resolution 2 The results are collected in the following theorems: Theorem 3.2.1 The total number of clusters arising from the horizontal type of partial match query in the lattice with resolution r is given by: Hr = 2 2 r - 1 67 Figure 3.6: Path of the U Order of Resolution 3 [proof] Consider the U orders shown in Fig. 3.4, Fig. 3.5, or Fig. 3.6. A horizontal type of partial match query specifies a row of cells. There are 2 r _ 1 clusters in each row. Each cluster consists of two cells with contiguous indices. Therefore, the total 2 r rows give HT = T x 2 r _ 1 = 2 2 r _ 1 clusters. • Theorem 3.2.2 The total number of clusters arising from the vertical type of partial match query in the lattice with resolution r is given by: v - 2 2 r + 1 +1 V r ~ 3 [proof] It is observed from Fig. 3.4 that there are three clusters for r = l , one with two contiguously indexed cells and two with a single cell each. This leads to V\ = 3, where Vr stands for the total number of clusters by vertical selection, for the resolution r. The U order with r=2 is constructed by replicating the one with r = l . However, there are two clusters being combined into one (Fig. 3.5), hence V 2 = 414 — 1- In general, 14 = 4I4_i — 1 (k = 2 , 3 , . . . , r). Consequently, by iteration we have: 68 4 fe — 1 (* = 2,3 , . . , , r - l ) The closed form is obtained by setting k to r — 1: 4 r _ 1 — 1 3 2 2r+l + ! Corollary 3.2.1 The expectation of the number of clusters per partial match query is given by: 7 x 2 r 1 Pr = - 7 7 ^ + 12 6 x 2 r [proof] There are 2 r horizontal selections and 2 r vertical selections altogether. There-fore, the expectation of the number of clusters per query is: Hr + Vr Pr = 2r + 2 r 2 2 r - l + ( 2 2r+l + ^^3 2 r + l 7 x 2 r 1 + 12 6 x 2 r w 1.167 x 2 r _ 1 • The statistic obtained above is compared with those of two other typical spatial orders which are applied in data processing systems, namely, the Z order and Hilbert order [31] (see Table 3.1). It is observed that the U order offers a desirable compromise. In the next section the range query will be analyzed. 69 Spatial Order The Expectation of Cluster Number Per Query Z 1.5 x 2 r _ 1 u 1.167 x 2 ' - 1 H 1.0 x y-1 Table 3.1: Expectation for Partial Match Query 3.2.2 Range Query The terminology used in the following has been defined in Chapter 2. The properties concerned are presented as a series of theorems. Theorem 3.2.3 [proof] It follows from Fig. 3.4 and Fig. 3.5 that iV x ( 1 ) = 1 and iV 2 ( 1 ) = 4A^X(1) + 1 = 5, respectively. The latter is established due to the fact that, in addition to the four one-cluster templates arising from replicating the U order with resolution 1, a new one-cluster template with contiguous indices 6, 7, 8, and 9 is formed. In general, = 4iV^L\ + 1 {k = 2 , 3 , . . . , r). Consequently, by iteration we have: AT(D = A"N(]}k + ^Lzl fc = 2 , 3 , . . . , r - l The closed form is obtained by setting k to r — 1: O 70 3 Theorem 3.2.4 (2) 22T~l +1 _ x . r 3 [proof] It follows from Fig. 3.4 and Fig. 3.5 that N[2) = 0 and = 1, respectively. In general, N l 2 ) = 4 i v £ \ + 6fc (fc = 2,3,...,r) where the first term of the right-hand side derives from the quadruple construction process from level k — 1 to level fc, while the second term stems from the contribution of the additional new templates generated by the cells in the single row or column which is one of the sides of the lower level U order. The subset consisting of all of the new templates is shaped like a Greek Cross with width of two cells. A sequence can be established as follows: bh = 2 f c " 1 - l (fc = 2,3,...,r) Consequently, by iteration we have: N{2) = 4 * ^ W f c + 2 r - i ( 2 * _ i ) _ l l z J : fc = 2 , 3 , . . . , r - l o Then the theorem is proved by setting k to r — 1. • Theorem 3.2.5 7V(3) = 4 r " 1 - 2 r- 1 3 71 [proof] It follows from Fig. 3.4 and Fig. 3.5 that Nf} = 0 and Nf> = 3, respectively. In general, = 47Vf_ ) 1+c f c (fc = 2 ,3 , . . . , r ) To obtain c^, the Greek Cross region described in Theorem 3.2.4 is investigated again. A sequence can be derived as follows: ck = 2 f c~ 1 + l (fc = 2 ,3 , . . . , r ) Then the theorem is proved similarly to the proof of Theorem 3.2.4. • Theorem 3.2.6 ( 4 ) 2 » - » + 4 3 [proof] It follows from Fig. 3.4 and Fig. 3.5 that N[4) = 0 and _/V2(4) = 0, respectively. In general, JV<4) = AN^+dk (A: = 2 ,3 , . . . , r ) Once more, the Greek Cross region needs to be considered to deduce dk. It can be shown that: dk = 2 f c - 4 (fc = 2 ,3 , . . . , r ) from which the theorem follows similarly to the proof of Theorem 3.2.4. • Corollary 3.2.2 When r is large, the expected value of clusters per range tends to: * = I 2.333 72 Spatial Order The Expectation of Cluster Number Per Query Z 2.625 G 2.5 U 2.333 H 2.0 Table 3.2: Expectation for 2 x 2 Range Query [proof] There are (2 r — l ) 2 templates in a lattice system with 2 r x 2 r cells. Therefore, the expectation of the number of clusters per range query is: IN™ + 2N& + 3JVr<3> + 4JVr<4> Rr = l im = - • 3 (2* - iy The statistics obtained above are compared with those of the three other typical spatial orders applied in data processing systems, namely, the Z order, the G order, and the Hilbert order (see Table 3.2). The data of expected values for the Z, G, and Hilbert orders were initially reported in [31]. It is observed that the U order offers a desirable compromise. 3.3 Neighbour Finding In a 2D cellular representation system, the neighborhood of a cell is defined as a subset of cells that are 4-connected or 8-connected adjacent. Finding the neighbors 73 of a specified cell is an elementary operation in many applications. For instance, to enhance a 2 D image with random noise, it is a prerequisite to examine the attributes of the eight neighbors. Another example involves the evaluation of the gradient from the elevation values of the cells in the neighborhood. Given a spatial order, a traversal path is specified in the spatial domain and a sequential order of positions in which the cells reside is determined on the secondary storage. After the transformation, the locality established by a set of cells in the same neighborhood in the high-dimensional spatial domain is no longer preserved: the adjacent cells are scattered in the ID space. For a 2 D system of resolution r, the scattering can be characterized by the following function: F = j}d{i) i'=0 where d(i) is a metric scaling the linear span of the neighborhood of the cell with index i. Mathematically, any norm-like formulations can be employed for this purpose. For instance, we may have: d(i) = E = <M0 or d(i) = maximumj£N(i) I * ~ J I dmax(i) or d(i) = maximumjltj2eN(i) I jl - J 2 | d= d„(i) where N(i) stands for the neighborhood of the cell with index i and its configuration is application-dependent. In the following, we choose the 8-connected template and 74 use both dmax a n d d\ to evaluate the performance of the spatial orders concerned. For 8-connected applications, the neighborhood is defined as shown in Fig. 3.3, where " C " stands for the centroid, the position in which the current cell resides, and " N " means neighbor cells concerned, respectively. The metric dmax is the maximum linear span between the centroid and its adjacent cells. In the case of tape systems, the time for retriving the neighborhood depends directly on dmax. In the case of disk systems, dmax is also relevant to the retrieval time, plus the effect of track positions. Thus dmax can be chosen as an appropriate metric to assess the scattering degree. The metric d\ is the sum of the linear spans between the centroid and its adjacent cells. In real circumstances, the performance of the neighbor-finding paradigm is asso-ciated with both the block size in the secondary storage and the capacity of the main memory. However, for purposes of machine-independent analysis, it is reasonable to assume that each cell occupies one block. Then the metric dmax or d\ is measured in block units. We have designed and implemented a set of algorithms to conduct the perfor-mance evaluation for the U order, the Z order, and six patterns of the Hilbert order. The relative merits of these orders are assessed computationally for the 8-connected neighbor-finding problem over the range of resolution r from 2 through 8. The average values of dmax and d\ over the set of 2 r x 2 r cells are numerically reported in Table 3.4 and Table 3.5, respectively. Figure 3.7 and Figure 3.8 also graphically present the results of the Hilbert order (original pattern), the U order, and the Z orders. It is 75 concluded that the performance of the U order is between the Hilbert order and the Z order. Two other spatial orders with structures that are not quadrant-exhaustive, namely, the the R (Row) order and the R D (Row-Diagonal) order, have also been evaluated for comparison (Table 3.4 and Table 3.5). The Row order has been shown in Fig. 2.6. The R D order of resolution 2 is illustrated in Fig. 3.9. It appears that, for the 8-connected neighbour finding problems, both the R order and the R D order behave well. However, the R order presents significant asymmetry to horizontal and vertical directions and may introduce large variance to query retrieval applications. The R D order is symmetrical diagonally, but the average number of clusters in a 2 x 2 template tends to 3, which is higher 50% than the Hilbert order. In addition, neither the R order nor the R D order presents satisfactory results when applied to neighbour proximity problems (see Section 3.4). In the 3D cellular representation system, cells can be adjacent along a face, an edge, or a vertex. Since an cell has eight vertices, twelve edges, and six faces, the neighborhood of an interior cell can be defined as a subset of cells that are 26-connected adjacent. For the cells that are on the boundary of the system, if the wrap-around effect is not an issue, then their neighborhood cardinality is dependent on the particular location of the cell, but the configuration is similar. A set of algorithms has been designed and implemented to conduct the perfor-mance evaluation for the Hilbert order (original pattern), the Z order, and the U order. The relative merits of these orders are assessed computationally for the 26-76 N N N N C N N N N Table 3.3: Neiboughood of the 8-connected Template Resolution 2 3 4 5 6 7 8 R 4.88 8.97 16.99 33.00 65.00 129.00 257.00 R D 5.88 11.84 22.90 44.44 87.22 172.61 343.30 Z 5.25 12.25 27.16 57.96 120.70 247.41 501.93 U 5.50 13.38 30.53 66.29 139.34 287.03 584.40 H 5.75 14.03 32.14 69.68 146.07 300.17 609.68 M 6.13 14.00 32.13 69.66 146.06 300.17 609.68 L4 6.00 14.14 32.27 69.82 146.22 300.33 609.84 L2 5.88 14.28 32.41 69.98 146.39 300.50 610.01 L3 6.19 14.88 34.09 73.91 154.87 318.12 645.91 L I 6.50 15.44 35.72 77.81 163.35 335.74 681.82 Table 3.4: Average Maximum Distance of Finding 8-Neighbors in a 2D Grid 77 Resolution 2 3 4 5 6 7 8 R D 16.25 38.06 81.02 166.50 337.25 678.63 1360.13 Z 15.50 37.50 83.19 176.70 366.13 747.51 1513.61 R 16.50 40.25 88.13 184.06 376.03 760.02 1526.68 U 16.50 41.63 94.53 203.32 424.05 868.75 1761.06 H 16.50 42.38 97.34 210.59 440.40 903.35 1832.55 M 17.00 43.00 98.00 211.25 441.06 904.02 1833.21 L4 17.00 43.00 98.00 211.25 441.06 904.02 1833.21 L2 17.00 43.00 98.00 211.25 441.06 904.02 1833.22 L3 17.75 45.44 103.86 223.96 467.49 957.87 1941.93 L I 19.00 48.50 110.38 237.34 494.59 1012.40 2051.34 Table 3.5: Average Total Distance of Finding 8-Neighbors in a 2D Grid 78 79 9 12 14 15 5 8 11 13 2 4 7 10 0 1 3 6 Figure 3.9: The R D Order of Resolution 2 connected neighbor-finding problem over the range of resolution r from 1 through 6. The average values of dmax and d\ over the set of 2 r x 2 r x 2 r cells are numerically re-ported in Table 3.6 and Table 3.7, respectively. It is concluded that the performance of the U order is between the other two. Resolution 1 2 3 4 5 6 H 5.50 25.44 117.34 506.87 2113.42 8646.15 U 5.50 24.53 114.34 496.66 2079.95 8529.41 Z 5.50 23.88 108.86 467.60 1948.84 7981.19 Table 3.6: Average Maximum Distance of Finding 26-Neighbors in a 3D Grid 3.4 Neighbour Proximity The neighbor proximity problem, also called nearest neighbor problem, has been an active research topic in the field of computational geometry. Its solution process 80 Resolution 1 2 3 4 5 6 H 21.00 166.38 913.38 4266.69 18444.64 76693.16 U 21.00 164.50 900.44 4210.77 18224.71 75823.30 z 21.00 157.50 848.44 3945.45 17039.35 70839.63 Table 3.7: Average Total Distance of Finding 26-Neighbors in a 3D Grid serves as a fundamental operation for core modules in many geographical informa-tion systems ([9], [14]). The problem involves searching a subset of cells with the specified attributes such that the distance between the given subset and the current cell is a minimum. In the literature, various metrics have been proposed for defining the distance ([56], [63]). Of them, the City-block metric is one of the most widely used metrics. Based on it, the performance of the Hilbert order and the Z order of resolutions from 1 through 4 was reported in [17], where the authors evaluated the average farthest distance from each cell to a subset defined as the collection of cells such that the difference between their indices and the current cell's index is not greater than a specified threshold. The rationale of this metric is based on the fact that less cells need to be visited to find the nearest neighbors if the cells with similar indices are closer each other in the cellular system. We have designed and implemented a set of algorithms to evaluate the performance for the U order, Z order, and six patterns of the Hilbert order. Similarly to the model described in [17], 2 r _ 1 and the number in the particular order is chosen as the threshold and the index, respectively. The results for two dimensions over the set 81 of T x 2 r cells (r = 1,2,...,7) are numerically reported in Table 3.8. Figure 3.10 also graphically presents the results of the Hilbert (original pattern), the U order, and the Z orders. The results for three dimensions over the set of 2 r x 2T x 2 r cells (r = 1,2,... ,6) are numerically reported in Table 3.9 and graphically presented in Figure 3.11, respectively. It is clear that the U order behaves better than the Z order. Resolution 1 2 3 4 5 6 7 LI 1.00 2.00 3.25 4.88 7.54 10.82 16.04 L2 1.00 2.00 3.25 4.88 7.54 10.81 16.04 L3 1.00 2.00 3.27 4.89 7.53 10.81 16.04 L4 1.00 2.00 3.27 4.89 7.53 10.81 16.04 H 1.00 2.00 3.28 4.89 7.53 10.81 16.05 M 1.00 2.00 3.28 4.90 7.53 10.81 16.05 U 1.00 2.63 4.06 7.28 10.76 17.25 24.26 Z 1.50 2.75 4.84 7.91 12.15 18.48 27.01 R 1.50 3.13 6.19 12.22 24.23 48.24 96.25 R D 1.50 3.38 7.41 15.41 31.42 63.44 127.45 Table 3.8: Average Farthest Distance of the Neighbors in a 2D Grid We have shown that the U order exhibits satisfactory characteristics for several paradigms frequently arising from applications of spatial data processing. Thus the 82 30 2 5 2 0 -1 5 S CD § 1 0 2 order • • ' / U ord • sr /• • • J' y\y^ ^. „ < 5- ' " s* Horder 3 4 resolution > Figure 3.10: Average Farthest Distance of the Neighbours in a 2D Lattice Figure 3.11: Average Farthest Distance of the Neighbours in a 3D Lattice 83 Resolution 1 2 3 4 5 6 H 1.00 2.00 3.17 4.41 5.85 8.07 U 1.00 2.56 4.35 5.25 7.51 10.59 Z 2.00 3.31 5.10 7.03 9.32 12.39 R 2.00 3.97 7.23 13.36 25.43 49.36 Table 3.9: Average Farthest Distance of the Neighbors in a 3D Grid importance of developing efficient encoding and decoding algorithms is strongly im-plied. In the next section the encoding and decoding processes for the U order will be analyzed. It will be seen that neither encoding nor decoding involves any rota-tions and analytic transformation formulas can be easily derived. It is its remarkable simplicity and the developed efficient algorithms that render the U order competitive with other spatial orders. 3.5 Encoding and Decoding The main purpose of this section is to describe the iteration rules applied to construct the U order. These rules are important since they establish the basis for deriving the analytic encoding formula and designing efficient algorithms. The U ordering is to be established in a subset £) = { 0 < x < 2 r , 0 < t / < 2 r } i n the two-dimensional integer space. Similar to the Hilbert order, the analysis for the U order is based on the top-down 84 approach. First, the whole lattice is partitioned into four quadrants and indexed in base 4. Then each quadrant is recursively processed identically. The first and second levels are illustrated in Fig. 3.12 and Fig. 3.13, respectively. 3 2 0 1 Figure 3.12: The First Level 33 32 23 22 30 31 20 21 03 02 13 12 00 01 10 11 Figure 3.13: The Second Level The units of the U order can be defined similarly as the Hilbert order (see Sec-tion 2.3), but with simpler features. Instead of multiform states, now both the orien-tation and the aspect of a unit are invariable. For the specified U order considered in this section, the "mouth" of a unit is oriented to the negative x axis and the indexing order of the nodes is counterclockwise in the two dimensional Cartesian coordinate system (see Fig. 3.4 to Fig. 3.6). According to the inherent property of the U order, the recursive construction pro-cess does not involve any rotations. Consequently, the encoding algorithm designed 85 160 160 140 120 1100 i i 80 i : 60 40 20 H order U order 5.5 6 6.5 7 resolution > 7.5 Figure 3.14: Execution-Time Assessments for Encoding 2D Orders for the Hilbert order is applicable to the U order by deactivating the switching tests and the iteration of rotations. Indubitably, this simplicity considerably reduces the encoding time. The encoding algorithm is implemented similarly to that of the Hilbert order. A series of execution-time assessments for various resolutions has been conducted. The performance of the encoding algorithm has been compared with that of the Hilbert order; the results are depicted in Fig. 3.14. The discussion above is also applicable to the decoding process. The performance comparison between the U order and the Hilbert order is presented in Fig. 3.15. 86 180 160 140 A j 120 J 100 8 1 80 l 6 0 40 20 0* H order : U order; 5.5 6 6.5 resolution — 7.5 Figure 3.15: Execution-Time Assessments for Decoding 2D Orders 3.6 The Analytic Formulas for the 2D U Order In this section, we derive the analytical encoding and decoding formulas for the two-dimensional U order. The discussion is related to the U order shown in Fig. 3.1 to Fig. 3.6. The relation defined between the u-value and the (x, y) values in the U order can be expressed in a tabular form in terms of bits. Consider the case of r = 2 first, then x = 2lxx + 2 ° i 0 , y = 21y1 + 2°y0, and u = 2lux + 2°u0. The following table is associated to Fig. 3.2. 87 "3 u2 til UQ yi yo X l Xrj 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 1 0 1 0 0 0 0 1 0 0 1 0 1 0 0 1 1 0 0 1 1 0 1 0 0 0 0 1 0 0 1 0 1 0 1 1 1 0 1 1 0 0 1 1 0 0 1 1 1 1 1 0 0 1 0 0 0 1 1 0 1 1 0 0 1 1 0 0 0 1 0 1 0 1 0 0 1 1 0 1 1 1 1 1 1 1 1 0 0 1 1 1 0 1 1 0 1 1 0 1 1 1 1 1 0 1 0 1 0 1 1 1 1 With Boolean algebra, a set of analytic formulas can be derived from the relation defined in the table: „ _ „ u2 = xi © yi «i = yo UQ = x0 © t/o It follows the hierarchical structure of the U order that the u-values with resolution r are determined by u2k+i = Vk (3-1) u2k = xk@yk (3.2) fc = 0 , l , . . . , r - l . The order of this algorithm is 5r. 88 Equation (3.1) and (3.2) can be implemented conveniently in modern computer languages. For instance, one can write the following expressions in C: u0dd = y where u0dd and u e v e n are the words consisting of the bits at odd and even positions in the expression of a u-value in base 2, respectively. The final step is to interleave the two machine words u0dd and u e v e n to generate a single word u. One of the sim-plest approaches is to use a loop structure. However, the interleaving process can considerably sped up by employing well-designed data structures. Schrack [62] has designed a set of sophisticated interleaving algorithms and the numerical simulations demonstrate remarkable performance. The methodology for deriving the analytic decoding formula is the same as that used for encoding. Decoding is the inverse operation of encoding, namely, given the u-value of a particular point, the corresponding coordinate pair (x,y) is to be determined. For a given resolution r, the decoding formulas become Xk = U2k+i (3.3) yk = u2k®U2k+i (3-4) k = 0 , 1 , . . . , r - 1. The order of this algorithm is also 5r. Just as the Delta Modulation (DM) is comparable with the Pulse Code Modulation (PCM) in digital communication, the principal advantage of the U order over the Hilbert order is the simplicity of its implementation. Later we will show that this 89 simplicity holds in n-dimensional space and it is in the form of a prefix calculation, a typical structure widely applied in parallel processing. 3.7 Extension to Three Dimensions The significance of the two-dimensional U order is suggested by the satisfactory per-formance revealed in the foregoing section. The discussion presented in this section extends the analysis from two dimensions to three dimensions. The encoding and decoding algorithms are developed parallelly to those for the Hilbert order. The U order is to be established in a subset £> = { 0 < x < 2 r , 0 < 2 / < 2 r , 0 < z < 2 r} in the three-dimensional integer space. Fig. 3.16 shows the U order of resolution 2. The terminology to be used below has been defined in Section 2.3. A unit of the three-dimensional U order can be represented by a set of 2 x 2 x 2 cells, or by a simple graph with seven edges and eight nodes, as shown in Fig. 3.17. The latter will be used in this section, since it appears more convenient to present the order's hierarchical features. The analysis for the orientation is to be developed in the three-dimensional Carte-sian coordinate system. As a reference point, the coordinate origin is set at the front bottom left of the lattice concerned. The frame of the highest level (called level 1, at resolution r = 1) is specified as shown in Fig. 3.17. This frame determines the overall appearance of the ordering considered. Analytically, the relation between the coordinates and u-values at level 1 is shown in Table 3.10. 90 U3 24 31 U2 0 16 23 U5 Z7I 40 47 32 39 uo U l / a. 8 15 U6 U7 0 48 55 56 63 Figure 3.16: The Three-Dimensional U Order of Resolution 2 Figure 3.17: Structure and Orientation of Level 1 91 xbit ybit zbit udigit 0 0 0 0 0 1 0 1 0 1 1 2 0 0 1 3 1 0 1 4 1 1 1 5 1 1 0 6 1 0 0 7 Table 3.10: Coordinates and u-values at Level 1 Level 2 of resolution r = 2 is generated by constructing eight units, labeled [4 (k — 0,1,2,3,4,5,6,7), from the corresponding indexed nodes in the unit at level 1 (see Fig. 3.18). Again, according to the inherent property of the U order, the recursive construc-tion process does not involve rotations. Therefore, the encoding algorithm designed for the Hilbert order is applicable to the U order by deactivating the switching tests and the iteration of rotation matrices. This simplicity considerably reduces the en-coding and decoding time. The encoding algorithm is implemented corresponding to the Hilbert order. A series of execution-time assessments for various resolutions has been conducted. The performance of the encoding algorithm has been compared with that of the Hilbert order, the results are shown in Fig. 3.19. The discussion above applies as well to the decoding process. The performance 92 T J 3 2z U2 U4 05 IE ui U6 UO U7 Figure 3.18: Structure and Orientation of Level 2 93 700 • 600 -| 500 • § 4 0 0 -"g 300 • ^ 2 0 0 -100-0*-3 H order 3.5 U order -* i— 4 4.5 resolution — 5.5 Figure 3.19: Execution-Time Assessments for Encoding 3D Orders comparison between the U order and the Hilbert order is presented in Fig. 3.20. 3.8 The Analytic Formulas for the 3D U Order Basically the encoding and decoding processes discussed in the preceding section can be regarded as a look-up-table approach. However, unlike the Hilbert order, a single look-up-table is sufficient for a specified U order. The following discussion is related to the U orders shown in Fig. 3.17 to Fig. 3.18. The analytic formulas can be derived much as for the case of two dimensions. First, the relation defined between the u-value and the (x,y,z) values in the U order is expressed in a tabular form in terms of bits. Then employing Boolean algebra, 94 700 600 A 500 § 4 0 0 '43 i • "8300 §200 100 0* : ) order _H : U order 3.5 4 4.5 resolution > 5.5 Figure 3.20: Execution-Time Assessments for Decoding 3D Orders the analytic formulas are derived from the table. To save space, only the result is presented here without listing the derivation details: U3k+2 = xk (3.5) «3fc+i = xk © zk (3.6) U3k = xk © yk © zk (3.7) Jb = 0 , l , . . . , r - 1 Equations 3.6 and 3.7 can be conveniently implemented in the C language. For instance, ua = x * y * z Ub = x A z Uc = X where ua and uj, are the machine words consisting of the bits at the 3fc-th and the 95 (3k + l)-th digit positions (k = 0 ,1 , . . . , r — 1), respectively. The final step is to interleave three words ua,ut, and uc to generate a single word it. The methodology for deriving the analytic decoding formula is the same as that used for encoding. Decoding is the inverse operation of encoding, namely, given the u-value of a particular point, the corresponding coordinate triple (x,y,z) is to be determined. Given resolution r, the decoding formulas are written as follows: Xk = "3*4-2 (3.8) Vk = «3fc © «3fc+i (3.9) Zk = «3fe+l©«3fc+2 (3.10) k = 0,1,. . . , r - 1 It is the simplicity inherent in the structure and the developed efficient interleaving algorithms that make the U order competitive with other spatial orders, especially for the high-dimensional cases. 3.9 The n-dimensional U Order The high-dimensional lattice system is a fundamental model in modern computer sciences, especially in database systems. A record with n attributes can be mapped to a point in n-dimensional integer space Zn. Thus a lattice can be defined to be the rectangular region that contains a subset of such points. The cardinality of the i-th. attribute is the resolution of the i-th dimension of the lattice (i = 1,. . . , n). The significance of the high-dimensional U order is suggested by the satisfied per-96 formance revealed in the proceeding sections. In this section, a formal mathematical definition is given to the n-dimensional U order. Then both the encoding and the decoding formulas are presented. 3.9.1 Definition Some terminology needs to be introduced before presenting a mathematical defini-tion for the high-dimensional U order. These concepts can be found in most college textbooks on graph theory (e.g., [27]). A graph G is a set (V,E), where V is a finite set and E is a binary relation on V. If it is needed from the context, the sets V and E are denoted V(G) and E(G), respectively, to stress the correspondence to a particular graph G. The elements of V are called nodes. The cardinality of a node is not necessarily unity. A node whose cardinality is more than one is called a complex node, while when its cardinality is one it is called a unitary node. The elements of E are called edges. The product G i x G2 of two graphs G\ and G2 is the graph with node set Vi x ^ , where the cross means Cartesian product, such that two nodes (<*i, a2) and (61, b2) of G\ x G2 are adjacent if and only if either ai = bi and a2b2 6 E(G2) or a2 = b2 and axbi 6 E{G\). A complete graph is a graph in which there exists an edge for every pair of nodes. The complete graph with two nodes is denoted K2. The n-dimensional hypercube Qn is defined recursively as follows: Q1 = K2 Qn = Q 1 x Qn~l 97 (n > 2) Thus Qn is a graph G = (V, E) with 2 n nodes and n 2 n _ 1 edges. Qn can be labeled or unlabeled. In the former case, each vertex is labeled with a bit string (6162 • • • &n)? where € {0,1}. Two vertices of Qn are adjacent if and only if their labels differ in exactly one bit. A Hamiltonian path is a path passing through every node exactly once. Fig. 3.21 shows a four-dimensional hypercube. Let Qn(V) be the n-dimensional cube with the elements in the set V as its nodes, then the Hamiltonian path in Qn(V) is denoted H(Qn,V). Let Vi be the set of unitary nodes. With the concepts introduced above, the U order can be recursively defined as follows: Definition 8.9.1 The n-dimensional U order of resolution 1 is the order determined by sequentially indexing the nodes of H(Qn, Vi); Let VT = {Hj(Qn, Vr-i),j = 1,2,... , 2 n } . Then the n-dimensional TJ order of resolu-tion r is the order determined by sequentially indexing the nodes of H(Qn,VT) and Figure 3.21: The Four-Dimensional Hypercube 98 incorporating the orders defined for resolution r — 1. As an instance, the four-dimensional U order of resolution 1 and H(Q4,Vi) is shown in Fig. 3.22. 7 Figure 3.22: The Four-Dimensional U order of Resolution 1 One of the interesting properties of high-dimensional U orders is the anti-Gray property. By anti-Gray it is meant that the large values of the Hamming distance between adjacent lines in a code are obtained. The following anti-Gray code was 99 proposed by Hamming [26]: 93 92 9i 9o 0 0 0 0 1 1 1 1 0 0 0 1 1 1 1 0 0 0 1 1 1 1 0 0 0 0 1 0 1 1 0 1 0 1 1 0 1 0 0 1 0 1 1 1 1 0 0 0 0 1 0 1 1 0 1 0 0 1 0 0 1 0 1 1 Let d(k) = row(ifc)© row(fc + 1) (k = 1,2,..., 16 (modl6).), then the anti-Gray code gives: £ j t t i 4 = 56. On the other hand, an instance of the four-dimensional U order can be written as 100 follows: W 3 U2 U\ U0 0 0 0 0 1 1 1 1 0 0 0 1 1 1 1 0 0 0 1 1 1 1 0 0 0 0 1 0 1 1 0 1 0 1 1 1 1 0 0 0 0 1 1 0 1 0 0 1 0 1 0 0 1 0 1 1 0 1 0 1 1 0 1 0 It gives £fc=i dk = 54. The anti-Gray type codes can be applied to design multiple disk allocation for partial match queries ([1], [16]). 3.9.2 Encoding and Decoding In this section, the analytic encoding and decoding formulas of the n-dimensional U order are presented. For encoding, both serial and parallel versions are given. Consider a point p in the n-dimensional integer space. Let the coordinates of p be (xi, X 2 , . . . , x n). The bits of Xj (j = 1,2, ..,n) are expressed as (xj ) r_iXj, r_2 . • • £j,o)> 101 where r is the specified resolution. The bits of u-values are expressed as (u n ) I . _ iu n i r _ 2 . . . u Then the serial encoding formula of the U order can be written as follows: "n.Jfc+n-l = Sl.fc Un,k+n-j = "n.Jfc+n-j+l © Xj.fc j = 2 , 3 , . . . , n , fc = 0 , l , . . . , r - l The parallel encoding formula can be expressed as follows: Un,k+n-l = #1,* Un,k+n-j = Xl,k © X2,k • • • © Xj,k j = 2 , 3 , . . . , n , fc = 0 , l , . . . , r - l The decoding formula can be expressed as follows: Xn,k = •Un-l+Jt.n Xn-j,k ~ Un-j+k,n © ^n-j-l+k,n j = 1,2, . . . ,n - 1, = 0 , l , . . . , r - 1 It is important that the presented encoding formula of the U order is in the form of a prefix calculation, a typical structure widely applied in parallel processing [11]. The complexity of many primitive tasks, such as binary addition, list ranking, etc., can be effectively reduced by introducing the prefix calculation. We also showed the "anti-Gray" property of the U order. A l l of these characteristics suggest that the potential of the U order is not limited to the application of spatial data processing only. 102 3.10 Kronecker Algebraic Structures of 2D Spa-tial Orders 3.10.1 The Z Order The Z order has been used widely in image processing and computer graphics because it corresponds to the quadtree structure. The location code of the Z order for r > 2 can be simply constructed by interleaving the bits of the y coordinate with the bits of the x coordinates: TVf > = wTb where w = [ 2 2 r - 1 , 2 2 r - 2 , . . . , 2 1 , 2 ° ] T b = [r/ r_i,x r_i,... ,yo,x0]T Fig. 3.23 shows the Z order for r = 3. In this section, we propose an alternative formulation by using Kronecker algebra ([25]). Define 2 3 0 1 h = 1 1 1 1 103 42 43 46 47 58 59 62 63 40 41 44 45 56 57 60 61 34 35 38 39 50 51 54 55 32 33 36 37 48 49 52 53 10 11 14 15 26 27 30 31 8 9 12 13 24 25 28 29 2 3 6 7 18 19 22 23 0 1 4 5 16 17 20 21 Figure 3.23: The Z Order of Resolution 3 Then the Kronecker Products of Z\ and I\, and of I\ and Z\ are M = Z i ® J i 2 2 3 3 2 2 3 3 0 0 1 1 0 0 1 1 N = h®Zy 2 3 2 3 0 1 0 1 2 3 2 3 0 1 0 1 104 The matrix of the Z order for r = 2 can be obtained by concatenating M and N entry by entry: It can be shown that M 0 N 22 23 32 33 20 21 30 31 02 03 12 13 00 01 10 11 10 11 14 15 8 9 12 13 2 3 6 7 0 1 4 5 (basei) -I (fcaaelO) Z3 = [Z2®h)(d{h®Zx) The Z order of higher resolution can be constructed recursively. 3.10.2 The U Order In this section, an alternative construction method based on the Kronecker algebra for the two-dimensional U order is presented to demonstrate its recursive property. The U order for r = 1 and r = 2 can be expressed by the matrices: 3 2 Ui = 0 1 105 15 14 11 10 12 13 8 9 3 2 7 6 0 1 4 5 It can be shown that U2 = ( t f i ® / i ) 0 ( / i ® C / i ) UT (r > 2) can be constructed recursively. 3.10.3 The X Order The Z order corresponds to the quadtree structure. It is recursive, or, equivalently, exhausts a quadrant or sub-quadrant of a square before exiting it. However, besides the Z order, the U order and the X order can also represent the quadtree structure. More importantly, only these three orders can represent quad-tree structures in the sense of topology. The former has been discussed. The latter is briefly discussed in this section. The X orders of resolution 1 and 2 are shown in Fig. 3.24 and Fig. 3.25, respectively. 2 1 0 3 Figure 3.24: The X Order of Resolution 1 106 10 9 6 5 8 11 4 7 2 1 14 13 0 3 12 15 Figure 3.25: The X Order of Resolution 2 The X order for r = 1 and r = 2 can be expressed by the matrices: X2 = 2 1 0 3 10 9 6 5 8 11 4 7 2 1 14 13 0 3 12 15 It can be shown that X2 = (Xi ® h) <2) {h ® Xi) Xr (r > 2) can be constructed recursively in this fashion. It is interesting to define the dual of the X order as: X2 = ( / l ® * l ) 0 107 22 12 21 11 02 32 01 31 20 10 23 13 00 30 03 33 10 6 9 5 2 14 1 13 8 4 11 7 0 12 3 15 (6ase4) (6aselO) X.2 happens to represent a typical spatial order applied to digital halftoning. In digital halftoning, three approaches are well known — ordered dithering [32], dot diffusion [36], and error diffusion [21]. Among them, the first two approaches involve using spatial orders, which will be called the B order and the K order. The B order for r = 1 and r = 2 can be represented by the matrices [32]: N? = i42) = 0 2 3 1 0 8 2 10 12 4 14 6 3 11 1 9 15 7 13 5 108 In terms of the Kronecker operators, it can be shown that Comparing this formulation with that of X2 (Section 3.10.3), we find that hap-pens to be the clockwise rotated version of X2\ This observation enhances the belief that there is an inherent relation between the X order and the B order. The analytical encoding and decoding formulas for the n-dimensional X order can be constructed similarly to those for the U order. The details will not be discussed further. 3.11 Discussion on Conventional Bit-Manipulation Techniques In this section, we will show that bit-interleaving, bit-centering, bit-concatenating, and bit-complementing are not sufficient for encoding the Hilbert, U , and X orders. Bit-interleaving is a method to map a spatial point in the Cartesian coordinate system into a one-dimensional sequence. Usually, the desired one-dimensional se-quence consists of non-negative integers, no matter what number base is used. The elements in this sequence are called location codes. Consider a point (a;, y) in the two-dimensional Cartesian coordinate system. Its coordinates can be expressed with base 2 as x r _ i x r _ 2 . . . X i x 0 and yT-\yT-2 • • • J/iJ/o, re-spectively, where r is the resolution of the image, Xj,t/j € {0,1}, and i = 0 ,1 , . . ; , r — 1. By inserting the bit sequence y,- into x,-, we have y r _ i x r _ i . . . yox0, called 1-interleaving, 109 or j / r _ 1 j / r _ 2 x r - i a ; r - 2 • • • y i y o X i X 0 , called 2-interleaving, etc. In the following, 1-interleaving will be simply called "interleaving" since it is the most popular technique to construct location codes. It is not possible to derive formulas to generate the location codes of the H , U , and X orders by using the interleaving technique only. Even other bit-manipulating techniques, such as centering, concatenation or complementation, are not sufficient for this purpose. Consider the case of r = 2 to introduce the concepts of new bit-manipulating techniques. Given a spatial point (x,y) in the two-dimensional Cartesian coordinate system, a four bit set, viz., (x iXo , yiyo) needs to be considered. The total number of their permutations is 4! or 24. The 24 instances can be partitioned to form three classes, which will be named the interleaving class, the concatenating class and the centering class. The three classes and their instances are shown below: Interleaving: yiXiyoXo x i j / i x 0 y o yiXQy0Xi X iyoZoyi yoxxyxxo x 0 y i X i y 0 yoxoyix-i x 0 y o Z i y i Concatenating: ViVoXiXo xixoyij/o ViVaXoXx x ix 0 y 0 y i yoV\X\Xo XrjXiyiJ/o yoyixoxi xoxij /oyi 110 Centering: yiXix0y0 iij/ij/o^o yix0xiy0 xiy0yix0 j/oxixoj/i x0yiyoX! yox0xiyi x0yoy\Xi By complementing one or more bits, more spatial orders can be constructed. Since we are considering the four-bit set xi,Xrj, y\,yo, there are 2 4 possible bit-complement operation for each instance out of the 4! permutations. Consequently, the total number of spatial orders is 4! x 2 4 = 384. It is interesting that the property discussed above is just the one of the hyperoctahedral group 0„ of order 2 n n! [23] for n — 4. The typical symmetry operation of 0 n consists of one of the n\ possible permutations of the coordinates of the n-tuple, followed by one of the 2 n possible complementations. Note that so far we have implicitly used the weight set { 2 3 , 2 2 , 2 1 , 2 ° } . This is just one instance of an integer subset {u>3, w2, u>i, WQ}. In general, we have: Defini t ion 3.11.1 For resolution r = 2, an M-code is the location code formulated as the combination of bits of the x-coordinate and ^-coordinate: where hi € {xi,x0,j/i,yo, ^1)^0)2/1) 2/b} xi, yi € {0,1} and the bit set 63, b2,61, 60 is one instance of 384 permutations. Below, we will show that the M-code system can not generate the H , U , and X orders. Some concepts in linear algebra need to be introduced before investigating the M-code. The details can be found in text books on linear algebra, e. g., [4]. I l l A system L of linear equations is inconsistent if it has no solutions, and it is consistent if there is at least one solution. L is overdetermined if the number of its equations is greater than the number of unknowns. Lemma 3.11.1 A system of linear equations Aw = b is consistent if and only if the rank of the matrix A equals the rank of the augmented matrix [A\b]. We begin the analysis with the following definition: Definition 3.11.2 Given a spatial order corresponding to an image with resolution r, its characteristic matrix is a 4 r x 2 r 0-1 Boolean matrix, with the ordinal numbers expressed in bits as its rows, and the order of such rows is determined by the row scanning order from the bottom of this spatial order. By this definition, the characteristic matrix of the Z order with r = 2 is: 0 0 0 0 0 1 0 1 0 0 0 0 0 0 1 0 1 1 1 0 1 0 1 1 1 1 0 0 0 1 0 0 0 1 1 0 1 1 1 0 1 1 0 0 0 1 0 0 0 1 1 0 1 1 1 0 1 1 112 0 0 0 0 0 1 0 1 0 0 0 0 0 1 0 0 0 1 0 0 0 1 1 0 1 1 1 0 0 1 It is convenient to formulate the M-code in the form of linear equation systems. For example, the Z order bit-interleaving formulation can be written as: 0 1 4 5 2 3 8 I 6 4 7 2 8 1 9 12 13 10 11 14 15 The system shown above may be represented by the matrix equation Aw = 6, where A is the characteristic matrix and b is the right hand side vector. Note that in this case, the system Aw = b is over determined, although not necessarily consistent. An analysis of the solution existence follows. For simplicity, it is carried out for the r = 2 1 0 1 1 1 1 1 0 1 0 1 1 1 1 0 0 0 1 0 0 0 1 1 0 1 1 1 0 1 1 113 case. However, the methodology and conclusion can be generalized to arbitrary values of r by induction without difficulty. In the case of r = 2, the 384 different bit-manipulating structures can be parti-tioned to form 16 classes such that each class consists of 4! instances. For each class, the 4! instances are just permutations of columns of A and such permutations do not change the rank of A or [A\b]. For this reason, only one instance needs to be analyzed from among the 4! ones. Picking one instance from each of the 16 classes, we obtain: y\X\yox0 yixiy0x0 yiXiy0x0 yiXiy0x0 ViXiVoXo yixiy0x0 yixiy0x0 yiXiy0x0 y~iXiy0xo yiXiy0x0 j/iXi^oXo j/iZij/oZo j/iXit/rjZo yiXiy0x0 j/iXij/oXo j/iZij/oZo We can write 16 characteristic matrices, denoted as For example, j/iXij/rjZo generates as follows: , ^ 6 ^ , respectively. 0 0 0 0 1 1 1 1 1 1 1 0 0 1 0 1 0 1 0 0 0 0 0 1 0 1 0 1 1 0 1 1 0 0 0 0 0 1 1 0 1 1 1 0 1 1 0 0 0 1 0 0 1 0 0 1 114 Let the right-hand side vectors corresponding to the H , U , X orders be bjj,bu,bx, respectively, then we have: 1 : bH = [0,1,14,15,3,2,13,12,4,7,8,11,5,6,9,10] T bv = [0,1,4,5,3,2,7,6,12,13,8,9,15,14,11,10] T bx = [0,3,12,15,2,1,14,13,8,11,4,7,10,9,6,5] T where the superscript "T" denotes transposition. It is easy to find that the rank of N-1^ is 4 while the rank of [./v/^ l&t] is 5, where i = 1,2,.. . , 16, t G {H, U, X}. Therefore, it follows from Lemma 3.11.1 that there exist no solutions of the weight vector w. Consequently, no M-code exists for the H , U , and X orders. The significance of above analysis is that it demonstrates that there does not exist any weight vector w such that w is a solution. Particularly, w = [8,4,2, l]T or a per-mutation can not be a solution. This implies that bit-interleaving, bit-centering, bit-concatenation, and bit-complementation are not sufficient for encoding the Hilbert, U , and X orders. Therefore, other mathematical systems have to be used to derive the analytical formulas for this purpose, as we discussed in the foregoing sections. The analysis for the decoding process of the H , U , and X orders is parallel to the analysis presented in this section. The conclusion is similar. In addition, we have investigated the D order (Fig. 2.2), R P (Fig. 2.7), and S order (Fig. 2.8). The above conclusion applies to them as well. 1 The results shown were obtained by using the M A T L A B system. 115 3.12 Summary The encoding and decoding processes for n-dimensional Hilbert orders (n > 4) is quite complicated and places a heavy computational burden on the user. In this chapter, we proposed a new spatial order called the U order. Its performance is as-sessed analytically and empirically through four paradigmatic data processing tasks in spatial analysis: the partial match query, the range query, the eight-connected neighbour-finding, and the neighbour proximity evaluation. These paradigms have frequently been chosen as typical applications to evaluate the performance of a partic-ular spatial order (e.g., [17], [31]). The result suggests that the U order is comparable with the Hilbert order and superior to others. A formal description for the n-dimensional U order is presented, whereas this topic is considerably more complicated for the Hilbert order. Several encoding and decoding methods are discussed. It is concluded that the major advantage of the U order is the significant simplicity of the encoding and decoding operations for multi-dimensional data. Future research may involve the U order applied to quadtree and octree structures as well as investigations into its parallel algorithms. 116 Chapter 4 The Q Orders 4.1 Overview The U order described in the preceding chapter is of advantage for computing ef-ficiency, at a small cost of the performance. In this chapter, we will investigate several new spatial orders. Our goal is to find candidates that are competitive with the Hilbert order in performance yet require relative simple encoding and/or decod-ing formulas. Attention will be paid to the spatial orders with a quadrant-recursive structure, i.e., one which exhausts a quadrant of a square before exiting it. Such orders will be called the Q orders in the following. In this chapter, in Sections 4.2 to 4.5, a recursive analysis is presented for four members of the Q order family, the Q l , Q2, Q3, and the Q4 orders. They differ in performance. Of the four orders, it appears that the Q4 order is the best. Its performance is very close to that of the Hilbert order. Thus, in Section 4.8, the 117 algorithms of the Q4 order are constructed. An analysis shows that its encoding and decoding algorithms only need 64.5% and 84.2% of operations of the corresponding algorithms of the Hilbert order. Finally, the relative merits of various orders are tabulated in Section 4.9. 4.2 The Q l Order First, the configuration of the Q l order is established, then its performance applied to range queries in a two-dimensional attribute space is analyzed. The expected value of clusters (defined in Section 2.2) is used as a metric to evaluate the performance. 4.2.1 Configuration Let C be the complex space. Consider again the region C + = {2|9ft(z) > 0, S(z) > 0, z € C} to construct the frame of the Q l order in C + . Given resolution r, consider a square D with its lower-left corner at the origin and with side 2 r , partitioned into four quadrants and indexed by 0, 1, 2, and 3, respectively. The sequence 0, 1, 2, 3 also specifies the Q l order of resolution 1 (Fig. 4.1). To establish the configuration of the Q l order, a path is introduced in D as a reference (Fig. 4.2). The generation of the configuration from resolution 1 to 2 then can be determined by the following transformations: Toz = \(l+j-z) Tiz = \(j + z) 118 T2z = \(l+j + z) T3z = l + \(j-z) where j is the imaginary number and Tk is the transformation that maps D onto its quadrant A; (fc = 0,1,2,3). The Q l order of resolution 2 and its reference path is shown in Fig. 4.3 and Fig. 4.4, respectively. The configurations of higher resolutions are determined by recursively applying the transformations defined above. The Q l order of resolution 3 and its reference path is shown in Fig. 4.5 and Fig. 4.6, respectively. 1—» 2 0 3 Figure 4.1: The Q l Order of Resolution 1 Figure 4.2: Path of the Q l Order of Resolution 1 119 5 6 9 10 4 7 8 11 3 0 15 12 2 1 14 13 Figure 4.3: The Q l Order of Resolution 2 Figure 4.4: Path of the Q l Order of Resolution 2 120 21 22 25 26 37 38 41 42 20 23 24 27 36 39 40 43 19 16 31 28 35 32 47 44 18 17 30 29 34 33 46 45 13 14 1 2 61 62 49 50 12 15 0 3 60 63 48 51 11 8 7 4 59 56 55 52 10 9 6 5 58 57 54 53 Figure 4.5: The Q l Order of Resolution 3 R JR - R 1 J 1 i LJ LJ LJ LJ n n n i i i 4f =fc: Figure 4.6: Path of the Q l Order of Resolution 3 121 4.2.2 Metric Analysis of Range Query The terminology used in the following has been defined in Chapter 2. The graphs of resolution 1,2 and 3 of the Q l order are shown in Fig. 4.2 to Fig. 4.6, respectively. Then, the following theorems capture the properties of interest. Theorem 4.2.1 = (^)4 P (r > 2) [proof] From Fig. 4.2 to Fig. 4.6 it follows that N[1] = l.JVj 1* = AN^ + 1, and /Vi x ) = 4A 2 ( 1 ) + 0. No new links are introduced for higher resolutions, hence JVf) = (k > 3). Therefore, JV/1) = 4,'/Vr(i). (t = l , . . . , r - 2 ) from which the closed form is obtained by setting i to r — 2: = ( - ) 4 r • 4 6 ; Theorem 4.2.2 iV r ( 2 ) = - ( ( r > 2) [proof] According to the structure of the Q l order (the graphs with resolution up to 3 are presented in Fig. 4.2 to Fig. 4.6), a sequence can be established as follows: 122 jvf) = 0 Tvf > = 1 JVf) = 14 Nl2) = AN^ + h (fc > 4,63 = 10) Then the theorem is proved by recurrence. • Theorem 4.2.3 ATr(3) = (^K (r > 2) [proof] According to the structure of the Q l order, a sequence can be established as follows: N(3) = Q TVf) = 3 i v f 0 = 12 In general, we have NJ® = 4A^i\ (fc > 3), since no new links are introduced into the graphs of resolution higher than 2. Then the theorem is proved by recurrence. • Theorem 4.2.4 ATr(4) = (\)±r ~ (^ )2r + 1 (r > 1) 123 [proof] According to the structure of the Q l order, a sequence can be established as follows: N[4) = 0 JV« = 0 Ai 4) = 3 Ni4) = 21 A<4) = 4NJ;4\+dk 3 dk = dk-i + (-)2 (k > 4,d3 = 3) Then the theorem is proved by recurrence. • Corollary 4.2.1 When r is large, the expectation of the number of clusters per tem-plate query tends to: ^ " "8" = 2.125 [proof] There are (2 r - l ) 2 templates in a lattice system with 2 r x 2 r cells. Let p 1 , p 2 , P 3 ) and p 4 denote the occurrence frequencies of the templates with one, two, three, and four clusters in this lattice system. Then we have: jV(D * = £ j g > ( 2 T ± T ) 2 _5_ 16 Similarly, we have: 124 Consequently, when r is large, the expectation of the number of clusters per template tends to: Rr = l p i + 2p2 + 3p 3 + 4p4 17 8 = 2.125 • The statistic obtained above will be compared with other spatial orders (Ta-ble 4.2). 4.3 The Q2 Order In this section, the configuration of the Q2 order is established, then its performance applied to range queries in a two-dimensional attribute space is analyzed. The ex-pected value of clusters (defined in Section 2.2) is used as a metric to evaluate the performance. 4.3.1 Configuration The configuration of the Q2 order is determined by the following transformations: Toz = \(l+jz) Tiz = + 125 T2z = T3z = l + where j is the imaginary number and Tk is the transformation that maps D onto its quadrant k (k = 0,1,2,3). The Q2 orders of resolution 1 to 3 and the corresponding reference paths are shown in Fig. 4.7 to Fig. 4.12. 1 2 0 3 Figure 4.7: The Q2 Order of Resolution 1 6 7 9 8 5 4 10 11 2 3 13 12 1 0 14 15 Figure 4.8: The Q2 Order of Resolution 2 4.3.2 Metric Analysis of Range Query The graphs of resolution 1,2 and 3 of the Q2 order are depicted in Fig. 4.10 to Fig. 4.12, respectively. The terminology used in the following has been defined in 126 24 27 28 31 39 36 35 32 25 26 29 30 38 37 34 33 23 20 19 16 40 43 44 47 22 21 18 17 41 42 45 46 8 11 12 15 55 52 51 48 9 10 13 14 54 53 50 49 7 4 3 0 56 59 60 63 6 5 2 1 57 58 61 62 Figure 4.9: The Q2 Order of Resolution 3 ;ure 4.10: Path of the Q2 Order of Resolution 1 127 Figure 4.11: Path of the Q2 Order of Resolution 2 1 r n 1 1 r i 1 L J L J L J L J l r i 1 1 r -\ 1 L J L J L J L J l r l 1 r 1 L J L J L J L J 1 r l 1 r l L J L J L J L J Figure 4.12: Path of the Q2 Order of Resolution 3 128 Chapter 2. Theorem 4.3.1 iV r ( 1 ) = ( § ) 4 ' (r > 2) [proof] From Fig. 4.10 to Fig. 4.12 it follows that A a ( 1 ) = 1, = 4 A a ( 1 ) + 2, and + 0. Again, no new links are introduced for higher resolutions, hence JVW = AN^ (Jfc > 3). Therefore, A r ( 1> = 4*X(-/ ' (t = l , . . . , r - 2 ) As before, the closed form is obtained by setting i to r — 2: jVU) = 4 r " 2 A 2 ( 1 ) = (fn- • Theorem 4.3.2 ^ = ^ " ( r " 2 ) [proof] According to the structure of the Q 2 order (the graphs with resolution up to 3 are presented in Fig. 4.10 to Fig. 4 .12), a sequence can be established as follows: A f > = 0 A f > = 0 A f } = 4 A f > = 4 A & + 6* bk = 26jt_i (k > 4 , 6 3 = 4) 129 Then the theorem is proved by recurrence. • Theorem 4.3.3 ^ ( 3 ) = (f)4' " fa (r > 2) [proof] According to the structure of the Q2 order, a sequence can be established as follows: TV*3 > = 0 iV? > = 3 J V f = AN^+Ck Ck = 2cjt-i (k > 3,c 2 = 3) Then the theorem is proved by recurrence. • Theorem 4.3.4 i V r ( 4 ) = (\W - (\W + 1 (r > 2) [proof] According to the structure of the Q2 order, a sequence can be established as follows: TV?* = 0 7Vt> = 0 = 3 Nl4) = AN^+dk 3 dk = dk~i + (-)2 (k > 4,d3 = 3) Then the theorem is proved by recurrence. • 130 Corollary 4.3.1 When r is large, the expectation of the number of clusters per template query tends to: * • ! = 2.25 [proof] There are (2 r — l ) 2 templates in a lattice system with 2 r x 2 r cells. Let p 1 , p 2 , P 3 , and P4 denote the occurrence frequencies of the templates with one, two, three, and four clusters in this lattice system. Then we have: PI — l im r-oo ( 2 r _ 1)2 3 8 Similarly, we have: 1 P2 = g 3 P3 = g 1 P i = 8 Consequently, when r is large, the expectation of the number of clusters per template tends to: Rr = lpx + 2p2 + 3p 3 + 4p4 9 4 = 2.25 • The statistic obtained above will be compared with other spatial orders (Ta-ble 4.2). 131 4.4 The Q3 Order In this section, the configuration of the Q3 order is established, then its performance applied to range queries in a two-dimensional attribute space is analyzed. The ex-pected value of clusters (defined in Section 2.2) is used as a metric to evaluate the performance. 4.4.1 Configuration The configuration of the Q3 order is determined by the following transformations: Tiz = + T2z = 1 T3z = l + where j is the imaginary number and Tk is the transformation that maps D onto its quadrant k (k = 0,1,2,3). The Q3 order of resolution 1 to 3 and the corresponding reference paths are shown in Fig. 4.13 to Fig. 4.18. 1 2 0 3 Figure 4.13: The Q3 Order of Resolution 1 132 7 6 9 8 4 5 10 11 3 2 13 12 0 1 14 15 Figure 4.14: The Q3 Order of Resolution 2 31 28 27 24 39 36 35 32 30 29 26 25 38 37 34 33 17 18 21 22 41 42 45 46 16 19 20 23 40 43 44 47 15 12 11 8 55 52 51 48 14 13 10 9 54 53 50 49 1 2 5 6 57 58 61 62 0 3 4 7 56 59 60 63 Figure 4.15: The Q3 Order of Resolution 3 133 Figure 4.16: Path of the Q3 Order of Resolution 1 Figure 4.17: Path of the Q3 Order of Resolution 2 i r i i i r i i LJ L J L j LJ n r r n L J i i L J r i i r n LJ L J L j LJ n r i r n n 1 L J 1 i L J 1 Figure 4.18: Path of the Q3 Order of Resolution 3 134 4.4.2 Metric Analysis of Range Query The graphical representation for the Q3 order is more convenient for the metric anal-ysis. The graphs of resolution 1,2 and 3 are demonstrated in Fig. 4.16 to Fig. 4.18, respectively. The terminology used in the following has been defined in Chapter 2. The properties concerned are narrated as a series of theorems. Theorem 4.4.1 [proof] It follows from Fig. 4.16 and Fig. 4.17 that JVX ( 1 ) = 1 and A 2 ( 1 ) = 4iV x ( 1 )+2 = 6. The latter is established due to the fact that, in addition to the four one-cluster tem-plates arising from replicating the Q3 order with resolution 1, a new one-cluster template with a continuous path is formed. In general, Nk^ = 4/V^l\ + 2 holds for k even and Nk^ = for k odd (k = 2 , 3 , . . . , r), according to the structure of the Q3 order. With the intermediate variable Tk = JVf* + N^+\ (k = 1,2,.. . , r -1), we NW + N l * = ( | ) 4 r - | (r>2) have: T2 4(iV 1 ( 1 ) + /V 2 ( 1 ) ) + 2 4Tx + 2 m~i + 2 (k = 2,3, . . . , r - l ) Then by iteration we have: 2(4' - 1) (j = l , 2 , . . . , * - 2 ) 3 135 The closed form is obtained by setting j to k — 1: 2(4fc"1 —1) Tk = 4 f c - 1 r 1 + K12} 3 3 Therefore, r 1 v48 ; 3 Theorem 4.4.2 N?) + m = ( H ) 4 ' - ( | ) 2 ' + 2 (r > 3) [proof] According to the structure of the Q3 order (the graphs with resolution up to 3 are presented in Fig. 4.16 to Fig. 4.18), a sequence can be established. With the intermediate variable Tk = + Nk% (k = 1,2,..., r — 1), we have: Tx •= 2 T2 = 10 Tk = 4r f c_1 + 3-2 f c -6 (k > 3) Then the theorem is proved by recurrence. • Theorem 4.4.3 r ^ J V ' - i 24 3 (r > 2) 136 [proof] According to the structure of the Q3 order a sequence can be established. With the intermediate variable Tk = N[3) + [k = 1,2,..., r - 1), we have: 2i = 0 Tk = 4T f c _ 1 +2 (Ar > 2) Then the theorem is proved by recurrence. • Theorem 4.4.4 NM + N " = ( § ) 4 - - (3-)r +1 (r > 3) [proof] According to the structure of the Q3 order a sequence can be established. With the intermediate variable Tk = N(K4) + N$X (k = 1,2,..., r - 1), we have: ft = 1 T2 = 10 Tk = ATk-i + 3 • 2* - 4 (jfc > 3) Then the theorem is proved by recurrence. • Corollary 4.4.1 When r is large, the expectation of the number of clusters per range query tends to: * " "6" w 2.167 137 [proof] There are (2k — l ) 2 templates in a lattice system with 2fc x 2fc cells (k = 1,2,... , r). Let Pi ,P2 5 P3 5 and p4 denote the occurrence frequencies of the templates with one, two, three, and four clusters in the lattice of resolution r. Then we have: P l ~ r f e (2' - 1)2 + (2 ' - 1 - l ) 2 23 60 + jv rW P 2 = r f e ( 2 r - 1)2 + ( 2 r - l _ 1)2 13 40 /Vr(3) + jv£\ P 3 ~ ( 2 r _ 1)2 + ( 2 r - l _ 1)2 30 j\T(4) + jv rW P 4 = (2 r - l ) 2 + (2 '" 1 - l ) 2 31 120 Consequently, when r is large, the expectation of the number of clusters per range tends to: Rr = l p i + 2p2 + 3p3 + 4p4 13 6 « 2.167 • The statistic obtained above will be compared with other spatial orders (Table 4.2). 138 4.5 The Q4 Order In this section, the configuration of the Q4 order is established, then its performance applied to range queries in a two-dimensional attribute space is analyzed. The ex-pected value of clusters (defined in Section 2.2) is used as a metric to evaluate the performance. 4.5.1 Configuration The configuration of the Q4 order is determined by the following transformations: T2z = I [ l + j ( 2 - z ) ] Tzz = where j is the imaginary number and Tk is the transformation that maps D onto its quadrant k (k = 0,1,2,3). The Q4 order of resolution 1 to 3 and the corresponding reference paths are shown in Fig. 4.19 to Fig. 4.24. r-« 2 0 3 Figure 4.19: The Q4 Order of Resolution 1 139 6 7 8 9 5 4 11 10 2 3 12 13 1 0 15 14 Figure 4.20: The Q4 Order of Resolution 2 25 26 29 30 33 34 37 38 24 27 28 31 32 35 36 39 23 20 19 16 47 44 43 40 22 21 18 17 46 45 42 41 9 10 13 14 49 50 53 54 8 11 12 15 48 51 52 55 7 4 3 0 63 60 59 56 6 5 2 1 62 61 58 57 Figure 4.21: The Q4 Order of Resolution 3 140 Figure 4.22: Path of the Q4 Order of Resolution 1 Figure 4.23: Path of the Q4 Order of Resolution 2 "1 r r n L J L J L J r .1 ( 4' 1 r J L J L J LJ • •n r r n L J i i 4f L J r 0 1 6: 1 r n • .J L J L J LJ Figure 4.24: Path of the Q4 Order of Resolution 3 141 4.5.2 Metric Analysis of Partial Match Query In a partial match query, one of the attributes is specified in a two-attribute record while the other is not. Geometrically, the set of the selected records corresponds to a line parallel to one of the coordinate axes in the two-dimensional Cartesian space. The graphical representation for the Q4 order is more convenient for the metric analysis. The graphs of resolution 1,2 and 3 have been demonstrated in Fig. 4.22 to Fig. 4.24, respectively. Let Rr be the total number of clusters in the lattice of 2 r x 2 r . The result is narrated as the following theorem: Theorem 4.5.1 The total number of clusters arising from the partial match query in the lattice of resolution r is given by: * - <6l>4' ( r * 3> [proof] It is observed from Fig. 4.22 that there are five clusters for r = l . This leads to i?i = 5. For the Q4 order with r=2, in addition to the four copies arising from replicating the graph with r = l , there are three new links introduced (Fig. 4.23), hence R2 = 4i?! — 3. Similarly, one has R3 = 4R2 — 1. However, no new links are introduced for r > 4. Therefore, Rk = 4Rk-i {k > 4). Consequently, Rr = IRr-l = 42Rr-2 (« = l , . . . , r - 3 ) 142 The closed form is obtained by setting i to r — 3: Corollary 4.5.1 The expectation of the number of clusters per partial match query [proof] There are 2 r horizontal selections and 2 r vertical selections altogether. There-fore, the expectation of the number of clusters per query is: Rr Pr = 2 r + 2 r v 6 4 y « 1.047 x 2 r _ 1 • The statistic obtained above is compared with those of the U , Z, and Hilbert order (Table 4.1). It is observed that the performance of the Q4 order is close to the Hilbert order, and is better than other two orders. In the next section the range query will be analyzed. 4.5.3 Metric Analysis of Range Query The terminology used in this section has been defined in Chapter 2. The properties concerned are narrated as a series of theorems: 143 Spatial Order The Expectation of Cluster Number Per Query Z 1.5 x 2 r - 1 u 1.167 x 2 ' - 1 Q4 1.047 x 2 ' " 1 H 1.0 x 2 r - J Table 4.1: Expectation for Partial Match Query Theorem 4.5.2 JV r ( 1 ) = ( f ) 4 ' (r > 3) [proof] It follows from Fig. 4.22 to Fig. 4.24 that = 1,JV2(1) = 47Va(1) + 2, and Ni1] = 47V2(1) + 1. No new links are introduced for higher resolutions, hence ArW = 4JVJp_\ (k > 4). Therefore, iV r ( 1 ) = 4*'iVr(-,- (»==l,...,r-3) The closed form is obtained by setting i to r — 3: N(l) = 4r-*NM = 25 x 4 r _ 3 = ( ~ ) 4 r • Theorem 4.5.3 NW = (|)4' _ (5 ) 2 r ( r > 3 ) 144 [proof] According to the structure of the Q4 order (the graphs with resolution up to 3 are presented in Fig. 4.22 to Fig. 4.24), a sequence can be established as follows: M2) = 0 7Vf> = 1 = 11 Nl2) = 4Ng>1+.bk bk = 2bk-i (k > 5,64 = 20) Then the theorem is proved by recurrence. • Theorem 4.5.4 ^ ( 3 ) = (r * 3 ) [proof] According to the structure of the Q4 order, a sequence can be established as follows: A f ) = 0 i V f > = 1 A f > = 5 In general, we have = 4JV£ ) 1 (k > 4), since no new links are introduced into the graphs of resolution higher than 3. Then the theorem is proved by recurrence. • Theorem 4.5.5 JV<4) = ( ^ ) 4 r - ( j )2 r + 1 (r > 3) 145 [proof] According to the structure of the Q4 order, a sequence can be established as follows: 7Vf> = 0 A f > = 1 = § = 41 Nl4) = ANl^+dk 3 dk = dk-i + (~)2 (Jfe > 5,d4 = 9) Then the theorem is proved by recurrence. • Corollary 4.5.2 When r is large, the expectation of the number of clusters per tem-plate query tends to: U e p t " 32 = 2.09375 [proof] There are (2r — l ) 2 templates in a lattice system with 2r x 2r cells. Let PiiP2,P3, and p 4 denote the occurrence frequencies of the templates with one, two, three, and four clusters in this lattice system. Then we have: 25 64 Similarly, we have: 21 P 2 = 64 146 64 13 P 4 = 64 Consequently, when r is large, the expectation of the number of clusters per template tends to: Repi = l p i + 2p2 + 3p 3 + 4p 4 67 32 = 2.09375 • The statistic obtained above is compared with those of the G, U , Z, and Hilbert order (Table 4.2). It is observed that the performance of the Q4 order is close to the Hilbert order. 4.6 The Q5, Q6, and Q7 Orders A n eight-variable integer programming model can be constructed to determine the optimal spatial order(s) with quadrant-exhaustive and single-type hierarchical struc-tures. Single-type implies that the iteration rule is identical for all resolutions. To simplify the analysis, consider the orders developed from the rook-connected path in the system of resolution 1 only (as shown in Fig. 4.2, for example). Let this path be called the basic path. The sub-ordering of each quadrant in the system of resolution 2 can be generated from this basic path by rotating it 0°, 90°, 180°, and 270°, with two possible indexing sequences each. Therefore, two integer variables are required to define the sub-ordering for each quadrant, one for the rotation state and another for 147 Spatial Order The Expectation of Cluster Number Per Query Z 2.625 G 2.5 U 2.333 Q l 2.125 Q2 2.25 Q3 2.167 Q4 2.094 H 2.0 Table 4.2: Behavior of the Average Case 148 the indexing sequence. Their domains can be defined as {0,90,180,270} and {0,1}, respectively. Since the basic path has eight distinct ordering possibilities, there are in total 8 4 = 4096 ordering instances for resolution 2. The objective function can be any metric that is meaningful to characterize the performance of a particular order applied to query retrievals, for example, the metrics dmax,di, or dt defined in Sec-tion 3.3. Instead of trying to solve this type of integer programming problem, which may have to rely on some enumeration techniques, in this section we investigate a subset of orders that are rook-connected at resolution 2. There are only six possible rook-connected patterns in a 2 x 2 lattice system. Three of them, namely, the Hilbert order (original pattern), the Q l order, and the Q4 order, have been discussed. In the following, the remaining three are named the Q5, Q6, and Q7 order, respectively. Their configuration for resolution 2 are the same as those of patterns L2, L3, and L4 of the Hilbert order, but the iterations for resolution 3 are different. To save space, the mathematical descriptions are omitted here and only the paths are illustrated in Fig. 4.25. 149 1 "1 r 1 1 1 I IrT J L ""4' r -i I n 15 41 J| •"1 *1 L JJ L 1 r _4i L J L r ~ n 1 "I r I X — L J ,| r 1 1 L ft 1 r _4 1 15 1 f- r 1 L J L Figure 4.25: Paths of the Q5, Q6 and Q7 Orders of Resolution 3 150 4.7 Neighbour Finding and Proximity A set of algorithms has been designed and implemented for the purpose of evaluating the performance of the Q orders. Their relative merits are assessed computationally for the 8-connected neighbor-finding problem over the range of resolution r from 2 through 8. The average values of dmax and d\ over the set of 2 r x 2 r cells are reported in Table 4.3 and Table 4.4, respectively. Resolution 2 3 4 5 6 7 8 H 5.75 14.03 32.14 69.68 146.07 300.17 609.68 Q5 5.88 14.41 32.41 69.96 146.30 300.34 609.80 Q7 6.00 14.59 32.98 71.32 149.42 307.03 623.67 Q4 6.13 14.63 33.02 71.51 150.01 308.52 627.05 Q3 6.13 14.88 33.53 72.12 150.65 309.16 627.74 Q2 6.13 14.86 33.61 72.31 150.95 309.57 628.10 Q6 6.19 15.17 35.20 76.92 161.89 333.36 677.94 Q l 6.50 16.00 37.91 83.90 177.90 367.91 749.96 Table 4.3: Average Maximum Distance of Finding 8-Neighbors in a 2D Grid The Q orders have also been compared with each other for the proximity problem. The scheme is similar to that described in Section 3.4. The results over the set of 2 r x 2 r cells (r = 1,2,.. . , 7) are reported in Table 4.5. 151 Resolution 2 3 4 5 6 7 8 H 16.50 42.38 97.34 210.59 440.40 903.35 1832.55 Q5 17.00 43.25 98.56 212.14 442.29 905.57 1835.05 Q7 17.00 43.81 100.44 216.70 452.23 926.29 1877.53 Q2 17.00 44.00 100.98 217.97 454.96 931.95 1888.84 Q3 17.00 44.00 101.00 218.00 455.00 932.00 1889.24 Q4 17.00 44.00 101.00 218.00 455.00 932.00 1889.25 Q6 17.75 46.56 107.81 233.95 489.84 1005.26 2039.63 Q l 19.00 51.00 119.00 259.00 543.00 1114.99 2262.88 Table 4.4: Average Total Distance of Finding 8-Neighbors in a 2D Grid 4.8 Encoding and Decoding the Q4 Order Of the seven Q orders proposed above, the Q4 order behaves best on the average for the 2 x 2 templates. For other applications like neighbor finding and proximity problems, the Q4 order also exhibits satisfactory performance. On the other hand, of the seven Q orders, only the Q l and Q4 orders do not involve the aspect iteration. This property implies that these algorithms may be simpler than the other ones. However, the behaviour of the Q l order is not as good as the Q4 order. Thus we choose the Q4 order as a representative and derive its encoding and decoding algorithms. These algorithms do not depend on look-up tables. The methodology is similar to that used for the Hilbert order. 152 Resolution 1 2 3 4 5 6 7 H 1.00 2.00 3.28 4.89 7.53 10.81 16.05 Q6 1.00 2.00 3.42 5.23 8.10 12.02 17.74 Q4 1.00 2.00 3.31 5.32 8.11 12.44 18.13 Q5 1.00 2.00 3.47 5.36 8.48 12.44 18.65 Q7 1.00 2.00 3.33 5.30 8.34 12.58 18.74 Q2 1.00 2.00 3.36 5.69 8.46 13.22 18.82 Q l 1.00 2.00 3.63 5.55 8.73 12.85 19.12 Q3 1.00 2.00 3.84 6.25 9.41 14.32 20.81 Table 4.5: Average Farthest Distance of the Neighbors in a 2D Grid 4.8.1 Encoding The Q4 orders and curves of various resolutions have been depicted in Fig. 4.19 to Fig. 4.24. In this section, the iteration rules to construct the Q4 order are described first. These rules establish the basis for deriving the analytic encoding and decoding formulas. The Q4 order is to be established in a subset D = {0 < x < 2 r ,0 < y < 2 r} in the two-dimensional integer space. Given the coordinates of a particular point P in the lattice system D, the corre-sponding value of P in the Q4 order (W — value in brief) is to be determined. This process is called encoding. 153 Similar to the Hilbert order, the analysis for the Q4 order is based on the top-down approach. First, the whole lattice is partitioned into four quadrants and indexed in base 4. Then each quadrant is recursively processed in the same way. The first and second levels are illustrated in Fig. 4.26 and Fig. 4.27, respectively. 1 2 0 3 Figure 4.26: The First Level 12 13 20 21 11 10 23 22 02 03 30 31 01 00 33 32 Figure 4.27: The Second Level Let the coordinate values x and y be expressed by bit-strings: x = x r _ j . . . x i X o and y = yr~\... j/ij/o, and the corresponding W-value be expressed in terms of qua-ternary digits (w-digits in brief): w = 4 r _ 1 x wr~i +.. . + 4 1 x w\ + 4 ° x WQ. At level i (i = 1,2,.. . , r), given x r _; and yr_,-, the t-th most significant digit of the coordinates x and y, the i-th most significant digit of the W-value (denoted by uv-t") is to be determined. The terminology to be used has been defined in Section 2.3. For example, a unit is 154 an item specifying the indexing strategy for four quadrants of a square region. It can be represented by a simple graph with three edges and four nodes labeled by indexes in terms of quaternary digits (n.digit in brief), as shown in Fig. 4.28. For the Q4 order, the units rotate, thus their orientations need to be iterated. Their aspects, however, do not change. In other words, the nodes of a unit are always indexed clockwise. According to the structure of the Q4 order, the units generated from node 0 and 1 of the unit at level i (i = 1,2,.. . , r) always rotate 90 degrees counterclockwise, while the units generated from node 2 and 3 always rotate 90 degrees clockwise. The orientation of the unit at the level 1 is called the principal orientation of the Q4 order. The analysis for the orientation will be developed in the two-dimensional Cartesian coordinate system. The structure of level 1 is specified as shown in Fig. 4.28. This structure determines the overall appearance of the order considered. Figure 4.28: Structure and Orientation of Level 1 Level 2 is generated by constructing four units, labeled Uv (v = 0,1,2,3), from the corresponding indexed nodes of the unit at level 1 (Fig. 4.29). The indexes of nodes are obtained by concatenating a quaternary digit to the index of the unit. Note the nodes in these units are indexed progressively clockwise (Fig. 4.27). 155 Figure 4.29: Structure and Orientation of Level 2 The orientation of a unit is specified by a normalized vector m. According to the specified arrangement of the Q4 order relative to the coordinate system, m € { (0 ,1 ) T , ( -1 ,0 ) T , ( 0 , - l ) T , ( l , 0 ) r ) . Two units with their normal vector are demonstrated in Fig. 4.30. -1, 0] [1, 0] U l V2 Figure 4.30: Two Instances with Directional Vic tors at Level 2 Similarly to the Hilbert order, we use a directional index to conduct the encoding process. The relation between the directional vector m and an index v has been shown in Fig. 2.20. Since v has four possible values, it can be represented by two bits a and b. Similarly, two bits ri2k-\ and n 2 * - 2 are used to represent an n-digit. According to the inherent structure of the Q4 order, n2k-\ and n2fc-2 can be expressed in terms of a, 6, Xfc_i and yk-i' n 2jt-i = abxk-i + abyk-\ + abxk-\ + abyk-i 156 = b(a®xk-i) + b(a®yk-i) n2jt-2 = b © xk-\ © j/jt-i A = 1,... , r . The iteration of v can be carried out by adding an incremental factor dv: if rt-digit = 0 or 1 then dv = 1; e/se dv = 3; Using a bit c to represent du, the relation defined by the if clauses above can be expressed as a table Tl2k-1 n2fc-2 c 0 0 0 0 1 0 1 0 1 1 1 1 which in turn can be formulated as a simple expression: c = n2jfc-i (4.1) The updated value v' of the directional index v is obtained by the addition: v' = (v + dv) mod 4 (4.2) Let v' be represented by two bits a' and 6', then Eq.(4.2) can be formulated as: a' = a © 6 © c (4.3) b' = b (4.4) 157 Substituting (4.1) into (4.3) yields a' = a®b@ n2k-i (4.5) In order to obtain the final expressions of a' and V in terms of a, 6,x k-i, and yk-i, tedious algebraical calculations need to be performed. The derivation details are ignored here. The encoding formula is listed as below: a r _ a = 0 K-i = 0 W2r-2 = K-l © Xr-i © u>2r-i = K-i(aT-i®xr-i) + K-i(ar-i@yr-i) ajt-i = hfk + hxk 6jt_i = bk w2k-2 = © xjt_i © j/fc-i u>2k-i = &fc-i(a*-i © xk-i) + bk-x(ak-x © yk-i) (fc = r - l , r - 2 , . . . , l ) . The order of this algorithm is 20r. 4.8.2 Decoding Decoding is the inverse process of encoding: given the W-value of a particular point, the corresponding coordinate pair (x,y) is to be determined. Let wudigit be repre-sented by the two bits io 2 *- i and w2k-2. The transition rules for iterating the index v are the same as applied to the encoding process, except using w-digit in the right 158 hand side: K-Xr-Vr-h-Xk-Vk-= 0 = 0 = ar_i © w2r-i © (6 r_iu;2r-2) = O r _ ! © l W 2 r _ i © ( & r _ i U > 2 r _ 2 ) = ak © bk © u>2fc-i = ^ = a*_i © u>2*-i © (&*-i">2*_2) = ak-i © io2fc-i © (&fc-iu>2*-a) (fc = r _ i , r - 2 , . . . , l ) . The order of this algorithm is 16r. 4.9 Summary In applications of data retrieval, the Hilbert order exhibits good performance. How-ever, its encoding and decoding processes involve many operations and the response time for interactive queries is long when the data volume is large. A fast addressing technique to point to positions of data blocks is desired. The U order is faster than the Hilbert order, but its performance is decreased somewhat. In this chapter, we pro-posed several new spatial orders and evaluated their performance. Among them, the Q4 order presents a simple structure. Thus it is selected as an applicable representa-tive and its analytical encoding and decoding formulas are derived. The performance of the Q4 order has been assessed both analytically and computationally for several typical applications in spatial data processing systems. The result suggests that the 159 Q4 order behaves similarly to the Hilbert order. A n algorithm analysis has also been conducted to estimate the computational costs of several spatial orders and the results are listed in Table 4.6. Spatial Order Encoding Decoding Z 4r 4r U 5r 5r Q4 20r 16r H 31r 19r Table 4.6: Computational Costs for Resolution r The uniform distribution of queries has been implicitly assumed throughout the foregoing discussions. For distributions that are non-uniform, the following model can be considered: given a range query that may correspond to an arbitrary polyhedron in an n-dimensional lattice system, the maximum linear span of the indices (i.e., location codes) in this polyhedron can be used as a metric to assess the scattering degree. Each range query may be associates with a probability. Therefore, the expectation and the variance can be determined given a particular spatial order. Thus the performance of spatial orders can be evaluated by comparing these statistics. If the probability distribution changes over time, then the approach similar to the Scenario Tracking model (Chapter 5) can be applied. In general, the concerned time range can be partitioned into a series of phases, and each phase corresponds to a scenario. A set of scenario probabilities can be assigned to these phases. Then a 160 tracking model can be constructed to determine a spatial order that performs best over all scenarios. 161 Chapter 5 Optimization of the Key Resolution for Range Query Retrieval In foregoing discussions, the resolution of the lattice systems is given and assumed to be identical for all dimensions. In this chapter, the optimization of the resolution is discussed. The analysis will concentrate on a typical application in information retrieval systems, the range query retrieval problem. The methodology and conclusion are expected to be applicable to other lattice models in data processing systems. 162 5.1 Overview In modern file systems, data records are grouped into a set of sections based upon the values of attributes. A hashing function is a means of calculating the index of a section containing a given record from its key. Multiattribute hashing is an effective mechanism solving partial match query retrieval problems. Suppose each record has n fields. In a multihashing structure, each field is as-sociated with a distinct hashing function h*. The n hash values can be arranged according to interleaving, combination or permutation rules to generate a number. Mathematically, this is a mapping from n to 1. Usually the arrangement for the n hash values is carried out in terms of bits, rather than in terms of decimal digits. The resulting number is used as a pointer to a data section. Interest in applying multihashing to partial match query retrieval has been high since the 1970s. The multiattribute hashing scheme was initially proposed by Rothnie and Lozano [57] as an alternative structure to invert files to improve a system's performance. Rivest [54] analyzed the performance of non-binary attributes and discussed some hashing schemes by means of combinatorial design. Aho and Ullman [3] proposed a mathematical programming model to determine the optimal number of the bits allocated to the fields. In a partial match query, a bit-string with 6,- bits called the field signature, is generated by the hash function hi with the specified field as its argument. The number of bits in a field signature is called field resolution and it determines the 163 cardinality of the hashed space. For instance, we may have: 00 if x < UG 01 if UG" < x < UN' 10 if "JV" < x <UT 11 others In this example, the field resolution is 2 and the cardinality of the hashed space is 4. The hashing process has a two-fold effect: 1. The domain of a field is subdivided into a finite set by decision thresholds, and 2. the range of the hash function is a finite discrete set of representation values. In this chapter, the word "range" has two meanings. One is associated with range query, a concept defined in the attribute space, whereas the other stands for the set of the hashed values. The latter is used because of mathematical conventions. For partial match query problems, Aho and Ullman [3] considered a storage space partitioned into 2 B sections (B is a positive integer), each of equal size. Thus B bits are needed to determine a section address. They proposed a constrained nonlinear programming model to optimize the resolution for each field, based on the specified query probabilities. The objective function to be minimized is the expected value of data sections that need to be searched to answer a query. Suppose that each record has n fields. Let pi be the relative importance coefficient or weights attached to the t-th field. Let the field resolution to be determined be 6,. A query q can be represented by a subset in the integer set [1,2, . . . , n]. There are 2 n 164 such subsets. Aho and Ullman proposed the following model: minimize y = E f t II**] (5.1) 9 i(q n subject to ^2 bi = B «'=i (5.2) where Pg = n, 6, Pi IW1 - Pi) and g C {1 ,2 , . . . , n}. The model characterized by (5.1) and (5.2) presents an elegant mathematical formulation for the partial match query optimization problem. This basic formulation has been developed and applied to various partial match query applications ([40], [41], Much research has been done on partial match query models. However, the lit-erature is relatively sparse on the range query retrieval problem. We think that the optimization models for the range query problem can be nicely established by means of stochastic programming methodology. A brief introduction to stochastic program-ming is given in the next section. 5.2 Introduction to Stochastic Programming A mathematical programming problem can be generally formulated as follows: subject to g{X) > 0 where / : PC1 —*• Rl,g : R" —• PC71. X is called the decision variable. f(X) is called the objective function and g(X) the constraint function. The feasible region of X is [53], etc.). minimize f(X) 165 represented by the region enclosed by g(X) = 0. Usually, both f(X) and g(X) are expressed analytically in terms of X and a set of coefficients (e. g., price of goods, cost of transportation). Such a coefficient set is denoted Cl in the following. Conventional mathematical programming research deals with the optimization theory and methods on the assumption that each u> € Cl is deterministic. Stochastic programming relaxes this assumption, namely, Cl is regarded as a set of stochastic variables. Tintner [64] described two kinds of problems: • subjective risk there exists a probability distribution of anticipation which is itself known with certainty, and • subjective uncertainty: there is an a priori probability of the probability distri-butions themselves. The first field leads to stochastic programming. In the literature, two classical types of stochastic programming models have been proposed ([13], [34]): • the wait-and-see type (also called the passive or adaptive type), and • the here-and-now type (also called the active or anticipative type). The wait-and-see model concerns the situation in which the decision variable X is chosen after making an observation for each u € Cl. In other words, the samples of realized u are available. The here-and-now model concerns the situation in which the decision variable X is chosen before making an observation for each UJ G Cl. This 166 implies that the moments of u can be incorporated into the formulations of f(X) and g{x). The traditional tasks of the wait-and-see approach are to find the moments of the optimal solution of X and/or the optimal value of f(X), or, more generally, the distributions of theirs. Recently, Dembo [12] proposed an alternative task for the wait-and-see model, scenario tracking. The basic ideas can be considered as coming from original concepts of norm regulation or from the connotation of least square techniques. Consider the following formulation of general stochastic programming problems: minimize f(X,ad,au) subject to g(X,b,i,bu) > 0 where / : Rn —• B},g : BJ1 -* R71; the subscripts d, u stand for the deterministic and uncertain parameters respectively. A scenario can be defined as a particular re-alization of the uncertain data, au and 6U, represented by a, and bs. Thus, for each scenario s € S = {set of all scenarios}, the above problem reduces to the determinis-tic problem below, which will be referred to as the scenario subproblem: ya = minimize f(X,ad,aa) subject to g(X,bd,ba) > 0 Corresponding to each scenario, a probability p„ is assigned. In such models, a solution Xj to a stochastic nonlinear system g(X, bd, ba) > 0 is said to be (norm) feasible if it minimizes J2»Ps[\\ minimum [0,g(X, bd, bs)] \\]. 167 Now, a coordination model can be constructed as: minimize J^P ' t l l f{X,ad,aa) -y,\\ + || minimum[0,g(X,bd,bt)} ||] (5.3) » This formulation can be named tracking model since it tracks the scenario solutions as closely as possible while still maintaining feasibility. The norm included can be the 1-norm or 2-norm. It may be instructive to give some remarks to end this section. The essential of scenario optimization is the idea of coordination. Having realized this, we can construct diverse tracking models. In principle, any reasonable way in which scenario solutions can be combined into a single feasible solution to the underlying stochastic programming problem can be considered as a candidate. The suitable formulation of the tracking models should differ from application to application. In the following sections, we will show that the optimization models of range query retrieval can be nicely established by means of stochastic programming techniques. 5.3 The Here-and-Now Stochastic Programming Model In applications, the domain of an attribute is not necessarily a subset in a numeral space (e.g., the integer space). It is, however, not difficult to design a transformation to map non-numeral items to numerals in a given application. Actually, this is the case that most hashing techniques assume. In the discussion to follow, therefore, this assumption is held. 168 Given a range query, one or more sequences of contiguous values of some fields are required. For example, (25 < age < 40) and (140 < weight < 160) (5.4) It would be beneficial to give some interpretations to the philosophy implied by the formulation (5.1) before constructing a model for the range query problem. The objective function in (5.1) can be rewritten as follows: y = E ^ I W _ \ ^ p r rLgg2' n.t6g2' " v n,€,2>< J = 2 B £ P , [ T T r 6 i ] (5-5) 9 « ' € 9 Therefore, the minimization of (5.5) is similar to: maximize 2B£ AHI2 ]^ (5-6) Formulation (5.6) explicitly presents the intuition: the more frequently a field is specified, the more bits it should be have. In other words, the frequency Pq acts as a weight to these fields i € q. In applications, the weights may be contributed not only by frequencies but also by various other factors. For the range query problem, one of these factors is the size of range instances. Let the factor that characterizes the relative importance of range sizes be u,-, which is a monotone decreasing function of the range size, then the model (5.1) becomes: minimize y = ] Q n u « P « IK1 - Pi)2*"] (5-7) 169 Obviously the parameters u,- (i = 1,2,..., n) should be regarded as stochastic param-eters for most practical range query problems. Consequently, the optimization model (5.7) is a stochastic programming model. One major difficulty arising in range query problems is the large amount of range query instances. Let the cardinality of the domain of the i-th. fields be denoted m,\ There may be two ways to classify range types. The first way considers both the size and the starting position of a range. By size it meant the cardinality of a range. For example, the size of [2, 2] is 1. Consider the case that the domain is {1,2,3,4,5,6,7,8}. Then with numeral 1 as the starting point, there are [1, 1], [1, 2], . . . , [1, 8], in total eight instances. In general, with numeral j (1 < j < m,) as the starting point, there are altogether m; — j + 1 instances. Thus the total number of all possible instances would be J2T=i(mi ~ J' + = m « ( m » + l)/2- However, a query incorporating an m^-size range specification for field i actually corresponds to unspecifying field i. For instance, the following two queries are equivalent: retrieve all records such that: (25 < age < 40) and all weights retrieve all records such that: 25 < age < 40 In other words, the m,-size instance should be excluded from the calculation for the probability of events i € q. Consequently, the number of total instances is given by: _ (m,- - l)(m t- + 2) ~ 2 The second way takes the size into account only. Thus there are m,- — 1 instances, each with size 1,2, . . . , m,- — 1, respectively. 170 In principle, a set of frequencies can be associated to each range instance in the attribute space. The number of them depends on the particular instance distribution. In this section, the here-and-now approach is discussed. Four frequency distributions of range types are analyzed as paradigm studies. Paradigm 1 : Paradigm 1 considers the average size of ranges over all possible instances, taking both the size and the starting point into account. For field i, there are m,- 1-size ranges, namely, [1, 1], [2, 2], [ TO , - , TO , - ] . Similarly, there are TOJ — 1 2-size ranges, namely, [1, 2], [2, 3], . . . , [TO,- — l ,m,] . In general, there are (TO,- - k + 1) k-size ranges (k = 1 , 2 , . . . , T O , - — 1). If assuming that all instances are equally likely to be specified, then the frequency of the fc-size (k = 1,2,..., TO,- — 1) range is TO,- — k + 1 fk = Ni 2(TO,- -k + l) (mi - 1)(TO,- + 2) i.e., the frequency follows a linear distribution. As a result, the average size of ranges is calculated as follows: m __ 1 Si = ^2 kfk k=i _ m,-(TO,- + 4) 3(mt- + 2) TO; 3 Paradigm 2 : On the assumption that the range instances are uniformly distributed by con-171 sidering both the size and the starting position, it has been shown that the frequency is a linear function of interval size k. Another possible situation is that the instances are uniformly distributed over the size only. Thus the fre-quency is a constant against size k. Without the m,-size interval, the frequency is given by fk = —^—r (fc = l , 2 , . . . , m , - l ) mi — l and the average size is expressed as m,—1 Si = J2 k h k=\ Y Paradigm 3 : The partial match query problem has the special significance in that it has been regarded as a representative paradigm in applications and extensive research has been done. Thus it is instructive to distinguish the instances of the partial match query from the ones of the general range query family. Paradigm 3 considers the case of the uniform distribution of range instances, taking both sizes and starting positions into account, over the size set k € { 2 , 3 , . . . , m,- — 1}. Given / i , the following formulas are obtained: _ 2(m,- + 1 - k) h ~ ( m i + l)(m, - 2) ( * - 2 ' - . " » . - - l ) m,—1 k=2 172 = / l + (m,- + l)(m, - 2) fc=2 m,(m,- — 4) 3(m, - 2) Paradigm 4 : The case of the uniform distribution of ranges about sizes only is considered in this paradigm. With f\ specified, a set of formulas can be obtained as follows: fk = — ^ (fc = 2 , . . . , m , - l ) k=2 I m,—1 m « - 2 Jfc=2 At present, on the uniformity assumption, we have derived a set of "mathematical expressions of the average size s,-. The uniformity assumption has frequently been applied in relevant studies. For example, assuming that all 2 n subsets for the binary reflected Gray code of order n are equiprobable, Faloutsos [15] has showed that the improvement on the cluster number is up to 50% to the natural binary order. Another example based on the same assumption is the model used in industry to estimate the performance of different disk drives [59], where the average seek time for a disk is derived as the same time required to traverse one third of the tracks or cylinders. It is interesting that this result is similar to the one derived in Paradigm 1 shown above. 173 Paradigm au a2i 03t a 4; 1 1/3 2/3 -4/3 2 2 1/2 0 0 -3 1/3 fi - 2/3 -4/3 -2 4 1/2 h + 1/2 0 -Table 5.1: Coefficients a*,-To save text space, in the following we use a unified formula of s,-: Si = aurrii + a 2; + (5.8) m,- + a4t-The values of aki (k = 1,2,3,4) are given in Table 5.1. By replacing u,- by /(s,) in Eq. (5.7), where /(s,) is a monotone decreasing function of Si, we have: minimize y = EQI/(5«)P«- IK1 ~ Pi)2*'] 9 • € « tgg (5.9) The objective function y in (5.9) can be rewritten by incorporating the equality constraint (5.2) as follows: y = 2 B P£[Il7. /(*)2- f c ' ] 9 » ' € ? (5.10) where P = n?=i(l - Pi) a n d r.' = Pi/(1 - Pi)-Since query g C {1,2, . . . , n } , the summation of q is over all 2 n subsets. The empty product is considered unity by convention. Consequently, Eq. (5.10) becomes: y = 2Bpfl[l + nf(si)2-b<} (5.11) i = i 174 The minimization of the objective function y characterized by Eq. (5.11) subject to equality constraint (5.2) can be solved by the classical Lagrangian multiplier method. To save space, the mathematical details are not presented here. The optimal solution is of closed form: b* = ^ + lo92[rif(sA)-±£log2[rjf(sj)} The optimal value of the objective function y is: y* = P[2* + (nri/(-i))*]" i=i 5.4 The Wait-and-See Stochastic Programming Model The here-and-now approach based on using the moments of the stochastic parame-ters is appropriate as long as analytic descriptions of the frequency distributions are available, especially for those cases with simple distributions like uniform or linear distributions, as shown in Section 5.3. In practice, however, the range size in the range query problem may be a stochastic variable with an unknown distribution. The moments are difficult or even impossible to be expressed in terms of m,-, the cardinality of the i-th field. In this section, the wait-and-see approach is investigated. Let x — (xi, x 2 , . . . , xn) be an observation to the stochastic vector v = (vi, v2,.. •, vn), where Vi is the range size of field i , and let f(t) is a decreasing monotone function of 175 t, then model (5.7) is written as follows: minimize y = EGI/O^P.' Iii1 ~ P«')26i] (5.12) Subject to equality constraint (5.2), the optimal solution of (5.12) can be determined by the Lagrangian multiplier method. The derivation details are ignored here to save space. The optimal solution is: b* = - + ^ 2 ( r 0 - - E / o 5 2 ( r j ) + / O c 7 2 [ / (x0 ] - - f : /o 5 2 [ / (x^ (5.13) n n j = 1 n j = 1 Let us choose Paradigm 1 and 2 (see Section 5.3) as the representatives and use a unified formula to express their density distribution functions: di(t) = qit + Ci 1 < t < mi - 1 where for the uniform distribution: qi = 0, 1 a = m,- — 1' and for the linear distribution: - 2 q i " ( m , . - l ) ( m , - f 2 ) ' 2(m< + 1) C i ~ (m,-- l)(m,-+ 2) Consequently, the expected value of the optimal solution 6* can be calculated as fol-lows: mi—1 m,—1 mn—1 E{K) = £ + C l ) • • • £ + « ) ] . . . £ ( ? n X n + (5.14) X l =1 X j = l s n = l 176 The availability of a closed form for Eq. (5.14) after substituting Eq. (5.13) into it depends on the specification of / (x,-). In general, given the density distribution func-tion di(t) (i = 1,2,... ,n), determining the expected value of the optimal solution may have to rely on numerical techniques to evaluate the following summation: mj—1 T7i | — 1 m n—1 m) = E fowl ••• E [*:*(*.•)] ••• E K M ] (5.i5) X l = l Xi=l x n = l So far we have applied the here-and-now and wait-and-see methods to deal with the stochastic programming models arising from the optimization problem of the range query retrieval. It has been shown that both methods may involve finding the moments of stochastic variables. For the uniform or linear distributions, the here-and-now model does not involve complicated computations, while the wait-and-see model needs to evaluate a complicated summation. For this reason, the here-and-now model seems to have advantage over the wait-and-see model. Concerning applications, it appears to be appropriate to apply the here-and-now and wait-and-see methods to commercial data systems, where the assumption on the uniformity is usually justified. For engineering data systems, where diverse distributions are prevalent, the here-and-now and wait-and-see approaches may have to rely on numerical summation or integration techniques. In the next section, we discuss the third method — the scenario tracking method. 177 5.5 The Scenario Tracking Stochastic Program-ming Model The conventional wait-and-see approach deals with the moments of the optimal so-lution of decision variables and/or the optimal value of the objective function, or, more generally, their distributions. Recently, the scenario tracking method was sug-gested to be an alternative task for the wait-and-see model [12]. The basic idea has been introduced in Section 5.2. In this section, this methodology is applied to the optimization of field resolutions. Scenario tracking consists of two phases: the first is scenario optimization and the second is scenario coordination. Recall that a scenario is defined by a set of parameters (u>i, u)2,..., wn), a particular realization of a set of stochastic data (ui, v2,..., vn). Thus model (5.12) is actually the formulation of the scenario optimization: minimize y = £ [ I I / K ) P i IK 1 " w)26<] ( 5 - 1 6 ) 9 t '€9 i(q The optimal solution can be expressed as follows: B 1 " b* = — + log2(ri) - - E l°92{rj) n n j = 1 +lo92[f(wi)]--J2log2[f(wj)} nj=i By substituting the above expression of b* into the expression of the objective function y, we obtain the optimal value: y: = ^ {2»+[nri/K-)]*}n 178 where P = I l ? = i ( l — .?«)> r«- = Pi/(\ ~P«)> where the equality constraint Eq. (5.2) has been incorporated. The subscript us" stands for scenarios. Using the value y*, the scenario coordination model can be formulated as follows: minimize y = £ / . || y4* - P„2B f[( l + / K ^ r , - , ^ " 6 ' ) || (5.17) n subject to J2bi = B (5-18) »=i The complete model may include some inequality constraints such as 6;mm ^ k < bimax) etc. A n appropriate solution process for the optimization model characterized by (5.17) and (5.18) is constrained nonlinear programming. A set of algorithms has been devel-oped and implemented. The description is given in the appendix. 5.6 Optimization of the Section Capacity In the foregoing discussion on the optimization of field resolution, the number of total data sections (2 B) has been assumed to be a specified parameter. In this section, we discuss how to determine the optimal value of this parameter. Let R be the number of total records and S be the capacity of each data section. Then, the following relation holds: 2 s 5 = R Therefore, changing B implies either changing R or changing S. The former concerns dynamic file systems and is not to be covered here. The latter is usually constrained 179 by the particular storage structure. Although this issue is machine-dependent, it would be instructive to gain some mathematical insight. Two more machine-dependent parameters, the time to locate the data section and the time to read it, need to be specified in order to proceed with the analysis. Let them be L and T, respectively. Thus, given the optimized expectation of data sections (denoted y*), the objective function with argument B is formulated as follows: r = (L + TR2-B)y* (5.19) In the following, the cases corresponding to the here-and-now and wait-and-see ap-proaches are discussed separately: 1. The here-and-now approach The optimal value determined by the here-and-now model is: y" = P(2%+C) (5.20) where C = [ n r i / ( s i ) l n The definition of Sj was given by Eq. (5.8). Substituting (5.20) into (5.19) leads to: t = (L + TR2~B)P(2" + C ) n Let the first derivative be zero, then we obtain the optimal solution of B: 180 2. The wait-and-see approach The optimal value determined by the wait-and-see model is: V*a = / ,{2* + [E[rj/(xj)]i}n (5-21) The meaning of Xj was described in Section 5.4. Substituting (5.21) into (5.19) leads to: t = (L + TR2-B)P{2% + [f[rjf(xj)]±y i=i Let the first derivative be zero, then we obtain the optimal solution of B: B* = ( ^ ) { n l o 9 2 ( ^ ) + ^092(rj) + f:iog2[f(xj)}} " + 1 L j=i j=i Then its moment can be determined through the process similar to that de-scribed in Section 5.4. In this section, we have briefly discussed how to optimize the capacity of data sections by means of the here-and-now and wait-and-see stochastic programming methods. The ST method is also applicable. In addition, the dynamic file model can be analyzed similarly. These topics are not to be discussed further, since the main subject of this section is on the optimization of field resolutions. The analytic methodology used in this section was initially described by Aho and Ullman [3] and was applied to the partial match retrieval problem. 181 5.7 Generalization to Uncertainty Models The uncertainty problems concern the situation in which there is an a priori prob-ability of the probability distributions themselves [64]. Conventional stochastic pro-gramming techniques do not deal with models related to such problems. Consider the following situation: There are several scenarios for the probability distributions. In scenario 1, the range distribution of field j follows the linear distribution; in scenario 2, it follows the uniform distribution; etc. By specifying a set of scenario probabilities, we can formulate a tracking model for this problem: minimize £ / . || { ^ P Y M ^ f M 2 ^ » 9 t'G? -{P{2* + [f[rj*f\\ (5.22) n subject to Y,b< = B (5-23) «=i The model characterized by (5.22) and (5.23) provide an adaptive approach to solve uncertainty problems arising from range query retrievals. 5.8 Summary Multiattribute hashing is an effective method dealing with partial match query re-trievals. In this chapter, optimization models for the range query retrievals are pro-posed by incorporating stochastic programming methodology. Three approaches are discussed. The here-and-now and the wait-and-see approaches seem to be applica-182 ble to the situation in which range queries follow a relative simple distribution. For the reason of computational complexity, the former seems to have advantage over the latter. Four paradigms are analyzed. The distribution models introduced have frequently been found in other relevant studies in computer sciences. One of them leads to a result that is similar to the one used in industry to evaluate the perfor-mance of disk drives. The scenario tracking approach appears to be more flexible and be appropriate for the problems where the probability distribution is not avail-able. It also provides an adaptive technique to formulate non-conventional stochastic programming problems. A stochastic programming software package is developed and implemented. The description is given in the Appendix. Numerical experiments produced the desired results. Future research may involve stochastic programming applied to range query prob-lems with general linear or nonlinear constraints. 183 Chapter 6 Optimization on the Bucket Distribution 6.1 Overview The performance of the multiattribute hashing system depends on how to order the data units on disks. In [15], Faloutsos proposed the following task: given a set of buckets containing records, determine their optimal or sub-optimal distribution on disks to minimize the retrieval time. With combinatorial analysis, he showed that the order determined by the Gray code can significantly improve the cluster distribution of the records. A major open question that remains is the lack of symmetric Gray codes to reduce the bias introduced by the reflected Gray code. In the model concerned, the multiattribute hashing scheme is applied to each record to generate an integer, called the record signature. The buckets are determined by a partition for the domain of record signatures. A cluster for a given query is a set 184 of consecutive, qualifying buckets, surrounded by non-qualifying ones. The number of clusters is considered the performance metric. The model is based on the following assumptions: • Each bucket occupies exactly one disk block; • the buckets are stored consecutively; • the overflow problem is not considered. A query can be represented by an n-tuple (si, 5 2 , . . . , s n ) , where s,- € {0,1, u} and is constructed by the multiattribute hashing function. Such a tuple is called a query signature. The element u stands for "unspecified item". For instance, we may have a query signature like "uulO" for n = 4; thus four buckets with the signature 0010,0110,1010 and 1110. The allocation by the binary code introduces four clusters, while the reflected Gray code introduces only two (see below). binary code reflected Gray code 0000 0000 0001 0001 0010 0011 0011 0010 0100 0110 0101 0111 0110 0101 0111 0100 1000 1100 1001 1101 1010 1111 1011 1110 1100 1010 1101 1011 1110 1001 1111 1000 Assuming that all 2" subsets for the binary reflected Gray code with order n are 185 equiprobable, Faloutsos [15] has showed that the improvement on the cluster number is up to 50% to the natural binary order. In the model discussed, the number of clusters determined by the reflected Gray code introduces an obvious bias over the attribute sets. For instance, the cluster distribution from left to right is (2,2,4,8) for n = 4. Therefore, other Gray codes with a better distribution need to be investigated. Since this topic involves some open questions in graph theory, a brief introduction may be helpful to explain the complexity. The hypercube Qn is an undirected regular graph G = (V, E) with | V | = 2 n and \E\ = n 2 n _ 1 . Qn can be labeled or unlabeled. In the former case, each vertex is labeled with a bit string (bib2 • • - bn), where 6,- € {0,1}. Two vertices of Qn are adjacent if and only if their labels differ in exactly one bit. A Hamiltonian path is a path that passes through every vertex in Qn exactly once. A Hamiltonian cycle Hn is a closed Hamiltonian path. Thus, mathematically, the distinct types of Gray codes are equivalent to the Hamiltonian problems in the hypercube Qn. The research on the Hamiltonicity of the hypercubes was initiated by Gilbert [23]. Based on the algebraic group theory, Gilbert classified the types of Hn (n > 2) into a finite number of classes: 1. Composite class, (a) ultracomposite subclass, (b) non-ultracomposite subclass, 186 2. non-composite class. By enumeration, Gilbert has shown that there are nine distinct types of unlabeled H4. Of them, five types are ultracomposite, including the reflected Gray code path, and three are non-ultracomposite. The last one is non-composite, which happens to present an even distribution of clusters, namely, (4, 4, 4, 4): #3 92 9i go 0 0 0 0 0 0 0 1 0 0 0 0 0 1 1 0 0 1 1 1 0 1 1 0 1 0 0 0 0 0 0 0 1 0 1 0 0 0 1 1 0 0 0 1 0 0 In Gilbert's work, the sequence of the coordinate places in which the changes occur is used as a compact notation to specify a particular h-cycle. For instance, if the left-most bit is specified as the first bit, then the coordinate sequence [4 3 4 2 1 4 1 3 1 2 3 4 3 2 1 2] 187 represents the above h-cycle. In modern database systems, a moderate size of the secondary key retrieval prob-lems usually needs n being above 4. Thus the Hn (n > 4) that present an even dis-tribution of clusters (symmetric Hn in brief) deserve to be studied. Gilbert proposed a set of necessary and sufficient conditions (although not completely constructive) for the ultracomposite subclass as well as a set of sufficient conditions for the non-composite class. It is quite obvious from the geometric view, however, that symmetric Hn have to be a subclass of non-composite class. Several years later, Mills [43] re-ported a symmetric Hs and proved that there exists a new class of Hn for n > 4 such that Hn does not traverse any Qk (k = 2 , . . . , n — 1). We will prove, however, that his construction is neither necessary nor sufficient for symmetric Recently, Faloutsos has also mentioned that systematic ways for designing symmetric Hn for n > 4 are not known [15]. It appears that the latest development relevant to this topic is re-ported in [28]. Some impressive results are presented in Table 6.1, where hn and Hn represent the Hamiltonian cycle in the labeled and unlabeled Qn, respectively. The analysis for n > 5 has not been reported, especially on the analysis of symmetrical Hn- Thus how to construct systematically symmetric Hn is still unsolved. From reviewing the literature of mathematics, we are aware that the Hamiltonic-ity problem of the hypercubes may lead to nontrivial graph-theoretic topics. Pay-ing too much attention to the theoretical mathematical issues is beyond the scope of the proposed project. In this chapter, a heuristic approach to produce a set of high-dimensional symmetric Gray codes is presented. By constructing special non-188 n number of Hn number of hn 2 1 1 3 1 6 4 9 1344 5 275065 906545760 Table 6.1: Number of Hamiltonian Cycles in Hypercubes composite Hn with a few local adjustments, we will derive symmetric Hn in the 5, 6 and 7-dimensional hypercubes. Our results show that Mil ls ' construction is neither necessary nor sufficient for symmetric Hn (n = 5,6,7). 6.2 Gray Coordinate Lattice and Symmetric Hn The Gray coordinate lattice is a two-dimensional lattice with the coordinates indexed according to the order determined by a Gray code. We begin the discussion by considering the symmetric HA proposed in [23]. Its path is shown in Fig. 6.1 in the Gray coordinate system. The planar graphical representation of H4 may suggest the configuration of since the six bits of the labels can be evenly assigned to two coordinate axes, three for each. Based on this intuition, a developed path is constructed as shown in Fig. 6.2. This path represents a six-dimensional Hamiltonian path in Q& and the corresponding 189 fl3g2 10 11 01 00 glgO 00 01 11 10 Figure 6.1: Symmetric H4 Gray code can be written as follows: 95 94 93 92 9i 5o 0 0 0 0 0 0 0 0 1 0 0 0 0 1 1 0 0 0 0 1 0 0 0 0 1 1 0 0 0 0 1 1 1 0 0 0 1 0 1 0 0 0 1 0 0 0 0 0 1 0 0 0 0 1 1 0 0 0 1 1 1 0 0 0 1 0 1 0 0 1 1 0 1 0 0 1 1 1 1 0 0 1 0 1 1 0 0 1 0 0 1 0 1 1 0 0 190 1 1 1 1 0 0 1 1 0 1 0 0 0 1 0 1 0 0 0 1 1 1 0 0 0 0 1 1 0 0 0 0 1 1 0 1 0 0 1 1 1 1 0 0 1 1 1 0 0 0 1 0 1 0 0 0 1 0 1 1 0 1 1 0 1 1 0 1 0 0 1 1 1 1 0 0 1 1 1 1 1 0 1 1 1 1 1 0 1 0 1 1 1 1 1 0 1 1 1 1 1 1 1 1 0 1 1 1 0 1 0 1 1 1 0 1 0 1 1 0 1 1 0 1 1 0 1 1 0 0 1 0 0 1 0 0 1 0 0 1 1 0 1 0 0 1 1 1 1 0 0 1 1 1 1 1 0 1 1 1 0 1 0 1 0 1 0 1 1 1 0 1 0 1 1 1 1 1 0 1 1 0 1 1 0 1 1 0 1 1 1 1 191 1 0 1 1 1 0 1 0 1 0 1 0 1 0 1 0 1 1 1 0 1 0 0 1 1 1 1 0 0 1 1 1 0 0 0 1 0 1 0 0 0 1 0 1 1 0 0 1 0 0 1 0 0 1 0 0 0 0 0 1 0 0 0 0 1 1 0 0 0 0 1 0 0 0 0 1 1 0 0 0 0 1 1 1 0 0 0 1 0 1 0 0 0 1 0 0 The Gray code shown above has a much better cluster distribution (8, 8, 16, 8, 8, g5g4g3 100 101 111 110 010 011 001 000 g2glg0 000 011 110 101 001 010 111 100 Figure 6.2: Approximate Symmetric H& 192 16) than the reflected Gray code that has a distribution (2, 2, 4, 8, 16, 32). A further improvement is achieved by conducting a few local adjustment only, since the original path exhibits an approximate symmetry. By rearranging four pairs of rows, we obtain a symmetrical Gray code with distribution (10, 10, 12, 10, 10, 12). The coordinate sequence corresponding to this Gray code is written as follows: [ 3 2 3 1 3 2 3 6 5 6 4 5 3 2 3 1 3 2 6 5 6 4 6 2 3 1 3 6 4 6 3 2 5 2 1 5 6 1 4 1 3 4 6 5 1 2 5 6 4 6 5 1 2 1 3 1 2 4 5 4 6 4 5 4 ] A similar approach can be applied to construct the balanced Hamiltonian cycles in five or seven-dimensional hypercubes. Since five or seven bits can not be separated into two parts evenly, the resulted graphical representations are not squares (Fig. 6.3, Fig. 6.4). 10 11 01 00 mm 000 011 110 101 001 010 111 100 g2glg0 Figure 6.3: Approximate Symmetric H5 The Gray code shown in Fig. 6.3 gives a cluster distribution (6, 8, 4, 6, 8) and its 193 coordinate sequence is written as follows: [ 2 1 2 5 3 4 3 5 3 4 2 1 2 5 2 1 4 1 2 5 3 2 3 1 3 5 4 1 4 2 4 5 ] The Gray code shown above has a much better cluster distribution than the reflected Gray code that has a distribution (2, 2, 4, 8, 16). To obtain the final symmetric Gray code with distribution (6, 6, 6, 6, 8), only one pair of rows needs to be rearranged. The Gray code shown in Fig. 6.4 gives a cluster distribution (12, 16, 30, 6, 12, 16, 36). It is better than than the reflected Gray code that has a distribution (2, 2, 4, 8, 16, 32, 64) and even further improvement is possible. The final distribution is (18, 18, 18, 18, 18, 18, 20) and the coordinate sequence corresponding to the adjusted code is written as follows: [ 2 1 3 1 2 1 3 5 4 6 4 5 4 7 5 4 5 6 5 4 5 7 2 1 3 1 2 1 3 7 5 6 5 7 3 2 3 1 5 1 3 2 3 6 4 3 2 3 4 1 4 6 5 1 5 7 1 6 1 3 2 3 6 7 5 7 3 7 5 7 2 7 5 7 3 1 3 7 4 5 4 7 6 7 4 7 1 7 2 7 6 2 3 6 1 6 5 6 4 1 4 6 3 2 6 2 1 2 6 2 5 2 4 6 4 7 4 6 4 5 4 7 6 2 1 2 3 7 ] The procedures for local adjusting are better demonstrated by the developed code lines. To save space, the detail is omitted. In this section, we have derived three balanced Hamiltonian cycles, for the 5, 6 and 7-dimensional hypercubes, respectively. The basic method is to construct a special non-composite h-cycle, followed by local adjustments. Finally, it should be mentioned that the constructed Hamiltonian cycles traverse Q2, due to the following lemma [43]: 194 Lemma 6.2.1 Let Pj, Pj+i, Pj+2, and P J + 3 be four consecutive vertices of a Hamil-tonian cycle in Qn. Let the coordinate sequence corresponding to the Hamiltonian cycle be (Ai,A2n). Then P , , P , + i , P J + 2 , and Pj+3 are the vertices of Q2 if and only if Aj = Aj+2. On the other hand, by starting from Mills ' symmetric H$ [43] and following Mil ls ' construction procedure, an H$ with distribution (9, 12, 11, 14, 14, 4) resulted. There-fore, the subclass proposed by Mills is neither necessary nor sufficient for symmetric Hn. g6g5g4 100 101 111 110 010 011 001 000 WW g3g2glg0 0000 0011 0110 0101 1100 1111 1010 1001 0001 .0010 0111 0100 1101 1110 1011 1000 Figure 6.4: Approximate Symmetric H7 6.3 Summary The symmetric Gray codes or the symmetric Hamiltonian paths in a high-dimensional hypercube can be applied to improve the performance of information retrieval sys-195 tems. The Haxniltonian path in the four-dimensional hypercube has been reported in the literature. In this chapter, a heuristic approach is proposed to construct the the symmetric Hamiltonian paths in the five-, six-, and seven-dimensional hypercubes. The same methodology may be applied to the n-dimensional (n > 8) cases. 196 Chapter 7 Conclusion The requirement of ordering.a set of multiattribute data arises frequently from spatial index processing and secondary key retrieval in modern information systems. The first application involves developing a single numerical index on a one-dimensional line for each point in multi-dimensional space, such that the spatial localizability can be preserved as best as possible. The two-dimensional Hilbert order has attained intensive interest in the literature due to its desirable performance. The lack of inex-pensive encoding and decoding algorithms is mentioned several times in publications. Especially, analytical formulas have not been reported. In this thesis, we propose analytical encoding and decoding formulas for the two-dimensional Hilbert order. The execution-time assessment suggests that they are faster than a look-up-table approach published previously [39]. Then, the encoding and decoding algorithms for the three-dimensional Hilbert order are discussed. An approach based on the rotational matrices is presented. Since a large number of 197 Boolean variables is involved, the Karnaugh-map technique is inappropriate to derive irreducible analytical formulas. Other systematic methods may lead to extensive mathematical operations. Therefore, for the three-dimensional case, it appears that the matrix approach is an acceptable method. The encoding and decoding processes for n-dimensional Hilbert orders (n > 4) would be considerably complicated and place a heavy computational burden on the application problems. Thus a new spatial order called the U order is proposed. Its major advantage is the significant simplicity of the encoding and decoding operations for multi-dimensional data. Performance assessments suggest that the U order is comparable with the Hilbert order and superior to many other orders. A performance analysis is followed by a formal mathematical description for the n-dimensional U order. Then a set of analytical encoding and decoding formulas are presented. Besides the U order, several new spatial orders with a quadrant-recursive structure are also investigated. The goal is to find the candidates that are competitive with the Hilbert order in performance and with relative simple encoding and/or decoding formulas. Of these new orders, the Q4 order behaves best and its performance is very close to that of the Hilbert order. A n analysis shows that its encoding and decoding algorithms only need 64.5% and 84.2% of operations of the corresponding algorithms of the Hilbert order. The second application, secondary key retrieval, involves two issues: determining the resolution for each dimension in a multi-dimensional hashed space and ordering data blocks on disks. Aho and Ullman proposed a constrained nonlinear programming 198 model applied to the partial match query. Faloutsos showed that the order determined by the reflected Gray code can significantly improve the cluster distribution of the data blocks. It is recommended to design symmetric Gray codes to reduce the bias introduced by the reflected Gray code. In this thesis, optimization models for the range query retrievals are presented by incorporating a stochastic programming methodology. Three approaches are dis-cussed. The here-and-now and the wait-and-see approaches seem to be applicable to the situation in which range queries follow a relative simple distribution. Four typical paradigms are analyzed. One of them leads to a result that is similar to the one used in industry to evaluate the performance of disk drives. The scenario tracking approach appears to be more flexible and more appropriate for the problems where the probability distribution is not available. It also provides an adaptive technique to formulate non-conventional stochastic programming problems. Then a stochastic pro-gramming software package is developed and implemented. Numerical experiments show the coordinating effect of the scenario tracking methodology. Finally, the symmetric Hamiltonian paths in a high-dimensional hypercube are discussed. A heuristic approach is applied to construct the symmetric Hamiltonian paths in the five, six, and seven-dimensional hypercubes. They are equivalent to the symmetrical Gray codes. In summary, in this thesis three major approaches applied to the ordering of multiattribute data are proposed: a set of analytical encoding and decoding formulas for the two-dimensional Hilbert order, two new spatial orders with significant simple 199 encoding and decoding formulas, and a set of optimization models for the range query-problem. The following issues are recommended as future research topics: 1. Quadtree and octree data structures based on the U order and the Q orders; 2. digital halftoning techniques based on the X order; 3. stochastic programming models of range query problems with general linear or nonlinear constraints. 200 Bibliography [1] K . Abdel-Ghaffar and A . Abbadi, Optimal Disk Allocation for Partial Match Queries, ACM Trans, on Database Syst., Vol. 18, No. 1, March 1993, pp. 132-156. [2] D. Abel and D. Mark, A Comparative Analysis of Some Two-dimensional Or-derings, Int. J. GIS., Vol. 4, No. 1, 1990, pp. 21-31. [3] A . Aho and J . Ullman, Optimal Partial Match Retrieval When Fields Are Inde-pendently Specified, ACM Trans. Database Syst., Vol. 4, No. 2, June 1979, pp. 168-179. [4] H . Anton and C. Rorres, Elementary Linear Algebra with Applications, 1987. [5] M . Avriel, Nonlinear Programming: Analyses and Methods, Prentice-Hall Co. Inc., 1976. [6] B . E . Beyer, A n Optimum Method for Two-Level Rendition of Continuous-Tone Pictures, International Conference on Communications, Conference Record, 1973, pp. (26-ll)-(26-15). 201 [7] T. Bially, Space-filling Curves: Their Generation and Their Application to Band-width Reduction, IEEE Trans. Inform. Theory, Vol. IT-15, Nov. 1969, pp. 658-664. [8] N . Bourbakis and C. Alexopoulos, Picture Data Encryption Using Scan Patterns, Pattern Recognition, Vol. 25, No. 6, 1992, pp. 567-581. [9] P. Burrough, Principles of Geographical Information Systems for Land Resources Assessment, Clarendon Press, Oxford, 1986. [10] A . Cole, Halftoning without Dither or Edge Enhancement, The Visual Computer, Vol. 7, 1991, pp. 232-246. [11] T. Cormen, C. Leiserson, and R. Rivest, Introduction to Algorithms, McGraw-Hi l l , New York, 1990. [12] R. Dembo, Scenario Optimization, Annals of Operations Research, Vol . 30, 1991, pp. 63-80. [13] M . Dempster, Stochastic Programming, Academic Press Inc., New York, 1980. [14] J . Eastman, IDRISI Student Manual (Ver. 4.0, Rev. 1), Clark University, Mas-sachusetts, USA, 1992. [15] C. Faloutsos, Gray Codes for Partial Match and Range Queries, IEEE Trans, on Software Engineering, Vol. 14, No. 10, Oct. 1988, pp. 1381-1393. 202 [16] C. Faloutsos'and D. Metaxas, De-clustering Using Error Correcting Codes, Pro-ceedings of ACM Symposium on the Principles of Database Systems (Philadel-phia, Penn., March 29-31), A C M , New York, 1989, pp. 253-258. [17] C. Faloutsos and S. Roseman, Fractals for Secondary Key Retrieval, Proceedings of ACM Symposium on the Principles of Database Systems (Philadelphia, Penn., March 29-31), A C M , New York, 1989, pp. 247-252. [18] A . Fisher, A New Algorithm for Generating Hilbert Curves, Software-Practice and Experience, Vol. 16, January 1986, pp. 5-12. [19] H . Flegg, Boolean Algebra and Its Applications, John Wiley & Sons, Ltd. , 1964. [20] R. Fletcher, Practical Methods of Optimization, John Wiley & Sons, Ltd. , 1987. [21] R. Floyd and L. Steinberg, A n Adaptive Algorithm for Spatial Grey Scale, SID 75 Digest, Society for Information Display, 1975, pp. 36-37. [22] J . Foley, A . Van Dam, S. Feiner, and J . Hughes, Computer Graphics: Principles and Practice, Addison-Wesley, Reading, M A (1990) [23] E . Gilbert, Gray Codes and Paths on the n-cube, Bell System Technical Journal, Vol. 37, May 1958, pp. 815-826. [24] L . Goldschlager, Short Algorithms for Space-filling' Curves, Software Practice and Experience, Vol. 11, January 1981, pp. 99-100. [25] A . Graham, Kronecker Products and Matrix Calculus with Applications, Wiley, New York, 1981. [26] R. Hamming, Coding and Information Theory (2nd Ed.), Prentice-Hall, Engle-wood Cliffs, N J , 1986. [27] F . Harary, Graph Theory, Addison-Wesley Publishing Company, 1969. [28] F . Harary, J . Hayes and H . Wu, A Survey of the Theory of Hypercube Graphs, Comput. Math. Applic, Vol. 15, No. 4, 1988, pp. 277-289. [29] M . Hestenes, Multiplier and Gradient Method, Journal of Optimization Theory and Applications, Vol. 3, 1969, pp. 303-320. [30] D. Hilbert, Ueber die stetige Abbildung einer Linie auf ein Flaechenstueck, Math-ematische Annalen, Vol. 38, 1891, pp. 459-460. [31] H . Jagadish, Linear Clustering of Objects with Multiple Attributes, Proc. of the ACM SIGMOD Conf., Atlantic City, New Jersey, 1990, pp. 332-342. [32] J . Jarvis, C. Judice, and W. Ninke, A Survey of Techniques for the Display of Continuous Tone Pictures on Bilevel Displays, Computer Graphics and Image Processing, Vol. 5, 1976, pp. 13-40. [33] C. Judice, J . Jarvis, and W. Ninke, Using Ordered Dither to Display Continuous Tone Pictures on an A C Plasma Panel, Proceeding of the SID, Vol. 15, No. 4 (Fourth Quarter, 1974), pp. 161-169. [34] P. Ka i l , Stochastic Linear Programming, Springer-Verlag, Berlin Heidelberg, 1976. 204 [35] M . Karnaugh, The Map Method for Synthesis of Combinational Logic Circuits, Trans. AIEE, Part I, Vol. 72, No. 9, 1953, pp. 593-598. [36] D. Knuth, Digital Halftones by Dot Diffusion, ACM Trans, on Graphics Vol. 6, No. 4, Oct. 1987, pp. 245-273. [37] J . Limb, Design of Dither Waveforms for Quantized Visual Signals, Bell System Technical Journal, Vol. 48, Sept. 1969, pp. 2555-2582. [38] X . Liu and G. Slemon, A n Improved Method of Optimization for Electrical Machines, IEEE Trans, on Energy Conversion, Sept. 1991, pp. 492-496. [39] X . Liu and G. Schrack, Encoding and Decoding the Hilbert Order, accepted by Software Practice and Experience, February 1996. [40] J . Lloyd, Optimal Partial-Match Retrieval, BIT, Vol. 20, 1980, pp. 406-413. [41] J . Lloyd and K . Ramamohanarao, Partial-Match Retrieval for Dynamic Files, BIT, Vol. 22, 1982, pp. 150-168. [42] E. J . McCluskey, Minimization of Boolean Functions, Bell System Technical Journal, Vol. 35, pp. 1417-1444, November 1956. [43] W. Mills , Some Complete Cycles on the n-cube, Proc. Amer. Math. Soc, Vol. 14, 1963, pp. 640-643. [44] E . H . Moore, On Certain Crinkly Curves, Trans. Amer. Math. Soc, Vol. 1, 1900, pp. 72-90. 205 [45] G . Morton, A Computer Oriented Geodetic Data Base and a New Technique in File Sequencing, Tech. Rpt. I B M Ltd. , Ottawa, Canada, 1966. [46] A . Null , Space-filling Curves, or How to Waste Time with a Plotter, Software-Practice and Experience, Vol. 1, 1971, pp. 403-410. [47] J. Orenstein and T. Merrett, A Class of Data Structures for Associative Search-ing, Proceedings of the Third ACM SIGACT-SIGMOD Symposium on Principles of Database Systems, Waterloo, Apri l 1984, pp. 181-190. [48] G. Peano, Sur une courbe qui remplit toute une aire plaine, Mathematische Annalen, Vol. 36, 1890, pp. 157-160. [49] M . Powell, A Method for Nonlinear Constraints in Minimization Problems, Op-timization: Symposium of the Institute of Mathematics and its Applications (ed. R. Fletcher), Academic Press, London, 1969. [50] F . Preparata and M . Shamos, Computational Geometry: An Introduction, Springer-Verlag, New York, 1985. [51] W. V . Quine, The Problem of Simplifying Truth Functions, Am. Math. Monthly, Vol. 59, No. 8, pp. 521-531, October 1952. [52] W. V . Quine, A Way to Simplify Truth Functions, Am. Math. Monthly, Vol. 62, No. 9, pp. 627-631, November 1955. 206 [53] K . Ramamohanarao, J . Lloyd, and J . Thorn, Partial-Match Retrieval Using Hashing and Descriptors, ACM Trans. Database Syst., Vol. 8, No. 4, Dec. 1983, pp. 552-576. [54] R. Rivest, Partial Match Retrieval Algorithms, SIAM J. Comput., Vol. 5, 1976, pp. 19-50. [55] R. Rockafellar, Augmented Lagrange Multiplier Functions and Duality in Non-convex Programming, SIAM, J. Control, Vol. 12, 1974, pp. 268-285. [56] A . Rosenfeld and J . Pfaltz, Distance Functions on Digital Pictures, Pattern Recognition, Vol. 1, pp. 33-61, 1968. [57] J . Rothnie and T. Lozano, Attribute Based File Organization in a Paged Memory Environment, Commun. ACM, Vol. 17, No. 2, Feb. 1974, pp. 63-69. [58] R. Strongin and Y . Sergeyev, Global Multidimensional Optimization on Parallel Computer, Parallel Computing, Vol. 18, 1992, pp. 1259-1273. [59] B . Salzberg, File Structures, Prentice-Hall Inc., Englewood Cliffs, N J , 1988. [60] H . Samet, The Design and Analysis of Spatial Data Structures, Addison-Wesley, Reading, M A , 1990. [61] H . Samet, Application of Spatial Data Structures: Computer Graphics, Image Processing, and GIS, Addison-Wesley, Reading, M A , 1990. 207 [62] G. Schrack, Fast encoding and decoding algorithms for quadtree and octree loca-tion codes, Tech. Rpt., Department of Electrical Engineering, The University of British Columbia, 1995, 19 pp. [63] G. Schrack, W. Wu, and X . Liu , A fast distance approximation algorithm for encoded quadtree locations, Proc. Canadian Conf. on Electrical and Comput. Eng., Vancouver, September 1993, pp. 1135-1138. [64] G. Tintner, The Pure Theory of Production under Technological Risk and Un-certainty, Econometrica, Vol . 9, 1941, pp. 298-304. [65] N . Wirth, Algorithm+Data Structures-Programs, Prentice-Hall Inc., Englewood Cliffs, N J , 1976. 208 Appendix A Implementation of Stochastic Programming Software and Numerical Experiments The general formulation for a nonlinear optimization model can be expressed as fol-lows: minimize f{X) subject to g(X) > 0 where / : R71 —> R},g : RJ1 —» R™. f(X) is called the objective function and g(X) the constraint function. The feasible region of X is represented by the region enclosed by g(X) = 0. Most effective methods for solving the optimization model above are known as indirect methods in that the constrained problem is transformed to an unconstrained 209 one. The solving methodology consists of two levels: • The technique of converting the constrained formulation to an unconstrained one, and • the technique of minimizing the unconstrained problem. One of the widely applied methods applied to level 1 is known as the exterior penalty function method ([5], [20]). The major advantage of the exterior penalty function method is that an infeasible starting point XQ is allowed. It has, how-ever, an inherent weakness. The value of the penalty factor, an interated parameter, may become very large to force the optimization process to converge. To overcome this disadvantage of the exterior penalty function method, Hestenes [29] and Pow-ell [49] independently proposed the augmented Lagrangian multiplier method. Later, Rockafellar [55] adapted the augmented Lagrangian multiplier method for inequality constrained problems and established some general computational properties. In augmented Lagrangian multiplier method, the augmented objective function is formulated as: L(X,r,h) = f(X) + rJ2\\ minimum[0, 9 I(X) +.A,-/r] || t=i where r and ft,- are the penalty factor and the multiplier, respectively. The augmented Lagrangian multiplier method shares with the exterior penalty function method the property that an infeasible starting point is allowed but does not suffer the problems the exterior penalty function method exhibits. As a relative new technique in optimization, augmented Lagrangian multiplier method has not 210 yet been widely applied to solve practical problems [38]. In this thesis, augmented Lagrangian multiplier method is implemented to solve the range query optimization model. At level 2 of the solving process, the conjugate direction method ([5], [20]) is imple-mented. The conjugate direction method has obtained a good reputation regarding its robustness for solving complex nonlinear problems. It is expected to be appropri-ate for the range query optimization model because of the basic product formulation in the model. The conjugate direction method requires searching the optimal points along sev-eral directions in the n-dimensional space. The parabolic interpolation method is chosen for carrying out this task as it exhibits good performance when incorporated in the conjugate direction method. The three major numerical algorithms described above as well as a few working subroutines are implemented to form a compact soft-ware package. Fortran is used as the implementation language since it is appropriate for scientific calculations. The optimization package consists of a main program and six subroutines: ALMM, CDM, PIM, XSTEP, RE, and STOCK. A brief description follows: 1. STOCH — The subroutine incorporating stochastic parameters into the construc-tions of the objective function, the constraint function(s), and the argumented objective function. 2. XSTEP — The subroutine iterating the argument X and calling STOCH. 3. RE — The subroutine swapping data. It is a working routine to simplify the 211 program structure. 4. PIM — The subroutine executing the parabolic interpolation process and calling XSTEP and RE. 5. CDM — The subroutine performing the unconstrained minimization. Specifically, its functions are: (a) Call PIM to seek the optimal points along the initial set of conjugate di-rections; (b) after searching n directions, construct the (n + l)-th direction and use Powell's criterion to determine if it is worthwhile choose this new direction to replace one of the previous directions; (c) check the convergence condition. 6. ALMM — the subroutine performing the augmented Lagrangian multiplier method. Specifically, its functions are: (a) Construct the initial set of conjugate directions; (b) call STOCH to calculate the objective function's value at the starting point to estimate the base value used for normalization; (c) call CDM; (d) check the convergence condition. The structure of the software is shown in Fig. A . l . 212 MAIN ALMM CDM 1 PIM XSTEP STOCH Figure A . l : The Structure of the Software Numerical experiments were conducted on a SPARC Station IPX. A typical result is presented below. The total number of sections is specified to be 2 2 0 . The number of fields chosen is 5 (i. e. n = 5). Three scenarios are considered. The corresponding input data are recorded in Table A . l and A.2. The optimal values of the single scenario model (Eq. (5.16)), j / * , are presented in Table A.3 , where / , is the specified probability of scenario s. The optimal value obtained by minimizing the scenario tracking model (5.17) is y* = 251389. The compensated amounts are 23.8%, -14.7%, and -21.0% for scenario 1, 2, and 3, respectively. 213 Scenario Pi P2 P3 P4 Ps 1 0.8 0.1 0.9 0.2 0.5 2 0.7 0.2 0.8 0.1 0.4 3 0.6 0.2 0.7 0.2 0.5 Table A . l : Input Data Set 1: Frequencies of Fields to be Specified Scenario l i 2 " 3 U4 " 5 1 0.5 0.5 0.5 0.5 0.5 2 0.4 0.6 0.7 0.3 0.5 3 0.8 0.4 0.5 0.5 0.7 Table A.2: Input Data Set 2: Weights of Range Queries Scenario v.* 1 0.3 203068 2 0.4 294571 3 0.5 318367 Table A.3: Optimal Values of the Single Scenario Model 214
- Library Home /
- Search Collections /
- Open Collections /
- Browse Collections /
- UBC Theses and Dissertations /
- On the ordering of multiattribute data in information...
Open Collections
UBC Theses and Dissertations
Featured Collection
UBC Theses and Dissertations
On the ordering of multiattribute data in information retrieval systems Liu, Xian 1996
pdf
Page Metadata
Item Metadata
Title | On the ordering of multiattribute data in information retrieval systems |
Creator |
Liu, Xian |
Date Issued | 1996 |
Description | The requirement of ordering a set of multiattribute data arises frequently from spatial index processing and secondary key retrieval in modern information systems. The first application involves developing a single numerical index on a one-dimensional line for each point in multi-dimensional space, such that the spatial localizability can be preserved as best as possible. This work can be carried out by a mathematical transformation called spatial ordering. The Hilbert order has attained intensive interest in the literature due to its desirable performance. The lack of inexpensive encoding and decoding algorithms has been mentioned frequently in publications. The second application, secondary key retrieval, involves two issues: determining the resolution for each dimension in a multi-dimensional hashed space and ordering data blocks on disks. Aho and Ullman proposed a constrained nonlinear programming model applied to the partial match query. However, the literature is relatively silent on the range query retrieval. Faloutsos showed that the order determined by the reflected Gray code can significantly improve the cluster distribution of the data blocks. It is recommended to design symmetric Gray codes to reduce the bias introduced by the reflected Gray code. In this thesis, efficient analytical formulas are first derived to encode and decode the two-dimensional Hilbert order. Then the algorithms for the three-dimensional Hilbert order are discussed. The encoding and decoding processes for n-dimensional Hilbert orders (n > 4) would be considerably complicated and place a heavy computational burden on the application problems. Thus a new spatial order called the U order is proposed. Its major advantage is the significant simplicity of the encoding and decoding operations for multi-dimensional data. Performance assessments suggest that the U order is comparable with the Hilbert order and superior to most other orders. The performance analysis is followed by a formal mathematical description for the n-dimensional U order. Then a set of analytical encoding and decoding formulas are presented. Besides the U order, several new spatial orders with the quadrant-recursive structure are also investigated. The goal is to find the candidates that are competitive with the Hilbert order in performance and with relative simple encoding and/or decoding formulas. Of these new orders, the Q4 order behaves best and its performance is very close to the Hilbert order. An analysis shows that its encoding and decoding algorithms only need 64.5% and 84.2% of operations of the corresponding algorithms of the Hilbert order. The optimization of resolutions of multiattribute file systems is then discussed. Optimization models for the range query retrievals are presented by incorporating stochastic programming methodology. Three approaches axe discussed. The hereand- now and the wait-and-see approaches seem to be applicable to the situation in which range queries follow a relative simple distribution. Four typical paradigms are analyzed. One of them leads to a result that is similar to the one used in industry to evaluate the performance of disk drives. The scenario tracking approach appears to be more flexible and more appropriate for the problems where the probability distribution is not available. It also provides an adaptive technique to formulate non-conventional stochastic programming problems. Then a stochastic programming software package is developed and implemented. Numerical experiments show the coordinating effect of the scenario tracking methodology. Finally, the symmetric Hamiltonian paths in a high-dimensional hypercube are discussed. A heuristic approach is applied to construct the symmetric Hamiltonian paths in the five-, six-, and seven-dimensional hypercubes. They are equivalent to the symmetrical Gray codes. In summary, this thesis proposes three major approaches applied to the ordering of multiattribute data: a set of analytical encoding and decoding formulas for the twodimensional Hilbert order, two new spatial orders with significant simple encoding and decoding formulas, and a set of optimization models for the range query problem. |
Extent | 6341419 bytes |
Genre |
Thesis/Dissertation |
Type |
Text |
FileFormat | application/pdf |
Language | eng |
Date Available | 2018-05-07 |
Provider | Vancouver : University of British Columbia Library |
Rights | For non-commercial purposes only, such as research, private study and education. Additional conditions apply, see Terms of Use https://open.library.ubc.ca/terms_of_use. |
IsShownAt | 10.14288/1.0065151 |
URI | http://hdl.handle.net/2429/4787 |
Degree |
Doctor of Philosophy - PhD |
Program |
Electrical and Computer Engineering |
Affiliation |
Applied Science, Faculty of Electrical and Computer Engineering, Department of |
Degree Grantor | University of British Columbia |
GraduationDate | 1996-05 |
Campus |
UBCV |
Scholarly Level | Graduate |
AggregatedSourceRepository | DSpace |
Download
- Media
- 831-ubc_1996-091201.pdf [ 6.05MB ]
- Metadata
- JSON: 831-1.0065151.json
- JSON-LD: 831-1.0065151-ld.json
- RDF/XML (Pretty): 831-1.0065151-rdf.xml
- RDF/JSON: 831-1.0065151-rdf.json
- Turtle: 831-1.0065151-turtle.txt
- N-Triples: 831-1.0065151-rdf-ntriples.txt
- Original Record: 831-1.0065151-source.json
- Full Text
- 831-1.0065151-fulltext.txt
- Citation
- 831-1.0065151.ris
Full Text
Cite
Citation Scheme:
Usage Statistics
Share
Embed
Customize your widget with the following options, then copy and paste the code below into the HTML
of your page to embed this item in your website.
<div id="ubcOpenCollectionsWidgetDisplay">
<script id="ubcOpenCollectionsWidget"
src="{[{embed.src}]}"
data-item="{[{embed.item}]}"
data-collection="{[{embed.collection}]}"
data-metadata="{[{embed.showMetadata}]}"
data-width="{[{embed.width}]}"
async >
</script>
</div>
Our image viewer uses the IIIF 2.0 standard.
To load this item in other compatible viewers, use this url:
http://iiif.library.ubc.ca/presentation/dsp.831.1-0065151/manifest