"Science, Faculty of"@en . "Computer Science, Department of"@en . "DSpace"@en . "UBCV"@en . "Cai, Yuhan"@en . "2009-11-21T20:26:25Z"@en . "2004"@en . "Master of Science - MSc"@en . "University of British Columbia"@en . "In this thesis, we investigate the subject of indexing large collections of spatiotemporal\r\ntrajectories for similarity matching. Our proposed technique is to first mitigate\r\nthe dimensionality curse problem by approximating each trajectory with a low order\r\npolynomial-like curve, and then incorporate a multidimensional index into the reduced\r\nspace of polynomial coefficients. There are many possible ways to choose the\r\npolynomial, including Fourier transforms, splines, non-linear regressions, etc. Some\r\nof these possibilities have indeed been studied before. We hypothesize that one of\r\nthe best approaches is the polynomial that minimizes the maximum deviation from\r\nthe true value, which is called the minimax polynomial. Minimax approximation is\r\nparticularly meaningful for indexing because in a branch-and-bound search (i.e., for\r\nfinding nearest neighbours), the smaller the maximum deviation, the more pruning\r\nopportunities there exist. In general, among all the polynomials of the same degree,\r\nthe optimal minimax polynomial is very hard to compute. However, it has been\r\nshown that the Chebyshev approximation is almost identical to the optimal minimax\r\npolynomial, and is easy to compute [32]. Thus, we shall explore how to use\r\nthe Chebyshev polynomials as a basis for approximating and indexing d-dimensional\r\n(d\u00E2\u0089\u00A51) trajectories. ,\r\nThe key analytic result of this thesis is the Lower Bounding Lemma. That is,\r\nwe show that the Euclidean distance between two d-dimensional trajectories is lower\r\nbounded by the weighted Euclidean distance between the two vectors of Chebyshev\r\ncoefficients. This lemma is not trivial to show, and it ensures that indexing with\r\nChebyshev coefficients admits no false negatives. To complement the analytic result,\r\nwe conduct comprehensive experimental evaluation with real' and generated\r\n1-dimensional to 4-dimensional data sets. We compare the proposed scheme with\r\nthe Adaptive Piecewise Constant Approximation (APCA) scheme. Our preliminary\r\nresults indicate that in all situations we test, Chebyshev indexing dominates APCA\r\nin pruning power, I/O and CPU costs."@en . "https://circle.library.ubc.ca/rest/handle/2429/15433?expand=metadata"@en . "3371098 bytes"@en . "application/pdf"@en . "Indexing Spat io tempora l Trajectories w i t h Chebyshev Po lynomia l s by Yuhan Cai B.Sc. (Honours), The University of British Columbia, 2002 A THESIS S U B M I T T E D IN P A R T I A L F U L F I L L M E N T O F T H E R E Q U I R E M E N T S F O R T H E D E G R E E O F M a s t e r o f Sc i ence in T H E F A C U L T Y OF G R A D U A T E STUDIES (Department of Computer Science) we accept this thesis as conforming to the required, standard T h e Un ive r s i t y of B r i t i s h C o l u m b i a Apri l 2004 \u00C2\u00A9 Yuhan Cai, 2004 Library Authorization In presenting this thesis in partial fulfillment of the requirements for an advanced degree at the University of British Columbia, I agree that the Library shall make it freely available for reference and study. I further agree that permission for extensive copying of this thesis for scholarly purposes may be granted by the head of my department or by his or her representatives. It is understood that copying or publication of this thesis for financial gain shall not be allowed without my written permission. yUHAAJ CAI 27/0^-/200^ Name of Author (please print) Date (dd/mm/yyyy) Degree: MoSt^T 0$ SciWCl Year: 2 0 0 ^ Department of Cofff'iAxx Science The University of British Columbia Vancouver, BC Canada A b s t r a c t In this thesis, we investigate the subject of indexing large collections of spatiotem-poral trajectories for similarity matching. Our proposed technique is to first mitigate the dimensionality curse problem by approximating each trajectory with a low order polynomial-like curve, and then incorporate a multidimensional index into the re-duced space of polynomial coefficients. There are many possible ways to choose the polynomial, including Fourier transforms, splines, non-linear regressions, etc. Some of these possibilities have indeed been studied before. We hypothesize that one of the best approaches is the polynomial that minimizes the maximum deviation from the true value, which is called the minimax polynomial. Minimax approximation is particularly meaningful for indexing because in a branch-and-bound search (i.e., for finding nearest neighbours), the smaller the maximum deviation, the more pruning opportunities there exist. In general, among all the polynomials of the same degree, the optimal minimax polynomial is very hard to compute. However, it has been shown that the Chebyshev approximation is almost identical to the optimal mini-max polynomial, and is easy to compute [32]. Thus, we shall explore how to use the Chebyshev polynomials as a basis for approximating and indexing d-dimensional (d > 1) trajectories. , The key analytic result of this thesis is the Lower Bounding Lemma. That is, we show that the Euclidean distance between two d-dimensional trajectories is lower bounded by the weighted Euclidean distance between the two vectors of Chebyshev coefficients. This lemma is not trivial to show, and it ensures that indexing with Chebyshev coefficients admits no false negatives. To complement the analytic re-sult, we conduct comprehensive experimental evaluation with real' and generated 1-dimensional to 4-dimensional data sets. We compare the proposed scheme with the Adaptive Piecewise Constant Approximation (APCA) scheme. Our preliminary results indicate that in all situations we test, Chebyshev indexing dominates A P C A in pruning power, I /O and C P U costs. ii C o n t e n t s Abstract ii Contents iii List of Tables vi List of Figures vii Acknowledgments ix Dedication x 1 Introduction 1 1.1 Motivation 2 1.2 Problem Statement 3 1.2.1 Similarity Model 4 1.2.2 Data Representation . 7 1.2.3 Index Structure 8 1.3 Contributions 9 1.4 Thesis Outline 10 iii 2 Related Works 12 2.1 Indexing 1-dimensional Time Series 12 2.1.1 Whole Matching 13 2.1.2 Subsequence Matching 24 2.2 Indexing Multidimensional Trajectories 27 3 Background: Chebyshev Polynomials 31 3.1 Trigonometric Definitions and Recurrences 32 3.2 Approximation Theory 34 3.3 Orthogonality , . 38 3.4 Series Expansions 40 4 Indexing with No False Negatives 42 4.1 Assumptions 43 4.2 Chebyshev Approximation of a Time Series 44 4.3 An Example 47 4.4 A Metric for Chebyshev Coefficients . 49 4.5 The Lower Bounding Lemma 50 4.6 Extension to the Weighted Euclidean Framework 52 5 Indexing Multidimensional Trajectories 55 5.1 Lower Bounding for the Multidimensional Case 55 5.2 Algorithms for Building and Searching A Single Index 56 5.3 Algorithms for Building and Searching Multiple Indices 57 5.4 Properties of Chebyshev Indexing 58 iv 6 Experimental Evaluation 62 6.1 Data Sets and Programs Used 62 6.2 Comparison Criteria: Pruning Power and Search Time 64 6.3 Pruning Power Comparison: Real Data Sets 66 6.4 Building Time and the Choice of n 68 6.5 On Scalability: Generated Data 70 6.6 Comparisons with Indexing Included 73 6.6.1 I /O Cost Comparison 73 6.6.2 C P U Cost Comparison 74 6.6.3 Recommendations 76 6.7 A Single Index vs. Multiple Indices 76 6.8 Subsequence Matching 77 7 Conclusions 82 Bibliography 84 v L i s t o f T a b l e s 4.1 Maximum Deviations for Different Approximation Schemes 49 6.1 Data Sets Used 62 vi L i s t o f F i g u r e s 2.1 Algorithm to Compute the A P C A Representation of a Time Series . 20 3.1 Chebyshev Polynomials 33 4.1 Summary of Notation 43 4.2 A Comparison of Approximation Schemes (n = 4) 47 4.3 A Comparison of Approximation Schemes (n = 8) 48 5.1 Algorithm for Building a Single Index of Chebyshev Coefficients . . 57 5.2 Algorithm for a Range Search in a Single Index 58 5.3 Algorithm for a kNN Search in a Single Index 59 5.4 Algorithm for Building Multiple Indices of Chebyshev Coefficients . 60 5.5 Algorithm for a Range Search in Multiple Indices 61 5.6 Algorithm for a kNN Search in Multiple Indices . 61 6.1 Pruning Power Comparisons: Real 1- to 4-Dimensional Data Sets . . 67 6.2 Computing Chebyshev Coefficients 68 6.3 Scalability: Pruning Power and Building Time 71 6.4 Search Time Comparison: Indexing Included 72 6.5 A Single Index vs. Multiple Indices (n = 3) 78 vii 6.6 A Single Index vs. Multiple Indices (n = 5) 79 6.7 Subsequence Matching: Pruning Power, I/O costs and C P U cost . . 80 viii A c k n o w l e d g m e n t s First of all, I would like to express my gratitude to my supervisor, Dr. Raymond T. Ng, for his rewarding guidance and constant support throughout the course of my work. It was his supervision and inspiration that have not only enabled me to complete this thesis but also led me to the right direction of research. Additionally, special thanks go to both Dr. Eamonn Keogh at the University of California, Riverside, for his A P C A code and experimental data sets, and Dr. Christos Faloutsos at Carnegie Mellon University for his DR-tree package. I would also like to thank Dr. Jason Harrison and Dr. Michiel van de Panne for providing me with their motion capture data. I did most of my research in the Database Systems Laboratory, where there was an enlightening and stimulating academic environment. My appreciation is extended to all my colleagues for their friendship and help. I owe many thanks to Dr. Edwin Knorr for his valuable comments and precious suggestions on my thesis. Moreover, for two years, my studies were mainly funded by scholarships from the Natural Sciences and Engineering Research Council of Canada (NSERC) and the Institute of Robotics and Intelligent Systems (IRIS). Finally, I am indebted to my parents for their love, education and encour-agement during the eighteen years of my schooling. Y U H A N C A I The University of British Columbia April 2004 ix To my parents. x Chapter 1 I n t r o d u c t i o n A spatiotemporal trajectory is a time-stamped sequence of vectors representing space and/or time information. More formally, a d-dimensional trajectory is an ordered collection S in the form S - {{h,vi), {t2,v2), \u00E2\u0080\u00A2. \u00E2\u0080\u00A2, {tN,vN)} where \u00E2\u0080\u00A2 N is the length of the trajectory S. \u00E2\u0080\u00A2 t\ < < \u00E2\u0080\u00A2 \u00E2\u0080\u00A2 \u00E2\u0080\u00A2 < tpf are time stamps. \u00E2\u0080\u00A2 Each vector Vi is of arity d for all 1 < i < N. \u00E2\u0080\u00A2 Each pair {U,Vi) records the values of a vector of d scalars at time i j . For example, if d = 1, then the trajectory is a time series. For a second example, Vi may capture the 2-dimensional or 3-dimensional coordinates of a moving object at time U, in which case we have a spatiotemporal trajectory. For yet another example, a trajectory may represent the change of the attributes or features of an entity over time. 1 1.1 Mot iva t ion Time series axe ubiquitous in temporal databases, which is a well-established area in database studies [16]. Massive time series or sequence data sets arise naturally in a variety of real world applications, such as medinformatics, meteorology, stock markets, and image/video databases. For example, doctors monitor the health conditions of their patients by keeping track of their body temperatures, geologists record monthly or annual rainfall data for weather forecasting, and financial analysts try to find patterns in their large pools of stock prices of different companies. There are also many large collections of higher-dimensional spatiotemporal trajectories, thanks in part to the development of cost-effective mobile technolo-gies, such as Geographic Information Systems, wireless communication electronics, and multimedia applications [14, 43, 56]. Examples include spatiotemporal trajec-tories of cars, airplanes, and other moving objects generated by motion tracking devices in surveillance applications and electronic games applications. Additionally, a video stream can also be regarded as a multidimensional trajectory, as it consists of a sequence of multiple frames, each of which is characterized by a set of feature attributes. Specifically, as part of our collaboration with an electronic games company, we encounter large collections of 2-, 3- and 4-dimensional spatiotemporal trajecto-ries. A 2-dimensional example is the coordinates of National Football League (NFL) players moving on a football field, or of National Hockey League (NHL) players skat-ing on an ice rink. A 3-dimensional example is the positions of aircrafts during a flight simulation. Finally, a 4-dimensional example is the four angles of body joints of a person playing kung-fu or dancing. This type of data sets is useful for games developers and medical professionals. The point here is that beyond 1-dimensional 2 time series, applications of higher-dimensional spatiotemporal trajectories are very common. Given those enormous databases of trajectories, what can we do with them, and how do we retrieve valuable information from them? One of the fundamental operations in mining trajectories is similarity matching, which refers to the pro-cess of finding trajectories that are similar to a given query trajectory. Similarity matching is useful in two aspects. First, it is a subroutine of many data mining tasks, such as classification, clustering, rule discovery, outlier detection, and query by contents. Second, it is important in its own right for exploratory data analysis. The following axe typical similarity queries: \u00E2\u0080\u00A2 Identify companies who have similar sales patterns as Microsoft has. \u00E2\u0080\u00A2 Find out if a given musical score is similar to any of the existing scores. \u00E2\u0080\u00A2 Discover all images that contain regions similar to regions of a given image. 1.2 Prob lem Statement The problem of retrieving similar trajectories can be formatted as follows: given a reference trajectory database DB, a distance measure Dist, a query trajectory q, and a positive number r, find the set R of trajectories that aie within distance r of q, or more precisely: R = {x \u00E2\u0082\u00AC DB | Dist{x, q) < r) (1.1) This is called a range query or radius search. Alternatively, one might be interested in finding the k nearest neighbours (fcNN) of q, which is equivalent to setting r so that \R\ = k. 3 Similarity-based pattern querying has three major components: the similar-ity model that defines a distance measure between trajectories, the data representa-tion that abstracts features from raw data sets, and the index structure that enables efficient searching for the closest matches. 1.2.1 S i m i l a r i t y M o d e l Many similarity distance functions have been studied in the literature, and which one is the \"best\" always depends on the specific user, data set and task. In general, they can be classified into two categories: metric functions and non-metric functions. A distance function D is a metric if it satisfies the following requirements: \u00E2\u0080\u00A2 Symmetry: D(a,b) = D(b,a). \u00E2\u0080\u00A2 Non-negativity: D(a, b) > 0 if a ^ b, and D(a, b) = 0 if and only if a = b. \u00E2\u0080\u00A2 Triangle Inequality: D(a,b) < D(a,c) + D(c,b). In most cases, a metric function is desired, because the triangle inequality can then be used to prune the index during search. The most popular distance metric is the \u00C2\u00A3 p -norm. Definition 1.1 Given two d-dimensional trajectories tu = ((h,u[),(t2,u2), (tN,U~N)) and tv = ((ti,v{), {t2,v2) (tN,VN)) the Cn-norm distance between them is: (1.2) 4 where each u\ denotes the jth component of the vector u$. It is called the Manhattan distance Distman if p = 1, the Euclidean distance Disteuc if p = 2, and the Max distance Dist^ if p = oo. A simple variant of (1.2) is the weighted \u00C2\u00A3 p -norm defined by: _ \u00C2\u00BB N d - \u00C2\u00BB - \u00C2\u00BB i \u00C2\u00A3 p ( t \u00C2\u00AB , to, = [ \u00C2\u00A3 \u00C2\u00A3 W ^ K - ^ H 5 (1.3) i = i j = i where I f is a matrix of (nonnegative) weights for different points on different tra-jectories. While Euclidean Distance Disteuc is used in most existing studies, it is nev-ertheless insufficient for all situations because: \u00E2\u0080\u00A2 It works only for trajectories of the same length. \u00E2\u0080\u00A2 It cannot handle outliers or noise. \u00E2\u0080\u00A2 It is very sensitive to scale or amplitude. \u00E2\u0080\u00A2 It does not work well with trajectories that are similar in shape, but out of phase. \u00E2\u0080\u00A2 It does not allow stretching or compression of the time axis. As a result, many attempts have been made to come up with distance functions that are invariant with respect to six transformations: shifting, uniform amplitude scaling, uniform time scaling, uniform bi-scaling, time warping and non-uniform amplitude scaling. Unfortunately, none of them is a metric. Some of the most famous distance notations are: \u00E2\u0080\u00A2 Dynamic Time Warping (DTW) [4, 19]: The idea is to use dynamic program-ming to construct the warping path in the distance matrix that minimizes the 5 warping cost and then define the distance as the minimized cost. In effect, it .allows shifting and stretching in order to align trajectories. . . \u00E2\u0080\u00A2 Longest Common Subsequence (LCSS) [52, 53]: As a variant of edit distance, it describes how well two trajectories can match one another, by allowing them to stretch and to translate in space, without any rearrangements of the sequence of elements. One of the advantages is that it is robust to noise by giving more weight to the similar portions and paying less attention to regions of great dissimilarity. \u00E2\u0080\u00A2 Landmark Model [25, 35]: Generally speaking, it identifies points of \"great importance\" as landmarks, based on which the similarity patterns are defined. For example, first-order landmarks are global or local extrema, second-order landmarks are inflection points, and so on. It is claimed to better match human intuition and episodic memory as it takes smoothing into account by letting certain landmarks be overshadowed by others. In this thesis, we adopt the Euclidean distance function Disteuc for spa-tiotemporal trajectories. While this distance function is easy to compute, it is natu-ral for many applications of spatiotemporal trajectories, including those for airplanes and other flying objects. Additionally, it allows scalable solutions to other problems such as clustering and classification. It is also the distance function adopted by most studies on indexing time series, including [21]. For more advanced distance functions such as time-warping [4] and longest common subsequence [53], we consider them future topics of investigation. 6 1.2.2 D a t a Represen ta t i on It is not necessary to separate data representation from the similarity model, but most previous works did, as an abstract representation permits more efficient com-putations than the raw data could, and may allow for an even more sophisticated indexing technique. While we shall discuss related works in greater detail in Chap-ter 2, it suffices to say that most existing frameworks are based on piecewise approx-imations, where each piece is either constant or linear. However, recall that, among the examples cited in Section 1.1, one thing in common is that they have smooth and continuous trajectories. This is because all those activities (e.g., human movement, flying objects) are governed by the laws of physics, giving rise to smooth motion trajectories. That is to say, a smooth and continuous trajectory is approximated with a piecewise discontinuous function. This mismatch may cause an unnecessary error or deviation, and may lead to a loss in pruning power in a branch-and-bound search. In this thesis, we seek to approximate and index a d-dimensional spatiotem-poral trajectory with a low order continuous polynomial-like curve. There are many possible ways to choose the polynomial, including (continuous) Fourier transforms, splines, non-linear regression, and so on. While all approximations are not exact by definition, the approximation that minimizes the maximum deviation from the true value is very desirable. This is called the minimax approximation. Minimax ap-proximation is particularly meaningful for indexing because in a branch-and-bound search (i.e., for finding nearest neighbours), the smaller the maximum deviation, the more pruning opportunities there exist. However, in general, among all the polynomials of the same degree, the optimal minimax polynomial is very hard to compute. It has been shown that the Chebyshev approximation is almost identical 7 to the optimal minimax polynomial, and is easy to compute [32]. Thus, we shall explore how to use the Chebyshev polynomials as a basis for indexing d-dimensional trajectories. 1.2.3 Index Structure Indexing is one of the many searching techniques available for similarity matching. If a searching mechanism retrieves a (proper) subset S of R, then the wrongly dismissed trajectories in R \u00E2\u0080\u0094 S are called false dismissals or false negatives. On the other hand, if S is a (proper) superset of R, then the wrongly retrieved trajectories in S \u00E2\u0080\u0094 R are called false alarms or false positives. As we can always remove false positives in a post-processing stage, they can be tolerated as long as there are not too many of them. Searching techniques that guarantee no false negatives are said to be exact; however, there are studies which consider providing faster approximate search at the expense of allowing both false positives and negatives [46, 25]. The most obvious brute-force solution for similarity matching would be a sequential scan of the whole database, in which we compute the distance between every trajectory x G DB and q, and return x if it qualifies. This approach requires that we access every single page in the database, which is clearly unrealistic for large data sets. Any mechanisms that avoid retrieving all the data pages could potentially increase the speed of the search, which automatically entails the idea of using an index. While we will discuss the existing indexing schemes in Chapter 2 and propose our own scheme in Chapter 4 and Chapter 5, here we shall give an outline of the desirable properties for any indexing technique [9]: \u00E2\u0080\u00A2 It should be faster than sequential scanning. \u00E2\u0080\u00A2 It should incur little space overhead. \" 8 \u00E2\u0080\u00A2 It should be able to handle queries of different lengths. \u00E2\u0080\u00A2 It should be incremental, that is, it should allow insertions and deletions with-out rebuilding the index. \u00E2\u0080\u00A2 It should guarantee no false negatives. In addition, two other desirable crteria [21] are: \u00E2\u0080\u00A2 It should be possible to build the index in a reasonable time. \u00E2\u0080\u00A2 It should be able to handle more than one distance measure. 1.3 Contributions As a preview, we make the following contributions in this thesis: \u00E2\u0080\u00A2 Recall that a spatiotemporal trajectory is of the form ((ti,vi),..., (\u00C2\u00A3AT,# /v ) ) . Thus, it is discrete in nature. We show how to approximate such a discrete \"function\" with Chebyshev polynomials. We first begin with the 1-dimensional case of time series. Our representation scheme allows us to prove a main result of this thesis - the Lower Bounding Lemma. That is, the true distance between two time series is lower-bounded by the distance in the index space (i.e., the space of Chebyshev coefficients in our case). As shown in Chapter 4, this is not a trivial result to prove. \u00E2\u0080\u00A2 We generalize from the 1-dimensional case to the d-dimensional case (d > 1). Specifically, a d-dimensional trajectory is projected onto each dimension to create d 1-dimensional trajectories. We show that this projection preserves the Lower Bounding Lemma. We also give algorithms for building an index 9 of Chebyshev coefficients, and for supporting similarity searching of whole trajectories. . , \u00E2\u0080\u00A2 To evaluate the effectiveness of the minimax property of Chebyshev polynomi-als on indexing, we conduct an extensive experimental evaluation. We use 1- to 4- dimensional real data sets, as well as generated (i.e. synthetic) data sets. For time series, the Adaptive Piecewise Constant Approximation (APCA) scheme has been shown to outperform all other schemes including Discrete Fourier Transform (DFT), Discrete Wavelet Transform (DWT) and Piecewise Aggre-gate Approximation (PAA) [21]. We obtain the A P C A code from Keogh et al., and compare with Chebyshev approximation. We also extend A P C A to d-dimensional situations as a \"straw man\" strategy. As a preview of our results, from 1- to 4-dimensional, real and generated data, Chebyshev dominates A P C A in pruning power, I /O cost and C P U cost. Our empirical results indicate that Chebyshev approximation can deliver a 3- to 5- fold reduction on the dimensionality of the index space. For instance, it only takes 4 to 6 Chebyshev coefficients to deliver the same pruning power produced by 20 A P C A coefficients. This is a very important advantage. As the dimensionality curse on the indexing structure is bound to set in sooner or later, Chebyshev coefficients are far more effective than A P C A in delivering additional pruning power before that happens. 1.4 Thesis Outline The thesis is organized as follows. In Chapter 2, we discuss related works. In Chapter 3, we review Chebyshev polynomials and their properties central to the 10 development of this thesis. In Chapter 4, we show how to approximate a time series with a Chebyshev polynomial, and give an example. We also propose a metric distance function between two vectors of Chebyshev coefficients. Finally, we prove the Lower Bounding Lemma. In Chapter 5, we generalize the earlier results for time series to deal with d-dimensional trajectories. In Chapter 6, we present our experimental setup and results. We compare Chebyshev and A P C A with respect to pruning power, I/O costs and C P U costs. 11 Chapter 2 R e l a t e d W o r k s 2.1 Indexing 1-dimensional T ime Series Substantial efforts have been made on the problem of indexing one-dimensional time series. There are basically two ways to post a similarity query: \u00E2\u0080\u00A2 Whole Matching: Given a collection DB of M time series, each of length N, and a query time series Q of the same length, we want to find those time series that are within distance r of Q. Mathematically, we seek the set R = {S \u00C2\u00A3 DB\Dist(S,Q) < r} Note that every time series, including Q, must have the same length N. \u00E2\u0080\u00A2 Subsequence Matching: Given a collection DB of M time series S i , . . . , SM, where |S;| = iV; for 1 < i < M, and a query time series Q of length |Q| = NQ < mini Distfeature{F(Q),F{0)) < r. This is clear since Distfeature(F(Q),F{0)) < Disttrue(Q,0) < r. \u00E2\u0080\u00A2 14 Intuitively, this Theorem means that if two objects are far apart in the feature space, then they must be far apart in the original space. The performance of GEMINI methods depends solely on the tightness of the lower bound. The closer DistfeatUTe is to Distune, the fewer false positives there are, and the more efficient the algorithm will be. In the following sections, we shall review the existing dimensionality reduc-tion techniques. Note that they are also applicable to the subsequence matching algorithm to be presented in Section 2.1.2. In general, these studies can be divided into the following categories based on the underlying approximation schemes: \u00E2\u0080\u00A2 DFT: [9, 3, 40, 8] \u00E2\u0080\u00A2 DWT: [7, 58, 17, 37] \u00E2\u0080\u00A2 PA A: [24, 59] \u00E2\u0080\u00A2 A P C A : [21] \u00E2\u0080\u00A2 SVD: [29, 18, 24] Discrete Fourier Transform (DFT) The first dimensionality reduction technique proposed for indexing time series in the literature is to use the Discrete Fourier Transform. The basic idea is that any realistic signal can be characterized by the superposition of a finite number of sine/cosine waves, each of which is represented by a single complex number known as a Fourier coefficient. The key observation is that, a signal of length N can be decomposed into N sine/cosine waves that can be recombined into the original signal, and many Fourier coefficients have a very low amplitude and therefore can be discarded without much loss of information in the reconstruction process. 15 The iV-point Discrete Fourier Transform of a signal x = (xo,. \u00E2\u0080\u00A2 \u00E2\u0080\u00A2, XJV-I) is a sequence X = (Xo,..., Xpf-i) of complex numbers, where XF = ^Ylx 2 with T0(t) = 1 and T^t) = t. From the above definition, the first six Chebyshev polynomials are: T0(t) = 1 Ti(t) = t T2(t) = 2 t 2 - l T3(t) = 4 t 3 - 3 f T4(t) = 8 i 4 - 8 t 2 + l T5(t) = 16t5 - 20*3 + 5* And Figure 3.1 shows the graphs of T\(t) to T5(t). Even though in the above definitions, t is defined,only over the interval [\u00E2\u0080\u00941,1], we may define Chebyshev Polynomials appropriate to any given finite inter-33 val [a, b] (\u00E2\u0080\u009400 < a < b < + 0 0 ) by transforming this domain into [\u00E2\u0080\u00941,1] of a new variable s under the linear transformation: 2x - (a + b) s = \u00E2\u0080\u0094\u00C2\u00BB; 0 \u00E2\u0080\u0094 a Then, the Chebyshev Polynomials appropriate to [a, b] are Tm(s). Without loss of generality, hereafter we simply focus on the interval [\u00E2\u0080\u00941,1]. Finally, we conclude this section with a key property of Chebyshev Polyno-mials: T h e o r e m 3.1 The Chebyshev Polynomial Tm(t) has m zeros and m + 1 local ex-trema in [\u00E2\u0080\u00941,1]. The zeros are: (i - -)ir tj = cos- j = l , 2 , . . . , m (3.3) m The extrema are: Z7T U = cos\u00E2\u0080\u0094, i = 0, l , . . . , m (3.4) Note that all the zeros are interior to [\u00E2\u0080\u00941,1], while there are two extrema at the endpoints \u00C2\u00B1 1 , and m \u00E2\u0080\u0094 1 \"true\" alternate maxima and minima where the first derivative is 0. 3.2 Approximation Theory As described before, polynomials are among the simplest classes of functions in the sense that they can be easily specified, compactly represented, and efficiently evaluated. However, there exist many other functions that are complex in nature. This is where approximation comes in - it is useful and sometimes essential to 34 approximate any given function / by a much simpler function /*, such that the approximating value f*(x) is very close to the corresponding real value f(x). In this section, we shall review the fundamental concepts of approximation theory, with an emphasis on uniform (minimax) approximation, and then introduce the minimax property of Chebyshev Polynomials. The approximation theory consists of three major components: 1. A function class J- (containing all the functions to be approximated). 2. A form (for the approximation function /*) that is parameterized by a few adjustable coefficients. This defines a set A of possible approximations to the 3. A norm || \u00E2\u0080\u00A2 || (of the approximation error) that measures how good the approx-imation is. That is, | | / \u00E2\u0080\u0094 /*| | defines the closeness of /* to / . There exists a vast body of literature on how to define different approximation problems by making appropriate selections of these components, however, in this thesis, our main focus is as follows: 1. Function class T = Cp[a, b], that is, the set of \u00C2\u00A3 p-integrable functions on [a, b], defined by: given / . T = Cp[a,b] = {h(x) for which / w(x)\h(x)\p dx exists} (3.5) where w(x) is a given nonnegative weight function, and 1 < p < oo. 2. Form A = I I m = {f*(x) = Pm(x) = a0 + aix + \u00E2\u0080\u00A2 \u00E2\u0080\u00A2 \u00E2\u0080\u00A2 + amxm} (3.6) where the adjustable coefficients are ao,. . . , a, 35 3. Norm || \u00E2\u0080\u00A2 || = \\h\\p = \u00C2\u00A3 p -norm where (3.7) Specifically, the Chebyshev (minimax) norm is: \\h maxa 0, there exists a polynomial Pm for some sufficiently large m such that \\f \u00E2\u0080\u0094 Pm\\p < e for any p > 1. Furthermore, there exists a unique best polynomial approximation to any function f \u00C2\u00A3 Cp[a,b] in the Cp-norm, where w(x) = 1 in the case p \u00E2\u0080\u0094* oo. Best approximations also exist in \u00C2\u00A3 p -norm on finite point sets for 1 < p < oo and are unique if and only if p > 1. Such Cp norms are defined by: l l / - / l p = [X>i|/(zi)-/>i ) l p ]* i = i where {w^fLx are positive scalar weights and {xi}f=l is a discrete set of fitting points. In particular, Theorem 3.2 guarantees the existence of a unique best approx-imation in the minimax norm; however, it does not tell us how to recognize such an approximation. Rather surprisingly, it is possible to do so explicitly, as the following powerful Theorem illustrates. Theorem 3.3 (Alternation Theorem for Polynomials) For any continuous function f a unique minimax polynomial approximation Pm exists and is uniquely characterized by the \"alternating or equioscillation property\" that there are at least m + 2 points in [a,b] at which f(x) \u00E2\u0080\u0094 Pm(x) attains its maximum absolute value with alternating signs. 37 It is clear from Theorem 3.1 and Figure 3.1 in Section 3.1 that the Chebyshev Polynomial Tm(t) has m + 1 extrema with alternating signs on the interval [\u00E2\u0080\u00941,1]. By invoking Theorem 3.3, we have: Theorem 3.4 The polynomial 21~mTm(t) is the minimax approximation on [\u00E2\u0080\u00941,1] to the zero function by a monic polynomial of degree m. 3.3 Orthogonality In the last section, we showed that their minimax property have earned Cheby-shev Polynomials a key position in the development of approximations. On the other hand, Chebyshev Polynomials also possess another equally important prop-erty, namely, they are a family of orthogonal polynomials. The orthogonality of those polynomials, in addition to having a strong linkage with \u00C2\u00A32 (or least-squares) approximations, lends itself to the subject of series expansions. Definition 3.3 Two functions f(x) and g(x) in \u00C2\u00A3.2[a,b] are orthogonal on the in-terval [a,b] with respect to a given continuous and nonnegative weight function w(x) then the orthogonality condition (3.9) is equivalent to saying that f is orthogonal to if (3.9) If we use the \"inner product\" notation (3.10) 9 if (f, 9) = 0. (3.11) 38 Furthermore, with this notation of inner product, the C2-norm is \h\\ = \\h\\2 := yft^h). (3.12) Definition 3.4 A family of polynomials {(f>i(x) : i = 0 ,1 , . . . } is orthogonal on the interval [a,b] with respect to a given continuous and nonnegative weight function w(x) if and only if for each i = 0,1, . . . , and j = 0 ,1 , . . . 1. 4>i[x) is of degree i. 2. (c6i(x)>0i(aO> = \u00C2\u00B0> 3. = ll&ll2 > 0 . The family is orthonormal if, for all i, (4>i(x),4>i{x)) = 1 = ||c/>i||2. If we define the inner product (3.10) using the interval and weight function [a,6] = [ - l , l ] , w(t) = ( l - t 2 ) - l 2 then the following theorem holds for the inner product between Chebyshev Polyno-mials: Theorem 3.5 (Ti,Tj)= 0 ifi + j 1 ifi = j^0 n if i = j = 0 (3.13) The system {Ti} is therefore orthogonal with respect to w(t) but not orthonormal. Hereafter we shall refer to w(t) = ( i - t2yh as the Chebyshev weight function. 39 3.4 Series Expansions One of the best ways to approximate a given function is to expand-it in terms of infinite series of an orthogonal family of simpler functions. Indeed, many expansion techniques have been studied before, including the familiar Taylor series, Laurent series and Fourier series. We may write the Chebyshev series expansion of a given function /(\u00C2\u00A3) as where S^if) denotes the partial sum of the infinite series ST-)(f). The following Theorem asserts that S^,(/) is in fact \u00C2\u00A32-convergent with respect to the Chebyshev weight function, provided that / is \u00C2\u00A32-integrable with respect to the same weight function. Theorem 3.6 / / f(t) is ti-integrable with respect to the inner product (3.10), then its Chebyshev series expansion (3.14) converges in \u00C2\u00A32, in other words, 00 (3.14) j\l-t>)-\[f(t)-Sl(m)?dx^Q as m 00 Thus, we may write: 00 /(*) = $\u00C2\u00A3(/)(*) = EC*3H*) (3.15) i=0 It follows, by taking inner products with Tj, that 00 (f,Tj) = Y,ci(Ti,Tj) = cj{Tj,Tj) i=0 since (Ti,Tj) \u00E2\u0080\u0094 0 for i ^ j. Therefore, 3 Irn. rp\ (3.16) (3.17) 40 j = WW) = - / - i T f ^ F f o r J - ( 3 - 1 8 ) Note that in the above formulae, for Co and Cj ' s , the constants differ by a factor of 2. This is a direct consequence of the second and third cases shown in Theorem 3.5 (i.e., 7r versus 7r/2). For a particular degree m, the error function for the partial sum S^f is emf = I ~ Smf a n d it satisfies: K / l l 2 = (f-SlfJ-Slf) = (f,f)-2(f,STlf) + {SL]f,SU) = l l / l | 2 - 2 E ^ 0 c i m ! / ) + E \u00C2\u00A3 o c f ( T i , T i ) 2 _ o r m p-TLr- 4- V m 2 ' 7r v ^ m \u00E2\u0080\u009E2 From Theorem 3.6, eL\f \u00E2\u0080\u0094> 0, as m \u00E2\u0080\u0094> oo, therefore, we obtain an important convergence Theorem for Chebyshev Coefficients: Theorem 3.7 \u00C2\u00A3 ^ = ^ l l / l l 2 = ^ / 1 ( 1 - * 2 ) \" ' / 2 ( * ) * - (3-19) 41 Chapter 4 Indexing wi th No False Negatives In this Chapter, we focus on 1-dimensional spatiotemporal trajectories, that is, time series. In Chapter 5, we shall generalize our framework to higher dimensional trajectories. Figure 4.1 summarizes all the symbols used in this thesis. Given a collection of time series of length N, we intend to represent each time series by its Chebyshev approximation of degree m, with m i),..., (tiv.vjv)), we consider the time series Z = ((t\,ui \u00E2\u0080\u0094 v\), ..., (\u00C2\u00A3#,UN \u00E2\u0080\u0094 VN))- Let zj = Uj \u00E2\u0080\u0094 Vj for all 1 < j < N. Then it is clear that the Euclidean distance between S\,S2 satisfies the following equality: \u00C2\u00A3>wtLc (Si ,5 2 ) = Y , { u ] - v j f = Yjz2j (4.9) Recall from Section 4.2 how an interval function is defined for a time series. Let the interval functions corresponding to S\, 5 2 and Z be /1, / 2 and fz respectively. The lemma below is easy to establish by following Equations (4.3) and (4.4) in Section 4.2. Lemma 4.4 For a l H e [-1,1], fz(t) = fi(t) - f2(t). Proof: Vz = 1,..., N, let x \u00E2\u0082\u00AC Ii be arbitrarily given. vSfc' \" [ E q u a t i o n (4-4) ] \u00E2\u0080\u00A2 = V^ f tM [Equation (4.3)] = vSta [Definition of Z] = h[x)-f2lx) [Equation (4.4)] \u00E2\u0080\u00A2 50 The above lemma can then be used to establish a useful result for Cheby-shev approximation. Let us consider the Chebyshev approximation of Z based on Equation (4.7). Let the corresponding vector of Chebyshev coefficients be denoted as Cz- Given that Z is the \"difference\" between Si,S2, the following lemma says that the vector of Chebyshev coefficients preserves the difference. Lemma 4.5 Let Ci,C2 and Cz be the vectors of Chebyshev coefficients for Si,S2 and Z respectively. Specifically, let C f = [oo,..., a m ] , C2 = [bo, \u00E2\u0080\u00A2 \u00E2\u0080\u00A2 \u00E2\u0080\u00A2, bm] and = [co,..., Cm]. Then for all 0 < i < m, it is the case that Cj = aj \u00E2\u0080\u0094 b{. Proof: In the following, we only focus on c; for 1 < i < m; the situation is almost identical for CQ. Ci = \u00C2\u00A3 \u00C2\u00A3 f = i fzitjWitj) [Equation (4.7)] = %Z\"=i[h(tj)-f2(tj)]Ti(tj) [Lemma 4.4] = ai \u00E2\u0080\u0094 bi [Equation (4.7)] \u00E2\u0080\u00A2 Based on the above lemma and Definition 4.1, it is clear that: m oo Distlby(CuC2) = fE c?<|E ci (4-!0) Z i=0 Z i=0 Lemma 4.2 tells us that, the function f(t), as defined in Equation (4.4), is \u00C2\u00A32-integrable with respect to the Chebyshev weight function. By invoking Theorem 3.7 in Chapter 3, we can finally put the various pieces together and conclude with the following theorem. Theorem 4.2 (The Lower Bounding Lemma) Let Si, S2 be two time series, and Ci,C2 be the corresponding vectors of Chebyshev coefficients. Then: Distcby(Ci,C2) < Disteuc(Si, S2) 51 Proof: Dist'iCxA) < | E \u00C2\u00A3 0 c 2 [Equation (4.10)] = J^Jji^dt [Theorem 3.7] = EjLi\Ij\^ [Equation (4.4)] - 72 = Dist2euc(S1,S2) [Equation (4.9)] \u00E2\u0080\u00A2 1 f2 (t) M z2 Notice that a key step in the above proof is / _ x ^dt = Ej=i \Ij I This is due to the fact that the integrand is a step function, and hence stepwise integrable. The result of the integration at each step is the area under the curve, z2 which is the width |7j| multiplied with the height j^, that is, z2. 4.6 E x t e n s i o n t o the W e i g h t e d E u c l i d e a n F r a m e w o r k In this Section, we shall extend the strict Euclidean framework into a generalized weighted version. Definition 4.2 For any two time series S\ = ((\u00C2\u00A31, ui),..., (tjv, UN)) and S2 = ((t\,vi),..., (tpf,vpf)), the weighted Euclidean Distance between them is: N ^ \u00C2\u00A3 W i ( \u00C2\u00AB i - V i ) 2 (4-11) DisteuCw(Si,S2) = where { W j } ^ are nonnegative scalar weights. We redefine the function g(t) (as in Equation (4.3)) as follows: g(t) = y/WiVi if t e Iit (for l 1) spatiotemporal trajectories. Then we present algorithms for indexing and kNN searches. 5.1 Lower Bounding for the Multidimensional Case Let S be a d-dimensional spatiotemporal trajectory of the form ((ti, vi),..., (t^, u/v)), where Vi is of arity d. Let the d dimensions be {Dim\,..., Dim^}. Then S is decom-posed into d 1-dimensional series: Srjimi > \u00E2\u0080\u00A2 \u00E2\u0080\u00A2 \u00E2\u0080\u00A2 > ^Dimd' Let each of these series Srjimi be approximated and represented with the vector (7, of Chebyshev coefficients. The vector Ci is of arity n,, and needs not be of the same arity as Cj for j i. Finally, let C be the vector of Chebyshev coefficients for 5, i.e., CT = [C[, \u00E2\u0080\u00A2 \u00E2\u0080\u00A2 \u00E2\u0080\u00A2, Cj]. We generalize Definition 4.1 to give a metric distance function between two 55 vectors of Chebyshev coefficients for two d-dimensional trajectories. Definition 5.1 Let S,R be d-dimensional spatiotemporal trajectories. Let their vectors of Chebyshev coefficients be CT = [Cf,..., Cj] and DT = [Df,..., D%] respectively. Define: The following corollary is a simple extension of Theorem 4.2 generalizing the the Lower Bounding Lemma from 1-dimensional to d-dimensional trajectories. This is because the d-dimensional distance is based on the sum-of-squares distances along each dimension. 5.2 Algorithms for Building and Searching A Single In-Having established the Lower Bounding Lemma in Corollary 5.1 for the d-dimensional case, we can build an index of Chebyshev coefficients. Figure 5.1 shows a skeleton of an algorithm which takes M d-dimensional spatiotemporal trajectories, obtains the Chebyshev coefficients for each trajectory, and inserts the vectors of coefficients into a multidimensional index. Recall from Section 4.2 that the complexity of step (4) of the algorithm is O(N), where N is the length of each trajectory. Thus, it is clear from Figure 5.1 that building the index takes 0(dMN) time. Corollary 5.1 Let S,R be d-dimensional spatiotemporal trajectories, and C,D be the corresponding vectors of Chebyshev coefficients. Then: Disced) < Disteuc{S,R) dex 56 Algorithm BuildOneIndex(\u00C2\u00A3)B, Index, n\,... ,nd) { /* input: a database DB of M d-dimensional trajectories */ /* input: Index, a multidimensional index which may already contain some entries */ /* input: ni (1 < % < d) denotes the number of Chebyshev coefficients to be used for the ith dimension */ . /* output: the trajectories approximated and added to Index */ for each trajectory S { (1) initialize C to be empty (2) project S to its d dimensions {Dim\,... ,Dimd} creating SDimiSDimd (3) for (1 < i < d) { (4) apply Equations (4.2) to (4.7) to SDimi (5) add all the computed ni coefficients to C } /* end for-loop */ (6) insert the coefficients in C as a single multi-dimensional point .into Index } /* end for-loop */ } /* end algorithm */ Figure 5.1: Algorithm for Building a Single Index of Chebyshev Coefficients Next we consider range and kNN searches. In both cases, the search is rather straightforward, following the GEMINI framework [9]. Figure 5.2 shows a skeleton of the range search algorithm, and Figure 5.3 shows a skeleton of the kNN search algorithm. 5.3 A l g o r i t h m s for B u i l d i n g a n d S e a r c h i n g M u l t i p l e In-d ices A keen reader would notice that the algorithm presented in Figure 5.1 does not scale very well with the dimensionality of the trajectories. For' example, if the trajecto-ries have a dimensionality of 10, then even as few as four coefficients per dimension would result in an index of 40 dimensions, which is clearly unacceptable, due to the dimensionality curse problem of multidimensional indices. In this Section, we 57 Algorithm RangeSearchOneIndex(Q, Index, r) { I* input: a d-dimensional query trajectory Q */ /* input: the index of Chebyshev coefficients Index */ /* input: a radius r for range search */ /* output: all trajectories within distance r of Q with respect to Disteuc */ (1) apply Equations (4.2) to (4.7) to obtain the vector of coefficients for Q (2) find all trajectories in Index within distance r of Q using Distcoy (3) retrieve from disk the corresponding (full) trajectories (4) compute the true distances using Disteuc and discard all the false positives } /* end algorithm */ Figure 5.2: Algorithm for a Range Search in a Single Index shall propose an alternative approach, in which we \"distribute\" the Chebyshev coef-ficients of trajectories to multiple indices instead of using a single index. Figure 5.4 illustrates how we build multiple indices based on the Chebyshev coefficients of each trajectory in a collection of M d-dimensional spatiotemporal trajectories. Figure 5.5 gives an algorithm for range search, and Figure 5.6 is for kNN search. Note that, in Step (5) of Figure 5.4, there are many ways to choose I. In fact, there are exponentially many ways to distribute the sets of Chebyshev coefficients to indices, and we shall leave the optimization of distributions as a topic of future research. 5.4 P r o p e r t i e s o f C h e b y s h e v I n d e x i n g Recall from Section 1.2.3 that we have outlined a list of desirable properties for indexing techniques. In this Section, we shall argue that Chebyshev indexing satisfies all those criteria: \u00E2\u0080\u00A2 Indexing is intrinsically much faster than sequential scanning in terms of query 58 Algorithm kNNSearchOneIndex(Q, Index, k) { /* input: a d-dimensional query trajectory Q */ /* input: the index of Chebyshev coefficients Index */ /* input: k a positive integer */ /* output: the k most similar trajectories to Q with respect to Disteuc */ (1) apply Equations (4.2) to (4.7) to obtain the vector of coefficients for Q (2) find the /c-nearest neighbours to Q in Index using Distcby (3) retrieve from disk the corresponding (full) trajectories (4) compute the true distances using Disteuc and record the maximum max (5) invoke the range search RangeSearchOneIndex(Q, Index, max) (6) retrieve from disk the corresponding (full) trajectories (7) compute the true distances using Disteuc and retain the nearest k trajectories } /* end algorithm */ Figure 5.3: Algorithm for a kNN Search in a Single Index performance, since it partitions the search space into hierarchical components and does not require access to every data page. As long as the dimensionality is not too high, indexing always dominates sequential scanning. This is also confirmed by our experimental results in Chapter 6. \u00E2\u0080\u00A2 As n is quite small, our indexing structure does not take much space and is supposed to be memory resident. \u00E2\u0080\u00A2 Chebyshev indexing allows both whole matching and subsequence matching. \u00E2\u0080\u00A2 For each trajectory in the database, we compute its Chebyshev coefficients based only on the trajectory itself. Unlike SVD, Chebyshev approximation is incremental. \u00E2\u0080\u00A2 No false negatives are guaranteed (Theorem 4.2 and Corollary 5.1). \u00E2\u0080\u00A2 It is clear to see from Figure 5.1 that the index building process takes 0(dMN) 59 Algorithm BuildMultipleIndices(DB, Indexi,..., Indexg, m,...,rid) { /* input: a database DB of M d-dimensional trajectories */ /* input: Indexi,..., Indexg, g indices which may already contain some entries */ /* input: rij (1 < i < d) denotes the number of Chebyshev coefficients to be used for the ith dimension */ /* output: the trajectories approximated and indexed */ for each trajectory S { (1) initialize C\, \u00E2\u0080\u00A2 \u00E2\u0080\u00A2 \u00E2\u0080\u00A2, Cg to be empty where Cj is the collection of coefficients to be inserted into Index j (2) project S to its d dimensions {Dimi,..., Dinid} creating SDimi, \u00E2\u0080\u00A2 \u00E2\u0080\u00A2 \u00E2\u0080\u00A2, Soimd (3) for (1 < i < d) { (4) apply Equations (4.2) to (4.7) to Spirm to get the set of Chebyshev coefficients Vi for Soim, (5) add all the coefficients in Vi to C; for some I } /* end for-loop */ (6) for (1 < j < g) { (7) insert the coefficients in Cj as a single multidimensional point into Index j } /* end for-loop */ } /* end for-loop */ } /* end algorithm */ Figure 5.4: Algorithm for Building Multiple Indices of Chebyshev Coefficients time. Thus, Chebyshev indexing is scalable with respect to all of d, M and N. \u00E2\u0080\u00A2 As shown in Chapter 4, our framework can handle both the Euclidean distance and the weighted Euclidean distance. 60 Algor i thm RangeSearchMultipleIndices(Q, Index\,..., Indexg, r) { / * input: a d-dimensional query trajectory Q * j / * input: the indices of Chebyshev coefficients * / / * input: a radius r for range search * / / * output: a l l trajectories wi th in distance r of Q with respect to Disteuc */ (1) apply Equations (4.2) to (4.7) to obtain the vector C of coefficients for Q (2) for (l 40 \u00E2\u0080\u00A2E 2000 4000 6000 8000 10000 Number of Trajectories, M (b) A P C A pruning power I\u00E2\u0080\u0094n = 6 \u00E2\u0080\u0094 n = 12\u00E2\u0080\u0094n\u00C2\u00BB20l 2000 4000 6000 8000 10000 Number of Trajectories, M (c) Chebyshev building time Figure 6.3: Scalability: Pruning Power and Building Time 71 [\"\u00E2\u0080\u0094 CHEBY ~ A P C A | 1200 -i z 200 -I 1 , . 1 1 , , , \u00E2\u0080\u0094 4 6 8 ' 10 12 14 16 16 20 Number of Coefficients, n (a) 1-D Stocks data: I/O cost [-\u00E2\u0080\u00A2-CHEBY \u00E2\u0080\u0094 APCA] 400 T Number of Coefficients, n (b) 3-D Kungfu data: I/O cost I\u00E2\u0080\u0094CHEBY-^APCAj 4 6 8 10 Number of Coefficients, n (c) 3-D Generated data: I/O cost KCHEBY^APCAI 10 12 14 16 .18 20 Number of Coefficients, n (d) 1-D Stocks data: CPU time | 0.4 O 0.1 I\u00E2\u0080\u0094CHEBY(index) ~ A P C A -*- CHEBY(scan)| 4 6 8 10 Number of Coefficients, n I\u00E2\u0080\u0094CHEBY (index) \u00E2\u0080\u0094APCA \u00E2\u0080\u0094 CHEBY (scan) I f 22 4 6 8 Number of Coefficients, n (e) 3-D Kungfu data: CPU time (f) 3-D Generated data: CPU time Figure 6.4: Search Time Comparison: Indexing Included 72 6.6 Comparisons with Indexing Included So far, our discussions have not taken account of the indexing structure. The com-parison between Chebyshev and A P C A is based on pruning power and sequential scans. In the remainder of this section, we compare these two schemes with indexing included - in terms of both I/O cost and C P U cost. I/O cost, if reported in seconds, may depend heavily on implementation and experimentation details, such as buffer space, speed of a random page read, etc. To eliminate these details, we report I/O cost as the sum of the number of index nodes/pages accessed and the number of page reads required to retrieve the specific trajectories needed by the kNNSearch procedure. We used a page size of lOKbytes. C P U time includes the time taken to navigate the index nodes, the time taken to compute the lower bounded distances Distcby(CQ,Cs), and the time taken to compute the real Euclidean distances Disteuc(Q, S), whenever needed. As discussed before, the time taken to perform the initial approximation of the query is not included, due to the fact that the A P C A code is written in Matlab. Even though the exact C P U time is highly dependent on the size of the data set and the length of the trajectories, the C P U time can at least be used as a relative measurement between Chebyshev and A P C A . Like the figures reported on pruning power, the timing figures reported here on I/O and C P U costs represent the averages over 10 randomly picked queries. 6.6.1 I/O C o s t C o m p a r i s o n Figure 6.4 shows the I/O and C P U costs for the Stocks data, the Kungfu data and the 3-dimensional generated data with 10,000 trajectories each of length 720. The x-axis of the graphs shows the number of coefficients used, n, for each dimension. 73 For graphs (a) to (c), the y-axis shows the I/O cost in page accesses. Given the differences in length and number of trajectories in each data set, the absolute values in graphs (a) to (c) are relatively unimportant; what is important are the curves within each graph. For the 1-dimensional Stocks data in graph (a), the reduction in page ac-cesses as n increases flattens off for n beyond 16. For the 3-dimensional Kungfu data in graph (b), the number of page accesses reaches a minimum when n = 8, corresponding to a 24-dimensional index. Beyond that, the dimensionality curse on the index structure sets in, and the number of total page accesses starts to rise. For the 3-dimensional generated data in graph (c), there is not yet any observed increase in total page accesses beyond n = 8. However, recall that total page accesses come from two sources: the number of data pages and the number of index nodes/pages. As n increases, the former decreases due to the increase in pruning power. In con-trast, the latter goes up due to increasing dimensionality, and accounts for a larger and larger percentage of total page accesses. Eventually, the latter dominates the former. Dimensionality curse aside, the number of page accesses required by Cheby-shev approximation in all cases is about 50% to 60% that of A P C A . This improve-ment is highly consistent with the pruning power results shown earlier in Figure 6.1 and Figure 6.3. 6.6.2 C P U C o s t C o m p a r i s o n For graphs (d) to (f), the y-axis shows the C P U time taken (in seconds) for the entire kNNSearch. Within each graph, we show the times taken by Chebyshev and A P C A , with indexing included. Furthermore, whenever the sequential scan strategy 74 (as described in Section 6.2) becomes competitive, the timing figures for scans are included as well. The key difference between indexing and sequential scans is that with the former, the dimensionality curse on the indexing structure will set in sooner or later. In graph (d), for the 1-dimensional Stocks data, the minimum C P U time occurs when n \u00E2\u0080\u0094 20. But for the 3-dimensional Kungfu and generated data in graphs (e) and (f), the minimum C P U time occurs when n = 4 or n = 6 (corresponding to a 12-dimensional or 18-dimensional index). And if the total time is considered by summing up the C P U and I /O costs, the best situation is when n = 6. As expected and consistent with the literature [55], our sequential scan strat-egy starts to dominate indexing. For graphs (e) and (f), this occurs when n = 10. As our sequential scan strategy is not optimized, it is conceivable that a more opti-mized sequential scan procedure may dominate even earlier. Recall from Figure 6.1 that the pruning power continues to grow beyond n = 8. Thus, it is important to include sequential scans as a viable alternative to indexing for spatiotemporal trajectories. For the comparison between Chebyshev and A P C A , again the former dom-inates in C P U time taken. This is consistent with all the previous comparisons on pruning power and I/O cost. But besides pruning power, there is an additional reason why the C P U time for Chebyshev is lower than that for A P C A . As defined in Definition 4.1, the computation of the distance between two vectors of Chebyshev coefficients is 0(n). However, based on the distance measure given in [21], the cor-responding computation between two vectors of A P C A coefficients is in fact O(N) (where iV is the length of each trajectory), which requires extra C P U time. 75 6.6.3 Recommendations In closing, we make the following suggestions regarding indexing for d-dimensional spatiotemporal trajectories. They are based on the DR-tree we used and should be adjusted depending on the choice of the index structure. For the 1-dimensional case, using n = 20 Chebyshev coefficients appears to be the best. For the 2-dimensional case, the suggested value of n is 8-12 for each dimension. The corresponding sug-gestion is 4-6 for 3-dimensional trajectories. And finally, 4 coefficients for each dimension are recommended for 4-dimensional trajectories. For higher dimension-ality, or for additional pruning power, a sequential scan using a higher number of Chebyshev coefficients is recommended. 6.7 A Single Index vs. Multiple Indices Until now, the highest dimensionality of trajectories in all the experimental data sets used is 4, but there are many examples of higher dimensionality in real life. In this Section, we shall explore the performance of our algorithms with respect to the dimensionality of trajectories, focusing only on Chebyshev approximation. We use our trajectory generator to generate five databases, each having M \u00E2\u0080\u0094 5000 trajectories, N = 500 in length, but with different dimensionalities (d = 2,4,6,8,10). We compare the performance under three situations (where n = 3 and 5): 1. Singlelndex corresponds to the scenario where n coefficients are computed for each attribute and a single index of 3d dimensions is used. 2. Multiplelndices corresponds to the scenario where n coefficients are computed for each attribute and the coefficients for two attributes are grouped into one 76 index. In total, | indices of 6 dimensions are used. 3. SequentialScan corresponds to the scenario where n coefficients are computed for each attribute and kNN search is conducted using sequential scan. Figure 6.5 and Figure 6.6 show the comparisons in terms of pruning power, I/O cost and C P U cost, respectively. They both display a similar trend as dimen-sionality increases from 2 to 10. When d is small, putting all the coefficients together in one' single index appears to be the best, but as d gets larger, the dimensionality curse sets in, and sequential scanning starts to dominate. It is interesting to note that, when d > 6, distributing Chebyshev coefficients to multiple indices turns out to be better than the single index approach, and the difference becomes more ob-vious as d increases. On the other hand, however, sequential scan is definitely the winner over the two indexing schemes at very high dimensionalities. 6.8 S u b s e q u e n c e M a t c h i n g A l l the previous Sections, whether indexing is included or not, whether a single index or multiple indices are used, are devoted to the problem of whole matching. In this Section, our focus is subsequence matching, as discussed in Section 2.1.2. We use our trajectory generator to create a database of M = 500 time series, each of which has a length of N \u00E2\u0080\u0094 500. Queries of length w = 180 are randomly generated and radius searches with a radius r = 10 are performed. In total, there are (N \u00E2\u0080\u0094 w + 1) * M \u00E2\u0080\u0094 160500 subsequences of length w. Note that, in the subsequence matching algorithm described in Section 2.1.2, both Chebyshev and A P C A can be used for the dimensionality reduction step. In Figure 6.7, we compare Chebyshev approximation and A P C A approximation in 77 |-*- Singlelndex \u00E2\u0080\u0094 Multiplelndices * SequentialScanl # 60 Dimensionality ol Trajectories, d (a) Pruning Power Comparison - Singlelndex \u00E2\u0080\u00A2\u00C2\u00AB-\u00E2\u0080\u00A2 Multiplelndices \" SequentialScan | 2 4 6 8 Dimensionality of Trajectories, d (b) I/O Cost Comparison 1 I ~-Singlelndex \u00E2\u0080\u0094 Multipielndces SequentialScan! Dimensionality of Trajectories, d (c) CPU Time Comparison Figure 6.5: A Single Index vs. Multiple Indices (n = 3) 78 Singlelndex -^Multiptelndices Sequential Scan] ft 60 o a g> 40 5 4 6 8 10 Dimensionality ol Trajectories, d (a) Pruning Power Comparison I\u00E2\u0080\u0094-Singlelndex \u00E2\u0080\u0094\u00E2\u0080\u00A2 Mulliplelndices \u00E2\u0080\u00A2\u00E2\u0080\u00A2\u00E2\u0080\u0094 SequentJalScan] g 4000 \u00C2\u00AB jj i 1000 z 4 6 e 10 Dimensionality ol Trajectories, d (b) I/O Cost Comparison |\u00E2\u0080\u0094 Singlelndex -*-Multiplelndces -*- SequentialScan] Dimensionality of Trajectories, d (c) CPU Time Comparison Figure 6.6: A Single Index vs. Multiple Indices (n = 5) 79 I\u00E2\u0080\u0094CHEBY -~APCA| | 60 a j? 40 1 Number o1 Coefficients, n (a) Pruning Power Comparison l \u00E2\u0080\u0094 CHEBY \u00E2\u0080\u0094 APCA| g 100 I 10 12 14 16 18 20 Number of Coefficients, n (b) I/O Cost Comparison HCHEBY \u00E2\u0080\u0094 APCA| 10 12 14 16 18 20 Number of Coefficients, n (c) CPU Time Comparison Figure 6.7: Subsequence Matching: Pruning Power, I /O costs and C P U cost 80 terms of pruning power, I/O cost and C P U cost, respectively. As expected, the results are highly consistent with those for whole matching in Section 6.6. 81 Chapter 7 C o n c l u s i o n s In this thesis, we explore how to apply Chebyshev polynomials for approximating and indexing d-dimensional spatiotemporal trajectories. Chebyshev polynomials enjoy the property that they are almost identical to the minimax polynomials; yet they are easier to compute. Computing Chebyshev coefficients is linear with respect to the data set size M, as well as to the trajectory length N. Our experimental results further show that computing extra Chebyshev coefficients takes negligible time (i.e., increasing n incurs little extra cost). In order for Chebyshev approximation to be used for indexing, a key analytic result of this thesis is the Lower Bounding Lemma. To achieve this result, we need to extend a discrete trajectory into an interval function, so that Chebyshev approxi-mation becomes applicable. We also need to define a distance function between two vectors of Chebyshev coefficients. To evaluate the effectiveness of the minimax property of Chebyshev polyno-mials on indexing, we conducted an extensive experimental evaluation. From 1- to 4-dimensional, real to generated data, Chebyshev dominates the widely-used A P C A 82 in pruning power, I/O costs and C P U costs. Our empirical results indicate that Chebyshev approximation can deliver a 3- to 5-fold reduction on the dimensionality of the index space. That is, it only takes 4 to 6 Chebyshev coefficients to deliver the same pruning power produced by 20 A P C A coefficients. This is a very impor-tant advantage. As the dimensionality curse on the indexing structure is bound to set in sooner or later, Chebyshev coefficients are far more effective than A P C A in delivering additional pruning power before that happens. In ongoing work, we would like to extend the Lower Bounding Lemma to other distance functions, such as the dynamic time-war ping distance [4] and the longest common subsequence distance [53]. We would also like to expand our frame-work to conduct sub-trajectory matching. The fixed-window strategy proposed in [9] is applicable; yet we seek to exploit properties of Chebyshev approximation for fur-ther optimization. The experimental results reported here are based on using the same number of coefficients for each dimension. In ongoing work, we would ex-plore how to allocate a fixed number of Chebyshev coefficients to the d dimensions according to the \"need\" of each dimension. Finally, we would explore how to de-velop an optimized sequential scan algorithm to use in conjunction with Chebyshev coefficients. 83 B i b l i o g r a p h y [1] P. Agarwal, L . Arge and J . Erickson. Indexing Moving Points. Proc. 2000 ACM PODS, pp. 175-186. [2] R. Agrawal, C. Faloutsos and A . Swami. Efficient similarity search in sequence databases. Proc. of the J^th Conference on Foundations of Data Organization and Algorithms 1993, pp. 69-84. [3] R. Agrawal, K . L in , H . Sawhney and K . Shim. Fast Similarity Search in the Presence of Noise, Scaling and Translation in Time-series databases. Proc. 1995 VLDB, pp. 490-501. [4] D. J . Berndt and J . Clifford. Using dynamic time warping to find patterns in time series. Working Notes of the Knowledge Discovery in Databases Work-shop, 1994, pp. 359-370. [5] Y . Cai and R. T. Ng. Indexing Spatio-Temporal Trajectories with Chebyshev Polynomials. Proc. 2004 SIGMOD, to appear. [6] K . Chakrabarti and S. Mehrotra. Local Dimensionality Reduction: a new approach to indexing high dimensional spaces. Proc. 2000 VLDB, pp. 89-100. 84 [7] K . Chan and A . Fu. Efficient Time Series Matching by Wavelets. Proc. 1999 ICDE, pp. 126-133. [8] K . Chu and M . Wong. Fast Time-Series Searching with Scaling and Shifting. Proc. 1999 PODS, pp. 237-248. [9] C. Faloutsos, M . Ranganathan and Y . Manolopoulos. Fast Subsequence Matching in Time-Series Databases. Proc. 1994 SIGMOD, pp. 419-429. [10] V . Gaede and O. Gunther. Multidimensional Access Methods. ACM Comput-ing Surveys, 30, pp. 170-231, 1998. [11] D. Gunopulos and G. Das. A Tutorial on Time Series Similarity Measures and Time Series Indexing. Proc. 2001 SIGMOD, pp. 243-307. [12] M . Hadjieleftheriou, G . Kollios, V . Tsotras and D. Gunopulos. Efficient In-dexing of Spatiotemporal Objects. Proc. 2002 EDBT, pp. 251-268. [13] M . L . Hetland. A Survey of Recent Methods for Efficient Retrieval of Simi-lar Time Sequences. Data Mining in Time Series Databases. Kandel, and H . Bunke, Eds. Singapore: World Scientific, to be published. [14] T. Imielinski and B. Badrinath. Querying in Highly Mobile Distributed En-vironments. Proc. 1992 VLDB, pp. 41-52. [15] H . V . Jagadish. On Indexing Line Segments. Proc. 1990 VLDB, pp. 614-625. [16] C. Jensen and R. Snodgrass. Temporal Data Management. TKDE 11(1), 1999, pp. 36-44. [17] T. Kahveci and A . Singh. Variable length queries for time series data. Proc. 2001 ICDE, pp. 273-282. 85 [18] K . V . R. Kanth, D. Agrawal and A . Singh. Dimensionality reduction for sim-ilarity searching in dynamic databases. Proc. 1998 SIGMOD, pp. 166-176. [19] E . Keogh. Exact indexing of dynamic time warping. Proc. VLDB 2002, pp. 406-417. [20] E . J . Keogh. Efficiently Finding Arbitrarily Sealed Patterns in Massive Time Series Databases. Proc. 2003 PKDD, pp. 253-265. [21] E . Keogh, K . Chakrabarti, M . Pazzani and S. Mehrotra. Locally adaptive dimensionality reduction for indexing large time series databases. Proc. 2001 SIGMOD, pp. 151-162. [22] E . Keogh and M . Pazzani. A n Indexing Scheme for Fast Similarity Search in Large Time Series Databases. Proc. 1999 SSDBM, pp. 56-67. [23] E . J . Keogh and M . J . Pazzani. A simple dimensionality reduction technique for fast similarity search in large time series databases. Pacific-Asia Conference on Knowledge Discovery and Data Mining 2000, pp. 122-133. [24] E . Keogh, K . Chakrabarti, M . Pazzani and S. Mehrotra. Dimensionality re-duction for fast similarity search in large time series databases. Journal of Knowledge and Information Systems, 2000, pp. 263-286. [25] E . Keogh and P. Smyth. A probabilistic approach to fast pattern matching in time series databases. Proc. 1997 KDD, pp. 20-24. [26] G. Kollios, D. Gunopulos and V . Tsotras. Nearest Neighbor Queries in a Mobile Environment. Spatio-Temporal Database Management Workshop 1999, pp.119-134. 86 [27] G . Kollios, D. Gunopulos and V . Tsotras. On Indexing Mobile Objects. Proc. 1999 PODS, pp. 261-272. [28] G. Kollios, D. Gunopulos, V . Tsotras, A . Delis and M . Hadjieleftheriou. Index-ing Animated Objects Using Spatio-Temporal Access Methods. IEEE Trans. Knowledge and Data Engineering 2001, pp. 742-777. [29] F . Korn, H . Jagadish and C. Faloutsos. Efficiently supporting ad hoc queries in large datasets of time sequences. Proc. 1997 SIGMOD, pp. 289-300. [30] S. L . Lee, S. J . Chun, D. H . K i m , J . H . Lee and C. W . Chung. Similarity Search for Multidimensional Data Sequences. Proc. 2000 ICDE, pp. 599-608. [31] J . Kuan, P. Lewis. Fast k nearest neighbor search for R-tree family. Proc. of First Int. Conf. on Information, Communication and Signal Processing 1997, pp. 924-928. [32] J . C. Mason and D. Handscomb. Chebyshev Polynomials. Chapman & Hall, 2003. [33] Geographic Data Mining and Knowledge Discovery. Research Monographs in Geographic Information Systems, edited by H . Miller and J . Han, Taylor and Francis, 2000. [34] Dimitris Papadopoulos, George Kollios, Dimitrios Gunopulos and Vassilis Tso-tras. Indexing Mobile Objects on the Plane. DEXA Workshops 2002, pp. 693-697. [35] C. S. Perng, H . Wang, S. R. Zhang and D. S. Parker. Landmarks: a new model for similarity-based pattern querying in time series databases. Proc. 2000 ICDE, pp. 33-42. 87 [36] D. Pfoser, C. J . Jensen and Y . Theodoridis. Novel approaches to the indexing of moving object trajectories. Proc. 2000 VLDB, pp. 395-406. [37] I. Popivanov and R. Miller. Similarity Search Over Time Series Data Using Wavelets. Proc. 2002 ICDE, pp. 212-221. [38] W . H . Press, B . P. Flannery, S. A . Teukolsky and W . T. Vetterling. Numerical Recipes: The Ar t of Scientific Computing. Cambridge University Press, 1986. [39] D. Rafiei. On similarity-based queries for time series data. Proc. 1999 ICDE, pp. 410-417. [40] D. Rafiei and A . Mendelzon. Efficient Retrieval of Similar Time Sequences Using D F T . Proc. 1998 FODO. [41] T. J . Rivl in. Chebyshev Polynomials: From Approximation Theory to Algebra and Number Theory (2nd Edition). John Wiley & Sons, 1990. [42] N . Roussopoulos, S. Kelley and F. Vincent. Nearest Neighbor Queries. Proc. 1995 SIGMOD, pp. 71-79. [43] S. Saltenis and C. Jensen. Indexing of Moving Objects for Location-Based Services. Proc. 2002 ICDE, pp. 463-472. [44] S. Saltenis, C. Jensen, S. Leutenegger and M . Lopez. Indexing the Positions of Continuously Moving Objects. Proc. 2000 SIGMOD, pp. 331-342. [45] T. K . Sellis, N . Roussopoulos and C. Faloutsos. The i? + -Tree: A Dynamic Index for Multi-dimensional Objects. Proc. 1987 VLDB, pp. 507-518. [46] H . Shatkay and S. Zdonik. Approximate queries and representations for large data sequences. Proc. 1996 ICDE, pp. 546-553. 88 [47] P. Sistla, O. Wolfson, S. Chamberlain and S. Dao. Modeling and querying moving objects. Proc. 1997 ICDE, pp. 422-432. [48] R. Snodgrass. The Temporal Query Language TQuel. TODS 12(2), pp. 247-298, 1987. [49] Y . Theodoridis, T. Sellis, A . Papadopoulos and Y . Manolopoulos. Specifica-tions for Efficient Indexing in Spatiotemporal Databases. Proc. 1998 SSDBM, pp. 123-132. [50] E . Tsoukatos and D. Gunopulos. Efficient Mining of SpatioTemporal Patterns. Seventh International Symposium on Spatial and Temporal Databases 2001, pp. 425-442. [51] James Stewart. Single Variable Calculus: Early Transcendentals. Brooks/Cole Publishing Company, 1995. [52] M . Vlachos, D. Gunopulos and G. Kollios. Robust Similarity Measures for Mobile Object Trajectories. Proc. of DEXA Workshops 2002, pp. 721-728. [53] M . Vlachos, G. Kollios and D. Gunopulos. Discovering similar multidimen-sional trajectories. Proc. 2002 ICDE, pp. 673-684. [54] C. Wang and S. Wang. Supporting content-based searches on time series via approximation. Proc. 2000 SSDBM, pp. 69-81. [55] R. Weber, H . Schek and S. Blott. A Quantitative Analysis and Performance Study for Similarity-Search Methods in High-Dimensional Spaces. Proc. 1998 VLDB, pp. 194-205. 89 [56] O. Wolfson, B. Xu , S. Chamberlain and L. Jiang. Moving objects databases: Issues and solutions. Proc. 1998 SSDBM, pp. 111-122. [57] D. Wu, D. Agrawal, A . E l Abbadi, A . Singh and T . R. Smith. Efficient retrieval for browsing large image databases. Proc. 1996 CIKM, pp. 11-18. [58] Y . Wu, D. Agrawal and A. Abbadi. A Comparison of D F T and DWT based Similarity Search in Time-Series Databases. Proc. 2000 CIKM, pp. 488-495. [59] B . Y i and C. Faloutsos. Fast Time Sequence Indexing for Arbitrary Cp Norms. Proc. 2000 VLDB, pp. 395-406. 90 "@en . "Thesis/Dissertation"@en . "2004-05"@en . "10.14288/1.0051331"@en . "eng"@en . "Computer Science"@en . "Vancouver : University of British Columbia Library"@en . "University of British Columbia"@en . "For non-commercial purposes only, such as research, private study and education. Additional conditions apply, see Terms of Use https://open.library.ubc.ca/terms_of_use."@en . "Graduate"@en . "Indexing spatiotemporal trajectories with Chebyshev polynomials"@en . "Text"@en . "http://hdl.handle.net/2429/15433"@en .