UBC Theses and Dissertations

UBC Theses Logo

UBC Theses and Dissertations

Parallel locality sensitive hashing for network discovery from time series Sodol, Svetlana

Abstract

Similarity operations on time series are a vital area in data mining research. Science and systems applications require a scalable solution that is fast enough to work with streaming of real data and provide reliable results in the presence of noise, high dimensionality and time dependency. Locality sensitive hashing (LSH) has shown advancement in limiting the number of comparisons required for similarity operations, thus reducing the overall time, by pre-processing the data into buckets as candidates for approximate nearest neighbours. This thesis proposes a scalable system based on locality sensitive hashing by implementing it in parallel with independent hash functions. The parallel system we present is implemented in a message-passing framework for four locality sensitive hashing methods – minhash, approximate binary correlation (ABC), symbolic aggregate approximation (SAX), and ``sketch, shingle and hash" (SSH), each with its own pre-processing, hash creation, and similarity measure. A preliminary investigation implements minhash with a bag-of-words representation of text data to validate our proposed framework. The experimental part of the thesis focuses on comparing the other three LSH methods (ABC, SAX, SSH) on a real flight data set processed in a streaming fashion, flexible to the size of the time series used. The output of our parallel system is a similarity network discovered from the data, that we use to detect an anomaly present in the data set. The LSH methods are evaluated with respect to the time of execution, amount of communication, computation complexity, tuning of parameters and the required number of similarity operations. Our results indicate the feasibility of the implemented methods and proposed framework for this type of application and this sort of real-life time series. Our thesis concludes with discussion of the impact of the similarity measures on the network discovery results, as well as proposing further investigations into other parts of the parameter space.

Item Media

Item Citations and Data

Rights

Attribution-NonCommercial-NoDerivatives 4.0 International