- Library Home /
- Search Collections /
- Open Collections /
- Browse Collections /
- UBC Theses and Dissertations /
- Parallel locality sensitive hashing for network discovery...
Open Collections
UBC Theses and Dissertations
UBC Theses and Dissertations
Parallel locality sensitive hashing for network discovery from time series Sodol, Svetlana
Abstract
Similarity operations on time series are a vital area in data mining research. Science and systems applications require a scalable solution that is fast enough to work with streaming of real data and provide reliable results in the presence of noise, high dimensionality and time dependency. Locality sensitive hashing (LSH) has shown advancement in limiting the number of comparisons required for similarity operations, thus reducing the overall time, by pre-processing the data into buckets as candidates for approximate nearest neighbours. This thesis proposes a scalable system based on locality sensitive hashing by implementing it in parallel with independent hash functions. The parallel system we present is implemented in a message-passing framework for four locality sensitive hashing methods – minhash, approximate binary correlation (ABC), symbolic aggregate approximation (SAX), and ``sketch, shingle and hash" (SSH), each with its own pre-processing, hash creation, and similarity measure. A preliminary investigation implements minhash with a bag-of-words representation of text data to validate our proposed framework. The experimental part of the thesis focuses on comparing the other three LSH methods (ABC, SAX, SSH) on a real flight data set processed in a streaming fashion, flexible to the size of the time series used. The output of our parallel system is a similarity network discovered from the data, that we use to detect an anomaly present in the data set. The LSH methods are evaluated with respect to the time of execution, amount of communication, computation complexity, tuning of parameters and the required number of similarity operations. Our results indicate the feasibility of the implemented methods and proposed framework for this type of application and this sort of real-life time series. Our thesis concludes with discussion of the impact of the similarity measures on the network discovery results, as well as proposing further investigations into other parts of the parameter space.
Item Metadata
Title |
Parallel locality sensitive hashing for network discovery from time series
|
Creator | |
Supervisor | |
Publisher |
University of British Columbia
|
Date Issued |
2022
|
Description |
Similarity operations on time series are a vital area in data mining research. Science and systems applications require a scalable solution that is fast enough to work with streaming of real data and provide reliable results in the presence of noise, high dimensionality and time dependency. Locality sensitive hashing (LSH) has shown advancement in limiting the number of comparisons required for similarity operations, thus reducing the overall time, by pre-processing the data into buckets as candidates for approximate nearest neighbours. This thesis proposes a scalable system based on locality sensitive hashing by implementing it in parallel with independent hash functions. The parallel system we present is implemented in a message-passing framework for four locality sensitive hashing methods – minhash, approximate binary correlation (ABC), symbolic aggregate approximation (SAX), and ``sketch, shingle and hash" (SSH), each with its own pre-processing, hash creation, and similarity measure. A preliminary investigation implements minhash with a bag-of-words representation of text data to validate our proposed framework. The experimental part of the thesis focuses on comparing the other three LSH methods (ABC, SAX, SSH) on a real flight data set processed in a streaming fashion, flexible to the size of the time series used. The output of our parallel system is a similarity network discovered from the data, that we use to detect an anomaly present in the data set. The LSH methods are evaluated with respect to the time of execution, amount of communication, computation complexity, tuning of parameters and the required number of similarity operations. Our results indicate the feasibility of the implemented methods and proposed framework for this type of application and this sort of real-life time series. Our thesis concludes with discussion of the impact of the similarity measures on the network discovery results, as well as proposing further investigations into other parts of the parameter space.
|
Genre | |
Type | |
Language |
eng
|
Date Available |
2022-01-25
|
Provider |
Vancouver : University of British Columbia Library
|
Rights |
Attribution-NonCommercial-NoDerivatives 4.0 International
|
DOI |
10.14288/1.0406350
|
URI | |
Degree | |
Program | |
Affiliation | |
Degree Grantor |
University of British Columbia
|
Graduation Date |
2022-05
|
Campus | |
Scholarly Level |
Graduate
|
Rights URI | |
Aggregated Source Repository |
DSpace
|
Item Media
Item Citations and Data
Rights
Attribution-NonCommercial-NoDerivatives 4.0 International