Parallel locality sensitive hashing for network discovery from time series

UBC Theses and Dissertations

Featured Collection

UBC Theses and Dissertations

Parallel locality sensitive hashing for network discovery from time series Sodol, Svetlana

Abstract

Similarity operations on time series are a vital area in data mining research. Science and systems applications require a scalable solution that is fast enough to work with streaming of real data and provide reliable results in the presence of noise, high dimensionality and time dependency. Locality sensitive hashing (LSH) has shown advancement in limiting the number of comparisons required for similarity operations, thus reducing the overall time, by pre-processing the data into buckets as candidates for approximate nearest neighbours. This thesis proposes a scalable system based on locality sensitive hashing by implementing it in parallel with independent hash functions. The parallel system we present is implemented in a message-passing framework for four locality sensitive hashing methods – minhash, approximate binary correlation (ABC), symbolic aggregate approximation (SAX), and ``sketch, shingle and hash" (SSH), each with its own pre-processing, hash creation, and similarity measure. A preliminary investigation implements minhash with a bag-of-words representation of text data to validate our proposed framework. The experimental part of the thesis focuses on comparing the other three LSH methods (ABC, SAX, SSH) on a real flight data set processed in a streaming fashion, flexible to the size of the time series used. The output of our parallel system is a similarity network discovered from the data, that we use to detect an anomaly present in the data set. The LSH methods are evaluated with respect to the time of execution, amount of communication, computation complexity, tuning of parameters and the required number of similarity operations. Our results indicate the feasibility of the implemented methods and proposed framework for this type of application and this sort of real-life time series. Our thesis concludes with discussion of the impact of the similarity measures on the network discovery results, as well as proposing further investigations into other parts of the parameter space.

Item Metadata

Title	Parallel locality sensitive hashing for network discovery from time series
Creator	Sodol, Svetlana
Supervisor	Wagner, Alan S.
Publisher	University of British Columbia
Date Issued	2022
Description	Similarity operations on time series are a vital area in data mining research. Science and systems applications require a scalable solution that is fast enough to work with streaming of real data and provide reliable results in the presence of noise, high dimensionality and time dependency. Locality sensitive hashing (LSH) has shown advancement in limiting the number of comparisons required for similarity operations, thus reducing the overall time, by pre-processing the data into buckets as candidates for approximate nearest neighbours. This thesis proposes a scalable system based on locality sensitive hashing by implementing it in parallel with independent hash functions. The parallel system we present is implemented in a message-passing framework for four locality sensitive hashing methods – minhash, approximate binary correlation (ABC), symbolic aggregate approximation (SAX), and ``sketch, shingle and hash" (SSH), each with its own pre-processing, hash creation, and similarity measure. A preliminary investigation implements minhash with a bag-of-words representation of text data to validate our proposed framework. The experimental part of the thesis focuses on comparing the other three LSH methods (ABC, SAX, SSH) on a real flight data set processed in a streaming fashion, flexible to the size of the time series used. The output of our parallel system is a similarity network discovered from the data, that we use to detect an anomaly present in the data set. The LSH methods are evaluated with respect to the time of execution, amount of communication, computation complexity, tuning of parameters and the required number of similarity operations. Our results indicate the feasibility of the implemented methods and proposed framework for this type of application and this sort of real-life time series. Our thesis concludes with discussion of the impact of the similarity measures on the network discovery results, as well as proposing further investigations into other parts of the parameter space.
Genre	Thesis/Dissertation
Type	Text
Language	eng
Date Available	2022-01-25
Provider	Vancouver : University of British Columbia Library
Rights	Attribution-NonCommercial-NoDerivatives 4.0 International
DOI	10.14288/1.0406350
URI	http://hdl.handle.net/2429/80726
Degree	Master of Science - MSc
Program	Computer Science
Affiliation	Science, Faculty of; Computer Science, Department of
Degree Grantor	University of British Columbia
Graduation Date	2022-05
Campus	UBCV
Scholarly Level	Graduate
Rights URI	http://creativecommons.org/licenses/by-nc-nd/4.0/
Aggregated Source Repository	DSpace

Open Collections

UBC Theses and Dissertations

UBC Theses and Dissertations

Parallel locality sensitive hashing for network discovery from time series Sodol, Svetlana

Abstract

Item Metadata

Item Media

Item Citations and Data

Rights