Selectivity estimation of approximate predicates on text

UBC Theses and Dissertations

Featured Collection

UBC Theses and Dissertations

Selectivity estimation of approximate predicates on text Lee, Hongrae

Abstract

This dissertation studies selectivity estimation of approximate predicates on text. Intuitively, we aim to count the number of strings that are similar to a given query string. This type of problem is crucial in handling text in RDBMSs in an error-tolerant way. A common difficulty in handling textual data is that they may contain typographical errors, or use similar but different textual representations for the same real-world entity. To handle such data in databases, approximate text processing has gained extensive interest and commercial databases have begun to incorporate such functionalities. One of the key components in successful integration of approximate text processing in RDBMSs is the selectivity estimation module, which is central in optimizing queries involving such predicates. However, these developments are relatively new and ad-hoc approaches, e.g., using a constant, have been employed. This dissertation studies reliable selectivity estimation techniques for approximate predicates on text. Among many possible predicates, we focus on two types of predicates which are fundamental building blocks of SQL queries: selections and joins. We study two different semantics for each type of operator. We propose a set of related summary structures and algorithms to estimate selectivity of selection and join operators with approximate matching. A common challenge is that there can be a huge number of variants to consider. The proposed data structures enable efficient counting by considering a group of similar variants together rather than each and every one separately. A lattice-based framework is proposed to consider overlapping counts among the groups. We performed extensive evaluation of proposed techniques using real-world and synthetic data sets. Our techniques support popular similarity measures including edit distance, Jaccard similarity and cosine similarity and show how to extend the techniques to other measures. Proposed solutions are compared with state-of-the-arts and baseline methods. Experimental results show that the proposed techniques are able to deliver accurate estimates with small space overhead.

Item Metadata

Title	Selectivity estimation of approximate predicates on text
Creator	Lee, Hongrae
Publisher	University of British Columbia
Date Issued	2010
Description	This dissertation studies selectivity estimation of approximate predicates on text. Intuitively, we aim to count the number of strings that are similar to a given query string. This type of problem is crucial in handling text in RDBMSs in an error-tolerant way. A common difficulty in handling textual data is that they may contain typographical errors, or use similar but different textual representations for the same real-world entity. To handle such data in databases, approximate text processing has gained extensive interest and commercial databases have begun to incorporate such functionalities. One of the key components in successful integration of approximate text processing in RDBMSs is the selectivity estimation module, which is central in optimizing queries involving such predicates. However, these developments are relatively new and ad-hoc approaches, e.g., using a constant, have been employed. This dissertation studies reliable selectivity estimation techniques for approximate predicates on text. Among many possible predicates, we focus on two types of predicates which are fundamental building blocks of SQL queries: selections and joins. We study two different semantics for each type of operator. We propose a set of related summary structures and algorithms to estimate selectivity of selection and join operators with approximate matching. A common challenge is that there can be a huge number of variants to consider. The proposed data structures enable efficient counting by considering a group of similar variants together rather than each and every one separately. A lattice-based framework is proposed to consider overlapping counts among the groups. We performed extensive evaluation of proposed techniques using real-world and synthetic data sets. Our techniques support popular similarity measures including edit distance, Jaccard similarity and cosine similarity and show how to extend the techniques to other measures. Proposed solutions are compared with state-of-the-arts and baseline methods. Experimental results show that the proposed techniques are able to deliver accurate estimates with small space overhead.
Genre	Thesis/Dissertation
Type	Text
Language	eng
Date Available	2010-09-22
Provider	Vancouver : University of British Columbia Library
Rights	Attribution-NonCommercial-NoDerivatives 4.0 International
DOI	10.14288/1.0051964
URI	http://hdl.handle.net/2429/28645
Degree (Theses)	Doctor of Philosophy - PhD
Program (Theses)	Computer Science
Affiliation	Science, Faculty of; Computer Science, Department of
Degree Grantor	University of British Columbia
Graduation Date	2010-11
Campus	UBCV
Scholarly Level	Graduate
Rights URI	http://creativecommons.org/licenses/by-nc-nd/4.0/
Aggregated Source Repository	DSpace

Open Collections

UBC Theses and Dissertations

UBC Theses and Dissertations

Selectivity estimation of approximate predicates on text Lee, Hongrae

Abstract

Item Metadata

Item Media

Item Citations and Data

Rights