Improving hash join performance by exploiting intrinsic data skew

UBC Theses and Dissertations

Featured Collection

UBC Theses and Dissertations

Improving hash join performance by exploiting intrinsic data skew Cutt, Bryce

Abstract

Large relational databases are a part of all of our lives. The government uses them and almost any store you visit uses them to help process your purchases. Real-world data sets are not uniformly distributed and often contain significant skew. Skew is present in commercial databases where, for example, some items are purchased far more often than others. A relational database must be able to efficiently find related information that it stores. In large databases the most common method used to find related information is a hash join algorithm. Although mitigating the negative effects of skew on hash joins has been studied, no prior work has examined how the statistics present in modern database systems can allow skew to be exploited and used as an advantage to improve the performance of hash joins. This thesis presents Histojoin: a join algorithm that uses statistics to identify data skew and improve the performance of hash join operations. Experimental results show that for skewed data sets Histojoin performs significantly fewer I/O operations and is faster by 10 to 60% than standard hash join algorithms.

Item Metadata

Title	Improving hash join performance by exploiting intrinsic data skew
Creator	Cutt, Bryce
Publisher	University of British Columbia
Date Issued	2009
Description	Large relational databases are a part of all of our lives. The government uses them and almost any store you visit uses them to help process your purchases. Real-world data sets are not uniformly distributed and often contain significant skew. Skew is present in commercial databases where, for example, some items are purchased far more often than others. A relational database must be able to efficiently find related information that it stores. In large databases the most common method used to find related information is a hash join algorithm. Although mitigating the negative effects of skew on hash joins has been studied, no prior work has examined how the statistics present in modern database systems can allow skew to be exploited and used as an advantage to improve the performance of hash joins. This thesis presents Histojoin: a join algorithm that uses statistics to identify data skew and improve the performance of hash join operations. Experimental results show that for skewed data sets Histojoin performs significantly fewer I/O operations and is faster by 10 to 60% than standard hash join algorithms.
Extent	1296388 bytes
Genre	Thesis/Dissertation
Type	Text
File Format	application/pdf
Language	eng
Date Available	2009-06-17
Provider	Vancouver : University of British Columbia Library
Rights	Attribution-NonCommercial-NoDerivatives 4.0 International
DOI	10.14288/1.0067287
URI	http://hdl.handle.net/2429/9375
Degree (Theses)	Master of Science - MSc
Program (Theses)	Interdisciplinary Studies
Affiliation	Graduate Studies, College of (Okanagan)
Degree Grantor	University of British Columbia
Graduation Date	2009-05
Campus	UBCO
Scholarly Level	Graduate
Rights URI	http://creativecommons.org/licenses/by-nc-nd/4.0/
Aggregated Source Repository	DSpace

Open Collections

UBC Theses and Dissertations

UBC Theses and Dissertations

Improving hash join performance by exploiting intrinsic data skew Cutt, Bryce

Abstract

Item Metadata

Item Media

Item Citations and Data

Rights