UBC Theses and Dissertations
Improving hash join performance by exploiting intrinsic data skew Cutt, Bryce
Large relational databases are a part of all of our lives. The government uses them and almost any store you visit uses them to help process your purchases. Real-world data sets are not uniformly distributed and often contain significant skew. Skew is present in commercial databases where, for example, some items are purchased far more often than others. A relational database must be able to efficiently find related information that it stores. In large databases the most common method used to find related information is a hash join algorithm. Although mitigating the negative effects of skew on hash joins has been studied, no prior work has examined how the statistics present in modern database systems can allow skew to be exploited and used as an advantage to improve the performance of hash joins. This thesis presents Histojoin: a join algorithm that uses statistics to identify data skew and improve the performance of hash join operations. Experimental results show that for skewed data sets Histojoin performs significantly fewer I/O operations and is faster by 10 to 60% than standard hash join algorithms.
Item Citations and Data
Attribution-NonCommercial-NoDerivatives 4.0 International