Open Collections

UBC Theses and Dissertations

UBC Theses Logo

UBC Theses and Dissertations

Improving hash join performance by exploiting intrinsic data skew Cutt, Bryce 2009

You don't seem to have a PDF reader installed, try download the pdf

Item Metadata


ubc_2009_spring_cutt_bryce.pdf [ 1.24MB ]
JSON: 1.0067287.json
JSON-LD: 1.0067287+ld.json
RDF/XML (Pretty): 1.0067287.xml
RDF/JSON: 1.0067287+rdf.json
Turtle: 1.0067287+rdf-turtle.txt
N-Triples: 1.0067287+rdf-ntriples.txt
Original Record: 1.0067287 +original-record.json
Full Text

Full Text

Improving Hash JoinPerformance By ExploitingIntrinsic Data SkewbyBryce CuttBSc(Hons), University of British Columbia, 2007A THESIS SUBMITTED IN PARTIAL FULFILMENT OFTHE REQUIREMENTS FOR THE DEGREE OFMASTER OF SCIENCEinThe College of Graduate Studies(Interdisciplinary Studies)THE UNIVERSITY OF BRITISH COLUMBIA (Okanagan)March, 2009c Bryce Cutt 2009AbstractLarge relational databases are a part of all of our lives. The government uses them and almostany store you visit uses them to help process your purchases. Real-world data sets are notuniformly distributed and often contain signiflcant skew. Skew is present in commercialdatabases where, for example, some items are purchased far more often than others. Arelational database must be able to e–ciently flnd related information that it stores. Inlarge databases the most common method used to flnd related information is a hash joinalgorithm. Although mitigating the negative efiects of skew on hash joins has been studied,no prior work has examined how the statistics present in modern database systems can allowskew to be exploited and used as an advantage to improve the performance of hash joins.This thesis presents Histojoin: a join algorithm that uses statistics to identify data skew andimprove the performance of hash join operations. Experimental results show that for skeweddata sets Histojoin performs signiflcantly fewer I/O operations and is faster by 10 to 60%than standard hash join algorithms.iiTable of ContentsAbstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iiTable of Contents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iiiList of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ivList of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vAcknowledgements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viDedication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vii1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42.1 Relational Databases . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42.1.1 Joins . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62.1.2 Keys . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62.1.3 Cardinality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72.2 Hash Join . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82.2.1 In-Memory Hash Join . . . . . . . . . . . . . . . . . . . . . . . . . . 82.2.2 Hash Partitioning . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112.2.3 Grace Hash Join . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132.2.4 Hybrid Hash Join . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142.2.5 Dynamic Hash Join . . . . . . . . . . . . . . . . . . . . . . . . . . . 162.2.6 Hash Join Performance Enhancements . . . . . . . . . . . . . . . . . 172.3 Skew . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 172.4 Statistics and Histograms . . . . . . . . . . . . . . . . . . . . . . . . . . . . 182.4.1 Histograms and Hash Joins . . . . . . . . . . . . . . . . . . . . . . . 202.5 Example Database . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 202.6 Relational Algebra Query Plan Diagrams . . . . . . . . . . . . . . . . . . . 203 Histojoin . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 243.1 General Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 243.1.1 Theoretical Performance Analysis . . . . . . . . . . . . . . . . . . . 253.2 Histojoin Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26iii3.2.1 Algorithm Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . 263.2.2 Selecting In-Memory Tuples . . . . . . . . . . . . . . . . . . . . . . . 293.2.3 Partitioning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 313.3 Using Histojoin . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 323.3.1 Join Cardinality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 323.3.2 Histogram Inaccuracies . . . . . . . . . . . . . . . . . . . . . . . . . 333.3.3 Query Optimizer Modiflcations . . . . . . . . . . . . . . . . . . . . . 364 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 384.1 Stand-Alone Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 384.1.1 Primary-to-Foreign Key Joins . . . . . . . . . . . . . . . . . . . . . . 384.1.2 Many-to-Many Joins . . . . . . . . . . . . . . . . . . . . . . . . . . . 404.1.3 Histogram Inaccuracies . . . . . . . . . . . . . . . . . . . . . . . . . 414.1.4 Joins on String Keys . . . . . . . . . . . . . . . . . . . . . . . . . . . 424.1.5 Multi-Way Joins . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 424.2 PostgreSQL Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . 434.2.1 Primary-to-Foreign Key Joins . . . . . . . . . . . . . . . . . . . . . . 454.2.2 Multi-Way Joins . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 454.2.3 Efiect of Number of MCVs . . . . . . . . . . . . . . . . . . . . . . . 454.3 Results Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 465 Discussion and Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51ivList of Tables2.1 Part Relation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52.2 Purchase Relation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52.3 Part-Purchase Join Result . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72.4 Part-Purchase Hash Join Result . . . . . . . . . . . . . . . . . . . . . . . . . 142.5 An Accurate Histogram on the partid Attribute of the Purchase Relation . . 182.6 An Aggregate Histogram on the partid Attribute of the Purchase Relation . 193.1 Absolute Reduction in Total I/Os of Skew-Aware Partitioning versus RandomPartitioning for Various Values of f and g and jRj = jSj = 1000 . . . . . . . 263.2 Histogram Partitioning Example . . . . . . . . . . . . . . . . . . . . . . . . . 303.3 Join Cardinality Cases . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33vList of Figures2.1 The Parts of a Relation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42.2 Chained Hash Table Example (mod5 hash function) . . . . . . . . . . . . . . 102.3 Partition 0 of Part and Purchase Relations . . . . . . . . . . . . . . . . . . . 122.4 Partition 1 of Part and Purchase Relations . . . . . . . . . . . . . . . . . . . 122.5 Partition 2 of Part and Purchase Relations . . . . . . . . . . . . . . . . . . . 122.6 Partition 3 of Part and Purchase Relations . . . . . . . . . . . . . . . . . . . 132.7 Partition 4 of Part and Purchase Relations . . . . . . . . . . . . . . . . . . . 132.8 DHJ Partitioning Example Part 1 . . . . . . . . . . . . . . . . . . . . . . . . 212.9 DHJ Partitioning Example Part 2 . . . . . . . . . . . . . . . . . . . . . . . . 222.10 TPC-H Schema from [1] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 232.11 Example Relational Algebra Diagram . . . . . . . . . . . . . . . . . . . . . . 233.1 Two Level Partitioning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 243.2 Total I/Os Percent Difierence . . . . . . . . . . . . . . . . . . . . . . . . . . 273.3 Partkey Histogram for Lineitem Relation TPC-H 1 GB Zipf Distribution (z=1) 273.4 Histojoin Flowchart . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 283.5 Example Multiple Join Plans . . . . . . . . . . . . . . . . . . . . . . . . . . 364.1 Lineitem-Part Join (1GB, z=1) . . . . . . . . . . . . . . . . . . . . . . . . . 394.2 Lineitem-Part Join (1GB, z=2) . . . . . . . . . . . . . . . . . . . . . . . . . 404.3 Percentage Improvement in Total I/Os of Histojoin vs. Hash Join (1GB) . . 404.4 Total I/Os for Wisconsin Many-to-Many Join (1GB, z=1) . . . . . . . . . . 414.5 Total I/Os for Lineitem-Part Join with Histogram Inaccuracies (1GB) . . . . 424.6 Total I/Os for Lineitem-Supplier Join on String key (1GB, z=1) . . . . . . . 434.7 Total I/Os for Lineitem-Supplier-Part Join (1GB) . . . . . . . . . . . . . . . 434.8 PostgreSQL Lineitem-Part Join (10GB, z=1) . . . . . . . . . . . . . . . . . . 454.9 PostgreSQL Lineitem-Part Join (10GB, z=2) . . . . . . . . . . . . . . . . . . 464.10 PostgreSQL Percentage Improvement in Total I/Os of Histojoin vs. Hash Join(10GB) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 464.11 Total I/Os for PostgreSQL Lineitem-Supplier-Part Join (10GB) . . . . . . . 474.12 PostgreSQL Lineitem-Part Join With Various Amounts of MCVs (10GB, z=1) 48viAcknowledgementsI would like to thank my supervisor Dr. Ramon Lawrence for providing me with the op-portunity to do a Master’s degree under his supervision. Dr. Lawrence has been a valuableinstructor, mentor, and most of all friend during this journey. His feedback, insight, andpassion for the subject matter has been invaluable in re-igniting my own obsession with datamanagement and in bringing my thesis to its current form.I am forever greatful for the patience my wife and family have had for me when my mindis engrossed in a subject. Angela has always provided love and support no matter whatmental state I was in. My parents Ken and Sue have always provided a safe harbour whennothing seems to be going right.I would also like to thank Dr. Patricia Lasserre and Dr. Yves Lucet for mentoringme throughout my previous degree and providing many opportunities for me to build theconfldence and skills needed to pursue a Master’s degree.Many people have provided feedback and support regarding my thesis and I would liketo express my appreciation to them and acknowledge their contribution to my success.viiDedicationTo my familyviii1. IntroductionThe world contains enormous stores of information for various uses, be they academic, gov-ernment, flnancial, commercial, etc. For this information to be useful it must be storedand retrieved e–ciently. For many years the method of choice for storing large amountsof information has been database systems. Most modern commercial database systems arerelational database systems. A relational database consists of tables of information (calledrelations) that are related to each other according to various rules (Section 2.1).A relational database system provides methods for storing, retrieving, sorting, searching,and comparing information. It does so while limiting a user’s need to understand the under-lying system. The operations performed by a database are largely automated and the userinteracts with the database primarily through queries in a very natural English-like languagecalled SQL (Structured Query Language) [6].When a user queries a database the actual operations performed and the algorithmsused are hidden from the user. The database must carefully manage its memory usage asmost large database systems contain far more information than will flt in the memory of acomputer at one time. Databases choose how to search relations for information and howto compare information in multiple relations so that related information can be returned tothe user. Returning related information from multiple relations requires that the databasecompare parts of the difierent relations and join the matching parts together. This is doneby specialized algorithms called join algorithms (Section 2.1.1).Hash join is the standard join algorithm used in database systems to process large joinqueries (Section 2.2). Any performance improvement for hash joins is signiflcant due tothe cost and prevalence of hash-based joins, especially in the large queries present in datawarehouses and decision-support systems. We are increasingly dependent on large govern-mental, educational, and commercial database systems that are queried regarding our taxinformation, our student records, and whenever we purchase an item.With a centralized database system the primary concern of a join algorithm (other thanproducing an accurate join) is to produce a result quickly by e–ciently using the limitedmemory available and only using disk resources when absolutely necessary as the speed ofa disk is an order of magnitude slower than the speed of memory. Parallel and distributeddatabase systems attempt to load balance the work of a join across many nodes. Althoughparallel databases attempt to avoid partition skew for load balancing, no parallel join algo-rithm has examined maximizing the in-memory results produced by keeping frequent datamemory-resident.Real data sets often contain skew. Skew occurs in data when certain values occur morefrequently than others (Section 2.3). Partition skew is when a data set is split into multiple1partitions and some partitions contain more information than the others. Many data setsfollow the \80/20 rule" where a small subset of the data items occur much more frequently.For example, consider a company that sells many products and has a database that storesinformation on customer purchases. This information is valuable for determining whichproducts to re-stock, which sales will be most beneflcial, and which products need to befeatured in advertising. If a few products are sold far more often than other products thenthe database will contain far more entries related to those products. This is a very commonexample of skew.Traditionally skew has been seen as a negative for join algorithms. Prior work has focusedon how to maximize the performance of join algorithms by avoiding the negative efiects ofskew and using memory more e–ciently. Avoiding and mitigating partition skew has beenconsidered in centralized and distributed databases for hash joins [10]. More recently in [19]various approaches were used to lower the negative efiect of skew on the performance ofsort-merge join algorithms. In each of these cases skew was seen as a problem to be avoided.It is a major issue for the largest databases used by corporations and government where datasizes are in terabytes and queries may take hours.Modern relational databases contain statistics on the underlying data set that can beused to detect skew before a join algorithm is used. This research examines how skew canbe detected and exploited by a modifled hash join algorithm to improve performance. Therehas been no prior work that detects data skew in the relations and uses that skew as anadvantage to maximize the number of in-memory results produced by a hash join. By usingthe features of a modern database system skew can flnally be seen as an advantage and usedto greatly increase database performance when properly exploited.By detecting intrinsic data skew in the data set and preferentially bufiering the mostuseful data, memory is used more e–ciently, I/O operations are decreased, and join anddatabase performance is improved. This thesis is conflrmed by research that produced theHistojoin algorithm. Histojoin is a modiflcation to hash join that exploits data skew toimprove hash join performance. The Histojoin algorithm implementation uses statistics todetect skew in the input relations. Statistics such as histograms [13] are commonly producedby a database system for query optimization and can be exploited at no cost by the joinalgorithm. The algorithm has better performance than standard hash joins for skewed data.The improvements made to hash join allow it to flnally take advantage of statistics that havebeen available in commercial database systems for years.The contributions are as follows.† An analysis of the advantage of exploiting data skew to improve hash join performance.† A modiflcation of hash join called Histojoin that uses statistics to detect data skewand adapt its memory allocation to maximize its performance.† An implementation of Histojoin in a stand-alone Java database system and an imple-mentation of Histojoin in the popular PostgreSQL open source database system.2† An experimental evaluation that demonstrates the beneflts of Histojoin for large datawarehouse queries using the TPC-H data set.This thesis expands on the presentation in [4, 5]. The PostgreSQL implementation iscurrently being evaluated for inclusion in the main production branch of PostgreSQL whereit will have an impact on many real world databases and millions of users.The organization of this thesis is as follows. In Chapter 2 the database and many ofits operations are described in enough detail to provide a base of understanding necessaryto appreciate the purpose and beneflts of Histojoin. In Chapter 3 the Histojoin algorithm’soperation and functionality are explained in detail. In Chapter 4 Histojoin is compared to astandard hash join in many experiments. The results demonstrate signiflcant performanceimprovements with Histojoin vs. hash join that increase as the data skew increases. InChapter 5 the content of this thesis is summarized and conclusions are drawn from theexperimental results.32. Background2.1 Relational DatabasesA modern relational database consists of tables of information that are related to each otheraccording to various rules. These tables are referred to as relations.Figure 2.1: The Parts of a RelationA relation consists of columns and rows where each row is an entry in the relation andeach column specifles a piece of information that each row contains. A row is referred to asa tuple and a column is referred to as an attribute. In the example Part relation given inFigure 2.1 and Table 2.1, the attributes are (partid, name, mfgr, and price) and one of thetuples is (1, jeans, Manufacturer#1, $10.00). As can be seen the tuple contains a piece ofinformation (a value) for each attribute in the relation and if multiple tuples are examinedit is apparent that values in the same attribute of difierent tuples are of a similar type. ThePurchase relation given in Table 2.2 follows a similar format. These relations will be usedas examples throughout the thesis.4partid name mfgr price1 jeans Manufacturer#1 $10.002 shoes Manufacturer#1 $2.003 linens Manufacturer#1 $20.004 chocolate Manufacturer#3 $30.005 flrebrick Manufacturer#1 $12.006 moccasin Manufacturer#2 $100.007 saddle Manufacturer#2 $50.008 khakis Manufacturer#1 $32.009 soap Manufacturer#3 $52.0010 shampoo Manufacturer#3 $55.00Table 2.1: Part Relationpurchaseid partid quantity tax shipdate shipmode1 1 10000 $0.02 1993-10-10 MAIL2 2 50000 $0.08 1995-10-28 RAIL3 2 5000 $0.03 2001-04-19 TRUCK4 2 3300 $0.02 1998-07-24 AIR5 2 8300 $2.00 2004-01-14 MAIL6 2 1000 $0.08 1993-10-11 RAIL7 2 2000 $0.02 1995-10-29 TRUCK8 2 3100 $0.03 2001-04-20 AIR9 2 1900 $0.03 1998-07-25 MAIL10 3 1800 $0.02 2004-01-15 RAIL11 3 1500 $0.08 1993-10-12 TRUCK12 3 1100 $0.08 1995-10-30 AIR13 4 500 $0.02 2001-04-21 MAIL14 4 1500 $0.03 1998-07-26 RAIL15 5 100000 $0.02 2004-01-16 TRUCK16 6 200000 $0.05 1993-10-13 AIR17 7 1300 $0.08 1995-10-31 MAIL18 8 10000 $0.02 2001-04-22 RAIL19 9 5000 $0.02 1998-07-27 TRUCK20 10 100000 $0.02 2004-01-17 AIRTable 2.2: Purchase Relation5A relational database also models the relationships between the data the relations rep-resent. The example database containing the Part and Purchase relations models types ofparts that are for sale and individual purchases made of those parts. Each type of part canbe purchased one or more times and each purchase is a purchase of one and only one typeof part. This relationship can be seen by examining the partid attribute in the Part andPurchase relations. Each value in the partid attribute of the Purchase relation exists in oneand only one tuple of the partid attribute in the Part relation. This is commonly referredto as a one-to-many relationship (see Section 2.1.3). The partid attribute contains uniquevalues in the Part relation in that every tuple has its own value for this attribute and noneare the same.2.1.1 JoinsWhen a user retrieves information about a part they may want to also know all the individualpurchases that have been made for that part. Also when they retrieve information aboutan individual purchase they may want to know about the part that was purchased. Usingthe relationship between Part and Purchase we can flnd the part that was sold by takingthe partid of the purchase and looking it up in the Part relation. We can also flnd all thepurchases made of a part by taking the part’s partid and looking for all matching tuples inthe Purchase relation.If the database completed this process as described above the user would do the flrstquery, make a note of the partid, and then do another query. As the data in a real databaseis usually far more complex and abundant than this example (perhaps millions of tuples) itis more e–cient to allow the database to do this second lookup using speciflcally designedalgorithms called join algorithms wherein the database joins the tuples of one relation withthe tuples of another relation according to their relationship and some join condition.The result of joining two relations is a collection of tuples where for each tuple in theflrst relation and each matching tuple in the second relation we have a result tuple whosevalues are a concatenation of the values from the flrst relation tuple and the second relationtuple. If the flrst relation has the attributes (partid, name, mfgr, and price) and the secondrelation has the attributes (purchaseid, partid, quantity, tax, shipdate, and shipmode) theneach result tuple has the attributes (partid, name, mfgr, price, purchaseid, partid, quantity,tax, shipdate, and shipmode). In this simple database example it contains a duplicate of thejoin attribute (partid) because it is a simple concatenation.The SQL statement for this join would be \SELECT * FROM Part, Purchase WHEREPart.partid = Purchase.partid" and the result returned by the query is shown in Table KeysIn a database system it is important to be able to flnd an individual tuple in a relation.Keys are a means for a database to flnd and compare individual tuples in its relations.Usually when joins are performed on relations those joins are performed using key attributes6partid name mfgr price purchaseid partid quantity tax shipdate shipmode1 jeans Manufacturer#1 $10.00 1 1 10000 $0.02 1993-10-10 MAIL2 shoes Manufacturer#1 $2.00 2 2 50000 $0.08 1995-10-28 RAIL2 shoes Manufacturer#1 $2.00 3 2 5000 $0.03 2001-04-19 TRUCK2 shoes Manufacturer#1 $2.00 4 2 3300 $0.02 1998-07-24 AIR2 shoes Manufacturer#1 $2.00 5 2 8300 $2.00 2004-01-14 MAIL2 shoes Manufacturer#1 $2.00 6 2 1000 $0.08 1993-10-11 RAIL2 shoes Manufacturer#1 $2.00 7 2 2000 $0.02 1995-10-29 TRUCK2 shoes Manufacturer#1 $2.00 8 2 3100 $0.03 2001-04-20 AIR2 shoes Manufacturer#1 $2.00 9 2 1900 $0.03 1998-07-25 MAIL3 linens Manufacturer#1 $20.00 10 3 1800 $0.02 2004-01-15 RAIL3 linens Manufacturer#1 $20.00 11 3 1500 $0.08 1993-10-12 TRUCK3 linens Manufacturer#1 $20.00 12 3 1100 $0.08 1995-10-30 AIR4 chocolate Manufacturer#3 $30.00 13 4 500 $0.02 2001-04-21 MAIL4 chocolate Manufacturer#3 $30.00 14 4 1500 $0.03 1998-07-26 RAIL5 flrebrick Manufacturer#1 $12.00 15 5 100000 $0.02 2004-01-16 TRUCK6 moccasin Manufacturer#2 $100.00 16 6 200000 $0.05 1993-10-13 AIR7 saddle Manufacturer#2 $50.00 17 7 1300 $0.08 1995-10-31 MAIL8 khakis Manufacturer#1 $32.00 18 8 10000 $0.02 2001-04-22 RAIL9 soap Manufacturer#3 $52.00 19 9 5000 $0.02 1998-07-27 TRUCK10 shampoo Manufacturer#3 $55.00 20 10 100000 $0.02 2004-01-17 AIRTable 2.3: Part-Purchase Join Resultas the join attributes. Many database systems also automatically generate statistics for keyattributes which will be important in later sections of this thesis.Primary KeysA tuple contains values for each of its attributes. A tuple can be uniquely identifled byflnding the tuple that has exactly these values. A primary key (PK) is a minimal set ofattributes that uniquely identifles a tuple in a relation. For example, in the Part relationthe PK is partid. Inspection of Table 2.1 shows that every tuple in the Part relation has adifierent value in this attribute. In the Purchase relation the PK is purchaseid.Foreign KeysWhen two relations are related the database must store some information in these relationsso that for each tuple in one relation all of the related tuples in the other relation can befound. A foreign key (FK) is a set of attributes in a relation whose values can be used toflnd related tuples in another relation. For example, in the case of the Purchase relation thepartid attribute for each Purchase tuple contains a value that is equal to the partid value forthe related Part tuple. In the Part relation (Table 2.1) there is a tuple with the value 5 inits partid attribute. In the Purchase relation (Table 2.2) all tuples with the value 5 in theirpartid attribute are related to that single tuple from the Part relation.Often a database system will enforce that a foreign key value is not valid unless that valueexists in the primary key attribute for at least one tuple in the relation that the foreign keyreferences.2.1.3 CardinalityIf there is a relationship between two relations in a database then that relationship has acardinality. The possible cardinalities are one-to-one (1:1), one-to-many (1:M), and many-7to-many (M:N).In a one-to-one relationship each tuple in one relation is related to (shares attribute valueswith) one and only one tuple in the related relation. For this relationship to exist the valuesin the key attributes must be unique and therefore those attributes can be a primary key ofthe relation.In a one-to-many relationship each tuple in the \one" side relation can share the samekey values as many tuples in the \many" side relation while each tuple in the \many" sidecontains the same value as one and only one tuple in the \one" side relation. Usually thistype of relationship is implemented as a primary-to-foreign key relationship where the joinsare perfomed using the values in the primary key of the \one" relation and a foreign keyof the \many" relation. The key attributes of the \one" side relation must contain uniquevalues and can therefore be a primary key of the relation. The key attributes of the \many"side relation are not unique and are foreign key attributes of the relation. The Part andPurchase relations follow this type of relationship. It is possible that a tuple in the \one"side relation may not be related to any tuples in the \many" relation.A many-to-many relationship exists when each tuple in one of the relations can share thesame attribute values as many tuples in the other relation and vice versa. The set of joinattributes in each relation cannot be a primary key simply because it cannot be required tobe unique.2.2 Hash JoinThere are numerous join algorithms [11]. The three general types are nested loop join,sort-based join, and hash-based join. In a nested loop join each tuple in the flrst relation iscompared linearly to each tuple in the second relation and any matching tuples generate aresult tuple. A sort-based join flrst sorts each of the input relations on the join attributesand then linearly scans through both relations simultaneously, generating a result tuple eachtime the same join attribute values are encountered in both underlying relations. A hash-based join is one that uses a hash function (Section 2.2.1) on the join attributes of the inputrelations and only compares the tuples of the flrst relation with the tuples of the secondrelation that hash to the same value as only those tuples have a chance of matching andgenerating result tuples.In a hash join the smaller relation of the two (in a one-to-many join it is usually the \one"side of the join) is called the build relation and the other relation is called the probe relation.The build relation is the relation whose tuples are used to build and flll the in-memory datastructures while the probe relation is the relation whose tuples are used to probe and searchthe in-memory data structures.2.2.1 In-Memory Hash JoinAn in-memory hash join is a join algorithm that uses a hash function and in-memory hashtable to simplify the comparison of join attribute values and limit the number of tuple8comparisons necessary to flnd all the result tuples of a join.Hash FunctionA hash function is a function (y = f(x)) that takes an input value and returns an outputvalue that falls within an acceptable range of possible values. The input values could belines of text and the output values could be all valid 8 digit hexadecimal numbers. Theinput values could be all positive integer numbers and the output values could be all integersbetween (and including) 0 and 4.The process of calling a hash function on a value is often called hashing, and the outputvalue is often referred to as the hash of the input value. Often a hash function is used totake a set of input values and map those into locations where they are to be stored.As a function, if two input values are the same, then the output values generated mustbe the same. For instance if the hash of 5 is 0 the flrst time the hash function is calledthen the hash of 5 must be 0 the second time. It is not required that difierent input valuesproduce difierent output values.One use of a hash function would be to separate a large unordered set of unsigned integers(positive whole numbers) into flve groups of roughly equal size making sure that all numbersthat are equal fall in the same group without flrst sorting those numbers. If we use ahash function that outputs flve possible values we can have a group for each value the hashfunction produces.A very simple hash function that can be performed on unsigned integers is an arithmeticmodulo operation. If there are flve groups then performing a modulo 5 (mod5) on the inputvalues would produce output values in the range 0 to 4 which is exactly 5 possible values.Assign these possible values to the groups much like the indexes of an array. Starting at 0we have group 0, group 1, group 2, group 3, and group 4. The hash of the number is thegroup to place the number in.A data structure that uses a hash function to determine the location that each value isto be stored and uses that same hash function to flnd values that are already stored in thedata structure is called a hash table. In the example above each location in the hash tablestores multiple values. To store a new value is an O(1) operation as it requires only thehash function to be computed on the value and then the value can be stored directly in theappropriate location. To flnd and retrieve a value in this hash table is not O(1) however asonce the correct location is found from the hash value all values in that location must besearched to flnd the correct one. A perfect hash table is one that maps each input value toa unique location which would make retrieval an O(1) operation. Finding the perfect hashfunction that creates a perfect hash table is often impossible. Database hash functions mustbe fast but not perfect. Further discussion of hashing as well as how to flnd a good hashfunction is discussed in Section 6.4 of [17].9Separate Chaining Hash TableOne common form of hash table is a separate chaining hash table (or chained hash table).In a chained hash table you start by determining a rough estimate of the number of entriesthe table needs to store. Then a hash bucket is created for each value the hash table is tostore and an array of pointers to those hash buckets is also created. The index of the array isthe hash value that will be placed in that bucket so any input value that hashes to 3 will beplaced in bucket 3. For example, if the hash table must store 5 values then 5 hash bucketsare necessary and indexed from 0 to 4 as shown in Figure 2.2a.As we assume our hash function is not perfect, a hash bucket must be able to accomodatemultiple input values that hash to the same hash value. It is also possible that an inputvalue could occur multiple times. A hash bucket can be as simple as a linked list. Eachbucket in Figure 2.2a initially contains an empty linked list.To insert an input tuple into the chained hash table the hash value of the input tuple iscalculated and then the input tuple is appended to the corresponding hash bucket. This isan O(1) operation as long as appending to the hash bucket is O(1), and it is for any wellbuilt linked list. For example, the Part tuple with partid 1 hashes to bucket 1 (with a mod5hash function) and is appended to the linked list for bucket 1 as shown in Figure 2.2b. If thetuples with partid 2, 4, 5, and 7 are also inserted, the hash table would look like Figure 2.2c.Both 2 and 7 hash to the same bucket but because the bucket can accomodate multipletuples this situation is handled.(a) Initial Table (b) After Inserting Tuple Withpartid 1(c) After Inserting All TuplesFigure 2.2: Chained Hash Table Example (mod5 hash function)To retrieve or flnd an entry in the chained hash table, the hash value of the input valueis calculated and then the corresponding hash bucket is linearly searched for any tuples thatmatch the input value. For example, to flnd a tuple with partid 7 in the hash table, 7is hashed and found to match with bucket 2. The tuples in that bucket are then linearlyscanned and the tuple that contains partid 7 is returned. If the hash function generatesrelatively unique hash values for input tuples and there are no duplicate partid values thenmost hash buckets will contain only one tuple and this operation is in practice close to O(1).However this operation could be O(N) in the worst case where all input tuples happen to10hash to the same bucket. Hash tables usually implement a feature that detects and correctsthis problem by adapting the hash function and rehashing the input tuples. For a chainedhash table this could be as simple as creating a new hash table that is multiple times largerthan the current hash table. Rehashing is an expensive operation. Additional informationon hashing and hash tables is in [11].Probing A Hash TableProbing a hash table of tuples is the process of searching the hash table for any tuples thatwill join with a given tuple and producing result tuples from any matches. It is possible forone probe to return multiple result tuples if multiple tuples in the hash table match withthe given tuple. Finding matching tuples in a hash table can be done while comparing veryfew tuples which is far more e–cient than comparing each probe tuple to every build tupleas in a nested loop join.If the hash table in Figure 2.2c is probed with the Purchase tuple (17, 7, 1300, $0.08,1995-10-31, MAIL) the partid value 7 is hashed and found to match to bucket 2. Bucket 2is then linearly scanned and any tuples that have the partid 7 are joined with the Purchasetuple to produce a result tuple. The flrst tuple in bucket 2 does not match but the secondone does which produces the result tuple (7, saddle, Manufacturer#2, $50.00, 17, 7, 1300,$0.08, 1995-10-31, MAIL). If multiple tuples in bucket 2 had matched with the probe tuplethen multiple result tuples would have been generated.Hash Join AlgoritmAn in-memory hash join is performed by storing the build relation tuples in an in-memoryhash table and then probing those tuples with all probe relation tuples. A hash table iscreated in memory and then fllled with build relation tuples. For each probe relation tuplethe hash table is probed and any matches are returned as result tuples. If there are morebuild tuples than will flt in memory at one time then an in-memory hash join can not beused. To solve this problem hash join algorithms that partition the build relation into smallerchunks have been developed.2.2.2 Hash PartitioningIf there are more build relation tuples than can flt in available memory, one approach isto partition the relations into smaller partitions of tuples so that all of the build relationtuples in a partition can flt in memory. The number of partitions necessary is determined bycalculating what percentage of the build relation will flt in available memory. For example,if 20% of the build relation will flt, then 5 partitions are necessary.When splitting the input relations into partitions the build relation is partitioned flrst.The hash value of the join attribute value of each tuple is calculated and used to determinewhich partition the tuple will be placed in. If a join of the Part and Purchase relationsis being performed, the join attribute in each relation is the partid attribute. Note this11attribute is an unsigned integer (always positive whole number) so a hash function couldbe an arithmetic modulo operation as described in Section 2.2.1. If the database chooses topartition the join into flve partitions then the hash function could be modulo flve (mod5).The build and probe relations have their own partitions, but these partitions are related.To see this take a look at the hash function and the attribute values being hashed on. Ifthe hash function is mod5 then an attribute value of 5 would hash to 0 for both relations.Therefore any tuple in the build relation with a partid attribute value of 5 would end up inpartition 0 for the build relation and any tuple in the probe relation with a partid attributevalue of 5 would end up in partition 0 for the probe relation. Other tuple values (such as10) will hash to this same partiton, but we can at least guarantee that all tuples with thesame value will be in corresponding partitions.Figures 2.3a, 2.4a, 2.5a, 2.6a, and 2.7a represent the contents of the flve partitions of thePart relation when partitioned using mod5 as the hash function. Notice how all tuples inpartition 0 (Figure 2.3a) have values in the partid attribute that hash to 0 with the givenhash function.Similarly Figures 2.3b, 2.4b, 2.5b, 2.6b, and 2.7b represent the contents of the flve parti-tions of the Purchase relation using the given hash function. All values of the partid attributein partition 0 (Figure 2.3b) hash to 0.partid name mfgr price5 flrebrick Manufacturer#1 $12.0010 shampoo Manufacturer#3 $55.00(a) Part Partition 0purchaseid partid quantity tax shipdate shipmode15 5 100000 $0.02 2004-01-16 TRUCK20 10 100000 $0.02 2004-01-17 AIR(b) Purchase Partition 0Figure 2.3: Partition 0 of Part and Purchase Relationspartid name mfgr price1 jeans Manufacturer#1 $10.006 moccasin Manufacturer#2 $100.00(a) Part Partition 1purchaseid partid quantity tax shipdate shipmode1 1 10000 $0.02 1993-10-10 MAIL16 6 200000 $0.05 1993-10-13 AIR(b) Purchase Partition 1Figure 2.4: Partition 1 of Part and Purchase Relationspartid name mfgr price2 shoes Manufacturer#1 $2.007 saddle Manufacturer#2 $50.00(a) Part Partition 2purchaseid partid quantity tax shipdate shipmode2 2 50000 $0.08 1995-10-28 RAIL3 2 5000 $0.03 2001-04-19 TRUCK4 2 3300 $0.02 1998-07-24 AIR5 2 8300 $2.00 2004-01-14 MAIL6 2 1000 $0.08 1993-10-11 RAIL7 2 2000 $0.02 1995-10-29 TRUCK8 2 3100 $0.03 2001-04-20 AIR9 2 1900 $0.03 1998-07-25 MAIL17 7 1300 $0.08 1995-10-31 MAIL(b) Purchase Partition 2Figure 2.5: Partition 2 of Part and Purchase Relations12partid name mfgr price3 linens Manufacturer#1 $20.008 khakis Manufacturer#1 $32.00(a) Part Partition 3purchaseid partid quantity tax shipdate shipmode10 3 1800 $0.02 2004-01-15 RAIL11 3 1500 $0.08 1993-10-12 TRUCK12 3 1100 $0.08 1995-10-30 AIR18 8 10000 $0.02 2001-04-22 RAIL(b) Purchase Partition 3Figure 2.6: Partition 3 of Part and Purchase Relationspartid name mfgr price4 chocolate Manufacturer#3 $30.009 soap Manufacturer#3 $52.00(a) Part Partition 4purchaseid partid quantity tax shipdate shipmode13 4 500 $0.02 2001-04-21 MAIL14 4 1500 $0.03 1998-07-26 RAIL19 9 5000 $0.02 1998-07-27 TRUCK(b) Purchase Partition 4Figure 2.7: Partition 4 of Part and Purchase Relations2.2.3 Grace Hash JoinThe Grace Hash Join (GHJ) [16] is a hash join algorithm that partitions the input relationsinto multiple temporary disk flles using hash partitioning as described in Section 2.2.2 andthen performs an in-memory hash join (Section 2.2.1) on each build and probe pair of diskflles.The flrst step of a GHJ is to partition the build and probe relations as described inSection 2.2.2. A flle is created on disk for each build relation partition, and another flle iscreated on disk for each probe relation partition.The tuples of the build relation (given in Table 2.1) are read one at a time, their hashvalue is calculated from the join attribute value (partid), and they are appended to thecorresponding build relation flle on disk. The order that tuples are read from the underlyingrelations cannot be guaranteed but for convenience we will assume they are read in the orderthat they appear in Tables 2.1 and 2.2 for this example. Looking at Table 2.1 the flrst buildtuple read is (1, jeans, Manufacturer#1, $10.00) and it is placed in partition 1 (Table 2.4a).This process is continued for all other build relation tuples.Once the build relation is partitioned a tuple is read from the probe partition and thehash value of its join attribute value is calculated. The tuple is then appended to thecorresponding probe relation flle on disk. Looking at Table 2.2 the flrst probe tuple thatwould be retrieved from the Purchase relation is (1, 1, 10000, $0.02, 1993-10-10, MAIL) andits hash value is 1 so it would be placed in partition 1 (Table 2.4b).After partitioning, both relations there are pairs of related partitions on disk. The buildrelation partition 0 is related to probe relation partition 0 because they were partitionedusing the same hash function on the common join attribute. Any tuples in probe partition0 will join with only tuples from build partition 0.The next step is to load tuples from the flrst build relation partition (partition 0) intomemory and then probe them with tuples from the flrst probe partition. An e–cient datastructure must be used for it to be e–cient to store the build tuples in memory and searchthem for tuples that match a given probe tuple. A hash table such as a separate chaining13hash table (as discussed in Section 2.2.1) is created in memory and then fllled with tuplesfrom the build relation partition. Once this is complete a tuple is read from the correspondingprobe relation partition. The hash value of the tuples join attributes is calculated, and thenthe probe tuple is compared to all build tuples in the corresponding hash table bucket. Ifany build tuples are found in the hash bucket that share the same join attribute value (nothash value) as the probe tuple then a result tuple is generated for each match. The flrsttuple in probe partition 0 is (15, 5, 100000, $0.02, 2004-01-16, TRUCK). It matches withthe build relation tuple (5, flrebrick, Manufacturer#1, $12.00) because they both have partidequal to 5. The result tuple is a concatenation of the values of both tuples and is exactly (5,flrebrick, Manufacturer#1, $12.00, 15, 5, 100000, $0.02, 2004-01-16, TRUCK). This processcontinues for each tuple in probe relation partition 0. The disk flles for build partition 0 andprobe partition 0 are then deleted, and the in-memory hash table is removed from memory.The above process is continued for each build/probe partition pair on disk until all flles havebeen deleted.The result of joining these two relations can be seen in Table 2.4.partid name mfgr price purchaseid partid quantity tax shipdate shipmode5 flrebrick Manufacturer#1 $12.00 15 5 100000 $0.02 2004-01-16 TRUCK10 shampoo Manufacturer#3 $55.00 20 10 100000 $0.02 2004-01-17 AIR1 jeans Manufacturer#1 $10.00 1 1 10000 $0.02 1993-10-10 MAIL6 moccasin Manufacturer#2 $100.00 16 6 200000 $0.05 1993-10-13 AIR2 shoes Manufacturer#1 $2.00 2 2 50000 $0.08 1995-10-28 RAIL2 shoes Manufacturer#1 $2.00 3 2 5000 $0.03 2001-04-19 TRUCK2 shoes Manufacturer#1 $2.00 4 2 3300 $0.02 1998-07-24 AIR2 shoes Manufacturer#1 $2.00 5 2 8300 $2.00 2004-01-14 MAIL2 shoes Manufacturer#1 $2.00 6 2 1000 $0.08 1993-10-11 RAIL2 shoes Manufacturer#1 $2.00 7 2 2000 $0.02 1995-10-29 TRUCK2 shoes Manufacturer#1 $2.00 8 2 3100 $0.03 2001-04-20 AIR2 shoes Manufacturer#1 $2.00 9 2 1900 $0.03 1998-07-25 MAIL7 saddle Manufacturer#2 $50.00 17 7 1300 $0.08 1995-10-31 MAIL3 linens Manufacturer#1 $20.00 10 3 1800 $0.02 2004-01-15 RAIL3 linens Manufacturer#1 $20.00 11 3 1500 $0.08 1993-10-12 TRUCK3 linens Manufacturer#1 $20.00 12 3 1100 $0.08 1995-10-30 AIR8 khakis Manufacturer#1 $32.00 18 8 10000 $0.02 2001-04-22 RAIL4 chocolate Manufacturer#3 $30.00 13 4 500 $0.02 2001-04-21 MAIL4 chocolate Manufacturer#3 $30.00 14 4 1500 $0.03 1998-07-26 RAIL9 soap Manufacturer#3 $52.00 19 9 5000 $0.02 1998-07-27 TRUCKTable 2.4: Part-Purchase Hash Join Result2.2.4 Hybrid Hash JoinIn the previous hash join example, none of the build partitions were more important thanthe others. All partitions were initially written to disk and then loaded back into memory tobe joined a partition at a time. For any signiflcantly large join in a real database it is likelythat there is not enough memory to store all of the build relation tuples in memory at onetime, but there is often enough memory to store some of the build tuples in memory whilepartitioning the build relation. Thus, less tuples must be written to disk while partitioningand then read back while probing. The speed of storing data on and retrieving data from diskis on the order of a million times slower than storing and retrieving data from a computer’smain memory. It is immediately apparent that limiting how much data must be stored ondisk during the join is vitally important to the performance of a join algorithm. If some14build tuples can be maintained in memory while partitioning then those tuples do not needto be written to disk and then re-read later during the join.Hybrid hash join (HHJ) [7] is a standard hash join algorithm used in database systems.HHJ works by partitioning the build relation as described in Section 2.2.2 except the flrstpartition (partition 0) is stored in memory and all other partitions are stored on disk. Thenumber of partitions is determined just like in hash join by calculating what percentage ofthe build relation will flt in memory. If 20% of the build relation will flt then one flfth of therelation will flt in memory and 5 partitions are used.At the start of the join a flle is created on disk for each partition other than the flrstone. A hash table is created in memory to store the build tuples of build relation partition0. When tuples are read during partitioning of the build relation any tuples that do not fallinto partition 0 are appended to the end of the on-disk flle for the corresponding partition.Any tuples that fall into partition 0 are placed in the in-memory hash table.After the build relation tuples are partitioned a flle is created on disk for each of theprobe relation partitions except the flrst one. Probe tuples are then read one at a time fromthe probe relation and the hash value of their join attribute value is calculated. If the probetuple falls into the flrst partition then the in-memory hash table is probed with that probetuple and any matches are returned as result tuples exactly as in the previous example.The probe tuple can then be discarded. It is not necessary to write it to disk as all resulttuples it would create have been generated. If the tuple falls in any other partition then itis appended to the end of the corresponding partition flle on disk for the probe relation.When all of the probe tuples have been read from the probe relation the cleanup phasebegins. All tuples and the hash table are deleted from memory. An in-memory hash tableis created and the tuples in the disk flle for the second build relation partition (partition 1)are then loaded into it. Probe tuples are then read one at a time from the disk flle for thesecond probe relation partition (partition 1) and used to probe the hash table. Any matchesare returned as result tuples as above. All tuples and the hash table are then removed frommemory and the two flles that were just read are deleted from disk. This process is repeatedfor each partition that has a flle on disk (partitions 2, 3, and 4). After this process the joinis flnished.The partitioning process assumes that an equal amount of tuples will fall into each buildpartition. This assumption makes sense if the data set is uniform. If, for example, it turnsout that more than 20% of the build tuples hash to partition 0 then too much memory willbe used to store these tuples. Various database systems handle this issue difierently. TheHHJ implementation in the PostgreSQL database system repartitions with twice as manypartitions until it is no longer using too much memory.If almost no build tuples hash to partition 0 then the algorithm is not making proper useof its memory and could perform faster if it kept more build tuples in memory. The solutionto this problem is an algorithm called Dynamic Hash Join.152.2.5 Dynamic Hash JoinDynamic hash join (DHJ) [8, 21] is similar to hybrid hash join except that it dynamicallyselects memory-resident partitions during execution. Instead of picking only one partitionto remain memory-resident before the join begins, DHJ allows all partitions to be memory-resident initially and then  ushes partitions to disk as required when memory is full.Although DHJ adapts to changing memory conditions, there has been no research ondetermining what is the best partition to  ush to maximize performance. Various approachesselect the largest or smallest partition, a random partition, or use a deterministic ordering.No approach has considered using data distributions to determine the optimal partition to ush. In the examples a deterministic ordering will be assumed because it is the simplest.The highest numbered partitions will be  ushed flrst.TheDHJalgorithmissimilartoHHJ.DHJdetermineswhat percentageoftuplescanfltinmemory and then creates a number of partitions. If 20% of the tuples can flt in memory thenassuming a uniform distribution of tuples (an equal amount will hash to each partition) thealgorithm will need at least 5 partitions. Often because a database system knows a uniformdistribution of tuples is unlikely it will create more partitions than the above estimation inan efiort to compensate for data that is not uniform. In this example, 10 partitions may beused. It is also possible that the tuples are not evenly divided among the partitions due tothe hash function used (partition skew). When joining Part and Purchase the build tuplesfrom Part are uniformly distributed which is acceptable for this example: 20% of the Partrelation (the maximum amount that flts in memory) is equivalent to 2 tuples.While partitioning the build relation DHJ starts with all build partitions in memory. Ahash table is stored in memory for each build partition. It reads the tuples in order from thePart relation and  ushes a partition to disk only when more than 2 tuples will be stored inmemory. When a partition is  ushed to disk its hash table is removed from memory as well.Tuple 1 (the tuple with partid 1) is read from the Part relation and placed in partition1 (Figure 2.8a). The build partitions are now using 10% of memory. Tuple 2 is read andplaced in partition 2 (Figure 2.8b); 20% of memory is now used. Tuple 3 is read and placedin partition 3 (Figure 2.8c). Because memory is over-full, partition 4 is  ushed and its tupleswritten to disk (Figure 2.8d). This has been referred to as a frozen [8, 18] partition inprevious work. It is marked as frozen so that any further tuples that would fall into thatpartition are immediately placed on disk. This did not free up any memory so partition 3 isalso  ushed. The disk flle for partition 3 now contains tuple 3 and memory is again only 20%full (Figure 2.8e). Tuple 4 is read and written to disk because partition 4 was previouslyfrozen (Figure 2.8f). Tuple 5 is read and placed in partition 0 (Figure 2.9a). Memory isover-full again so partition 2 is  ushed to disk which puts the partitions back at the memorylimit (Figure 2.9b). Tuple 6 is read and placed in partition 1 (Figure 2.9c) which causespartition 1 to be  ushed to disk (Figure 2.9d). Only 10% of memory is now in use. Tuples7, 8, and 9 are all read and written to disk (Figure 2.9e). Tuple 10 is read and placed inpartition 0 in memory (Figure 2.9f). All build tuples have now been partitioned and thebuild partitions are not over-using memory.Reading tuples from the probe relation is handled exactly as for HHJ, except if multiple16partitions had been kept in memory after partitioning the build relation then any probetuples that hashed to an in-memory partition could have generated result tuples before thecleanup phase. The cleanup phase is handled exactly as for HHJ.2.2.6 Hash Join Performance EnhancementsIt is possible for the database optimizer to poorly estimate the size of the underlying relationsand choose the larger one to be the build relation. After partitioning the relations, a hashjoin knows the exact size of its inputs. If the build input is larger than the probe input hashjoin can swap them and use the probe input as the build input instead. This is called rolereversal [12].An optimization that can be used during the cleanup phase after the input relations havebeen partitioned is to process multiple partitions at the same time. In this optimization,instead of performing one cleanup iteration per partition, each cleanup iteration operates onas many partitions that can flt in memory.When a single partitioning step is not su–cient to construct partitions that flt in memory,multiple rounds of recursive partitioning are used. Recursive partitioning avoids having many(hundreds) of open flles during partitioning which increases the impact of random I/Os whilewriting to those flles.2.3 SkewSkew can be classifled [23] as either partition skew or intrinsic data skew. Partition skewis when the partitioning algorithm constructs partitions of non-equal size (often due tointrinsic data skew but also due to the hash function itself). Minimizing partition skewhas been considered for distributed databases [10] and DHJ [15, 21]. Partition skew canbe partially mitigated by using many more partitions than required, as in DHJ, and byproducing histograms on the data when recursive partitioning is required.Consider two relations R(A) and S(B;A) where attribute A is the join attribute betweenR and S. The underlined attributes are the primary key attributes of the relations. Assumethat the number of tuples of R, denoted as jRj, is smaller than the number of tuples of S(i.e. jRj < jSj). Assume S:A is a foreign key to R:A. In a hash join of R and S the buildrelation is R and the probe relation is S.When performing a hash join most systems have no intelligent way of selecting whichpartition remains memory-resident. PostgreSQL simply selects the flrst partition. Thisassumption makes sense if the data set is uniform. In that case, each tuple in R is equallylikely to join to tuples in S, so it does not matter what tuples in R are left in memory. Ifthe data is skewed such that certain tuples in R join to many more tuples in S than theaverage, it is preferable that those tuples of R remain in memory.Intrinsic data skew is when data values are not distributed uniformly. Intrinsic data skewmay cause partition skew for hash joins when the join attribute on the build relation is notthe primary key attribute. Data skew causes values to occur with difierent frequencies in17the relations. In the example that joins R:A = S:A (primary-to-foreign key join), data skewmay cause the distribution of values of S:A (probe relation) to vary dramatically.Histograms have been previously used during recursive partitioning [12] to detect dataskew and minimize partition skew. However, no previous work has discussed using existinghistograms in the database system to estimate and exploit skew in the probe relation whilepartitioning the build relation.2.4 Statistics and HistogramsMost commercial database systems create and maintain some form of aggregate statistics ontheir relations to be used for optimizing query processing. A database will keep an accuratecount of how many tuples are in each relation and the average size of a tuple so that when aquery is performed it can correctly calculate the amount of memory necessary at each stepin the query and which order of operations will perform the fastest.When a selection or fllter is performed on a relation as part of a query, a database canmake educated guesses as to how many tuples will make it past the fllter based on the fllterbeing used. If the database knows that there are roughly 10 distinct values in the attributebeing flltered on and the fllter is something of the form \attribute = value" then it couldassume that one tenth of the relation tuples will match the fllter.partid Frequency1 12 83 34 25 16 17 18 19 110 1Table 2.5: An Accurate Histogram on the partid Attribute of the Purchase RelationWhen there is a uniform distribution of values in an attribute this is a safe estimate, butthis is not always the case. The partid attribute of the Purchase relation in Table 2.2 is aperfect example of a case when this estimate would be quite incorrect. There are 10 distinctvalues in the partid attribute, but they are not uniformly distributed. While the averagefrequency of each value is 2 the value 2 occurs 4 times as often. A histogram of the partidattribute of the Purchase relation is provided in Table 2.5. Generally a histogram bucket isdeflned by the range of values that fall into the bucket and the aggregate frequency of allthose values (how many times those values occur). The flrst column of this histogram is theattribute value and the second column is how frequently that value occurs in all the tuplesof the relation. Not all histograms are organized this way (see [13]).18With a pre-generated histogram on the partid attribute the query optimizer can moreaccurately estimate the result of a fllter. If the fllter is \partid = 2" the database wouldestimate that 8 tuples would pass the fllter. The histogram in Table 2.5 is ideal becauseevery distict value in the partid attribute has its frequency stored.In a large relation with millions of tuples a histogram aggregates attribute values so thatone bucket contains many values. A simple example of how this afiects the accuracy ofthe histogram can be seen in Table 2.6 where the bucket value ranges now contain 2 valueseach. The frequency column of the histogram specifles the sum of all the frequencies of theindividual values in the bucket. Since no individual frequencies are maintained a uniformassumption must be made where it is assumed that each value in the bucket corresponds toan equal portion of the frequency of the bucket. With the fllter \partid = 2" and this secondhistogram the database would estimate that 5 (rounded up from 4.5) tuples would pass thefllter because the frequency of the bucket is 9 and the number of distinct values in the rangeof the bucket is 2.partid Range Frequency1-2 93-4 55-6 27-8 29-10 2Table 2.6: An Aggregate Histogram on the partid Attribute of the Purchase RelationTo compensate for the bad estimates caused by aggregate histograms many specializedhistogram types are used by various database systems. For example, an end-biased histogrammaintains the most frequent values individually from the rest of the buckets. It keeps track ofthe frequency of those individual values as well as the frequency of each bucket. End-biasedhistograms take up only slighly more space than our example histograms but allow the mostfrequent values to be known and exploited. Equi-depth histograms arrange the buckets sothat each contains the same number of values and consequently the frequency of all bucketsis the same but the width of the bucket (number of values in its range) varies.When the input to a join is not a base relation (it could be another join or a selection forexample) the base relation statistics may not accurately re ect the distribution of tuples onthat input. One method to compensate for this is SITs (Statistics on Intermediate Tables) [2].When a database system supports SITs it occasionally runs common queries that involvejoins and other operators and stores the statistics of the intermediate results of these queries.The query optimizer would then use these statistics to improve any estimates for queries thatare equivalent to the common queries or when one of the common queries is a sub query ofthe current query being estimated.For a further discussion of histograms see [13, 14, 22].192.4.1 Histograms and Hash JoinsIn the preceding discussion of HHJ and DHJ it was apparent that keeping some of the buildtuples in memory while partitioning the build relation not only kept those build tuples fromhaving to be written to and re-read from disk, it also kept the related probe tuples fromhaving to be written to and re-read from disk. The histograms in Tables 2.5 and 2.6 showthat some of the build tuples are related to more probe tuples than the other build tuplesare. If the hash join knew this information when it was choosing which build tuples to keepin memory during the build phase, it could save many more tuple disk writes and reads.If the histogram for the partid attribute of the Purchase relation was examined beforepartitioning the Part relation, the order that DHJ freezes partitions could be changed sothat partition 1 stays in memory and partition 0 is  ushed to disk. If this could be done thenumber of result tuples generated during the probe phase would increase signiflcantly andthe number of probe tuples written to disk during the probe phase and re-read during thecleanup phase would decrease.2.5 Example DatabaseIn this thesis, the TPC-H database benchmark standard is used as an example database.TPC-H is a decision-support and data warehouse benchmark developed by the TransactionProcessing Performance Council (TPC). More information about the TPC-H benchmark canbe found at [1]. The TPC-H schema diagram is in Figure 2.10. The SF in the diagram repre-sents the scale factor of the relations. Scale factors 1 and 10 will be used which produce totaldatabase sizes of approximately 1 and 10 GB respectively. The largest relation, LineItem,has just over 6 million tuples for SF=1 and 60 million tuples for SF=10.2.6 Relational Algebra Query Plan DiagramsIn this thesis relational algebra (RA) diagrams are used to describe the order in whichmultiple relations are joined in a single database query. RA diagrams can contain manyoperators but in this thesis only joins and relations are represented in these diagrams. InFigure 2.11 the example RA diagram LI represents the LineItem relation, S represents theSupplier relation, and P represents the Part relation of the TPC-H database benchmarkdescribed in Section 2.5. The transposed hourglass symbol represents a join of the relationsconnected below it. In Figure 2.11 Supplier is joined with LineItem and then the result ofthis join is joined with Part.20(a) After Inserting Tuple 1 (b) After Inserting Tuple 2(c) After Inserting Tuple 3 (d) After Freezing Partition 4(e) After Freezing Partition 3 (f) After Inserting Tuple 4Figure 2.8: DHJ Partitioning Example Part 121(a) After Inserting Tuple 5 (b) After Freezing Partition 2(c) After Inserting Tuple 6 (d) After Freezing Partition 1(e) After Inserting Tuples 7, 8, and 9 (f) After Inserting Tuple 10Figure 2.9: DHJ Partitioning Example Part 222Legend:PARTKEYNAMEMFGRBRANDTYPESIZECONTAINERCOMMENTRETAILPRICEPARTKEYSUPPKEYAVAILQTYSUPPLYCOSTCOMMENTSUPPKEYNAMEADDRESSNATIONKEYPHONEACCTBALCOMMENTORDERKEYPARTKEYSUPPKEYLINENUMBERRETURNFLAGLINESTATUSSHIPDATECOMMITDATERECEIPTDATESHIPINSTRUCTSHIPMODECOMMENTCUSTKEYORDERSTATUSTOTALPRICEORDERDATEORDER-PRIORITYSHIP-PRIORITYCLERKCOMMENTCUSTKEYNAMEADDRESSPHONEACCTBALMKTSEGMENTCOMMENTPART (P_)SF*200,000 PARTSUPP (PS_)SF*800,000 LINEITEM (L_)SF*6,000,000 ORDERS (O_)SF*1,500,000CUSTOMER (C_)SF*150,000SUPPLIER (S_)SF*10,000ORDERKEYNATIONKEYEXTENDEDPRICEDISCOUNTTAXQUANTITYNATIONKEYNAMEREGIONKEYNATION (N_)25COMMENTREGIONKEYNAMECOMMENTREGION (R_)5Figure 2.10: TPC-H Schema from [1]Figure 2.11: Example Relational Algebra Diagram233. Histojoin3.1 General ApproachTheHistojoin algorithm isdesigned tousestatistics and histograms ascurrentlyimplementedin the database system. The algorithm does not assume any histogram method and willwork with any method. Commercial systems typically implement equi-depth [20] or maxdifi(Microsoft SQL server) histograms. An overview of histograms can be found in [13, 14, 22].The actual construction of the histograms is orthogonal to this work.The general approach is to use the extra memory available to the hash join to bufierthe tuples that participate in the most join results. Consider a primary-to-foreign key joinbetween R(A) and S(B;A) on A, where R is the smaller relation and some subset of itstuples are bufiered in memory. Unlike hybrid hash join that selects a random subset of thetuples of R to bufier in memory, the tuples bufiered in memory will be chosen based on thevalues of A that are the most frequently occurring in relation S.For example, let R represent a Part relation, and S represent a Purchase relation. Everycompany has certain parts that are more commonly sold than others. A common part may beassociated with thousands of purchases and a rare part only a handful. If a single part tupleordered thousands of times is kept in memory when performing the join, every matchingtuple in Purchase does not need to be written to disk and re-read during the cleanup phase.Hash partitioning randomizes tuples in partitions. This is desirable to minimize the efiectof partition skew, but data skew is also randomized. Traditional hash joins have no abilityto detect data skew in the probe relation or exploit it by intelligent selection of in-memorypartitions.Level 14420011047999876543210(100−199)(500−599)cb(0−39,740−799)aLevel 2Figure 3.1: Two Level PartitioningThis approach uses two levels of partitioning. The flrst level performs range partitioningwhere ranges of values of R:A are selected to be memory-resident. Tuples that do not fall24into the ranges are partitioned using a hash function as usual. The data structures used areshown in Figure 3.1.In Figure 3.1 there are 3 in-memory partitions (a;b;c) and 10 hash partitions numbered0 to 9. For Level 1 partitions, each partition is deflned by one or more join attribute valueranges. For example, partition a consists of values from 0 to 39 and 740-799. Ideally, theseattribute values are the most frequently occurring in S. The maximum partition size isbounded by the memory size available to the join. The Level 2 partitions are regular hashpartitions. If a tuple does not fall into any of the Level 1 partitions, it is placed in a Level2 partition by hashing the join attribute value. In general, there may be multiple Level 1memory-resident partitions each deflned by multiple ranges of values. The only constraintsare that each partition must flt in the available memory during a cleanup phase at the endof the join, and the total memory used by in-memory partitions is always below the memoryavailable.3.1.1 Theoretical Performance AnalysisThissectionprovidesthetheoreticalmaximumimprovementofskew-awarepartitioningusingdata distributions versus random partitioning (dynamic hash join). Let f represent thefraction of the smaller relation (R) that is memory-resident: f = M=jRj (approximately),where M is the memory size. The number of tuple I/O operations performed by dynamichash join is 2 ⁄ (1 ¡ f) ⁄ (jRj + jSj). The factor 2 represents the two I/Os performed foreach non-memory-resident tuple: one to  ush to disk if not memory-resident and then oneto read again during the cleanup phase of the join. Note that this does not count the costto read the tuple initially.Let g represent the fraction of the larger relation (S) that joins with the in-memoryfraction f of R. If the distribution of the join values in S is uniform, then f = g. Dataskew allows g > f if memory-resident tuples are chosen properly. The number of I/Osperformed by Histojoin is 2 ⁄ (1 ¡ f) ⁄jRj + 2 ⁄ (1 ¡ g) ⁄jSj. The absolute difierence inI/Os performed between DHJ and Histojoin is 2⁄(1¡f)⁄(jRj+jSj)¡(2⁄(1¡f)⁄jRj+2⁄(1¡g)⁄jSj) which simplifles to 2⁄(g ¡f)⁄jSj. A negative number indicates DHJ isoutperforming Histojoin, while a positive number indicates Histojoin performs better thanDHJ. The percentage difierence in I/Os is (g¡f)⁄jSj(1¡f)⁄(jRj+jSj).The absolute difierence in total I/Os performed given selected values of f and g is givenin Table 3.1. The percentage difierence in total I/Os performed is given in Figure 3.2. Theabsolute difierence is directly proportional to the difierence between f and g. The tableshown is for a 1:1 ratio of R and S where jRj = jSj = 1000. The difierence between f and gis bounded above by the intrinsic skew in the data set and is limited by how we exploit thatskew during partitioning.Properly exploiting data skewallows g > f, but if the in-memory tuples are chosen poorly,it is possible for g < f. This is worse than the theoretical average of the uniform case fordynamic hash join. Tuples may be chosen improperly if the statistics used for deciding whichtuples to bufier in memory are incorrect causing the algorithm to bufier worse than average25tuples.As an example, consider a data set following the \80/20 rule". If we can keep 20% oftuples of R in-memory (f = 20%) that join with 80% of the tuples in S (g = 80%), thena skew-aware join will perform 68% fewer tuple I/Os than hybrid hash join. However, ifthe data had \80/20 skew", but the 20% of tuples of R bufiered in-memory were the leastfrequently occurring in S, then it may be possible for g = 5%, resulting in skew-aware joinperforming 17% more I/Os than hash 5% 10% 20% 50% 80% 90% 100%5% 0 100 300 900 1500 1700 190010% -100 0 200 800 1400 1600 180020% -300 -200 0 600 1200 1400 160050% -900 -800 -600 0 600 800 100080% -1500 -1400 -1200 -600 0 200 40090% -1700 -1600 -1400 -800 -200 0 200Table 3.1: Absolute Reduction in Total I/Os of Skew-Aware Partitioning versus RandomPartitioning for Various Values of f and g and jRj = jSj = 10003.2 Histojoin AlgorithmA low cost technique for performing skew-aware partitioning is by using histograms. His-tograms [13] are used in all commercial databases for query optimization and provide anapproximation of the data distribution of an attribute. A histogram divides the domain intoranges and calculates the frequency of the values in each range. An example histogram pro-duced by Microsoft SQL Server 2005 for the TPC-H relation Lineitem on attribute partkeyis provided in Figure 3.3.The advantage of using histograms is that they are readily available, calculated andmaintained external to the join algorithm, and require no modiflcation to the query optimizerorjoinalgorithmtouse. Onexamination ofthehistogram, thequeryoptimizercandetermineif the Histojoin algorithm will be beneflcial. An imprecise or out-of-date histogram limitsHistojoin’s ability to exploit the data skew.3.2.1 Algorithm OverviewThe Histojoin algorithm works by implementing a set of privileged partitions in addition tothe partitions dynamic hash join would normally use. These privileged partitions are thelast partitions to be  ushed from memory to disk and are arranged so that they are  ushedin a speciflc order, whereas the non-privileged partitions are  ushed in a randomized order.The privileged partitions correspond to the Level 1 partitions shown in Figure 3.1.26Figure 3.2: Total I/Os Percent Difierence 0 50 100 150 200 250 300 350 400 0  25000  50000  75000  100000  125000  150000  175000  200000FrequencyPartkey valueFigure 3.3: Partkey Histogram for Lineitem Relation TPC-H 1 GB Zipf Distribution (z=1)27The difierences between Histojoin and dynamic hash join are isolated in the hash table.Histojoin’s hash table is a two layered table that is aware of which partitions are privileged,how to determine if tuples fall in the privileged partitions, and how to optimally  ushthe partitions by flrst randomly  ushing non-privileged partitions and then  ushing theprivileged partitions in order of worst to best.The key difierence between Histojoin and dynamic hash join is that Histojoin attemptsto isolate frequently occurring tuples in the privileged partitions which are the last ones ushed. Dynamic hash join spreads out frequently occurring tuples across all partitions andprovides no special handling for frequent tuples.A  owchart describing the Histojoin algorithm is in Figure 3.4. The flrst step is to loadthe histogram and determine which tuple value ranges are in the privileged partitions. Atthis point, if insu–cient skew is detected or there is limited confldence in the histogram, adecision is made on the maximum size of the privileged partitions. If no skew is detected,Histojoin will allocate no memory to the privileged partitions, and the algorithm behavesidentically to dynamic hash join. Determining the privileged partitions is discussed in Section4.2.PARTITION BUILD RELATIONPARTITION PROBE RELATIONStartLoad Histogram and Select Privileged RangesBuild Tuples Left?Is Tuple Privileged?YesCreate ChainHash Tables For In-Memory PartitionsNoInsert Into Privileged PartitionsYesInsert Into Non-Privileged PartitionsNoMemory Used < Memory Available ?YesFreze Next PartitionNoProbe Tuples Left?Tuple Matches To In-Memory Partition?YesFrozen Partitions Left?NoOutput Result TupleYesWrite Tuple To On-Disk Partition FileNoLoad Frozen Partition Into Memory And Create ChainHash TableYesEndNoProbe Partition Using On-Disk Probe TuplesFigure 3.4: Histojoin FlowchartGiven su–cient detected skew, the privileged partition ranges are organized into e–cientdata structures to allow rapid determination of privileged tuples. Each build and probe tuplerequires a range check to determine if they belong in a privileged partition. The range checkoperation is discussed in Section 4.3.Histojoin processes the join in a similar manner to dynamic hash join. Tuples are readfrom the build relation. When a tuple is read, the range check is performed. If the tuple fallsinto a privileged partition, it is placed there. Otherwise, the tuple’s partition is determinedusing a hash function similar to DHJ. Whenever memory is full while the build relation is28being read, a partition  ush is performed. Non-privileged partitions are  ushed flrst. If allnon-privileged partitions are  ushed, the privileged partitions are  ushed in reverse orderof beneflt. When  ushing the non-privileged partitions, we  ush in random order. Once apartition is  ushed, a single disk bufier is allocated to the partition to make writing tuplesto the disk flle more e–cient. A  ushed partition cannot receive any new tuples in memory.Once the build relation is completely read, there will be some build partitions in memoryand others in disk flles. Partitions that are memory-resident have main memory (chained)hash tables constructed to store their tuples. These hash tables will be probed using tuplesfrom the probe relation.The probe relation tuples are then read. The range check is performed on each tuple.If the tuple corresponds to an in-memory build partition (privileged or not), it probes thechained hash table for that partition to potentially generate results. If the correspondingbuild partition is not in-memory, the probe tuple is written to the probe partition flle ondisk. Once all probe relation tuples are read, there will be pairs of build and probe partitionson-disk. Main memory is cleared, and each partition pair is read from disk and processed.Typically, the build relation partition is read, a chained hash table produced, and thenresults are generated by probing using probe relation tuples. However, common practicessuch as those described in Section 2.2.6 can be applied.In summary, Histojoin behaves like dynamic hash join except that its hash table structureallows for the identiflcation and prioritization of frequently occurring tuples in the proberelation. The difierences between DHJ and Histojoin are embedded in the distribution oftuples between the two layers of the hash table and the order in which partitions are frozen todisk to free up memory. All other hash join techniques are unafiected by these modiflcations.3.2.2 Selecting In-Memory TuplesGiven a histogram that demonstrates skew, Histojoin must determine a set of join attributerange(s) that constitute the most frequently occurring values in S. Tuples of R with thesefrequently occurring values are the ones in the privileged partitions. For instance in Figure3.3 there are several ranges of part keys that occur frequently in LineItem. The challengeis that the join partition size is determined independently from histogram partitioning. Forexample, let jRj = 1000 and M = 100. Thus, at least 10 partitions of R are required. Thein-memory partition can have 100 tuples. It may require multiple independent ranges in thehistogram to deflne a set of attribute ranges that contain up to 100 distinct values of R andhave high frequency in S.The greedy algorithm reads the histogram for S on the join attribute, sorts its buckets byfrequency, and selects as many buckets that flt into memory in the order of highest frequencyflrst. The detailed steps are:† Assume each histogram bucket entry is a 6-tuple of the form (MINVAL, MAXVAL,ROWCOUNT, EQROWS, DISTINCT ROWS, ROWS R). MINVAL and MAXVALare the lower and upper (inclusive) values deflning the bucket range. ROWCOUNT isthe number of rows in S with a value in the range of [MINVAL,MAXVAL). EQROWS is29the number of rows in S whose value is exactly equal to MAXVAL. DISTINCT ROWSis the distinct number of values in S in the bucket range. ROWS R is a derived value(not present in the histogram) that is the estimate of the number of rows in R thathave a value in the histogram bucket range. The estimation of ROWS R is given inSection 4.2.1.† A bucket frequency is calculated as:(ROWCOUNT + EQROWS) = ROWS R.† Sort the buckets in decreasing order by frequency.† The sorted list is traversed in order. Assume the size of memory in tuples is M, andcount is the number of tuples currently in the in-memory partition. A histogram bucketrange is added to the in-memory partition if count + ROWS R <= M.† The previous step is repeated until the histogram is exhausted, there is no memory leftto allocate, or the current bucket does not flt entirely in memory.Consider the histogram in Figure 3.2, and a join memory size of 400 tuples. The flrsthistogram bucket added has range 751-1000 (250 tuples) as its frequency is 4.6. The secondhistogram bucket added has range 101-200 with frequency 3.1. The remaining memoryavailable can be allocated in various ways: leave as over ow, flnd next bucket that flts, ordivide a bucket. With integer values, it is possible to take the next best bucket and split therange. In this case, the range from 1-100 can be divided into a subrange of 1-50.MINVAL MAXVAL ROWCOUNT EQROWS DISTINCT ROWS ROWS R FREQ1 100 300 5 100 100 3.05101 200 300 10 100 100 3.1201 350 150 100 150 150 1.67351 500 200 40 150 150 1.6501 750 244 6 250 250 1751 1000 650 500 250 250 4.6Table 3.2: Histogram Partitioning ExampleNotallhistogramswillseparateoutthefrequencyofaboundaryvalue(suchasEQROWS)in that case the frequency is calculated as ROWCOUNT=ROWS R. When a histogramdoes separate out the frequency of a boundary value (such as with maxdifi histograms), thesevalues can be used as separate bucket ranges as they typically have very high frequencies.These single, high-frequency values are referred to as premium values. Premium values havea high payofi as they occupy little memory (one tuple each) and match with many rows inthe probe relation. Premium values tend to be good values to keep in-memory even whenthe accuracy in the histogram is low, especially when they are signiflcantly more commonthan the average value.Note in the example that value 350 occurs 100 times even though on average the othervalues in the range of 201-349 only occur once. Tuple with key 350 should be memory-resident. The algorithm creates separate one value ranges for each separation value. When30sorted, these ranges may be selected independently of the rest of their histogram bucket.For example, with a memory size of 400 tuples, the algorithm selects the following ranges:1000, 350, 500, 200, 750, 100, 751-999, 101-199, and 1-46. (The last range is a partial rangeof 1 to 100.) A tuple is in the in-memory partition if it falls in one of these ranges.Estimating Cardinality Of Value RangesA histogram estimates the number of distinct values and number of tuples in a histogrambucket (DISTINCT ROWS and ROWCOUNT respectively). If the histogram is on the proberelation S, then the histogram provides the number of tuples in each bucket for relation S.However, Histojoin also requires an estimate of the number of tuples in R that have a value ina histogram bucket. This estimate is used to determine approximately how many histogrambuckets can be memory-resident for the build relation R. This value is also used to determinethe relative value of each bucket. Histogram buckets with few rows in R and numerous rowsin S are prime candidates for privileged partitions.Given the number of distinct values in S, DISTINCT ROWS, the estimate of thenumber of rows in R with values in that range, ROWS R, is determined as follows:† For integer values, it is calculated using the bucket low and high range values. Thatis, ROWS R = MAXVAL¡MINVAL+1.† For non-integer values, it is estimated as ROWS R = DISTINCT ROWS.There will be inaccuracy in estimating ROWS R for non-integer values. For one-to-many joins, ROWS R will be underestimated due to primary key values not appearing inthe foreign key relation. For many-to-many joins, it is impossible to determine exactly howmany rows in R will have values in the range without having a histogram on R as well.Some heuristics can be used based on the size of relation R, but in general, the estimatemay be imprecise. Thus, it is critical that Histojoin adapts its memory management to ush even privileged partitions in case of inaccurate estimates. This is discussed further inSection PartitioningHistojoin partitions tuples in two layers. The flrst layer contains privileged partitions withjoin attribute ranges that are deflned as described in Section 4.2. A tuple is placed in aprivileged partition if its join attribute value falls into one of the privileged partition ranges.This is performed using a range check function. A range check is performed for each tupleby comparing its join attribute value with the ranges calculated in Section 3.2.2. In thisexample, the ranges are 1000, 350, 500, 200, 750, 100, 751-999, 101-199, 1-46.For e–ciency, the range check is implemented in two steps. The flrst step uses a hashtable to record all ranges of size one. Each hash table entry maps the join attribute to apartition number. This step is used for very frequently occurring values such as premiumvalues. The size of this hash table is very small, usually less than 200 entries, as the number31of premium values is limited based on the number of histogram buckets. The second stepprocesses all ranges of size greater than one by storing them in a sorted array. This sortedarray is searched using a binary search to detect if a value is in one of the ranges.When a range check is performed the value is flrst tested against the hash table. If itis in the hash table then the mapped partition is returned. If not then a binary searchis performed on the sorted array, and if the correct range is found the related partition isreturned. If the value is not found in either of these two structures then it does not fall in aprivileged partition, and the value is hashed to flnd which non-privileged partition it belongsin. For tuples with join values that do not fall into privileged partition ranges, the tuplesare placed in hash partitions using a hash function. This hash partitioning works exactlythe same as in dynamic hash join.This hash table and search array method works for all types and combinations of values.A further speed optimization for integer values is to enter every value that falls in a rangeinto the hash table and not use the binary search. This works for integer values because thepossible values in a range can be discretely enumerated.3.3 Using HistojoinThis section contains a discussion of some of the issues in using Histojoin. These issuesinclude handling difierent join cardinalities, tolerating inaccuracy in histograms, and sup-porting joins where the input relations are from selection or other join operators.3.3.1 Join CardinalityAlthough the previous examples considered primary-to-foreign key joins, Histojoin works forall join cardinalities. Histojoin is useful when there is a histogram on the join attribute of theprobe relation, and the probe relation has skew. If due to flltering the foreign key relation isthe smaller (build) relation, Histojoin is not usable because the probe (primary key) relationis uniform, and there is no skew to exploit. However, it may be possible to reverse the rolesand still make the larger foreign key relation the probe relation if there is skew to exploitthat improves performance.Histojoin adds no beneflt over dynamic hash join for one-to-one joins due to the uniformdistribution of the probe relation. In this case, Histojoin behaves exactly as dynamic hashjoin and allocates no privileged partitions.For many-to-many joins, Histojoin only requires the histogram on the probe relation.The algorithm behaves exactly as in the one-to-many case, but execution of the algorithmmay result in  ushing privileged partitions as the size estimates of the privileged buildpartitions are less accurate. For example, a histogram may indicate that the values from5 to 10 have high frequency in the probe relation. Histojoin will estimate that there are 6tuples in the build relation in that range. However, there may be multiple occurrences ofeach value such that there are actually 30 tuples in the build relation with values in thatrange. This may force Histojoin to  ush some privileged partitions to compensate for the32over-allocation of memory. A histogram on the build relation may mitigate some of theseestimation concerns, but may be hard to exploit as independently produced histograms mayhave widely difiering bucket ranges. Even when Histojoin over-allocates privileged partitionranges, dynamic  ushing based on frequency improves performance over dynamic hash joinwhile avoiding memory over ows.The join cardinality cases are enumerated in Table 3.3.Type Larger Side Approach Special Notes1-1 Either behave like DHJ No skew in relations.1-M 1 behave like DHJ No skew in probe.Evaluate role reversal if skew on many-side.1-M M use probe histogram Skew can be exploited.M-N M or N use probe histogram Skew can be exploited.Table 3.3: Join Cardinality Cases3.3.2 Histogram InaccuraciesIn the ideal case, the join algorithm would know the exact distribution of the probe relationand be able to determine exactly the skew and the frequently occurring values. Withoutpre-sampling the inputs, this requires a pre-existing summary of the distribution as providedby histograms. Histograms are not perfect because they summarize the information, whichresults in lack of precision. Also, the histogram may be inaccurate as it may be constructedby only using a sample of the data or was constructed before some modiflcations occurredon the table.Note that skew-aware partitioning, as implemented by Histojoin, can be used with a sam-pling approach as well as with pre-deflned histograms. The advantage of using histograms isthat there is no overhead during join processing as compared to sampling. The disadvantageis the accuracy of the distribution estimation may be lower. Histograms are valuable becausethey require no pre-processing for the join and are kept reasonably up-to-date for other pur-poses by the optimizer. Non-random sampling has been experimented with by examining theflrst few thousand tuples of the probe relation before processing the build relation. Althoughit is sometimes possible to determine very frequent values using this approach, in general,most relational operators produce a set of initial tuples that is far from a random sample.True random sampling incurs cost that is too high for the potential beneflt.There are two key histogram issues. First, the histogram may not be a precise summary ofa base relation distribution due to issues in its construction and maintenance in the presenceof updates. Second, if the join is performed on relations that are derived from other relationaloperators (selection, other joins), then a histogram on the base relation may poorly re ectthe distribution of the derived relation. Without an approach to derive histograms throughrelational operators, we must decide on our confldence in the histogram when allocatingmemory in the operator.33This approach assigns a confldence value to the histogram. The confldence value re ectsthe confldence in the accuracy of the histogram in relation to the data it is designed torepresent. Histograms derived after selections have lower confldence than those recentlybuilt on the base relation.The confldence value is used to determine how many privileged partitions are used. Witha high confldence value, privileged partition ranges are deflned such that almost all of thememory is allocated to the privileged partitions, as we are reasonably certain that the besttuples in the histogram are actually the best tuples in the relation. For a low confldencevalue, only the absolute best values as determined by the histogram are used as the rangepartitions. The result is that the algorithm can control its beneflt or penalty as comparedto DHJ based on the confldence of the estimates. This improves the stability, robustness,and overall performance of the algorithm.For example, consider M=1000 (1000 tuples can flt into memory). Let the join attributevalue range be 1 to 2000. With a high confldence histogram, the algorithm would deflneprivilegedpartitionrangestooccupyall1000tuplesofmemoryavailable. Forinstance, itmayallocate 4 ranges 100-199, 300-599, 1000-1199, and 1500-1899 that would correspond to 1000tuples in the build relation. With a low confldence histogram, the algorithm only allocatesthe very best ranges, which may result in only 2 ranges such as 100-199 and 1000-1199 (300total tuples). The algorithm determines the ranges to allocate based on the frequency ofoccurrence and the confldence value. With the low confldence histogram, the range 100-199 must have been signiflcantly more common than average. The number of privilegedpartitions is reduced with a low confldence histogram to reduce the penalty of error. Forinstance, the range of values 100-199 may turn out to be very infrequently occurring in theprobe relation. Bufiering build tuples in the range 100-199 then would produce fewer resultsthan bufiering random tuples.There are multiple possibilities for determining how many tuples to put in the privilegedpartition based on the histogram confldence level. One approach is to select ranges whosefrequencies are one or more standard deviations better than the mean frequency of all ranges.The amount that the ranges must be better than the mean is increased for lower confldencehistograms. A high confldence histogram will flll up memory with histogram buckets thatare above the mean. A low confldence histogram will only accept buckets that are multiplestandard deviations better than the mean.The approach chosen to measure the quality of the histogram depends on the databasesystem and its optimizer. The two experimental implementations (see Chapter 4) use dif-ferent approaches to selecting ranges based on the confldence level. The stand-alone Javaimplementation that only performs the joins and does not have an optimizer operates in twomodes. Histograms on base relations with or without a selection operator are considered highconfldence and all privileged ranges better than the average are selected. A low confldencehistogram is when a base relation histogram is used to estimate the distribution of a relationproduced by an intermediate join operator. In this case, only single premium values areused and no ranges. The PostgreSQL implementation exploits PostgreSQL’s statistics thatcapture the most common values (MCVs) of an attribute. All MCVs are kept regardless of34the histogram confldence and the equi-depth histogram is not used to select ranges. This isan efiective approach as the penalty for being incorrect with MCVs is minimal, the payofi ispotentially very high, and there is a high probability that MCVs of a base relation remainMCVs in derived relations. More details are in Section 6.There are two potential \costs" in using this approach. The flrst is a lost opportunitycost that occurs when due to low confldence in the histogram we do not select ranges withfrequently occurring values as privileged partitions. In this case, the performance of thealgorithm could have been improved had it been more aggressive on selecting privilegedpartition ranges. However, the performance would be no worse than dynamic hash join as anytuples that are not privileged get  ushed randomly as in DHJ. The second cost, inaccuracycost, is much more important. Inaccuracy cost occurs when a value range is selected asprivileged and turns out to be less frequently occurring than average. For example, if the100 build tuple values in the range 100-199 map to 2 tuples on average in the probe relation,and the average build tuple maps to 3 tuples on average, then skew-aware partitioning willhave worse average performance than dynamic hash join. For low confldence histograms, itis better to be conservative in selecting privileged ranges, as there is a penalty for being tooaggressive. By selecting no privileged ranges, Histojoin behaves exactly as DHJ.Handling SelectionsThe discussion so far has considered joins where both inputs are base relations. It is commonthat a selection is performed before a join. A selection on the probe relation may change thedistribution of join values and result in lower confldence in the histogram. The confldencecan be changed based on the attribute correlation. If the selection attributes are highlycorrelated with the join attribute, then the histogram will most likely be very inaccurate.If there is low correlation, then the histogram is more usable and the uniform assumptioncan be applied. For example, the uniform assumption assumes that if a selection reducesthe cardinality of the entire relation by 90%, then the cardinality of each histogram bucketis also reduced by 90%. If present, multi-dimensional histograms on both the selection andjoin attributes may be used to estimate the distribution after selection. SITs may be usedas well. See Section 2.4 for more information about database statistics.Selections on the build relation are less problematic. A selection on the build relationmay afiect the number of build tuples in a privileged partition range. For instance, if thealgorithm determines that the range 100-199 is valuable, it expects 100 unique values inthe build relation. However, a selection may cause the actual number of build tuples to be50. This is another example of a lost opportunity cost because given this knowledge, thealgorithm may have been able to select more privileged partitions (since memory is available)or select difierent ones because the value of the partition range may be lowered since notall of its build tuples participate in the join. Note that since we do not allocate a staticamount of memory to privileged partitions, the extra memory for the 50 tuples is availablefor other partitions (most likely non-privileged hash partitions) to use. The algorithm willstill outperform dynamic hash join if the build tuples actually in the privileged partitionrange join with more probe tuples than the average build tuple.35Multiple Join PlansWhen a query consists of multiple joins, Histojoin can be used with each join as long as ahistogram is available or can be estimated for the probe relations. Histojoin can be usedfor star joins, where multiple relations are joined to one common relation, which are verycommon in data warehouses.For example, consider a star join of the tables Part, Supplier, and LineItem as shown inFigure 3.5a. With histograms on LineItem.partkey and LineItem.suppkey and no selectionoperations, Histojoin will have high confldence histograms for both joins. The bottom joinof LineItem and Supplier will use the histogram on LineItem.suppkey. The second joinwill use the histogram on LineItem.partkey which will accurately re ect the distribution ofLineItem.partkey in the intermediate join result LineItem-Supplier as the intermediate resultwas produced using a primary-to-foreign key join. In general, star joins with no selectionsand histograms on all join attributes of the fact table are accurately estimated and result inlarge performance improvements for skewed data.(a) Part Supplier LineItem (b) Customer Order LineItemFigure 3.5: Example Multiple Join PlansIn contrast, consider a join of the tables Customer, Orders, and LineItem as shown inFigure 3.5b. The join of LineItem and Orders can exploit the histogram onLineItem.orderkey. However, the top join has no base histogram that can model the distri-bution of custkey in the intermediate relation LineItem-Orders. It is possible to estimate thehistogram from one on Orders.custkey, but it would be a low confldence histogram. Whenselections are added with joins, the confldence of the histograms decreases, especially withselections on the probe relations.3.3.3 Query Optimizer ModiflcationsThere are minimal query optimizer modiflcations required to use Histojoin. Histojoin can beused as a drop-in replacement to hybrid or dynamic hash join, or the concepts used to modifyan existing hash join implementation. If Histojoin is implemented separately from hybridhash join, then when costing a potential join, Histojoin will return a high, not-applicablecost for joins where a histogram does not exist or cannot be estimated for the probe relation.36As a drop in replacement for hybrid hash join the cost for Histojoin when it cannot make useof histograms would be exactly equal to that of hybrid hash join. The cost of Histojoin willbe the same formula as given in Section 3.1. In estimating the term g in the formula fromthe histogram we flrst calculate which histogram buckets will be in the privileged partitions.The histogram tells us how many probe tuples are related to each histogram bucket so usingthis we can estimate the number of results we will get from the in-memory build tuples. Inpractice this can be done in one step as the estimate of result tuples can be calculated whilechoosing privileged build tuple ranges.Given the list of privileged partitions, g is estimated by summing up the FREQ⁄ROW Ror alternatively ROWCOUNT +EQ ROWS (see Section 4.2) then dividing by S. That is,g =PROWCOUNT+EQ ROWSjSj .Using the example histogram given in Figure 3.2, without separating out the max values,the ranges in sorted order are 751-1000 (W=4.6), 101-200 (W=3.1), 1-100 (W=3.05), 201-350 (W=1.67), 351-500 (W=1.6), and 501-750 (W=1). In the example, the build relationR has 1000 tuples, and the probe relation S has 2505 tuples. With 350 tuples of memory(f = 35%) the flrst two ranges 751-1000 and 101-200 would be selected as privileged andE = 250 ⁄ 4:6 + 100 ⁄ 3:1 = 1460. g = E=jSj = 1460=2505 = 58%. Using the formulas inSection 3.1.1 we expect DHJ to perform 4556 I/Os during this join and Histojoin to perform3405 I/Os (25% less).Using Histojoin this way will allow a query optimizer to only use the algorithm whenHistojoin indicates that it will have a performance beneflt (by exploiting the skew it poten-tially sees). A major beneflt is that no major changes to the optimizer are required. Theonly issue is the DBMS must make the histograms available to the Histojoin operator whencosting and initializing.Histojoin’s performance and applicability are increased according to the database systemsupport for statistics collection. For instance, Histojoin works best when provided with a listof the most frequent values and their frequencies. It is this list of values and their associatedtuples that must remain memory-resident. Some statistics systems collect this data explicitlyeither separate from the histogram (PostgreSQL’s most common values) or as part of thehistogram (end-biased histograms). Note that histogram bucket ranges are a less accurateapproximation to the most frequent value list.The challenge of using Histojoin on derived operators (selections, joins, etc.) can alsobe mitigated by better statistics collection. For example SITs [2] and statistics collectionon views allow the optimizer to have improved distribution estimates for relations of inter-mediate operators. Instead of base histograms and uniform assumptions, these approachescan provide Histojoin with more accurate data when deciding on privileged ranges/values.Any technique to improve the histogram accuracy will improve Histojoin’s performance. Insummary, Histojoin will always produce a correct result that is robust and optimal accordingto the distribution estimate given. The more accurately the histogram re ects the actualdata distribution, the better actual performance Histojoin will have.374. Experimental ResultsThis chapter presents two separate experimental evaluations for Histojoin. The flrst eval-uation is a stand-alone Java application performing the joins. The second evaluation is animplementation of the algorithm in PostgreSQL. The Histojoin algorithm was tested withthe TPC-H data set. The TPC-H generator produced by Microsoft Research [3] was usedto generate skewed TPC-H relations. Skewed TPC-H relations have their attribute valuesgenerated using a Zipf distribution, where the Zipf value (z) controls the degree of skew.The data sets tested were of scale 1 GB and 10 GB and labeled as skewed (z=1) and highskew (z=2).4.1 Stand-Alone EvaluationThe dynamic version [8] of hybrid hash join [7] (DHJ) was compared to Histojoin. Bothalgorithms were implemented in Java and used the same data structures and hash algorithms.The only difierence between the implementations is that Histojoin allocated its in-memorypartitions using a histogram and DHJ  ushed partitions to free memory without regardto the data distribution. DHJ typically  ushes partitions in a deterministic ordering, butthis implementation  ushes randomly such that a more accurate average case is found. Forinstance, a deterministic ordering may always  ush the exact worst partition for a join flrst.A random ordering will  ush the worst partition flrst with probability 1=P (where P is thenumber of partitions).The data flles were loaded into Microsoft SQL Server 2005, and histograms were gener-ated. The histograms were exported to disk, and the data flles converted to binary form.Data flles were loaded from one hard drive and a second hard drive was used for tempo-rary flles produced during partitioning. The experimental machine was an Athlon 64 3700+(2.2Ghz) with 1.5GB RAM running Windows XP Pro and Java 1.6. All results are theaverage of 10 runs. These results use TPC-H scale 1 GB and demonstrate the applicabilityof the algorithm in various scenarios.4.1.1 Primary-to-Foreign Key JoinsThe joins tested were LineItem-Part on partkey and LineItem-Supplier on suppkey for z=1and z=2. Memory fractions, f, were tested ranging from 10% to 100%.Histojoin performs approximately 20% fewer I/O operations with the z=1 data set whichresults in it being about 20% faster overall. This is a major improvement for a standard38 0 2000 4000 6000 8000 10000 12000 14000 16000 18000 0  20  40  60  80  100I/Os (x 1000)Memory Fraction (%)DHJHistojoin(a) Total I/Os 0 50 100 150 200 250 0  20  40  60  80  100Time (sec)Memory Fraction (%)DHJHistojoin(b) Total TimeFigure 4.1: Lineitem-Part Join (1GB, z=1)operation like hash join. An improvement occurs over all memory sizes until full memory isavailable for both joins.For the z=2 data set, the performance difierence is even more dramatic. In the 10%memory case Histojoin performs 60% fewer I/Os resulting in 60% faster execution. Theresults by total I/Os and by time are in Figures 4.2a and 4.2b respectively.DHJ is slower because random partitioning causes the most important tuples to be dis-tributed across all partitions. Regardless what partitions are  ushed (or conversely whatpartition(s) remain in memory), hash join is guaranteed to not keep in memory all of themost beneflcial tuples. Even worse, for highly skewed data sets, it is very likely that it willevict the absolute best partition. For instance, with 10% memory and 10 partitions, hashjoin has only a 10% probability of keeping the partition in memory with the key value thatis most frequently occurring. The performance of dynamic hash join is unpredictable forskewed relations and is highly dependent on the partition  ushing policy. For highly skewedrelations and low memory percentages the likelihood of DHJ  ushing the best values is veryhigh.For the z=1 data set and LineItem-Supplier, Histojoin performs about 10-20% fewer totalI/Os and executes 10-20% faster. For the z=2 data set, Histojoin performs between 20-60%fewer total I/Os and executes 20-60% faster. A summary of the percentage total I/O savingsof Histojoin versus dynamic hash join for all joins is in Figure 4.3.Experiments with uniform data show that the performance of Histojoin and hash joinis identical, as there are no tuples that occur more frequently than any other, and theperformance is independent of the tuples bufiered in memory. For totally uniform data,Histojoin selects no privileged partitions. For mostly uniform data, such as generated bythe standard TPC-H generator, there are still some join attribute ranges that are marginallybetter and are used by Histojoin to improve performance slightly.39 0 2000 4000 6000 8000 10000 12000 14000 16000 18000 0  20  40  60  80  100I/Os (x 1000)Memory Fraction (%)DHJHistojoin(a) Total I/Os 0 20 40 60 80 100 120 0  20  40  60  80  100Time (sec)Memory Fraction (%)DHJHistojoin(b) Total TimeFigure 4.2: Lineitem-Part Join (1GB, z=2) 0 10 20 30 40 50 60 70 0  20  40  60  80  100% I/O Improvement of Histojoin vs. DHJMemory Fraction (%)LI-P Z1LI-P Z2LI-S Z1LI-S Z2Figure 4.3: Percentage Improvement in Total I/Os of Histojoin vs. Hash Join (1GB)4.1.2 Many-to-Many JoinsThe many-to-many join tested combined a randomized version of the Wisconsin relation[9] with a randomized and Zipflan skewed (z=1) version of the Wisconsin relation on thetenPercent attribute. Both relations contained 1,000,000 tuples. The tenPercent attributehas a domain that is 10% the size of the relation. For 1,000,000 tuple relations, the domainof tenPercent is 100,000. Memory fractions, f, were tested ranging from 10% to 100%.For this test, the build relation (randomized Wisconsin) contains 1,000,000 tuples, andthe tenPercent attribute contains values in the range 0 to 99,999, each value being shared by10 tuples. The probe relation has a domain of 0 to 99,999 as well with a Zipflan distributionof the values. In the generated Zipflan relation, the top 2 values occur in 127,380 of the1,000,000 tuples (12.7%) and 31,266 of the 1,000,000 tuples (3.7%) respectively. Beyond thetop 200 values, the average value occurs in approximately 5.7 of the 1,000,000 tuples. A40histogram on the probe relation attribute is misleading because it shows an integer domainof 100,000 tuples which underestimates the size of each privileged relation partition by afactor of 10.This underestimation causes Histojoin to allocate too much memory for privileged parti-tions because it thinks the partitions contain far fewer tuples than they really do. However,these privileged partitions are dynamically  ushed as required with no harm to the perfor-mance. The Wisconsin results by total I/Os (includes cost of reading each relation) for thisjoin are in Figure 4.4. For all memory sizes, Histojoin performs approximately 10% fewerI/O operations than DHJ. 0 1000 2000 3000 4000 5000 6000 0  20  40  60  80  100I/Os (x 1000)Memory Fraction (%)DHJHistojoinFigure 4.4: Total I/Os for Wisconsin Many-to-Many Join (1GB, z=1)4.1.3 Histogram InaccuraciesTo demonstrate the efiect of histogram inaccuracies on join performance, a modifled TPC-HLineItem relation was created to show the worst case scenario for Histojoin and how the useof histogram confldence mitigates this scenario. The new LineItem relation contains onlyevery 10000th partkey (1, 10000, 20000, ..., 200000) and each of these values occurs as oftenas the others. A histogram was created that indicates to Histojoin that these 10000th valuesnever occur in LineItem so that in all cases except the 100% memory case Histojoin willnot store any of the corresponding build tuples from the Part relation in memory but willinstead flll memory with build tuples whose partkey values never occur within LineItem.Histojoin was compared to DHJ using the join LineItem-Part on partkey. Memory frac-tions, f, were tested ranging from 10% to 100%. Histojoin executed the join under twoconfldence levels. In the high confldence level, it assumed the histogram was very accurateand fully allocated privileged partitions to memory. In Figure 4.5, this corresponds to theHJ Bad Histogram plot. Histojoin does considerably worse than DHJ by trusting a totallywrong histogram. In comparison, when executed under a low confldence level, Histojoin onlyselects premium values from the histogram (in the diagram as HJ Bad Premium Values).41Since the histogram is completely inaccurate, the premium values give no performance im-provement, but also result in little cost compared to DHJ due to the minimal amount ofmemory occupied. If the histogram is only 10% correct (10% of the premium values aregood), Histojoin in low confldence mode outperforms DHJ. In summary, executing Histo-join in low confldence mode has little risk and considerable reward if the histogram is evenmarginally accurate. 0 2000 4000 6000 8000 10000 12000 14000 16000 18000 20000 0  20  40  60  80  100I/Os (x 1000)Memory Fraction (%)DHJHJ Bad HistogramHJ Bad Premium ValuesHJ 10% Good Premium ValuesFigure 4.5: Total I/Os for Lineitem-Part Join with Histogram Inaccuracies (1GB)4.1.4 Joins on String KeysWith string keys Histojoin is less accurate in predicting the size of build partition ranges forprivileged partitions. To test joining on string keys, versions of the LineItem and Supplierrelations were generated with the suppkey replaced by randomly generated strings. Onceagain memory fractions, f, were tested ranging from 10% to 100%. Much like the join ofLineItem-Part on partkey using integer keys Histojoin performs around 20% fewer Total I/Osthan DHJ for the z=1 dataset. The results are in Figure Multi-Way JoinsAs described in Section 3.3.2, Histojoin can be used on multi-way star joins when a histogramexists on the join attributes of the fact relation. A star join of the tables Part, Supplier, andLineItem as shown in Figure 3.5a falls into this category.If the memory for the entire query is split evenly between the two joins then for amemory percentage above 10% the flrst join of LineItem-Supplier would be done completelyin memory as Supplier is quite small in comparison to Part. For this reason Histojoin wascompared to DHJ using memory fractions ranging from 3% to 10%. The total I/Os for theentire join using a z=1 dataset are shown in Figure 4.7a. For all memory sizes Histojoinperforms about 20% fewer I/Os than DHJ. For memory sizes above 10%, Histojoin is fasterthan DHJ but only one join requires disk I/Os. Results for the z=2 dataset are shown in42 0 2000 4000 6000 8000 10000 12000 14000 16000 18000 0  20  40  60  80  100I/Os (x 1000)Memory Fraction (%)DHJHistojoinFigure 4.6: Total I/Os for Lineitem-Supplier Join on String key (1GB, z=1)Figure 4.7b. Due to the high skew, Histojoin dramatically improves on the performance ofDHJ. 0 5000 10000 15000 20000 25000 30000 3  4  5  6  7  8  9  10I/Os (x 1000)Memory Fraction (%)DHJHistojoin(a) z=1 0 5000 10000 15000 20000 25000 30000 3  4  5  6  7  8  9  10I/Os (x 1000)Memory Fraction (%)DHJHistojoin(b) z=2Figure 4.7: Total I/Os for Lineitem-Supplier-Part Join (1GB)4.2 PostgreSQL ImplementationHistojoin was implemented in PostgreSQL 8.4 to test its performance for large-scale datasets in a production system. PostgreSQL implements hybrid hash join (HHJ). Its HHJimplementation requires that the number of partitions be a power of two, and it alwayskeeps the flrst partition in memory. Thus, experimental data was only collected for memoryfractions: 3.1% (1/32), 6.2% (1/16), 12.5% (1/8), 25% (1/4), 50% (1/2), and 100%.PostgreSQL collects statistics on its tables. Statistics on an attribute of a table includethe most common values (MCVs) and an equi-depth histogram. The user is able to control43on a per table basis the number of MCVs. The user can also initiate statistics re-calculations.The query optimizer has access to the histograms and a list of most common values (MCVs)that are automatically generated for foreign key attributes.Histojoin was added to the PostgreSQL HHJ implementation. Using environment  agsthat PostgreSQL uses to control which joins are available, Histojoin can be turned on andofi from the standard HHJ implementation. Thus, the existing HHJ implementation wasaltered instead of having two hash join algorithms for the optimizer to choose between.Histojoin requires the ability to use the existing statistics which were available in theplanner. The code uses the join attributes of the probe relation to flnd statistics for thatattribute. If no statistics were available, Histojoin would not be used. If the optimizerdetermines that the build relation will completely flt in memory then Histojoin is not usedas it would have no positive efiect and add unnecessary overhead. If statistics are available,Histojoin only uses the MCVs (not the histogram) as the MCVs are more precise. However,this means that the privileged partitions do not occupy very much of the memory availablefor build relation tuples. The MCVs were determined and allocated into an in-memoryhash table when the join operator was initialized. The default number of MCVs is 10,which produces a small hash table (less than 1KB), however the database administrator canincrease this number to as many as 10,000 MCVs. The hash table size is at least 4 times thenumber of MCVs (load factor is less than or equal to 25%) to make collisions unlikely.During the partitioning of the build relation, a build tuple’s join attributes are hashedaccording to the small MCV hash table to determine if its value is one of the MCVs. If itis, then the tuple is put into the hash table, otherwise it is processed using the regular hashfunction as usual. While partitioning the probe relation, a probe tuple’s join attribute ishashed and a lookup performed in the MCV table. If there is a match, the tuple is joinedimmediately, otherwise it proceeds through the hash join as normal. In efiect, the MCVlookup table is a small mini-hash table for the most frequent values.The equi-depth histograms are not used as it is preferable to increase the number ofMCVs rather than allocate ranges from the histogram. The experiments all use the defaultof 10 MCVs unless otherwise specifled. Results are improved when MCVs are set to 100 ormore.The experimental machine for the PostgreSQL implementation was an Intel Core 2 QuadQ6600 (2.4Ghz) with 8GB RAM running 64bit Debian Linux with a 2.6.25 kernel. Theseresults use TPC-H scale 10 GB. Note that even for machines with large main memories, ajoin operator is allocated only a small fraction of the memory available as it must competewith other operators and other queries for system resources. The default join memory sizefor PostgreSQL is 1 MB. The experiments alter that memory size to produce the desiredmemory fraction based on the build table size.There are a couple of difierences from the Java experiments that should be noted. First,the execution times more accurately re ect the number of I/Os performed. This is due toincreased stability and performance of PostgreSQL on Linux versus a Java implementationon Windows. The Java I/O counts are exact, but the execution times are more variable.There is less I/O performance improvement for the PostgreSQL implementation compared to44the Java implementation because the PostgreSQL implementation has very small privilegedpartitions (just the MCVs) where the Java implementation uses all available memory forprivileged partitions by fllling them with valuable histogram bucket ranges.4.2.1 Primary-to-Foreign Key JoinsThe LineItem-Part results by total I/Os (includes cost of reading each relation) and bytime for the z=1 data set are in Figures 4.8a and 4.8b respectively. Histojoin is around10% faster and performs 10% less I/Os than HHJ. With the z=2 dataset, Histojoin performsapproximately 50% faster (Figures 4.9a and 4.9b). The percentage improvement of Histojoinis shown in Figure 4.10. Note that the sudden improvement of HHJ for the z=2 50% memorycase is because HHJ manages to get the best tuples from the build partition in its in-memorypartition by chance. 0 20000 40000 60000 80000 100000 120000 140000 160000 180000 200000 0  20  40  60  80  100I/Os (x 1000)Memory Fraction (%)HHJHistojoin(a) Total I/Os 0 100 200 300 400 500 600 0  20  40  60  80  100Time (sec)Memory Fraction (%)HHJHistojoin(b) Total TimeFigure 4.8: PostgreSQL Lineitem-Part Join (10GB, z=1)4.2.2 Multi-Way JoinsWhen performing a star join of the tables Part, Supplier, and LineItem any memory sizeabove 10% of the size of the Part table will run the smaller join of LineItem and Suppliercompletely in memory. This multi-way join was tested with memory fractions (sizes) of0.78% (2770KB), 1.56% (5440KB), and 3.12% (10880KB). Figure 4.11a shows that for thez=1 dataset Histojoin performs around 6% fewer IOs than HHJ and for the z=2 datasetHistojoin performs around 40% fewer IOs than HHJ.4.2.3 Efiect of Number of MCVsBy increasing the number of MCVs from the default 10, the performance of Histojoin in-creases as Histojoin is able to capture more of the most valuable tuples. The join of LineItem45 0 20000 40000 60000 80000 100000 120000 140000 160000 180000 200000 0  20  40  60  80  100I/Os (x 1000)Memory Fraction (%)HHJHistojoin(a) Total I/Os 0 50 100 150 200 250 300 350 400 450 500 0  20  40  60  80  100Time (sec)Memory Fraction (%)HHJHistojoin(b) Total TimeFigure 4.9: PostgreSQL Lineitem-Part Join (10GB, z=2) 0 10 20 30 40 50 60 0  20  40  60  80  100% I/O Improvement of Histojoin vs. HHJMemory Fraction (%)LI-P Z1LI-P Z2LI-S Z1LI-S Z2Figure 4.10: PostgreSQL Percentage Improvement in Total I/Os of Histojoin vs. Hash Join(10GB)and Part was performed with a memory size of 6.2% and various amounts of MCVs. Theresults by total I/Os and by time are in Figures 4.12a and 4.12b respectively. The query wasrun with 10, 100, 300, 500, 700, and 1000 MCVs on partkey. Histojoin’s performance withthe z=1 dataset can be increased by adding more MCV statistics as this dataset has manyrelatively good MCVs. As more MCVs are added the beneflt per new MCV is much less.4.3 Results SummaryFor skewed data sets, Histojoin dramatically outperforms traditional hash joins by 10% to60%. This is signiflcant because hash join is a very common operator used for processingthe largest queries. As the amount of skew increases, the relative performance improvement46 0 50000 100000 150000 200000 250000 300000 0  0.5  1  1.5  2  2.5  3I/Os (x 1000)Memory Fraction (%)HHJHistojoin(a) z=1 0 50000 100000 150000 200000 250000 300000 0  0.5  1  1.5  2  2.5  3I/Os (x 1000)Memory Fraction (%)HHJHistojoin(b) z=2Figure 4.11: Total I/Os for PostgreSQL Lineitem-Supplier-Part Join (10GB)of Histojoin increases.Histojoin introduces no performance penalty compared to hash join for uniform data setsor data sets where the skew is undetected due to selection conditions or stale histograms.Histojoin’s performance improvement depends on the amount of skew detected (as givenby the formula in Section 3.1). Histojoin has better performance with a more accurateestimate of the distribution of the probe relation. When the confldence in the histogramapproximation of the distribution is low, Histojoin allocates fewer privileged partitions whichmust be signiflcantly better than the average. Thus, Histojoin will exploit whatever skew isdetectable and fall back to dynamic hash join behavior otherwise. Even with low accuracyhistograms, Histojoin will improve join performance over hash join for skewed data sets.The implementation of Histojoin in PostgreSQL uses only premium values determinedfrom pre-generated MCV lists to determine its privileged partitions. Histojoin is minimallyafiected by bad estimates as the MCV lists are small and represent only a minimal memoryoverhead. In the experiments this implementation shows a large improvement over the stan-dard hybrid hash join operator used for all large unsorted joins in PostgreSQL while addingno noticeable overhead when skew cannot be exploited. Histojoin is especially valuable forsmaller memory fractions as its relative beneflt over HHJ is higher.47 0 20000 40000 60000 80000 100000 120000 140000 160000 180000 0  200  400  600  800  1000I/Os (x 1000)# of MCVsHHJHistojoin(a) Total IOs 0 100 200 300 400 500 0  200  400  600  800  1000Time (sec)# of MCVsHHJHistojoin(b) Total TimeFigure 4.12: PostgreSQL Lineitem-Part Join With Various Amounts of MCVs (10GB, z=1)485. Discussion and ConclusionThis thesis began by describing the general function of a database system (Section 2.1) andhow join algorithms flt into that system (Section 2.2). Hash join algorithms are used inmany of the large, important, and costly queries that a database performs especially in thelarge database systems used by governments and commercial organizations.The original in-memory hash join algorithm (Section 2.2.1) increased the speed of joininglarge unordered data sets signiflcantly over previous methods by reducing the number of tuplecomparisons that needed to be made but was limited to data sets that could flt in memory.The Grace Hash Join (Section 2.2.3) handled very large relations by partitioning them intosmaller memory sized chunks that could be joined one at a time but required writing thesepartitions to disk and then re-reading them during a cleanup phase which incurred extraI/O operations per tuple. Grace Hash Join also failed to make use of all available memoryduring the initial partitioning phase.Hybrid Hash Join (Section 2.2.4) improved on the Grace Hash Join by keeping one ofthe partitions in memory after partitioning the build relation and producing results whilepartitioning the probe relation. Recognizing that partitions sizes could vary due to partitionskew in the build relation and that memory conditions could change the Dynamic HashJoin algorithm was created (Section 2.2.5). Dynamic Hash Join is similar to Hybrid HashJoin except it initially keeps all partitions in memory and then dynamically  ushes themto disk whenever memory must be freed. Since many partitions can remain in memorythe targetted size of a partition does not necessarily have to be equivalent to the size ofmemory and Dynamic Hash Join can adapt its memory usage much easier if the partitionsare targetted at being smaller than the amount of available memory.In a hash join the partitions that remain memory-resident after the build relation ispartitioned are chosen before any probe relation tuples have been read. Consequently thereis no guarantee that the memory-resident build tuples will join with many of the probetuples. Due to skew (Section 2.3) the memory-resident build tuples may join with far moreprobe tuples than the frozen partitions or far fewer. Statistics such as histograms alreadyexist in most commercial database systems and when present can give a good indication ofthe number of probe tuples that will join with each build tuple (Section 2.4).The efiect of skew on hash joins is well known and modern hash join algorithms try tominimize the negative efiect that skew has on their performance. Previously skew has notbeen exploited by a hash join to improve algorithm performance. The theoretical benefltsof exploiting skew (Section 3.1.1) are signiflcant, however the practical beneflts rely on thestatistics accurately representing skew and the algorithms ability to compensate when thestatistics are incorrect.49The generic Histojoin algorithm described in Section 3.2 does not rely on any particulardatabase system or type of statistics although the maxdifi histograms used by MicrosoftSQL Server are used in the examples. The ability of a Histojoin implementation to exploitskew is directly proportional to how good the statistics are. Two actual implementations ofHistojoin have been created, one as a stand-alone Java algorithm and the other as an additionto the Hybrid Hash Join algorithm in the PostgreSQL open source database system. Thestatistics available to the Java implementation are more comprehensive than those availablein PostgreSQL and consequently that implementation is better able to detect and exploitskew in the underlying relations being joined. If Histojoin was implemented in a system thatused SITs (Statistics on Intermediate Tables) [2] or other advanced statistics the amount ofinstances when skew could be exploited would increase. It is a distinct beneflt of Histojointhat it can adapt to the database system it is implemented in.Empirical experimental results (Section 4) have shown that Histojoin can have signiflcantpractical beneflts over conventional hash join algorithms. Demonstrating an algorithm inonly the ideal cases, however, does not prove that it is generally beneflcial. Histojoin is ableto adapt to cases where exploiting skew is not possible or necessary, such as when performinga one-to-one join or when the entire join can be performed in memory, without noticeableruntime overhead. Histojoin has also been designed to handle cases where statistics areinaccurate and attempting to exploit skew would reduce performance (although this abilityis implementation speciflc (Sections 3.3.2 and 4.1.3)).Through discussion and collaboration with core PostgreSQL developers the PostgreSQLimplementationofHistojoinhasbeenmodifledtohandlemanypotentiallynegativesituationsgracefully while still providing a signiflcant beneflt when skew can be exploited.This research is important as it makes improvements to a core database algorithm thatis used in all major database systems by corporations and individuals alike. The algorithmis generic in that it is not limited to one database system and does not require signiflcantchanges to the underlying system to gain its beneflts.Since the beneflts of Histojoin depend heavily on the statistics available it would beuseful to examine more types of statistics used in current database systems (especially SITs)and how they can be exploited by Histojoin. As deflned, Histojoin is a join algorithm forcentralized database systems but the concept should adapt to distributed and grid databases.This thesis expands on the presentation in [4, 5]. The PostgreSQL implementation iscurrently being evaluated for inclusion in the main production branch of PostgreSQL whereit will have an impact on many real world databases and millions of users. In conclusion,by exploiting statistics already existing in database systems hash join algorithms can gainsigniflcantly increased performance.50Bibliography[1] TPC-H Benchmark. Technical report, Transaction Processing Performance Council,Available at:[2] N. Bruno and S. Chaudhuri. Exploiting Statistics on Query Expressions for Optimiza-tion. In ACM SIGMOD, pages 263{274, 2002.[3] S. Chaudhuri and V. Narasayya. TPC-D data generation with skew. Technical report,Microsoft Research, Available at:,1999.[4] B. Cutt and R. Lawrence. Using Intrinsic Data Skew to Improve Hash Join Performance.Information Systems, To appear, 2008.[5] B. Cutt and R. Lawrence. Histojoin: A Hash Join that Exploits Skew. In IEEECanadian Conference of Electrical and Computer Engineering, May 2008.[6] C. Date. The SQL standard. Addison Wesley, Reading, US, third edition, 1994.[7] D. DeWitt, R. Katz, F. Olken, L. Shapiro, M. Stonebraker, and D. Wood. Implemen-tation Techniques for Main Memory Database Systems. In ACM SIGMOD, pages 1{8,1984.[8] D. DeWitt and J. Naughton. Dynamic Memory Hybrid Hash Join. Technical report,University of Wisconsin, 1995.[9] D. J. DeWitt. The Wisconsin Benchmark: Past, Present, and Future. 1993.[10] D. J. DeWitt, J. F. Naughton, D. A. Schneider, and S. Seshadri. Practical skew handlingin parallel joins. In VLDB, pages 27{40, 1992.[11] H. Garcia-Molina, J. D. Ullman, and J. Widom. Database Systems: The Complete Book.Pearson Education, Inc., 2002.[12] G. Graefe. Five Performance Enhancements for Hybrid Hash Join. Technical ReportCU-CS-606-92, University of Colorado at Boulder, 1992.[13] Y. E. Ioannidis. The history of histograms (abridged). In VLDB, pages 19{30, 2003.[14] Y. E. Ioannidis and V. Poosala. Balancing histogram optimality and practicality forquery result size estimation. In ACM SIGMOD, pages 233{244. ACM, 1995.51[15] M. Kitsuregawa, M. Nakayama, and M. Takagi. The Efiect of Bucket Size Tuning inthe Dynamic Hybrid GRACE Hash Join Method. In VLDB, pages 257{266, 1989.[16] M. Kitsuregawa, H. Tanaka, and T. Moto-oka. Application of hash to database machineand its architecture. New Generation Computing, 1(1), 1983.[17] D. E. Knuth. The Art of Computer Programming, Volume 3: Sorting and Searching.Series in computer-science and information processing. Addison-Wesley, 1973.[18] R. Lawrence. Early Hash Join: A Conflgurable Algorithm for the E–cient and EarlyProduction of Join Results. In VLDB 2005, pages 841{842, 2005.[19] W. Li, D. Gao, and R. T. Snodgrass. Skew handling techniques in sort-merge join. InSIGMOD, pages 169{180, 2002.[20] M. Muralikrishna and D. J. DeWitt. Equi-Depth Histograms For Estimating SelectivityFactors For Multi-Dimensional Queries. In H. Boral and P.-”A. Larson, editors, ACMSIGMOD, pages 28{36. ACM Press, 1988.[21] M. Nakayama, M. Kitsuregawa, and M. Takagi. Hash-partitioned join method usingdynamic destaging strategy. In VLDB, pages 468{478, 1988.[22] V. Poosala, P. J. Haas, Y. E. Ioannidis, and E. J. Shekita. Improved histograms forselectivity estimation of range predicates. In SIGMOD ’96: Proceedings of the 1996ACM SIGMOD international conference on Management of data, pages 294{305, NewYork, NY, USA, 1996. ACM.[23] C. B. Walton, A. G. Dale, and R. M. Jenevein. A Taxonomy and Performance Modelof Data Skew Efiects in Parallel Joins. In VLDB, pages 537{548, 1991.52


Citation Scheme:


Usage Statistics

Country Views Downloads
United States 19 1
France 3 0
China 3 11
India 2 0
Germany 1 0
City Views Downloads
Ashburn 11 0
Unknown 6 0
San Francisco 4 0
Beijing 3 0
Redmond 2 0
Los Angeles 1 0
Troy 1 0

{[{ mDataHeader[type] }]} {[{ month[type] }]} {[{ tData[type] }]}
Download Stats



Customize your widget with the following options, then copy and paste the code below into the HTML of your page to embed this item in your website.
                            <div id="ubcOpenCollectionsWidgetDisplay">
                            <script id="ubcOpenCollectionsWidget"
                            async >
IIIF logo Our image viewer uses the IIIF 2.0 standard. To load this item in other compatible viewers, use this url:


Related Items