09 - Parallel Hash Join Algorithms
202406232045
Status: #idea
Tags: CMU Advanced Database Systems
09 - Parallel Hash Join Algorithms
Dynamic Scaling vs Work Stealing
- Perform a a join between 2 relations on multiple threads simultaneously
- 2 main approaches
- Hash Join → Generally preferred.
- Sort-Merge Join
- OLAP DBMS almost never wants to use nested-loop join
- Hash join is one of the most important operators in OLAP
- But it is still not the dominant cost
Hashing vs Sort-Merge

Join Algorithm Design Goals
Goals mater whether the DBMS is using a hardware-conscious vs hardware-oblivious algorithm for joins
- Minimize synchronization
- Avoid taking latches during execution
- Minimize memory access cost
- Ensure data is always local to worker thread
- Ideally in L1, L2. Worst case in L3 (common to all cores)
- Should not be in a different NUMA region
- Increase cache utilization by reusing data when its in cache
- Factors that affect cache misses
- Cache + TLB capacity
- Using a lot of lines can lead to line not being in TLB
- Locality (temporal and spatial)
- Cache + TLB capacity
- Non-random access (Scan)
- Clustering data to a cache line
- Execute maximum number of operations per cache line
- Random access (Lookups)
- Partition data to fit in cache + TLB during the build/probe side of the join
- Involves more work, but the number of cycles is reduced
- Ensure data is always local to worker thread
Hash Join
- Partition (optional)
- Divide the tuples of
and into disjoint subsets based on a hast of the join key
- Divide the tuples of
- Build
- Scan relation
and create a hash table on join key
- Scan relation
- Probe
- For each tuple in
, lookup its join key in hash table for . If match is found, output combined tuple
- For each tuple in
Partition Phase
Build Phase
For each tuple, hash the join key for that tuple and add it to the appropriate bucket in the hash table
- The bucket should be only a few cache lines in size
Hash Tables
Design decision 1: Hash Function
- How to map a a large key space into a smaller domain
- Trade-off: speed vs. collision rate
Design decision 2: Hashing Scheme
- How to handle key collisions after hashing
- Trade-off: Large hash table vs additional instructions to find/insert keys
We do not want to use a crpytographic hash function for our join algorithm
Hashes we like: CRC32 (good for integers), XXHash, FarmHash (last one unsure)?
Hashing Schemes
Hash Table Contents
- Tuple Data vs Pointers/Offsets to Data
- Whether to store the tuple directly inside of the HT
- Storing tuples inside table not possible in open-addressing if there is variable length data
- Join Keys vs. Hashes
- Whether to store only the original join keys in the HT or also include computed hashed key
- Classic compute vs storage trade-off
Probe Phase
For each tuple in
- hash its join key
- check for a match in the bucket in
- If inputs are partitioned, then assign each thread a unique partition
- Otherwise, synchronize their access to
using a latch
Optimization: Bloom filter
- Create a bloom filter during the build phase
- Threads check the filter before probing the hash table
- Faster since the filter will fit in CPU cache
- AKA sideways information passing
Benchmarks
Join operator only takes
SAHA from ClickHouse is an extremely performant HT for strings
Partition-based joins outperform no-partition algorithms in most settings, but it is non-trvial to tune it
Most DBMS's pick one hash join and does not try to be adaptive