09 - Parallel Hash Join Algorithms

202406232045
Status: #idea
Tags: CMU Advanced Database Systems

09 - Parallel Hash Join Algorithms

Dynamic Scaling vs Work Stealing

Perform a a join between 2 relations on multiple threads simultaneously
2 main approaches
1. Hash Join → Generally preferred.
2. Sort-Merge Join
OLAP DBMS almost never wants to use nested-loop join
Hash join is one of the most important operators in OLAP
- But it is still not the dominant cost

Hashing vs Sort-Merge

Pasted image 20240623210049.png

Join Algorithm Design Goals

Goals mater whether the DBMS is using a hardware-conscious vs hardware-oblivious algorithm for joins

Minimize synchronization
- Avoid taking latches during execution
Minimize memory access cost
- Ensure data is always local to worker thread
  - Ideally in L1, L2. Worst case in L3 (common to all cores)
  - Should not be in a different NUMA region
- Increase cache utilization by reusing data when its in cache
- Factors that affect cache misses
  - Cache + TLB capacity
    - Using a lot of lines can lead to line not being in TLB
  - Locality (temporal and spatial)
- Non-random access (Scan)
  - Clustering data to a cache line
  - Execute maximum number of operations per cache line
- Random access (Lookups)
  - Partition data to fit in cache + TLB during the build/probe side of the join
  - Involves more work, but the number of cycles is reduced

Hash Join

Partition (optional)
- Divide the tuples of $R$ and $S$ into disjoint subsets based on a hast of the join key
Build
- Scan relation $R$ and create a hash table on join key
Probe
- For each tuple in $S$ , lookup its join key in hash table for $R$ . If match is found, output combined tuple

Partition Phase

Build Phase

For each tuple, hash the join key for that tuple and add it to the appropriate bucket in the hash table

The bucket should be only a few cache lines in size

Hash Tables

Design decision 1: Hash Function

How to map a a large key space into a smaller domain
Trade-off: speed vs. collision rate

Design decision 2: Hashing Scheme

How to handle key collisions after hashing
Trade-off: Large hash table vs additional instructions to find/insert keys

Note

We do not want to use a crpytographic hash function for our join algorithm

Attention

Hashes we like: CRC32 (good for integers), XXHash, FarmHash (last one unsure)?

Hashing Schemes

Hash Table Contents

Tuple Data vs Pointers/Offsets to Data
- Whether to store the tuple directly inside of the HT
- Storing tuples inside table not possible in open-addressing if there is variable length data
Join Keys vs. Hashes
- Whether to store only the original join keys in the HT or also include computed hashed key
- Classic compute vs storage trade-off

Probe Phase

For each tuple in $S$ ,

hash its join key
check for a match in the bucket in $H T_{R}$

If inputs are partitioned, then assign each thread a unique partition
Otherwise, synchronize their access to $S$ using a latch

Optimization: Bloom filter

Create a bloom filter during the build phase
Threads check the filter before probing the hash table
- Faster since the filter will fit in CPU cache
AKA sideways information passing

Benchmarks

Info

Join operator only takes $\sim 10 %$ of the total query time, making it not the most important

Note

SAHA from ClickHouse is an extremely performant HT for strings

Attention

Partition-based joins outperform no-partition algorithms in most settings, but it is non-trvial to tune it
Most DBMS's pick one hash join and does not try to be adaptive

References

Slides