Partition Phase

202406232233
Status: #idea
Tags: CMU Advanced Database Systems

Partition Phase

Split the input relations into partitioned buffers by hashing the tuples join keys

Divide the inner/outer relations and redistribute among workers (CPU cores)
Ideally, the cost of partitioning is less than the cost of cache misses during the build phase
Explicitly partitioning the input relations before a join, is sometimes called a Grace Hash Join

Approach 1: Non-Blocking Partitioning

Only scan input relation once
Have the threads access the data at the same time, and populate a single hash table
Have to use latches to synchronize

Approach 1.1: Shared Partitions

Single global set of partitions that all threads update
Must use a latch to synchronize threads

Approach 1.2: Private Partitions

Each thread has its own set of partitions
Must consolidate them after all threads finish
Each core has $n$ mini-partitions, which are then consolidated into global $n$ partitions at the end
Consolidation can have each core fetch data from outside its own NUMA region, which is bad. But is it better than locks? (Depends on the hardware)

Approach 2: Blocking Approach (Radix)

Scan the input relation multiple times
Only materialize results all at once
Requires more preprocessing
AKA Radix Hash Join
Two-pass algorithm
1. Scan $R$ and compute a histogram of the number of tuples per hash key for the radix at some offset
2. Compute prefix sum of this histogram, to determine per-thread offsets
3. Scan $R$ again and partition them according to the hash key

Optimizations

Software write combine buffers
- Each worker maintains a local output buffer to stage writes
- When buffer is full, write changes to global partition
- Essentially, buffer writes to the global partition
Non-temporal streaming writes
- Special instructions in the CPU which allows you to write to bypass the CPU cache and write to memory

References