Partition Phase
202406232233
Status: #idea
Tags: CMU Advanced Database Systems
Partition Phase
Split the input relations into partitioned buffers by hashing the tuples join keys
- Divide the inner/outer relations and redistribute among workers (CPU cores)
- Ideally, the cost of partitioning is less than the cost of cache misses during the build phase
- Explicitly partitioning the input relations before a join, is sometimes called a Grace Hash Join
Approach 1: Non-Blocking Partitioning
- Only scan input relation once
- Have the threads access the data at the same time, and populate a single hash table
- Have to use latches to synchronize
Approach 1.1: Shared Partitions
- Single global set of partitions that all threads update
- Must use a latch to synchronize threads
Approach 1.2: Private Partitions
- Each thread has its own set of partitions
- Must consolidate them after all threads finish
- Each core has
mini-partitions, which are then consolidated into global partitions at the end - Consolidation can have each core fetch data from outside its own NUMA region, which is bad. But is it better than locks? (Depends on the hardware)
Approach 2: Blocking Approach (Radix)
- Scan the input relation multiple times
- Only materialize results all at once
- Requires more preprocessing
- AKA Radix Hash Join
- Two-pass algorithm
- Scan
and compute a histogram of the number of tuples per hash key for the radix at some offset - Compute prefix sum of this histogram, to determine per-thread offsets
- Scan
again and partition them according to the hash key
- Scan
Optimizations
- Software write combine buffers
- Each worker maintains a local output buffer to stage writes
- When buffer is full, write changes to global partition
- Essentially, buffer writes to the global partition
- Non-temporal streaming writes
- Special instructions in the CPU which allows you to write to bypass the CPU cache and write to memory