Robust External Hash Aggregation in the Solid State Age
202407261504
Status: #idea
Tags: e6data
Robust External Hash Aggregation in the Solid State Age
- DuckDB does not allocate a fixed-size buffer pool
- Eviction
- Unpinned pages are added to the eviction queue
- Lock-free concurrent priority queue with LRU policy
- Buffers are evicted when newly requested memory would cause the memory limit to be exceeded
- If newly allocated page is same size as evicted one, buffer is reused
- Unpinned pages are added to the eviction queue
- Temporary data
- 3 allocation types
- Non-paged allocations
- Goes through buffer manager, to decide if other pages need to be evicted
- Paged fixed-size
- Swapped in an out of a temp file in storage
- Paged variable-size
- Used only if needed
- Hash table buckets
- Strings that won’t fit in fixed size pages
- Used only if needed
- Non-paged allocations
- Operator implementations try to destroy temporary pages as soon as they are no longer needed
- 3 allocation types
- Page layout
- Row-major is optimal for intermediates, even in column-major systems for join and aggregation HT’s and also for sorting
- Uses fixed size rows
- Variable-size data
- Only use explicit addressing (i.e. pointers) and not offsets
- Stored on different pages from fixed-size data
- (De)-Serialisation
- No serialisation (Similar to Arrow) on write. Minimal serialisation before read
- Variable-size row
- Uses German-style strings
- TODO: Doubt
- Pointer recomputation
- Recomputed based on previous base pointer of page
- Done in place and lazily
- Only after skew is detected
- Metadata about page base pointer is stored as in-memory metadata
- Can be implemented in any system that uses explicit paged IO
- Data distributions
- Data is partitioned after reducing
- Approach is more robust towards skewed distributions
- Resizing hash table could prevent this
- Should benchmark both cases. Resizing and not resizing
- Data is partitioned after reducing
- External aggregation algorithm
- Phase 1: Thread-Local Pre-Aggregation
- Chunks are fed to threads which pre-aggregate into small, fixed size hash tables
- Phase 2: Partition-Wise Aggregation
- Over-partitioning, to keep memory pressure low
- When a partition is fully aggregated, the results are immediately pushed to the next operator, freeing up used pages
- Hash table design
- Consists of 2 parts
1. Array of 64-bit entries
- Lower 48 bits contains pointer
- Salt → Upper 16 bits
- Contain the upper 16 bits of hash of corresponding tuple (not just of the join key)
- Helps with collision management in linear probing
3. Temporary pages consisting of groups and aggregates and corresponding pages for variable length values - Partitioning
- Directly materialised tuples into partitions
- Data is simultaneously converted into row-major during partitioning
- Partitions are determined by radix
- Radix → Middle bits of the hash that are not used for the salt or for the offset into the hash table (lower bits)
- Used bits should not overlap
- Radix → Middle bits of the hash that are not used for the salt or for the offset into the hash table (lower bits)
- RAM-Oblivious
- HT is reset once it is two-thirds full
- We can vary this threshold
- Resetting is inexpensive
- Only the array of 64-bit entries are reset
- Pages that store the tuples can be unpinned
- HT is reset once it is two-thirds full
- Phase 1: Thread-Local Pre-Aggregation
- Experimentation
- Set the tenancy of the AWS instance to dedicated (but did not use the rest of the node’s capacity)
- Used available Instance Storage, which is physically attached to the host
- Evaluation
- Umbra doesn’t have external aggregation yet? Seems strange for a disk-based database 🤔
- Clickhouse’s external aggregation scales well, but it still runs out of memory on the largest grouping they tested
- Future work
- Problem → Same group may be materialised multiple times during phase 1
- Can be mitigated by adaptively exchanging and aggregating partitions early during phase 1, if the memory limit would otherwise be exceeded
- Problem → Same group may be materialised multiple times during phase 1