Robust External Hash Aggregation in the Solid State Age

202407261504
Status: #idea
Tags: e6data

Robust External Hash Aggregation in the Solid State Age

DuckDB does not allocate a fixed-size buffer pool
Eviction
- Unpinned pages are added to the eviction queue
  - Lock-free concurrent priority queue with LRU policy
- Buffers are evicted when newly requested memory would cause the memory limit to be exceeded
- If newly allocated page is same size as evicted one, buffer is reused
Temporary data
- 3 allocation types
  1. Non-paged allocations
    - Goes through buffer manager, to decide if other pages need to be evicted
  2. Paged fixed-size
    - Swapped in an out of a temp file in storage
  3. Paged variable-size
    - Used only if needed
      - Hash table buckets
      - Strings that won’t fit in fixed size pages
- Operator implementations try to destroy temporary pages as soon as they are no longer needed
Page layout
- Row-major is optimal for intermediates, even in column-major systems for join and aggregation HT’s and also for sorting
- Uses fixed size rows
- Variable-size data
  - Only use explicit addressing (i.e. pointers) and not offsets
  - Stored on different pages from fixed-size data
(De)-Serialisation
- No serialisation (Similar to Arrow) on write. Minimal serialisation before read
Variable-size row
- Uses German-style strings
- TODO: Doubt
Pointer recomputation
- Recomputed based on previous base pointer of page
- Done in place and lazily
  - Only after skew is detected
- Metadata about page base pointer is stored as in-memory metadata
- Can be implemented in any system that uses explicit paged IO
Data distributions
- Data is partitioned after reducing
  - Approach is more robust towards skewed distributions
- Resizing hash table could prevent this
  - Should benchmark both cases. Resizing and not resizing
External aggregation algorithm
- Phase 1: Thread-Local Pre-Aggregation
  - Chunks are fed to threads which pre-aggregate into small, fixed size hash tables
- Phase 2: Partition-Wise Aggregation
  - Over-partitioning, to keep memory pressure low
  - When a partition is fully aggregated, the results are immediately pushed to the next operator, freeing up used pages
- Hash table design
  - Consists of 2 parts
  1. Array of 64-bit entries
  - Lower 48 bits contains pointer
  - Salt → Upper 16 bits
  - Contain the upper 16 bits of hash of corresponding tuple (not just of the join key)
  - Helps with collision management in linear probing
  3. Temporary pages consisting of groups and aggregates and corresponding pages for variable length values
- Partitioning
  - Directly materialised tuples into partitions
  - Data is simultaneously converted into row-major during partitioning
  - Partitions are determined by radix
    - Radix → Middle bits of the hash that are not used for the salt or for the offset into the hash table (lower bits)
      - Used bits should not overlap
- RAM-Oblivious
  - HT is reset once it is two-thirds full
    - We can vary this threshold
  - Resetting is inexpensive
    - Only the array of 64-bit entries are reset
    - Pages that store the tuples can be unpinned
Experimentation
- Set the tenancy of the AWS instance to dedicated (but did not use the rest of the node’s capacity)
- Used available Instance Storage, which is physically attached to the host
Evaluation
- Umbra doesn’t have external aggregation yet? Seems strange for a disk-based database 🤔
- Clickhouse’s external aggregation scales well, but it still runs out of memory on the largest grouping they tested
Future work
- Problem → Same group may be materialised multiple times during phase 1
  - Can be mitigated by adaptively exchanging and aggregating partitions early during phase 1, if the memory limit would otherwise be exceeded

Robust External Hash Aggregation in the Solid State Age

References