External Aggregation in DuckDB
202407261500
Status: #idea
Tags: e6data
External Aggregation in DuckDB
- HT for an aggregation is traditionally not stored in the buffer pool
- DuckDB stores both temporary and permanent data in buffer pool
- This means that DuckDB can use all memory for persistent data (if needed)
- If large aggregations are being done, the persistent data can be evicted from emory to make space for a large hash table
- Issues with variable page sizes
- Memory fragmentation
- Short lived allocations are only variable size
- Use
jemallocfor allocations when possible, which reduces fragmentation
- Invalid references
- Temporary data needs serdes when moving from/to storage
- Row major is more efficient for hash tables
- Created a custom page layout
- Row-major
- Uses old invalidated pointers to recompute new valid pointers after being read back into memory
- Places fixed size rows and variable size data on separate pages
- Pointers are recomputed (lazily) during deserialisation if address of the page has changed
- Memory fragmentation
- External Aggregation
- All threads have a thread local HT.
- Radix partition based aggregation, where each thread deals with 1 partition, across all thread-local HTs
- They use linear probing to resolve collisions, and salting to reduce the overhead of dealing with collisions. Explained here
- Now, each thread has one hash table, but the data in each HT is partitioned
- Data is stored on the page layout described above
- HT’s are not resized during thread-local pre-aggregation
- HT sizes are kept small, to reduce cache misses
- If HT is full, it is reset and they start over. TODO
- Data pages are unpinned, so the buffer manager can write them to disk if memory is needed.
- Partition-Wise Aggregation
- Each thread combines data of a single partition into a HT
- More partitions than threads are created
- Size of partitions will be smaller
- All partitions will not be combined simultaneously, so less memory footprint
- Size of partitions will be smaller