Parallel Grouped Aggregation in DuckDB

202407261508
Status: #idea
Tags: e6data

Parallel Grouped Aggregation in DuckDB

Uses linear probing due to better cache locality
For efficient resize: 2 part aggregate HT
- Separately allocated pointing array
  - Contains block, offset pairs
- On resize
  - Allocate a new, larger pointing array
  - Read all payload blocks, hash them and fill in pointers in the new pointing array
- Optimisations
  - Store hash along with the row, to prevent rehashing
Probing optimisation
- Following the pointer creates random memory access
- Pointer array now, stores some bytes of the hash, in addition to the block-offset pair
- So, if hash doesn’t match, no need to follow the pointer
- Has nice side-effect of greatly reducing full group comparisons
Minor memory utilisation optimisation
- For smaller HT’s, DuckDB supports both 32-bit and 64-bit pointer arrays
Code is found in aggregate_hashtable.cpp
Concurrency-friendly HT’s are not a solved problem yet (as of March 2022)
For parallelism, they use the method from this TUM paper
- Partitions are created using the radix partitioning scheme on the hash
- We do NOT need to merge these partitions into a single giant HT
- DuckDB-specific optimisations
  - Only start partitioning if aggregate HT exceeds a fixed number of entries (tuneable threshold. Requires benchmarking)
    - Since partitioning is local to a thread, it is possible that only some threads partition.
    - If this happens, we will need to partition the thread-local HT’s before the merge
      - This is entirely thread local
  - We stop adding values to a HT once its pointer array exceeds a certain threshold
    - Every thread then builds multiple sets of (potentially) partitioned tables
    - This leads to multiple entries for the same group, but that is okay, as during combination, all this is dealt with
- NOTE: Some aggregations can not be parallelised (e.g. median), but DuckDB provides approx_quanitle as an alternative

Parallel Grouped Aggregation in DuckDB

References