Caching at e6data

202407271045
Status: #idea
Tags: e6data

Caching at e6data

Major bottleneck in existing distributed query engines is the central coordinator in Spark/Presto
Impact
- Speed up reading table metadata
- Avoid query re-processing
- Speed up reading table’s data
- How do we handle changing data?
- Avoid re-computation of intermediate results of a query

Impact on table metadata operations

Problem definition

Listing a file in an object store is super expensive (due to lack of a file tree)
Metadata is stored in small files (object stores are optimized for throughput)
Metadata doesn’t change at high frequency

Our solution

Cache versioned snapshot of table’s metadata
Async refreshing of the table metadata in the background
- Incremental snapshot updates

Impact on reading table data

Problem definition

High first byte latencies for object stores
Random access is poor in object stores (worker nodes perform prefetching on S3 side)
- Random access is required to a certain extent in Parquet

Distributed data caching with task pinning

Pin tasks to workers which have already cached the relevant data
Work stealing occurs if a worker has a large queue (due to caching of popular files on a single worker)
Not used in e6data

Distributed data caching without task pinning

Distribution service doesn’t know what data is available at each worker
Fetching data from caches of other works is not desirable (even though it is faster than fetching from S3), as each EC2 has limited bandwidth, and right now, we saturate the bandwidth of this instance

Merging ranges

Each Parquet file is modeled as a series of virtual pages (default size of 64K)
- Raw bytes are cached, to ensure compatibility with all file types
Ranges are merged recursively. Minimum gap is configurable

Questions

References

Faiz Kothari & Kiran Nunna