Caching at e6data
202407271045
Status: #idea
Tags: e6data
Caching at e6data
- Major bottleneck in existing distributed query engines is the central coordinator in Spark/Presto
- Impact
- Speed up reading table metadata
- Avoid query re-processing
- Speed up reading table’s data
- How do we handle changing data?
- Avoid re-computation of intermediate results of a query
Impact on table metadata operations
Problem definition
- Listing a file in an object store is super expensive (due to lack of a file tree)
- Metadata is stored in small files (object stores are optimized for throughput)
- Metadata doesn’t change at high frequency
Our solution
- Cache versioned snapshot of table’s metadata
- Async refreshing of the table metadata in the background
- Incremental snapshot updates
Impact on reading table data
Problem definition
- High first byte latencies for object stores
- Random access is poor in object stores (worker nodes perform prefetching on S3 side)
- Random access is required to a certain extent in Parquet
Distributed data caching with task pinning
- Pin tasks to workers which have already cached the relevant data
- Work stealing occurs if a worker has a large queue (due to caching of popular files on a single worker)
- Not used in e6data
Distributed data caching without task pinning
- Distribution service doesn’t know what data is available at each worker
- Fetching data from caches of other works is not desirable (even though it is faster than fetching from S3), as each EC2 has limited bandwidth, and right now, we saturate the bandwidth of this instance
Merging ranges
- Each Parquet file is modeled as a series of virtual pages (default size of 64K)
- Raw bytes are cached, to ensure compatibility with all file types
- Ranges are merged recursively. Minimum gap is configurable