Internal Data Representation
202406212157
Status: #idea
Tags: CMU Advanced Database Systems
Internal Data Representation
- All values must be fixed length (to be able to use offsets to find corresponding values across values)
- Ideal properties
- Move data structures without serializing
- Zero-copy shared memory access
Apache Arrow
- Open-source format which describes how to store data in-memory in a columnar fashion
- Supports efficient random access
- Only supports 2 encoding schemes (Dictionary and RLE)
- Velox extended Arrow to use German-Style String Storage (Refer 05 slide 41)
- Fixed-length portion contains size + prefix + payload
- Payload contains full string (if its <= 16 bytes). Otherwise, its a pointer to a string.
Substrait
- Open-source spec to represent relation algebra query plans
- Like Arrow, but for relational algebra
- Not great for more complex query plans
Apache DataFusion
- Vectorized execution library for Apache Arrow data
- Provides more front-end functionality than Velox