e6-experiments

202406251015
Status: #idea
Tags: e6data

e6-experiments

Questions?

PRESENTATION
VarHandleStringCopy not working. Need help in fixing. Passing command-ling args to javac like stackoverflow suggested did not work
Packages/classes of interest
1. MethodHandle
2. VarHandle
  1. VarHandleByteArray as Longs, etc. This might be better a more Java idiomatic way
  2. VarHandle uses Unsafe internally.
  3. VarHandleMemorySegment as Longs.
    1. This seems to have a lot of scope.
    2. For my work, perhaps in the case of a buffer manager. Could be used to implement pages, and other page-backed data structures
3. MemorySegment
  1. https://www.baeldung.com/java-foreign-memory-access#1-memorysegment
  2. https://www.baeldung.com/java-foreign-memory-access#using-native-memory

Foreign Memory Access in Java

MemorySegment→ Model a contiguous region of memory
MemoryAddress → A location in a MemorySegment
MemoryLayout → A way to define the layout of the bytes of data in a MemorySegment in a Java idiomatic way

`MemorySegment`

A contiguous section of memory. Can be:
- heap memory
- off-heap memory

Info

A memory segment is attached to a particular thread
Other threads can gain access to it using the acquire() method

A memory memory segment has spatial bounds and temporal bounds (provided by Scope)

Memory locations

A native memory segment is backed by native memory
A memory segment can also be backed by a regular, heap-allocated Java array
Addresses
- Virtual address for on-heap segments
- Physical address for off-heap segments

MemorySegment memorySegment = MemorySegment.allocateNative(200); // Native memory segment
MemorySegment memorySegment = MemorySegment.ofArray(new long[100]); // Array memory segment
MemorySegment memorySegment = MemorySegment.ofByteBuffer(ByteBuffer.allocateDirect(200)); // Buffer memory segment (backed by existing ByteBuffer)
MemorySegment memorySegment = MemorySegment.mapFromPath(
  Path.of("/tmp/memory.txt"), 200, FileChannel.MapMode.READ_WRITE); // Memory mapped file

Scope

Memory segments’ lifetimes are modelled using Scope.
A memory segment can not be accessed if it is not ‘alive’
An unbounded lifetime is global and always alive
Bounded lifetimes modelled with a Scope that can be invalidated
- Scope invalidation can be done explicitly or is automatically done by the GC
Segments with automatic scope are:
- Array-backed segments
- Buffer-backed segments

Accessing memory segments

String Serialisation

ASCII Strings

Using
TODO: Compare ASCII string with VarHandle vs Unsafe.memory copy

Primitive Serialisation

A System.arraycopy from byte[] to byte[] is as performant as using VarHandle into a byte array
- Thoughts: byte[] backed vectors are not required for primitive data types
For byte arrays, using System.arraycopy is as performant as using Unsafe.copyMemory, and has less variation
- Thoughts: Use System.arraycopy for all byte[] to byte[] operations.
Using VarHandle to access a String’s private field does not yield any performance improvement over writing them 1 character at a time
- Perhaps VarHandle.get() does not pass a reference?
- Additionally, System.arraycopy is performing slightly worse than a character wise for-loop.
  - Note, that using a character-wise for-loop after accessing the private byte array gives the best performance
Charwise benchmarks have high variance ~50% (is it being JIT-ed way?)

Engine code benchmarks

MinIO

Spark streaming APIs, how reach from data from kafka to a lakehouse (open table format, parquet, catalogue)
How does one person setup a lakehouse from scratch (without data)
Maybe start stream writing into Parquet.

Java

Idea scratchpad

How can we strike a balance between spilling data (incoming/outgoing) and state?
Can we do a hybrid join where we start off with a hash join, and then use sorted-merge join for spilled data?
Can we perform sort-merge join directly on the data or should we perform the sort on hashes?
According to this blog post, POSIX is better for small writes. S3 performs poorly by default with small writes, due to metadata overhead? Need to confirm

Next Steps

Write a custom stream
- vector.write(Stream s)
  - where s.writeLongs(long[] l, start, end)
  - Implement theDataOutput interface. Ofc keep in mind all the optimisations of data movement, cache locality, etc.
HelloWorld serialisation
Microbenchmark for serialisation.
Start by describing what benchmarks we want to do.
Query patterns
- Table sizes (width and length)
- Join keys, and their types.
- Faiz can suggest patterns which we need to solve.
- Maybe make a data generator.
Join algo
Each spill optimisation from the paper?
Algorithms for join
- Changes from e6 prod onto Kiran’s algo
Start with DuckDB blog post
- Does DuckDB directly use jemalloc? Or do they use some special commands tht jemalloc has.
Old paper on hybrid hash join
Apache Spark and Presto are full query engines. They have spill to disk. Check Trino.
- Jira tickets
Lookup VLDB and SIGMOD proceedings
Read the patent.
Read A disk-based join with probabilistic guarantees
The Sort-Merge-Shrink Join
Adishesh
- Comprehensive analysis of S3 behaviour for every access pattern
- Some things to keep in mind, like how it varies for fixed length data types vs variable length data types
- Random vs sequential vs contiguous
- Block sizes
- reads vs writes? a thorough analysis of s3 and GCS.
- Also test variability in partition sizes. Should we consistently fetch chunks of fixed size?
- Use sorted set instead of hash map?
Priyam
- Sort-based allows streaming
- Bench sort vs hash. Sort is probably more expensive.
Papers about interacting with SSDs
- https://www.vldb.org/pvldb/vol16/p2090-haas.pdf
- https://www.vldb.org/pvldb/vol16/p1712-he.pdf
Papers about interacting with object-stores
- https://www.vldb.org/pvldb/vol16/p2769-durner.pdf
Other interesting papers
- Good hash tables for vectorised data → https://www.vldb.org/pvldb/vol16/p2755-bother.pdf
- FILM: a Fully Learned Index for Larger-than-Memory Databases
- Faster than spilling to SSDs. Made as a KV store. MSR → https://www.vldb.org/pvldb/vol15/p766-zhang.pdf
- Origami: A High-Performance Mergesort Framework
- New Wine in an Old Bottle: Data-Aware Hash Functions for Bloom Filters

Spill2Disk Papers

Umbra

DataBlocks

Doesn’t seem very relevant
Describes a data format for storing data.
Might be worth looking into if we choose to spill data (in contrast to state)
Can’t find implementation. Paper is extremely detailed at first glance

Patent: Disk-based hash join process

Waste of time 😠
- Does not provide any information
- Bunch of yap about how DB’s work, followed by a bunch of yap on what a lakehouse is, followed by a bunch of yap on what spill to disk means, in a very verbose manner, without any useful information whatsoever.
Belongs to ParAccel → Actian → Amazon Redshift

External Aggregation in DuckDB

Robust External Hash Aggregation in the Solid State Age

Parallel Grouped Aggregation in DuckDB

Redesigning DuckDB’s Sort

Survey paper on sorting in DB’s: http://doi.acm.org/10.1145/1132960.1132964
All columns in the the ORDER BY are converted into binary strings
- Padded if needed, to make them fixed size
- All columns in compare are concatenated into single binary string
DuckDB uses an implementation of Radix sort
- Parallelised via Morsel-level parallelism
Each thread sorts some data using DuckDB’s Radix sort which is then merged
- 2 types of merge sort:
  1. K-way merge → Traditionally used for external sorting
  2. Cascade merge → Used for in-memory sorting
- DuckDB prefers cascade merge for graceful degradation
- In order to fully parallelise this, DuckDB has implemented Merge Path
  - Pre-computes where the sorted lists will intersect
  - Allows all threads to remain busy during the entire merge phase
Similar to joins and aggregations, DuckDB uses a row-major representation for sorting as well, and stores the data on the pages provided by the buffer manager
- Each row has an additional 8-byte pointer field which points to the start of the row in the heap
- Enables pointer swizzling
- TODO doubt

e6-experiments

e6-experiments

Questions?

Foreign Memory Access in Java

`MemorySegment`

Memory locations

Scope

Accessing memory segments

String Serialisation

ASCII Strings

Primitive Serialisation

Engine code benchmarks

MinIO

Java

Idea scratchpad

Next Steps

Spill2Disk Papers

Umbra

DataBlocks

Patent: Disk-based hash join process

External Aggregation in DuckDB

Robust External Hash Aggregation in the Solid State Age

Parallel Grouped Aggregation in DuckDB

Redesigning DuckDB’s Sort

Papers for Spill to Disk

References