e6-experiments
202406251015
Status: #idea
Tags: e6data
e6-experiments
Questions?
- PRESENTATION
- VarHandleStringCopy not working. Need help in fixing. Passing command-ling args to javac like stackoverflow suggested did not work
- Packages/classes of interest
- MethodHandle
- VarHandle
- VarHandleByteArray as Longs, etc. This might be better a more Java idiomatic way
- VarHandle uses Unsafe internally.
- VarHandleMemorySegment as Longs.
- This seems to have a lot of scope.
- For my work, perhaps in the case of a buffer manager. Could be used to implement pages, and other page-backed data structures
- MemorySegment
Foreign Memory Access in Java
MemorySegment→ Model a contiguous region of memoryMemoryAddress→ A location in aMemorySegmentMemoryLayout→ A way to define the layout of the bytes of data in aMemorySegmentin a Java idiomatic way
MemorySegment
- A contiguous section of memory. Can be:
- heap memory
- off-heap memory
Info
A memory segment is attached to a particular thread
Other threads can gain access to it using the acquire() method
- A memory memory segment has spatial bounds and temporal bounds (provided by Scope)
Memory locations
- A native memory segment is backed by native memory
- A memory segment can also be backed by a regular, heap-allocated Java array
- Addresses
- Virtual address for on-heap segments
- Physical address for off-heap segments
MemorySegment memorySegment = MemorySegment.allocateNative(200); // Native memory segment
MemorySegment memorySegment = MemorySegment.ofArray(new long[100]); // Array memory segment
MemorySegment memorySegment = MemorySegment.ofByteBuffer(ByteBuffer.allocateDirect(200)); // Buffer memory segment (backed by existing ByteBuffer)
MemorySegment memorySegment = MemorySegment.mapFromPath(
Path.of("/tmp/memory.txt"), 200, FileChannel.MapMode.READ_WRITE); // Memory mapped file
Scope
- Memory segments’ lifetimes are modelled using Scope.
- A memory segment can not be accessed if it is not ‘alive’
- An unbounded lifetime is global and always alive
- Bounded lifetimes modelled with a Scope that can be invalidated
- Scope invalidation can be done explicitly or is automatically done by the GC
- Segments with automatic scope are:
- Array-backed segments
- Buffer-backed segments
Accessing memory segments
String Serialisation
ASCII Strings
- Using
- TODO: Compare ASCII string with VarHandle vs Unsafe.memory copy
Primitive Serialisation
- A
System.arraycopyfrombyte[]tobyte[]is as performant as using VarHandle into a byte array- Thoughts:
byte[]backed vectors are not required for primitive data types
- Thoughts:
- For byte arrays, using
System.arraycopyis as performant as usingUnsafe.copyMemory, and has less variation- Thoughts: Use
System.arraycopyfor allbyte[]tobyte[]operations.
- Thoughts: Use
- Using VarHandle to access a String’s private field does not yield any performance improvement over writing them 1 character at a time
- Perhaps
VarHandle.get()does not pass a reference? - Additionally,
System.arraycopyis performing slightly worse than a character wise for-loop.- Note, that using a character-wise for-loop after accessing the private byte array gives the best performance
- Perhaps
- Charwise benchmarks have high variance ~50% (is it being JIT-ed way?)
Engine code benchmarks
MinIO
- Spark streaming APIs, how reach from data from kafka to a lakehouse (open table format, parquet, catalogue)
- How does one person setup a lakehouse from scratch (without data)
- Maybe start stream writing into Parquet.
Java
Idea scratchpad
- How can we strike a balance between spilling data (incoming/outgoing) and state?
- Can we do a hybrid join where we start off with a hash join, and then use sorted-merge join for spilled data?
- Can we perform sort-merge join directly on the data or should we perform the sort on hashes?
- According to this blog post, POSIX is better for small writes. S3 performs poorly by default with small writes, due to metadata overhead? Need to confirm
Next Steps
- Write a custom stream
- vector.write(Stream s)
- where s.writeLongs(long[] l, start, end)
- Implement theDataOutput interface. Ofc keep in mind all the optimisations of data movement, cache locality, etc.
- vector.write(Stream s)
- HelloWorld serialisation
- Microbenchmark for serialisation.
- Start by describing what benchmarks we want to do.
- Query patterns
- Table sizes (width and length)
- Join keys, and their types.
- Faiz can suggest patterns which we need to solve.
- Maybe make a data generator.
- Join algo
- Each spill optimisation from the paper?
- Algorithms for join
- Changes from e6 prod onto Kiran’s algo
- Start with DuckDB blog post
- Does DuckDB directly use jemalloc? Or do they use some special commands tht jemalloc has.
- Old paper on hybrid hash join
- Apache Spark and Presto are full query engines. They have spill to disk. Check Trino.
- Jira tickets
- Lookup VLDB and SIGMOD proceedings
- Read the patent.
- Read A disk-based join with probabilistic guarantees
- The Sort-Merge-Shrink Join
- Adishesh
- Comprehensive analysis of S3 behaviour for every access pattern
- Some things to keep in mind, like how it varies for fixed length data types vs variable length data types
- Random vs sequential vs contiguous
- Block sizes
- reads vs writes? a thorough analysis of s3 and GCS.
- Also test variability in partition sizes. Should we consistently fetch chunks of fixed size?
- Use sorted set instead of hash map?
- Priyam
- Sort-based allows streaming
- Bench sort vs hash. Sort is probably more expensive.
- Papers about interacting with SSDs
- Papers about interacting with object-stores
- Other interesting papers
- Good hash tables for vectorised data → https://www.vldb.org/pvldb/vol16/p2755-bother.pdf
- FILM: a Fully Learned Index for Larger-than-Memory Databases
- Faster than spilling to SSDs. Made as a KV store. MSR → https://www.vldb.org/pvldb/vol15/p766-zhang.pdf
- Origami: A High-Performance Mergesort Framework
- New Wine in an Old Bottle: Data-Aware Hash Functions for Bloom Filters
Spill2Disk Papers
Umbra
DataBlocks
- Doesn’t seem very relevant
- Describes a data format for storing data.
- Might be worth looking into if we choose to spill data (in contrast to state)
- Can’t find implementation. Paper is extremely detailed at first glance
Patent: Disk-based hash join process
- Waste of time 😠
- Does not provide any information
- Bunch of yap about how DB’s work, followed by a bunch of yap on what a lakehouse is, followed by a bunch of yap on what spill to disk means, in a very verbose manner, without any useful information whatsoever.
- Belongs to ParAccel → Actian → Amazon Redshift
External Aggregation in DuckDB
Robust External Hash Aggregation in the Solid State Age
Parallel Grouped Aggregation in DuckDB
Redesigning DuckDB’s Sort
- Survey paper on sorting in DB’s: http://doi.acm.org/10.1145/1132960.1132964
- All columns in the the
ORDER BYare converted into binary strings- Padded if needed, to make them fixed size
- All columns in compare are concatenated into single binary string
- DuckDB uses an implementation of Radix sort
- Parallelised via Morsel-level parallelism
- Each thread sorts some data using DuckDB’s Radix sort which is then merged
- 2 types of merge sort:
- K-way merge → Traditionally used for external sorting
- Cascade merge → Used for in-memory sorting
- DuckDB prefers cascade merge for graceful degradation
- In order to fully parallelise this, DuckDB has implemented Merge Path
- Pre-computes where the sorted lists will intersect
- Allows all threads to remain busy during the entire merge phase
- 2 types of merge sort:
- Similar to joins and aggregations, DuckDB uses a row-major representation for sorting as well, and stores the data on the pages provided by the buffer manager
- Each row has an additional 8-byte
pointerfield which points to the start of the row in the heap - Enables pointer swizzling
- TODO doubt
- Each row has an additional 8-byte
Papers for Spill to Disk
- https://dare.uva.nl/search?identifier=5ccbb60a-38b8-4eeb-858a-e7735dd37487
- https://scholar.google.com/scholar?as_sdt=2007&q=join+disk&hl=en
- http://www.vldb.org/pvldb/vol12/p681-zou.pdf
- https://link.springer.com/article/10.1007/s00778-020-00605-w
- https://dl.acm.org/doi/pdf/10.1145/1066157.1066222
- https://dl.acm.org/doi/pdf/10.1145/1807167.1807224
- https://dl.acm.org/doi/pdf/10.1145/1189769.1189775
- https://dl.acm.org/doi/10.1145/1066157.1066222
- https://dl.acm.org/doi/pdf/10.1145/1376616.1376740
- https://dl.acm.org/doi/10.1145/1386118.1386121
- http://www.cse.iitb.ac.in/infolab/Data/Courses/CS632/2014/2013/Papers/Jermaine-SMSJoin.pdf
- https://scholar.google.com/scholar?as_sdt=2007&q=join+disk&hl=en