04-05 - Query Execution & Processing
Status: #idea
Tags: CMU Advanced Database Systems
04-05 - Query Execution & Processing
- 3 main optimizations
- Data parallelization (Vectorization)
- Task parallelization (Multi-threading)
- Code specialization (Pre-compiled/JIT)
- JIT is very hard to implement, so often pre-compiled operators are more common
Optimization Goals
- Reduce instruction count
- Reduce cycles per instruction
- Reduce cache misses
- Maximize data locality
- Parallelized execution
Query Execution
- Query plan -> DAG of operators
- Operator instance -> Invocation of an operator on a unique segment of data
- Task -> Sequence of one or more operator instances
CPU Overview
Selection Scans
Scalar (Branching)
i = 0
for t in table:
key = t.key
if (low < key) && (key < high):
copy(t, output[i])
i = i + 1
Scalar (Branch-less)
i = 0
for t in table:
copy(t, output[i]) # Always copying the data
key = t.key
delta = (low<key ? 1 : 0) & (key<high ? 1 : 0)
i = i + delta
# NOTE: Ignore last element if delta is 0
Excessive instructions
DBMS must check a value's type before it performs any operation
- Usually implemented as giant switch statements
- Creates more branches, which is bad for branch prediction
Processing Model
Plan Processing Direction
Top-to-bottom (Pull)
- Start with root and pull data from children
- Tuples are passed between operators using function calls (jump instruction - bad for super-scalar CPU)
- Easy to control output via
LIMIT - Parent blocks till child returns
Next()lookup is expensive (due to branching)
Bottom-to-Top (Push)
- Start with leaf nodes and push data to parents, using an external scheduler
- Can fuse operators within a for-loop to minimize intermediate result staging
- Allows for tighter control of caches/registers in pipelines
- May not have exact control of intermediate result sizes
Observation: Vector may contain some tuples which do not satisfy filters
Selection vector
- Dense, sorted list of offsets of valid tuples
- Pre-allocate selection vector as max-size of input vector
Bitmap
- Some SIMD instructions natively use bitmaps as input masks
- Push-based systems with centralized scheduling enables fine-grained control of execution
Parallel Execution
Operator Output
- Contents can vary depending on:
- processing model
- storage model
- data requirements in query
Early materialization
- Copy the values for the attributes in outer and inner tuples into a new output tuple
- Subsequent operators in the query never need to go back to the base tables to fetch data
Late materialization
- Only passes the bare minimum amount of data, i.e. only copy the joins' keys along with the tuple ID's
- Common in most OLAP systems.
- Ideal for most column stores because the DBMS does not need to copy data that is not needed for the query
Internal Data Representation
Expression Evaluation
- DBMS represents a
WHEREclause as an expression tree - Nodes in the tree represent
- Comparisons
- Conjunction (
AND), Disjunction (OR) - Arithmetic operators
- Constant values
- Tuple attribute references
- Functions
- Evaluating predicates by a tree is terrible for the tree
Optimizations in Velox
- Constant Folding -> Compute the sub-expression on a constant once and reuse the result
- Common Sub-Expression Elimination -> Identify repeated sub-expressions that can be shared across expression tree
Adaptive Query Processing
-
Allow the execution engine to modify a query's plan and expression tree while it is running
-
Goal is to use information gathered from executing some part of the query to decide how to best proceed with executing the rest query
- In the extreme case, the DBMS can give up and return the query to the optimizer but with new information
- This is probably not done in any non-academic system
-
Predicate Reordering -> Decide the order to run predicates based on their selectivity and computation cost
-
Column Prefetching -> Async retrieval of columns
-
Not Null Fast Path -> Switch to faster functions that skip null checking if input vector has no null values
-
Elide ASCII Encoding Checks -> Use faster ASCII functions if no UTF-8 data
- Bonus: Can reuse buffers and perform the computation in-place, without additional
malloc's
- Bonus: Can reuse buffers and perform the computation in-place, without additional