04-05 - Query Execution & Processing

Status: #idea
Tags: CMU Advanced Database Systems

04-05 - Query Execution & Processing

3 main optimizations
1. Data parallelization (Vectorization)
2. Task parallelization (Multi-threading)
3. Code specialization (Pre-compiled/JIT)
  - JIT is very hard to implement, so often pre-compiled operators are more common

Optimization Goals

Reduce instruction count
Reduce cycles per instruction
- Reduce cache misses
- Maximize data locality
Parallelized execution

Query Execution

Query plan -> DAG of operators
Operator instance -> Invocation of an operator on a unique segment of data
Task -> Sequence of one or more operator instances

CPU Overview

Selection Scans

Scalar (Branching)

i = 0
for t in table:
	key = t.key
	if (low < key) && (key < high):
		copy(t, output[i])
		i = i + 1

Scalar (Branch-less)

i = 0
for t in table:
	copy(t, output[i]) # Always copying the data
	key = t.key
	delta = (low<key ? 1 : 0) & (key<high ? 1 : 0)
	i = i + delta
# NOTE: Ignore last element if delta is 0

Excessive instructions

DBMS must check a value's type before it performs any operation

Usually implemented as giant switch statements
Creates more branches, which is bad for branch prediction

Processing Model

Plan Processing Direction

Top-to-bottom (Pull)

Start with root and pull data from children
Tuples are passed between operators using function calls (jump instruction - bad for super-scalar CPU)
Easy to control output via LIMIT
Parent blocks till child returns
Next() lookup is expensive (due to branching)

Bottom-to-Top (Push)

Start with leaf nodes and push data to parents, using an external scheduler
Can fuse operators within a for-loop to minimize intermediate result staging
Allows for tighter control of caches/registers in pipelines
May not have exact control of intermediate result sizes

Observation: Vector may contain some tuples which do not satisfy filters

Selection vector

Dense, sorted list of offsets of valid tuples
Pre-allocate selection vector as max-size of input vector

Bitmap

Some SIMD instructions natively use bitmaps as input masks
Push-based systems with centralized scheduling enables fine-grained control of execution

Parallel Execution

Operator Output

Contents can vary depending on:
- processing model
- storage model
- data requirements in query

Early materialization

Copy the values for the attributes in outer and inner tuples into a new output tuple
Subsequent operators in the query never need to go back to the base tables to fetch data

Late materialization

Only passes the bare minimum amount of data, i.e. only copy the joins' keys along with the tuple ID's
Common in most OLAP systems.
- Ideal for most column stores because the DBMS does not need to copy data that is not needed for the query

Internal Data Representation

Expression Evaluation

DBMS represents a WHERE clause as an expression tree
Nodes in the tree represent
- Comparisons
- Conjunction (AND), Disjunction (OR)
- Arithmetic operators
- Constant values
- Tuple attribute references
- Functions
Evaluating predicates by a tree is terrible for the tree

Optimizations in Velox

Constant Folding -> Compute the sub-expression on a constant once and reuse the result
Common Sub-Expression Elimination -> Identify repeated sub-expressions that can be shared across expression tree

Adaptive Query Processing

Allow the execution engine to modify a query's plan and expression tree while it is running
Goal is to use information gathered from executing some part of the query to decide how to best proceed with executing the rest query
- In the extreme case, the DBMS can give up and return the query to the optimizer but with new information
- This is probably not done in any non-academic system
Predicate Reordering -> Decide the order to run predicates based on their selectivity and computation cost
Column Prefetching -> Async retrieval of columns
Not Null Fast Path -> Switch to faster functions that skip null checking if input vector has no null values
Elide ASCII Encoding Checks -> Use faster ASCII functions if no UTF-8 data
- Bonus: Can reuse buffers and perform the computation in-place, without additional malloc's

References

05 - Slides