My Notes

202502271027
Status: #idea
Tags:

Notes

Tunables

Model parallelization
- Tensor parallelization → Slice up the computations to happen on a differebt GPU
- Number of pipeline stages → Each layer(s) on a different gpu
- Number of replicas → Multiple requests
Scheduling algo - How to optimally process requests?
- Algorithm specific parameters
Max batch size, max wait time for a batch, etc.

Attention

Optimal configuration is a function of model-trace pair
Trace is list of requests and when they come

Challenges

Granularity should is much finer (few ms) than in the case of LLM training (~100ms)
Varying iteration times
- Different phases - Prefill and decode
- Input sizes vary (due to request being different)
- Batch size varies (unlike training)
  - Depending on the system load and the workload characteristics
Small errors in predictions lead to cascading effect
- Each iteration’s prediction should be super accurate, as there are many iterations for a single inference

Vidur

Most LLMs can be decomoposed into a small set of operators:
1. token-level
2. sequence-level
3. communication operators
Given a model
1. Vidur will decompose it into operators
2. Identifies a minimal set of input sizes that should be profiled (experimentally)
3. Build a predictor using the above data
Vidur will output:
- Time to First Token (TFFT)
- Time Between Tokens (TBT)
- Latency
- Throughput
- Cluster-level metrics
  - Model Flops Utilization (MFU)
  - Memory utilization

Profiling

Info

Runtime of some operations depend on total context length, while some others only depenend on the number of tokens in the current iteration

Attention

Only attention kernel is dependant on request history
During decode, the MLP will always take the same amount of compute
Hence, only attention kernel needs to be profiled.
- Attention during decode is memory-bound, so only the size of KV-cache needs to be modeled, to determine the kernel runtime

Automatic Profiling for Different Parallelism Strategies

No need to profile each strategy
Vidur knows each strat, so it can identify which subset of computation is performed on each device.
- During profiling, it automatically identifies tensor sharing configurations for each operator from a declarative specification of the model
- Hence, can simulate various parallelization after profiling just 1 GPU

Info

Vidur works. Has a 9% error rate at request level.
Claims to be able to predict cluster level metrics (PROOF?)

Vidur-Bench

Workload has a huge effect on inference perf
- Number of input tokents
- Number of decode tokens (DOUBT)
- Batch size
Acts as a standardised benchmark to evaluate LLM inference performance
- Consists of workload traces and existing abtching and scheduling policies

Vidur-Search

Figure out the optimal settings for a given workload trace
Finds the highest throughput-per-dollar config

Background

All (relevant) LLMs use self-attention
- Attention helps learn the relationship between tokens
- LLM = Attention block + MLP
2 phases during inference
1. Prefill
  - Processes entire input prompt
  - Produces first token
2. Decode
  - Output tokens are generated 1 at a time
  - Auto-regressive
  - Last generated token ( $T_{k}$ ) is used to generate $T_{k + 1}$
  - Continues till a special end-of-sequence token is generated ( $T_{end}$ )
  - Requires repeated access to key and value activations
    - Store these in a KV-cache, instead of re-computing

Model Parallelism

Tensor Parallelism

Shard the each layer across multiple devices
The layers can be0 broken up
- Model weights and KV cache are split equally across GPUs
Pros
- Throughput $↑$ if batch size $↑$
- Latency $↓$ → Due to splititng each operator across multiple GPUs
  - DOUBT: WHat is the atomic unit here? Which operator? (MatMul, Accum)?
Cons
- Frequent blocking communication b/w worker GPUs
  - AKA need NvLink or similar → Expensive, specialized hardware :(

Pipeline Parallelism

The model is broken into separate stages
- Each stage is on a separate device
Output of one stage is input to the next
Pro → Much lesser communication than tensor parallel
Everything else is same as CompArch pipelining
- Pipeline stalls can occur

LLM Scheduler Design

Schedulers can be:
1. Prefill prioritizing
  - Higher batch sizes → Throughput $↑$ and Latency $↑$
2. Decode prioritizing
  - Latency $↓$ but throughput $↓$
Existing work
- Sarathi-Serve → Tries to utilize computation slack in decode phase
- Splitwise and Dist-Serve → Computation for prefill and decodes on separate devices

Configuration Space (for LLM inference)

Tunables
- Parallelism strategy
- Scheduler
- Chunk size (DOUBT)
- Batch size
- SKU (DOUBT)
The optimal configuration is for a $(model, trace)$ pair
- Search is $O (| models | \cdot | traces |)$

How it actually works

Processing

Model onboarding (Section 4.2)

flowchart TD


1([Model specification]) -->|Generates| 2([Set of operations to profile])

3[Profiler]

2 --> 3
4([Runtime characteristics]) -->|Fed into| 3

5["Runtime estimator
(ML model)"]

3 -->|Profiled measurements| 5
5 -->|Outputs| 6(["*LUT*: operator -> runtime"])

Profiler

graph

0([Operators]) --- 1([Token-level]) & 2([Sequence-level]) & 3([Communication])

Token-level

Depends on total number of tokens processed (prefill + decode)
- E.g.: Linear, activation functions
2 main types
1. MatMul’s
2. Point-wise apply/reduction operations → Addition, normalization, activation functions
In order to profile
- Generate the tensor parallel sharing configurations and profile each combination
Tooling
- Standard PyTorch kernels
- CUDA Profiling Tools Interface (CUPTI)

Attention

This is how to obtain traces for different parallelism configs using a single GPU

Sequence-level

Depend on the number of tokens AND context length of each request
- E.g. Attention
Profiling
- Separately profile attention kernels for prefill and decode phases
  - Prefill
    $P \to Batch of prefills$
    $p_{i} \to Length of prefill$
    - We approximate the runtime of a batch to be equivalent to that of a single prefill of length of root of sum of squares ${(t_{a t t e n t i o n})}_{p r e f i l l} \propto \sum t (p_{i}^{2}) \approx t (\sqrt{\sum {p_{i}}^{2}})$
  - Decode
    - Largely memory bound
    - Runtime is determinded by total data volume that needs to be fetched from the KV-cache
    - All models can not effectively parallelize KV-cache fetch when there are large skews, but sequence parallel attention kernels can handle such skews

Communication

Depend ONLY on the amount of data to be transferred
- Independent of model architecture
- E.g. all-reduce, all-gather
3 types
1. all-reduce
2. all-gather → Tensor parallel
3. send-recv → Pipeline parallel
Profiled independantly in a model agnostic manner for different topologies (??)

Runtime Estimator (Section 4.4)

2 parts
1. Trains models on profiled data
2. Generates runtime estimates for a range of input tensor dimensions (from end-to-end simulation)
Use Random Forest

Hierarchical Scheduler (Section 4.5)

3-tier system

graph TD

subgraph S1[Global Scheduler]

G0([Scheduling Policy])

end

S1 --- S2 --- S3

  

subgraph S2[Replica Scheduler]
direction LR

A([Product Space]) --> B(Memory Planner)

C([Parallelism]) --> B

B --> D([Mem Available for KV Cache])

B --> F([High-Level API Used to Implement Task/Batching Policy])

end

  

subgraph S3[Replica Stage]

end

Global scheduler

Does request routing
Supports standard load balancing policies
- Round robin, least outstanding requests, etc.
Supports stateful policies
- Defer routing till a later time
  - Beneficial for bursty workloads (NOTE: Typo)

Replica scheduler

2 responsibilities
1. Batching
2. Memory management
  - Makes it easy to implement new batching policies

Replica stage scheduler

Handles the scheduling of micro-batches within a pipeline stage
Future work → Emulate optimizations like:
- asycn communication
- sequence parallelism
- pipelined decoding

Vidur-Bench

Benchmark suite for LLM inference
Provides a set of workloads from public datasets

Performance Metrics

Operator-level

Operators’ input size
Operators’ execution time

Request-level

Time-to-first-token (TFFT)
Time-between-tokens (TBT)
Additional metrics of interest
- How many times vLLM preempts or restarts each request, when it runs out of VRAM for KV-cache

Replica-level

Batch size
Tokens processed per iteration
Busy/idle time
Memory and compute utilization

Hardware

Cluster-wide memory and compute
Cluster’s energy consumption

Vidur-Search

Find the best config quickly
Input
- LLM model
- Workload
- Available GPU SKUs
- Maximum number of SKUs in replica
Constraints
- SLOs on metrics (like TFFT and TBT)
Search space (automated)
- Parallelism
  - Parallelism strategy
    - Tensor parallel
    - Pipeline parallel
  - Parallelism degree
- Scheduling policy
  - Scheduler-specific parameters
- Batch size
- Choice of GPU
Optimisation objective
- Maximise QPS per $

for every setup in all setups {
	do binary search to find max QPS {
		condition: queue delay stays under 5 seconds
	}
}

SORT all setups by QPS per dollar

SELECT cheapest config that meets latency targets

Each search is run on a separate core

Evaluation

Good eval. Tests various metrics of interest in isolation. Also, has good appendix with results
- Overall, seems trustworthy
Evaluated on both offline and online workloads
- Online → Request arrive based on a Poisson distribution (why?)
Eval metric
- Percentage error for normalised (wrt to output length) E2E latency
- For static workload, only request execution time is measured (not scheduling delay)
It works! Worst error is $< 10 %$
Dynamic workloads
- Need to evaluate close to capacity of the system
- Request rate at 85% of capacity is similar to prod (Citation? Proof?)
Good point about how if capacity is exceeded, massive delays occur.
- Prod systems have additional buffer to prevent this
Cost savings due to test via simulation (instead of actually running every configuration) is in the order of $10^{3} - 10^{4} \times$ range
- Practically (imo) no-one would do an exploration of this level, leading to inflated value prop. However, the value of simulation remains

Pareto Frontier Analysis

Note

Actually, a Pareto Frontier is very simple. Refer to https://youtu.be/ELLHqHk32II?t=157 this video for a visual representation

Figure 5 of the paper is a v nice graph.
- They plot $Q P S / $$ vs TTFT and TBT.
- Then they plot TTFT vs TBT, and see where the cheapest configuration shows up
- Seems like a useful tool to analyse the tradeoffs when choosing among multiple configurations.
  - Not very relevant to the paper (imo)
- DOUBT: Why are some blue dots (do not meet the SLO for all the metrics) present in the shaded region in the 3rd plot?
  - This is due to overloaded meaning of the blue color.
  - In the 3rd plot, the blue represents the cost (via a tempature colormap), while in the other 2 plots, it represents points which do not meet some (but not all) SLOs
“if the TBT SLO is changed from 0.12 seconds to 0.14 seconds (a difference of only 20ms), the Pareto curve point moves from approximately 0.07 to 0.13, ∼ 1.85× reduction in cost!”
- This line seems misleading. It is a $16 %$ increase. However, we do see a $100 %$ increase in $Q P S / $$ , which imo is worth investigating

Important Conclusions

Change in workload can drastically change the optimal configuration

Points to improve

Would have loved to see the prediction on a real-world workload (even if the workload is internal to MSoft)
- Would cement the relevance of simulation
- Would also like to know the existing internal processes for optimal configurations, and seeing if Vidur performs as well as (or better than) existing methods used in prod
The graphs do not indicate any form of error bars, and the paper doesn’t discuss if the experiments were run multiple times
- Since it incorporates hardware profiling, taking the mean and ensuring a low deviation is necessary
Would be nice to see a deeper analysis of the relationship between certain SLOs and $Q P S / $$
- Especially as they note the rapid change in QPS when SLO is only slightly changed
It is unclear if the profiling is fixed-time or fixed-work
- Systems benchmarks should always be fixed-work^[1]
Figures and graphs are inaccessible
- Poor choice of colours
- Lack of symbols is neither printer-friendly not colour-blindness friendly
- Perhaps prefer bright+ieee from SciencePlots

References

https://benchmarking-book.com/ ↩︎