My Notes
202502271027
Status: #idea
Tags:
Notes
Tunables
- Model parallelization
- Tensor parallelization → Slice up the computations to happen on a differebt GPU
- Number of pipeline stages → Each layer(s) on a different gpu
- Number of replicas → Multiple requests
- Scheduling algo - How to optimally process requests?
- Algorithm specific parameters
- Max batch size, max wait time for a batch, etc.
Optimal configuration is a function of model-trace pair
Trace is list of requests and when they come
Challenges
- Granularity should is much finer (few ms) than in the case of LLM training (~100ms)
- Varying iteration times
- Different phases - Prefill and decode
- Input sizes vary (due to request being different)
- Batch size varies (unlike training)
- Depending on the system load and the workload characteristics
- Small errors in predictions lead to cascading effect
- Each iteration’s prediction should be super accurate, as there are many iterations for a single inference
Vidur
- Most LLMs can be decomoposed into a small set of operators:
- token-level
- sequence-level
- communication operators
- Given a model
- Vidur will decompose it into operators
- Identifies a minimal set of input sizes that should be profiled (experimentally)
- Build a predictor using the above data
- Vidur will output:
- Time to First Token (TFFT)
- Time Between Tokens (TBT)
- Latency
- Throughput
- Cluster-level metrics
- Model Flops Utilization (MFU)
- Memory utilization
Profiling
Runtime of some operations depend on total context length, while some others only depenend on the number of tokens in the current iteration
- Only attention kernel is dependant on request history
- During decode, the MLP will always take the same amount of compute
- Hence, only attention kernel needs to be profiled.
- Attention during decode is memory-bound, so only the size of KV-cache needs to be modeled, to determine the kernel runtime
Automatic Profiling for Different Parallelism Strategies
- No need to profile each strategy
- Vidur knows each strat, so it can identify which subset of computation is performed on each device.
- During profiling, it automatically identifies tensor sharing configurations for each operator from a declarative specification of the model
- Hence, can simulate various parallelization after profiling just 1 GPU
Vidur works. Has a 9% error rate at request level.
Claims to be able to predict cluster level metrics (PROOF?)
Vidur-Bench
- Workload has a huge effect on inference perf
- Number of input tokents
- Number of decode tokens (DOUBT)
- Batch size
- Acts as a standardised benchmark to evaluate LLM inference performance
- Consists of workload traces and existing abtching and scheduling policies
Vidur-Search
- Figure out the optimal settings for a given workload trace
- Finds the highest throughput-per-dollar config
Background
- All (relevant) LLMs use self-attention
- Attention helps learn the relationship between tokens
- LLM = Attention block + MLP
- 2 phases during inference
- Prefill
- Processes entire input prompt
- Produces first token
- Decode
- Output tokens are generated 1 at a time
- Auto-regressive
- Last generated token (
) is used to generate - Continues till a special end-of-sequence token is generated (
) - Requires repeated access to key and value activations
- Store these in a KV-cache, instead of re-computing
- Prefill
Model Parallelism
Tensor Parallelism
- Shard the each layer across multiple devices
- The layers can be0 broken up
- Model weights and KV cache are split equally across GPUs
- Pros
- Throughput
if batch size - Latency
→ Due to splititng each operator across multiple GPUs - DOUBT: WHat is the atomic unit here? Which operator? (MatMul, Accum)?
- Throughput
- Cons
- Frequent blocking communication b/w worker GPUs
- AKA need NvLink or similar → Expensive, specialized hardware :(
- Frequent blocking communication b/w worker GPUs
Pipeline Parallelism
- The model is broken into separate stages
- Each stage is on a separate device
- Output of one stage is input to the next
- Pro → Much lesser communication than tensor parallel
- Everything else is same as CompArch pipelining
- Pipeline stalls can occur
LLM Scheduler Design
- Schedulers can be:
- Prefill prioritizing
- Higher batch sizes → Throughput
and Latency
- Higher batch sizes → Throughput
- Decode prioritizing
- Latency
but throughput
- Latency
- Prefill prioritizing
- Existing work
- Sarathi-Serve → Tries to utilize computation slack in decode phase
- Splitwise and Dist-Serve → Computation for prefill and decodes on separate devices
Configuration Space (for LLM inference)
- Tunables
- Parallelism strategy
- Scheduler
- Chunk size (DOUBT)
- Batch size
- SKU (DOUBT)
- The optimal configuration is for a
pair - Search is
- Search is
How it actually works
Processing
Model onboarding (Section 4.2)
flowchart TD 1([Model specification]) -->|Generates| 2([Set of operations to profile]) 3[Profiler] 2 --> 3 4([Runtime characteristics]) -->|Fed into| 3 5["Runtime estimator (ML model)"] 3 -->|Profiled measurements| 5 5 -->|Outputs| 6(["*LUT*: operator -> runtime"])
Profiler
graph 0([Operators]) --- 1([Token-level]) & 2([Sequence-level]) & 3([Communication])
Token-level
- Depends on total number of tokens processed (prefill + decode)
- E.g.: Linear, activation functions
- 2 main types
- MatMul’s
- Point-wise apply/reduction operations → Addition, normalization, activation functions
- In order to profile
- Generate the tensor parallel sharing configurations and profile each combination
- Tooling
- Standard PyTorch kernels
- CUDA Profiling Tools Interface (CUPTI)
This is how to obtain traces for different parallelism configs using a single GPU
Sequence-level
- Depend on the number of tokens AND context length of each request
- E.g. Attention
- Profiling
- Separately profile attention kernels for prefill and decode phases
- Prefill
- We approximate the runtime of a batch to be equivalent to that of a single prefill of length of root of sum of squares
- We approximate the runtime of a batch to be equivalent to that of a single prefill of length of root of sum of squares
- Decode
- Largely memory bound
- Runtime is determinded by total data volume that needs to be fetched from the KV-cache
- All models can not effectively parallelize KV-cache fetch when there are large skews, but sequence parallel attention kernels can handle such skews
- Prefill
- Separately profile attention kernels for prefill and decode phases
Communication
- Depend ONLY on the amount of data to be transferred
- Independent of model architecture
- E.g. all-reduce, all-gather
- 3 types
- all-reduce
- all-gather → Tensor parallel
- send-recv → Pipeline parallel
- Profiled independantly in a model agnostic manner for different topologies (??)
Runtime Estimator (Section 4.4)
- 2 parts
- Trains models on profiled data
- Generates runtime estimates for a range of input tensor dimensions (from end-to-end simulation)
- Use Random Forest
Hierarchical Scheduler (Section 4.5)
- 3-tier system
graph TD subgraph S1[Global Scheduler] G0([Scheduling Policy]) end S1 --- S2 --- S3 subgraph S2[Replica Scheduler] direction LR A([Product Space]) --> B(Memory Planner) C([Parallelism]) --> B B --> D([Mem Available for KV Cache]) B --> F([High-Level API Used to Implement Task/Batching Policy]) end subgraph S3[Replica Stage] end
Global scheduler
- Does request routing
- Supports standard load balancing policies
- Round robin, least outstanding requests, etc.
- Supports stateful policies
- Defer routing till a later time
- Beneficial for bursty workloads (NOTE: Typo)
- Defer routing till a later time
Replica scheduler
- 2 responsibilities
- Batching
- Memory management
- Makes it easy to implement new batching policies
Replica stage scheduler
- Handles the scheduling of micro-batches within a pipeline stage
- Future work → Emulate optimizations like:
- asycn communication
- sequence parallelism
- pipelined decoding
Vidur-Bench
- Benchmark suite for LLM inference
- Provides a set of workloads from public datasets
Performance Metrics
Operator-level
- Operators’ input size
- Operators’ execution time
Request-level
- Time-to-first-token (TFFT)
- Time-between-tokens (TBT)
- Additional metrics of interest
- How many times vLLM preempts or restarts each request, when it runs out of VRAM for KV-cache
Replica-level
- Batch size
- Tokens processed per iteration
- Busy/idle time
- Memory and compute utilization
Hardware
- Cluster-wide memory and compute
- Cluster’s energy consumption
Vidur-Search
- Find the best config quickly
- Input
- LLM model
- Workload
- Available GPU SKUs
- Maximum number of SKUs in replica
- Constraints
- SLOs on metrics (like TFFT and TBT)
- Search space (automated)
- Parallelism
- Parallelism strategy
- Tensor parallel
- Pipeline parallel
- Parallelism degree
- Parallelism strategy
- Scheduling policy
- Scheduler-specific parameters
- Batch size
- Choice of GPU
- Parallelism
- Optimisation objective
- Maximise QPS per $
for every setup in all setups {
do binary search to find max QPS {
condition: queue delay stays under 5 seconds
}
}
SORT all setups by QPS per dollar
SELECT cheapest config that meets latency targets
- Each search is run on a separate core
Evaluation
- Good eval. Tests various metrics of interest in isolation. Also, has good appendix with results
- Overall, seems trustworthy
- Evaluated on both offline and online workloads
- Online → Request arrive based on a Poisson distribution (why?)
- Eval metric
- Percentage error for normalised (wrt to output length) E2E latency
- For static workload, only request execution time is measured (not scheduling delay)
- It works! Worst error is
- Dynamic workloads
- Need to evaluate close to capacity of the system
- Request rate at 85% of capacity is similar to prod (Citation? Proof?)
- Good point about how if capacity is exceeded, massive delays occur.
- Prod systems have additional buffer to prevent this
- Cost savings due to test via simulation (instead of actually running every configuration) is in the order of
range - Practically (imo) no-one would do an exploration of this level, leading to inflated value prop. However, the value of simulation remains
Pareto Frontier Analysis
Actually, a Pareto Frontier is very simple. Refer to https://youtu.be/ELLHqHk32II?t=157 this video for a visual representation
-
Figure 5 of the paper is a v nice graph.
- They plot
vs TTFT and TBT. - Then they plot TTFT vs TBT, and see where the cheapest configuration shows up
- Seems like a useful tool to analyse the tradeoffs when choosing among multiple configurations.
- Not very relevant to the paper (imo)
- DOUBT: Why are some blue dots (do not meet the SLO for all the metrics) present in the shaded region in the 3rd plot?
- This is due to overloaded meaning of the blue color.
- In the 3rd plot, the blue represents the cost (via a tempature colormap), while in the other 2 plots, it represents points which do not meet some (but not all) SLOs
- They plot
-
“if the TBT SLO is changed from 0.12 seconds to 0.14 seconds (a difference of only 20ms), the Pareto curve point moves from approximately 0.07 to 0.13, ∼ 1.85× reduction in cost!”
- This line seems misleading. It is a
increase. However, we do see a increase in , which imo is worth investigating
- This line seems misleading. It is a
Important Conclusions
- Change in workload can drastically change the optimal configuration
Points to improve
- Would have loved to see the prediction on a real-world workload (even if the workload is internal to MSoft)
- Would cement the relevance of simulation
- Would also like to know the existing internal processes for optimal configurations, and seeing if Vidur performs as well as (or better than) existing methods used in prod
- The graphs do not indicate any form of error bars, and the paper doesn’t discuss if the experiments were run multiple times
- Since it incorporates hardware profiling, taking the mean and ensuring a low deviation is necessary
- Would be nice to see a deeper analysis of the relationship between certain SLOs and
- Especially as they note the rapid change in QPS when SLO is only slightly changed - It is unclear if the profiling is fixed-time or fixed-work
- Systems benchmarks should always be fixed-work[1]
- Figures and graphs are inaccessible
- Poor choice of colours
- Lack of symbols is neither printer-friendly not colour-blindness friendly
- Perhaps prefer
bright+ieeefrom SciencePlots