Review
202502271028
Status: #idea
Tags:
Critique
Introduction
- Presents a simulation-based approach to optimise the deployment of LLMs by reducing the experimenting cost of trying various configurations
- Combines
- operator-level profiling
- predictive modelling
- workload-aware scheduling
- Achieves good perf predictions (under 9%)
- Also consists of:
- Vidur-Bench → A benchmark suite for LLM deployment
- Vidur-Search → Configuration optimiser
- Orders of magnitude cheaper (for finding optimal config)
Core Contribution
1. Modular Performance Simulation (Runtime Estimator)
- Decomposes into token-level, sequence-level and communication operators
- Profiles runtime characteristics, and uses ML to predict the unprofiled ones
-
- Uses random forest to interpolate between profiled configurations and profiled data
- Why RF? → Fast enough and accurate enough
- Uses random forest to interpolate between profiled configurations and profiled data
- Approximates prefill attention costs using aggregated batch length
- Models decode attention via KV-cache size
2. Automated Configuration optimization
- Identifies pareto-optimal configurations
- Simulates trade-offs between throughput per $ and latency SLOs (TBT and TTFT)
- Noticed that the cost is extremely sensitive to SLOs
3. Realistic Benchmark Suite
- Title essentially
- Has variations in prompt lengths and request arrival patterns
Limitations
- Fidelity degrades for smaller models which are bounded by the CPU overhead
- Hardware heterogeneity
- Only focuses on a very limited set of GPUs (with successive micro-architectures)
- Would have been nice to see non-Nvidia GPUs as well as emerging accelerators (TPUs, Groq, Cerebras, Inferentia, etc.)
- Would be nice if energy consumption was also tracked (for non-cloud workloads)
Conclusion
- Significant tool in LLM deployment optimisation
- Operator-centric simulation approach is effective
- The integration with Vidur-Search makes allows for an immediate value-add
- Current limitations
- Hardware diversity (or lack thereof)