Review

202502271028
Status: #idea
Tags:

Critique

Presents a simulation-based approach to optimise the deployment of LLMs by reducing the experimenting cost of trying various configurations
Combines
- operator-level profiling
- predictive modelling
- workload-aware scheduling
Achieves good perf predictions (under 9%)
Also consists of:
- Vidur-Bench → A benchmark suite for LLM deployment
- Vidur-Search → Configuration optimiser
Orders of magnitude cheaper (for finding optimal config)

Decomposes into token-level, sequence-level and communication operators
Profiles runtime characteristics, and uses ML to predict the unprofiled ones
- Uses random forest to interpolate between profiled configurations and profiled data
  - Why RF? → Fast enough and accurate enough
Approximates prefill attention costs using aggregated batch length
Models decode attention via KV-cache size

Fidelity degrades for smaller models which are bounded by the CPU overhead
Hardware heterogeneity
- Only focuses on a very limited set of GPUs (with successive micro-architectures)
- Would have been nice to see non-Nvidia GPUs as well as emerging accelerators (TPUs, Groq, Cerebras, Inferentia, etc.)
Would be nice if energy consumption was also tracked (for non-cloud workloads)

Would have loved to see the prediction on a real-world workload (even if the workload is internal to MSoft)
- Would cement the relevance of simulation
- Would also like to know the existing internal processes for optimal configurations, and seeing if Vidur performs as well as (or better than) existing methods used in prod
The graphs do not indicate any form of error bars, and the paper doesn’t discuss if the experiments were run multiple times
- Since it incorporates hardware profiling, taking the mean and ensuring a low deviation is necessary
Would be nice to see a deeper analysis of the relationship between certain SLOs and $QPS/$
- Especially as they note the rapid change in QPS when SLO is only slightly changed
It is unclear if the profiling is fixed-time or fixed-work
- Systems benchmarks should always be fixed-work^[1]
Figures and graphs are inaccessible
- Poor choice of colours
- Lack of symbols is neither printer-friendly not colour-blindness friendly
- Perhaps prefer bright+ieee from SciencePlots