My Notes

202502271027
Status: #idea
Tags:

Notes

Tunables

Attention

Optimal configuration is a function of model-trace pair
Trace is list of requests and when they come

Challenges

Vidur

Profiling

Info

Runtime of some operations depend on total context length, while some others only depenend on the number of tokens in the current iteration

Attention

  • Only attention kernel is dependant on request history
  • During decode, the MLP will always take the same amount of compute
  • Hence, only attention kernel needs to be profiled.
    • Attention during decode is memory-bound, so only the size of KV-cache needs to be modeled, to determine the kernel runtime

Automatic Profiling for Different Parallelism Strategies

Info

Vidur works. Has a 9% error rate at request level.
Claims to be able to predict cluster level metrics (PROOF?)

Vidur-Bench

Background

Model Parallelism

Tensor Parallelism

Pipeline Parallelism

LLM Scheduler Design

Configuration Space (for LLM inference)

How it actually works

Processing

Model onboarding (Section 4.2)

flowchart TD


1([Model specification]) -->|Generates| 2([Set of operations to profile])

3[Profiler]

2 --> 3
4([Runtime characteristics]) -->|Fed into| 3

5["Runtime estimator
(ML model)"]

3 -->|Profiled measurements| 5
5 -->|Outputs| 6(["*LUT*: operator -> runtime"])
Profiler
graph

0([Operators]) --- 1([Token-level]) & 2([Sequence-level]) & 3([Communication])
Token-level
Attention

This is how to obtain traces for different parallelism configs using a single GPU

Sequence-level
Communication

Runtime Estimator (Section 4.4)

Hierarchical Scheduler (Section 4.5)

graph TD

subgraph S1[Global Scheduler]

G0([Scheduling Policy])

end

S1 --- S2 --- S3

  

subgraph S2[Replica Scheduler]
direction LR

A([Product Space]) --> B(Memory Planner)

C([Parallelism]) --> B

B --> D([Mem Available for KV Cache])

B --> F([High-Level API Used to Implement Task/Batching Policy])

end

  

subgraph S3[Replica Stage]

end
Global scheduler
Replica scheduler
Replica stage scheduler

Vidur-Bench

Performance Metrics

Operator-level

Request-level

Replica-level

Hardware

Vidur-Search

for every setup in all setups {
	do binary search to find max QPS {
		condition: queue delay stays under 5 seconds
	}
}

SORT all setups by QPS per dollar

SELECT cheapest config that meets latency targets

Evaluation

Pareto Frontier Analysis

Note

Actually, a Pareto Frontier is very simple. Refer to https://youtu.be/ELLHqHk32II?t=157 this video for a visual representation

Important Conclusions

Points to improve


References

  1. Benchmarking Book
  2. SciencePlots
  3. VIDUR - A Large-Scale Simulation Framework for LLM Inference

  1. https://benchmarking-book.com/ ↩︎