Inference Benchmark

deepseek-ai/DeepSeek-R1-0528 on 8xH200

Model
deepseek-ai/DeepSeek-R1-0528
GPU
8xH200
Architecture
671B MoE, 37B active params

What Do the Scenarios Represent?

Each benchmark scenario simulates a different real-world usage pattern with distinct input and output token profiles.

Chatbot
1,024 in / 256 out
A typical conversational turn: short context (prior messages plus a new prompt) and a moderate-length reply. This is the lightest workload and represents most chat-style applications. Prefill is fast, so TTFT is low and decode speed is at its peak.
RAG / QA
4,096 in / 256 out
Retrieval-augmented generation: a user question plus several retrieved document chunks injected into the prompt. The 4x larger input means longer prefill and higher TTFT, while output stays short. This is the bread-and-butter pattern for knowledge-base assistants and search-grounded QA.
Agentic
16,384 in / 256 out
An autonomous agent step: the prompt carries a long scratchpad of prior reasoning, tool outputs, and instructions. The heavy input context stresses prefill throughput and KV-cache memory. Decode output is still short (a single action or thought), but TTFT can spike significantly under concurrency.
Tool Calling Agentic
32,768 in / 256 out
The heaviest scenario: a multi-step agent with full conversation history, tool schemas, and prior tool results packed into the context. At 32K input tokens the prefill phase dominates, TTFT is at its highest, and the GPU can serve far fewer concurrent requests before quality degrades.
Key insight: As input tokens grow from 1K to 32K, prefill cost rises dramatically. This means higher TTFT, more GPU memory consumed per request, and fewer concurrent users the system can handle at acceptable quality. Output tokens (256 across all scenarios) are held constant so the differences you see are driven entirely by input context length.

Metric Glossary

Quick definitions for the metrics used throughout this report. All timing metrics are measured on a live streaming endpoint.

Context Length
Input size
The number of tokens provided as input (prompt + chat history). Longer context increases prefill cost and often increases TTFT.
Prefill
Compute phase
The "prompt processing" phase where the model ingests the full context and builds KV cache. Prefill cost scales roughly with context length.
Decode
Compute phase
The "generation" phase where the model produces new output tokens after the first token. Decode speed is usually reported as tok/s.
TTFT
Latency
Time To First Token: time from request start to the first streamed token. Dominated by prefill + scheduling/queueing overhead.
ITL
Latency
Inter-Token Latency: average time between successive streamed tokens during decode. Often shown in ms/token. Lower feels more "snappy".
Decode Speed
Throughput
Output tokens per second. Per-user = tokens/sec for one request stream, system = sum across all concurrent streams.
Per-User vs System
Interpretation
Per-user shows the experience of a single client (speed/latency). System total shows total server capacity across concurrent requests.
Scaling Efficiency
Concurrency
How close the system is to perfect scaling as concurrency increases: system_throughput / (bs1_throughput x batch_size). 100% = no loss.
E2E Latency
Latency
End-to-end time for the request to finish streaming: TTFT + decode_time. This is what a user feels for "full completion time".
Batch Size / Concurrency
Load
Number of concurrent requests in flight. Higher concurrency typically improves system tok/s but reduces per-user tok/s.
Rule of thumb: TTFT is mostly about prefill + queueing; ITL is mostly about decode smoothness. "Per-user" metrics reflect UX; "system" metrics reflect capacity.
Peak Decode Speed
93.3 tok/s
Chatbot at batch_size=1
Max System Throughput
1439 tok/s
RAG / QA at 32 users
Best TTFT
0.301 s
Chatbot at batch_size=1
Max Safe Queue
151 reqs
Chatbot — TTFT ≤ 3.0s

When Am I the Fastest?

How fast is this model when the GPU is fully dedicated to my requests? We run exactly N requests simultaneously (1, 2, 4, … up to 32) with no queuing — every request gets immediate GPU attention. This isolates the hardware’s raw capability from any scheduling overhead, giving you the best-case per-user experience at each concurrency level. Use these numbers to understand the ceiling of what’s possible before real-world traffic patterns come into play.

Decode Speed per Scenario

How many tokens per second each user receives as more users share the GPU. A single user gets the fastest decode, and speed drops as concurrency grows because the GPU’s memory bandwidth is split among them.
Decode Speed per Scenario
Per-user decode speed (tok/s) at each concurrent-user level. One line per context-length scenario.
Per-user decode speed ranges from 93.3 tok/s (Chatbot, single user) down to 30.4 tok/s (Tool Calling Agentic, 32 concurrent).

Inter-Token Latency per Scenario

The delay between consecutive tokens arriving in the stream — this is what determines whether the output feels “fluid” or “choppy” to the user. Under ~30ms feels instant (like fast typing), 30–70ms feels noticeably slower but acceptable, and above 70ms feels sluggish.
Inter-Token Latency per Scenario
Average inter-token latency (ms) at each concurrent-user level. 70ms threshold marks MoE-class ceiling.
Best inter-token latency is 10.7ms (Chatbot, single user), rising to 32.9ms under maximum load (Tool Calling Agentic, 32 concurrent). The 70ms threshold reflects the memory-bandwidth ceiling for large MoE models.

Scaling Efficiency

The GPU utilization trade-off. At 100% efficiency, doubling concurrency would double total throughput (every user gets the same speed as a single user). In practice, efficiency drops because memory bandwidth is shared.
Scaling Efficiency
Scaling efficiency (%) at each concurrent-user level relative to ideal linear scaling.
At 32 concurrent, scaling efficiency ranges from 49% (RAG / QA) to 35% (Tool Calling Agentic). Values above 90% are excellent; below 50% indicates severe contention.

Per-User Throughput Range

The band between best-case (short context) and worst-case (long context) per-user decode speed. A narrow band means context length doesn’t matter much — users get a consistent experience.
Per-User Throughput Range
Per-user decode speed range across scenarios at each concurrent-user level.
Peak per-user decode speed is 93.3 tok/s (single user, Chatbot). At 32 concurrent, per-user speed ranges from 30.4 to 45.0 tok/s.

TTFT Range

How long users wait before the first token appears. TTFT is dominated by the prefill phase — longer contexts mean longer prefill times, especially under concurrency.
TTFT Range
TTFT range across scenarios at each concurrent-user level.
Best TTFT is 301ms (Chatbot), worst case 7.74s (Tool Calling Agentic, 32 concurrent).

When to Scale Up

What do your users actually experience when requests pile up? Instead of holding a fixed number of requests on the GPU, we send all requests at once — mimicking a burst of simultaneous traffic. New requests must queue behind in-progress work, so TTFT climbs dramatically compared to peak conditions. The key insight: even moderate queue depths cause major delays for long-context scenarios because each prefill operation blocks the GPU for a significant duration. Use these numbers for SLA planning and capacity decisions.

TTFT vs Queue Depth

The most important chart for production planning. TTFT explodes under queue saturation because arriving requests must wait while the GPU finishes earlier prefills and decodes. The dashed red line marks the 3.0s threshold beyond which users perceive the system as slow.
TTFT vs Queue Depth
TTFT (seconds) at each queue depth, measured under simultaneous burst load.
TTFT under queue saturation: best case 0.04s (Chatbot), worst case 7.00s (Chatbot, 256 queue depth). TTFT increases with queue depth because new prefills queue behind active decode operations. This is what users experience under production burst load.

TTFT Range

The spread between best-case (short context) and worst-case (long context) TTFT under queue saturation. Unlike peak, these numbers include queue wait time on top of prefill compute.
Queue TTFT Range
TTFT range across scenarios at each queue depth.
Best TTFT is 0.04s (Chatbot), worst case 7.00s (Chatbot, 256 queue depth).

Per-User Throughput Range

Per-user decode speed under queued traffic. Compare to peak section — the gap reveals how much decode performance you lose to scheduling overhead. If the band stays flat, decode isn’t the bottleneck; prefill and queuing are.
Queue Per-User Throughput Range
Per-user decode speed range across scenarios at each queue depth.
Peak per-user decode speed is 93.5 tok/s (Chatbot). At 256 queue depth, per-user speed ranges from 35.0 to 35.0 tok/s.

Inter-Token Latency vs Queue Depth

The streaming “smoothness” your users experience under queued traffic. Unlike peak conditions where ITL rises steadily with concurrency, queued ITL can plateau if the serving engine’s continuous batching effectively shares the GPU’s decode pipeline across all active sequences.
Inter-Token Latency vs Queue Depth
Average inter-token latency (ms) at each queue depth. 70ms threshold marks MoE-class ceiling.
Best inter-token latency is 10.7ms (Chatbot), rising to 36.9ms under maximum load (RAG / QA, 128 queue depth). The 70ms threshold reflects the memory-bandwidth ceiling for large MoE models.

Per-User Decode Speed vs Queue Depth

The per-scenario breakdown of decode speed under queued load. While the throughput range chart above shows the envelope, this chart lets you compare individual context lengths side by side. The key insight: decode speed (ITL) is more robust than TTFT when the system is heavily queued.
Per-User Decode Speed vs Queue Depth
Per-user decode speed (tok/s) at each queue depth. One line per context-length scenario.
Per-user decode speed ranges from 93.5 tok/s (Chatbot) down to 27.1 tok/s (RAG / QA, 128 queue depth).
Max Safe Queue Depth
151 reqs
Chatbot — TTFT ≤ 3.0s
Max Safe Queue Depth
122 reqs
RAG / QA — TTFT ≤ 3.0s
Max Safe Queue Depth
15 reqs
Agentic — TTFT ≤ 3.0s
Max Safe Queue Depth
5 reqs
Tool Calling Agentic — TTFT ≤ 3.0s