Inference Benchmark

Qwen/Qwen3.5-35B-A3B-FP8 on 1xH100

Model
Qwen/Qwen3.5-35B-A3B-FP8
GPU
1xH100
Architecture
35B MoE, 3B active params, FP8

What Do the Scenarios Represent?

Each benchmark scenario simulates a different real-world usage pattern with distinct input and output token profiles.

Chatbot
1,024 in / 256 out
A typical conversational turn: short context (prior messages plus a new prompt) and a moderate-length reply. This is the lightest workload and represents most chat-style applications. Prefill is fast, so TTFT is low and decode speed is at its peak.
RAG / QA
4,096 in / 256 out
Retrieval-augmented generation: a user question plus several retrieved document chunks injected into the prompt. The 4x larger input means longer prefill and higher TTFT, while output stays short. This is the bread-and-butter pattern for knowledge-base assistants and search-grounded QA.
Agentic
16,384 in / 256 out
An autonomous agent step: the prompt carries a long scratchpad of prior reasoning, tool outputs, and instructions. The heavy input context stresses prefill throughput and KV-cache memory. Decode output is still short (a single action or thought), but TTFT can spike significantly under concurrency.
Tool Calling Agentic
32,768 in / 256 out
The heaviest scenario: a multi-step agent with full conversation history, tool schemas, and prior tool results packed into the context. At 32K input tokens the prefill phase dominates, TTFT is at its highest, and the GPU can serve far fewer concurrent requests before quality degrades.
Key insight: As input tokens grow from 1K to 32K, prefill cost rises dramatically. This means higher TTFT, more GPU memory consumed per request, and fewer concurrent users the system can handle at acceptable quality. Output tokens (256 across all scenarios) are held constant so the differences you see are driven entirely by input context length.

Metric Glossary

Quick definitions for the metrics used throughout this report. All timing metrics are measured on a live streaming endpoint.

Context Length
Input size
The number of tokens provided as input (prompt + chat history). Longer context increases prefill cost and often increases TTFT.
Prefill
Compute phase
The "prompt processing" phase where the model ingests the full context and builds KV cache. Prefill cost scales roughly with context length.
Decode
Compute phase
The "generation" phase where the model produces new output tokens after the first token. Decode speed is usually reported as tok/s.
TTFT
Latency
Time To First Token: time from request start to the first streamed token. Dominated by prefill + scheduling/queueing overhead.
ITL
Latency
Inter-Token Latency: average time between successive streamed tokens during decode. Often shown in ms/token. Lower feels more "snappy".
Decode Speed
Throughput
Output tokens per second. Per-user = tokens/sec for one request stream, system = sum across all concurrent streams.
Per-User vs System
Interpretation
Per-user shows the experience of a single client (speed/latency). System total shows total server capacity across concurrent requests.
Scaling Efficiency
Concurrency
How close the system is to perfect scaling as concurrency increases: system_throughput / (bs1_throughput x batch_size). 100% = no loss.
E2E Latency
Latency
End-to-end time for the request to finish streaming: TTFT + decode_time. This is what a user feels for "full completion time".
Batch Size / Concurrency
Load
Number of concurrent requests in flight. Higher concurrency typically improves system tok/s but reduces per-user tok/s.
Rule of thumb: TTFT is mostly about prefill + queueing; ITL is mostly about decode smoothness. "Per-user" metrics reflect UX; "system" metrics reflect capacity.
Peak Decode Speed
172.5 tok/s
Chatbot at batch_size=1
Max System Throughput
9774 tok/s
Chatbot at 256 users
Best TTFT
0.189 s
Chatbot at batch_size=1
Max Safe Queue
131 reqs
Chatbot — TTFT ≤ 3.0s

When Am I the Fastest?

How fast is this model when the GPU is fully dedicated to my requests? We run exactly N requests simultaneously (1, 2, 4, … up to 32) with no queuing — every request gets immediate GPU attention. This isolates the hardware’s raw capability from any scheduling overhead, giving you the best-case per-user experience at each concurrency level.

Decode Speed per Scenario

How many tokens per second each user receives as more users share the GPU. A single user gets the fastest decode, and speed drops as concurrency grows because the GPU’s memory bandwidth is split among them.
Decode Speed per Scenario
Per-user decode speed (tok/s) at each concurrent-user level. One line per context-length scenario.
Per-user decode speed ranges from 172.5 tok/s (Chatbot, single user) down to 13.2 tok/s (Tool Calling Agentic, 32 concurrent). Peak speed is excellent — matching top-tier single-user performance.

Inter-Token Latency per Scenario

The delay between consecutive tokens arriving in the stream — this is what determines whether the output feels “fluid” or “choppy” to the user. Under ~20ms feels instant, 20–70ms feels noticeably slower but acceptable, and above 70ms feels sluggish.
Inter-Token Latency per Scenario
Average inter-token latency (ms) at each concurrent-user level. The 20ms threshold marks the boundary of perceptually instant streaming.
Best inter-token latency is 5.8ms (Chatbot, single user), rising to 75.8ms under maximum load (Tool Calling Agentic, 32 concurrent). This is excellent at low concurrency. For reference, top-tier H200 SXM setups achieve 4–8ms at short contexts.

Scaling Efficiency

The GPU utilization trade-off. At 100% efficiency, doubling concurrency would double total throughput (every user gets the same speed as a single user). In practice, efficiency drops because memory bandwidth is shared.
Scaling Efficiency
Scaling efficiency (%) at each concurrent-user level relative to ideal linear scaling.
At 32 concurrent, scaling efficiency ranges from 41% (Chatbot) to 8% (Tool Calling Agentic). Scaling drops significantly at longer contexts — GPU memory bandwidth is the likely bottleneck.

Per-User Throughput Range

The band between best-case (short context) and worst-case (long context) per-user decode speed. A narrow band means context length doesn’t matter much — users get a consistent experience.
Per-User Throughput Range
Per-user decode speed range across 4 scenarios at each concurrent-user level.
Peak per-user decode speed is 172.5 tok/s (single user, Chatbot). At 32 concurrent, per-user speed ranges from 13.2 to 70.6 tok/s across context lengths.

TTFT Range

How long users wait before the first token appears. TTFT is dominated by the prefill phase — longer contexts mean longer prefill times, especially under concurrency.
TTFT Range
TTFT range across 4 scenarios at each concurrent-user level.
Best TTFT is 211ms (Chatbot), worst case 6.18s (Tool Calling Agentic, 32 concurrent).

When to Scale Up

What do your users actually experience when requests pile up? Instead of holding a fixed number of requests on the GPU, we send all requests at once — mimicking a burst of simultaneous traffic (up to 256 concurrent). New requests must queue behind in-progress work, so TTFT climbs dramatically compared to peak conditions. Use these numbers for SLA planning and capacity decisions.

TTFT vs Queue Depth

The most important chart for production planning. TTFT explodes under queue saturation because arriving requests must wait while the GPU finishes earlier prefills and decodes. The dashed red line marks the 3.0s threshold beyond which users perceive the system as slow.
TTFT vs Queue Depth
TTFT (seconds) at each queue depth, measured under simultaneous burst load.
TTFT under queue saturation: best case 0.19s (Chatbot, 1 user), worst case 8.25s (Chatbot, 256 queue depth). TTFT increases with queue depth because new prefills queue behind active decode operations. This is what users experience under production burst load.

Inter-Token Latency vs Queue Depth

The streaming “smoothness” your users experience under queued traffic. Unlike peak conditions where ITL rises steadily with concurrency, queued ITL can plateau if the serving engine’s continuous batching effectively shares the GPU’s decode pipeline across all active sequences.
Inter-Token Latency vs Queue Depth
Average inter-token latency (ms) at each queue depth. 70ms threshold marks MoE-class ceiling.
Best inter-token latency is 5.8ms (Chatbot), rising to 38.9ms under maximum load (RAG / QA, 128 concurrent). The 70ms threshold reflects the memory-bandwidth ceiling for large MoE models.

Per-User Decode Speed vs Queue Depth

The per-scenario breakdown of decode speed under queued load. The key insight: decode speed (ITL) is more robust than TTFT when the system is heavily queued.
Per-User Decode Speed vs Queue Depth
Per-user decode speed (tok/s) at each queue depth. One line per context-length scenario.
Per-user decode speed ranges from 172.4 tok/s (Chatbot) down to 25.7 tok/s (RAG / QA, 128 concurrent).
Max Safe Queue Depth
131 reqs
Chatbot — TTFT ≤ 3.0s
Max Safe Queue Depth
84 reqs
RAG / QA — TTFT ≤ 3.0s
Max Safe Queue Depth
19 reqs
Agentic — TTFT ≤ 3.0s
Max Safe Queue Depth
8 reqs
Tool Calling Agentic — TTFT ≤ 3.0s