How fast is this model when the GPU is fully dedicated to my requests? We run exactly N requests simultaneously (1, 2, 4, … up to 32) with no queuing — every request gets immediate GPU attention. This isolates the hardware’s raw capability from any scheduling overhead, giving you the best-case per-user experience at each concurrency level.
Decode Speed per Scenario
How many tokens per second each user receives as more users share the GPU. A single user gets the fastest decode, and speed drops as concurrency grows because the GPU’s memory bandwidth is split among them.

Per-user decode speed (tok/s) at each concurrent-user level. One line per context-length scenario.
Per-user decode speed ranges from 172.5 tok/s (Chatbot, single user) down to 13.2 tok/s (Tool Calling Agentic, 32 concurrent). Peak speed is excellent — matching top-tier single-user performance.
Inter-Token Latency per Scenario
The delay between consecutive tokens arriving in the stream — this is what determines whether the output feels “fluid” or “choppy” to the user. Under ~20ms feels instant, 20–70ms feels noticeably slower but acceptable, and above 70ms feels sluggish.

Average inter-token latency (ms) at each concurrent-user level. The 20ms threshold marks the boundary of perceptually instant streaming.
Best inter-token latency is 5.8ms (Chatbot, single user), rising to 75.8ms under maximum load (Tool Calling Agentic, 32 concurrent). This is excellent at low concurrency. For reference, top-tier H200 SXM setups achieve 4–8ms at short contexts.
Scaling Efficiency
The GPU utilization trade-off. At 100% efficiency, doubling concurrency would double total throughput (every user gets the same speed as a single user). In practice, efficiency drops because memory bandwidth is shared.

Scaling efficiency (%) at each concurrent-user level relative to ideal linear scaling.
At 32 concurrent, scaling efficiency ranges from 41% (Chatbot) to 8% (Tool Calling Agentic). Scaling drops significantly at longer contexts — GPU memory bandwidth is the likely bottleneck.
Per-User Throughput Range
The band between best-case (short context) and worst-case (long context) per-user decode speed. A narrow band means context length doesn’t matter much — users get a consistent experience.

Per-user decode speed range across 4 scenarios at each concurrent-user level.
Peak per-user decode speed is 172.5 tok/s (single user, Chatbot). At 32 concurrent, per-user speed ranges from 13.2 to 70.6 tok/s across context lengths.
TTFT Range
How long users wait before the first token appears. TTFT is dominated by the prefill phase — longer contexts mean longer prefill times, especially under concurrency.

TTFT range across 4 scenarios at each concurrent-user level.
Best TTFT is 211ms (Chatbot), worst case 6.18s (Tool Calling Agentic, 32 concurrent).