Inference Benchmark

Qwen/Qwen3-Coder-30B-A3B-Instruct on NVIDIA H100 PCIe

Model
Qwen/Qwen3-Coder-30B-A3B-Instruct
GPU
NVIDIA H100 PCIe
Scenarios
Chatbot, RAG / QA, Agentic, Tool Calling Agentic

Metric Glossary

Quick definitions for the metrics used throughout this report. All timing metrics are measured on a live streaming endpoint.

Context Length
Input size
The number of tokens provided as input (prompt + chat history). Longer context increases prefill cost and often increases TTFT.
Prefill
Compute phase
The "prompt processing" phase where the model ingests the full context and builds KV cache. Prefill cost scales roughly with context length.
Decode
Compute phase
The "generation" phase where the model produces new output tokens after the first token. Decode speed is usually reported as tok/s.
TTFT
Latency
Time To First Token: time from request start to the first streamed token. Dominated by prefill + scheduling/queueing overhead.
ITL
Latency
Inter-Token Latency: average time between successive streamed tokens during decode. Often shown in ms/token. Lower feels more "snappy".
Decode Speed
Throughput
Output tokens per second. Per-user = tokens/sec for one request stream, system = sum across all concurrent streams.
Per-User vs System
Interpretation
Per-user shows the experience of a single client (speed/latency). System total shows total server capacity across concurrent requests.
Scaling Efficiency
Concurrency
How close the system is to perfect scaling as concurrency increases: system_throughput / (bs1_throughput x batch_size). 100% = no loss.
E2E Latency
Latency
End-to-end time for the request to finish streaming: TTFT + decode_time. This is what a user feels for "full completion time".
Batch Size / Concurrency
Load
Number of concurrent requests in flight. Higher concurrency typically improves system tok/s but reduces per-user tok/s.
Rule of thumb: TTFT is mostly about prefill + queueing; ITL is mostly about decode smoothness. "Per-user" metrics reflect UX; "system" metrics reflect capacity.
Peak Decode Speed
152.9 tok/s
Single user (batch size 1)
Max System Throughput
2961 tok/s
Decode at batch size 24
Best TTFT
0.03 s
1024 in, single user
Scaling Efficiency
39% @ bs=32
vs ideal linear scaling

Performance Charts

Performance across context lengths at different concurrency levels. The top edge of each band shows single-user performance; the bottom edge shows performance at maximum tested concurrency.

Per-User Throughput Range

Per-user generation speed across context lengths. Band shows range from single-user to max concurrency.
Per-user throughput range chart
Per-user decode throughput (tok/s) from single-user to 32-user concurrency across 4 context-length scenarios.
Peak per-user decode speed is 152.9 tok/s (single user, Chatbot). At maximum concurrency (32 users), per-user speed ranges from 59.2 to 81.7 tok/s across context lengths — the shaded band shows how much individual throughput degrades under load.

Time to First Token Range

TTFT across context lengths. Lower is better for responsive user experience.
Time to first token range chart
Time to first token (seconds) from single-user to 32-user concurrency across context lengths.
Best TTFT is 29ms (Chatbot, 2 concurrent), worst case 0.97s (Tool Calling Agentic, 32 concurrent). Best-case TTFT is — under 100ms. Industry references: ~50ms at 1K context, ~2s at 32K, ~18s at 128K on top hardware.

System Throughput

Aggregate decode throughput (tok/s) across all concurrent requests at each context length.
System throughput chart
Aggregate system decode throughput (tok/s) at each concurrency level across 4 context-length scenarios.
ConditionPeak System Throughput (tok/s)Peak Per-User (tok/s)Tokens/Hour
Single user153152.9550,800
Mid concurrency (8 reqs)1,121140.24,035,600
Max concurrency (32 reqs)2,61481.79,410,400
At peak throughput (24 concurrent requests, Chatbot), this configuration produces approximately 10.7 million tokens per hour with a per-user decode speed of 38.8 tok/s under maximum load. Single-user performance reaches 152.9 tok/s, and scaling efficiency at 32x concurrency is 39%.
Peak system throughput is 2961.0 tok/s (Chatbot, 24 concurrent users), equivalent to ~10,659,600 tokens/hour. Higher concurrency increases total throughput as the GPU processes more requests in parallel, even though per-user speed decreases.

Technical Analysis

Deep dive into per-user metrics across context lengths at different concurrency levels.

What good looks like

Inter-Token Latency: <10ms excellent, <20ms good, >70ms poor. Under 20ms feels instantaneous to users.
Per-User Decode: >150 tok/s excellent (short ctx), >80 tok/s good (long ctx). Below 30 tok/s noticeably slow.
Scaling Efficiency: >90% excellent, >70% good, <50% poor. Measures how well throughput scales with concurrency.
TTFT: <100ms excellent (short ctx), <2s good (32K ctx), <20s acceptable (128K ctx). Grows with context length.

Inter-Token Latency

Average inter-token latency across context lengths at each concurrency level. Lower is better — under 20ms feels instantaneous.
Inter-token latency chart
Average inter-token latency (ms) derived from per-user decode speed. The 20ms threshold marks the boundary of perceptually instant streaming.
Best inter-token latency is 6.5ms (Chatbot, single user), rising to 25.8ms under maximum load (Tool Calling Agentic, 24 concurrent). This is well under the 20ms threshold where streaming feels instantaneous.

Per-User Decode Speed

Per-user decode speed across context lengths at each concurrency level. Higher is better.
Per-user decode speed chart
Per-user decode speed (tok/s) at each concurrency level. Higher is better for interactive use.
Per-user decode speed ranges from 152.9 tok/s (Chatbot, single user) down to 38.8 tok/s (Tool Calling Agentic, 24 concurrent). Peak speed is matching top-tier single-user performance.

Scaling Efficiency

How well throughput scales with concurrency. 100% means no per-user speed loss when adding users.
Scaling efficiency chart
Scaling efficiency (%) at each concurrency level relative to ideal linear scaling.
At 32x concurrency, scaling efficiency ranges from 54% (RAG / QA) to 39% (Chatbot). Values above 90% are considered excellent; below 50% indicates severe resource contention.

Capacity Analysis

How many concurrent requests can the system handle before quality degrades below acceptable thresholds? Each scenario shows measured data at tested concurrency levels.

Chatbot Capacity
32 users
1K in / 128 out
RAG / QA Capacity
32 users
4K in / 512 out
Agentic Capacity
32 users
16K in / 2K out
Tool Calling Agentic Capacity
8 users
32K in / 4K out

Chatbot

Context length: 1,024 input + 128 output tokens.
Chatbot capacity chart
ITL (left axis)
TTFT (right axis)
Quality thresholds
This scenario supports up to 32 concurrent users within quality thresholds. Beyond that, ITL or TTFT exceeds acceptable limits. The system was tested up to 32 concurrent requests.

RAG / QA

Context length: 4,096 input + 512 output tokens.
RAG / QA capacity chart
ITL (left axis)
TTFT (right axis)
Quality thresholds
This scenario supports up to 32 concurrent users within quality thresholds. Beyond that, ITL or TTFT exceeds acceptable limits. The system was tested up to 32 concurrent requests.

Agentic

Context length: 16,384 input + 2,048 output tokens.
Agentic capacity chart
ITL (left axis)
TTFT (right axis)
Quality thresholds
This scenario supports up to 32 concurrent users within quality thresholds. Beyond that, ITL or TTFT exceeds acceptable limits. The system was tested up to 32 concurrent requests.

Tool Calling Agentic

Context length: 32,768 input + 4,096 output tokens.
Tool Calling Agentic capacity chart
ITL (left axis)
TTFT (right axis)
Quality thresholds
This scenario supports up to 8 concurrent users within quality thresholds. Beyond that, ITL or TTFT exceeds acceptable limits. The system was tested up to 32 concurrent requests.

Methodology

Benchmarks were run against a live endpoint using streaming completions. TTFT, inter-token latency, and decode speed are measured directly from the token stream. All capacity charts show only measured data points at tested concurrency levels.

TTFT (Time to First Token)
Measured from request start to first streamed token
Decode speed
Measured from token stream: output_tokens / decode_time
Scaling efficiency
actual_throughput / (bs1_throughput x batch_size) x 100
E2E latency
TTFT + decode_time (measured end-to-end)