Quick definitions for the metrics used throughout this report. All timing metrics are measured on a live streaming endpoint.
Context LengthPrefillDecodeTTFTITLDecode SpeedPer-User vs SystemScaling EfficiencyE2E LatencyBatch Size / ConcurrencyPerformance across context lengths at different concurrency levels. The top edge of each band shows single-user performance; the bottom edge shows performance at maximum tested concurrency.



| Condition | Peak System Throughput (tok/s) | Peak Per-User (tok/s) | Tokens/Hour |
|---|---|---|---|
| Single user | 153 | 152.9 | 550,800 |
| Mid concurrency (8 reqs) | 1,121 | 140.2 | 4,035,600 |
| Max concurrency (32 reqs) | 2,614 | 81.7 | 9,410,400 |
Deep dive into per-user metrics across context lengths at different concurrency levels.



How many concurrent requests can the system handle before quality degrades below acceptable thresholds? Each scenario shows measured data at tested concurrency levels.




Benchmarks were run against a live endpoint using streaming completions. TTFT, inter-token latency, and decode speed are measured directly from the token stream. All capacity charts show only measured data points at tested concurrency levels.