Methodology

Every number on this site comes from a reproducible automated benchmark. This page documents exactly what we measure, how we measure it, and what we do not claim. The same method is applied to every provider, so numbers are directly comparable across providers.

What one benchmark run looks like

The worker sends a single streaming chat-completion request to the provider's OpenAI-compatible API endpoint. The model is asked to write a 400-word prose explanation of HTTP request routing. max_tokens is capped at 300 — enough to produce a full streaming response without waiting for a very long generation, and consistent across every model so comparisons are fair. The prompt and cap are fixed; they never change between runs or providers.

TTFT — time to first token

TTFT (milliseconds) is measured from the moment the HTTP request is dispatched to the moment the first non-empty content chunk arrives in the stream. It captures network round-trip plus the provider's prompt-processing time. It does not include DNS or TLS handshake if a keep-alive connection is reused, but those costs are typical of real API usage.

TPS — tokens per second (hybrid measurement)

TPS measures generation throughput — how fast the model emits output tokens, excluding the initial wait (TTFT). We use a hybrid approach that picks the right method per response, because providers stream tokens very differently.

Inter-token timing (streamed delivery)

When a response is streamed smoothly — many small chunks, each carrying one or a few tokens — we measure the time between consecutive tokens and compute the rate directly. This is the decode-phase throughput the user actually experiences while the answer is streaming in.

The inter-token path is used when the response arrives as ≥ 8 chunks and the mean inter-token gap exceeds 50 ms. Below those thresholds the stream is effectively a burst and inter-token timing becomes noisy or meaningless, so we fall back to the wall-clock path.

Wall-clock fallback (burst / chunked delivery)

Some providers flush output in large bursts (for example when a response is generated server-side and then streamed out in a few big chunks, or when an upstream proxy buffers). For those responses we use a wall-clock rate over the whole generation:

TPS = output_tokens ÷ (end_time − first_token_time)

This path is taken when the response has fewer than 8 chunks or the mean inter-token gap is ≤ 50 ms. A run is discarded as malformed if no token count or generation time can be derived.

Why hybrid

A single method breaks for one of the two delivery styles: a naïve client-side stopwatch wildly inflates burst-streamed models (the whole answer arrives at once), while inter-token timing collapses for chunked delivery (few timestamps, large variance). The threshold (≥ 8 chunks AND > 50 ms) is conservative — it only trusts inter-token timing when there is enough signal — so burst models fall through to the wall-clock rate by default.

NVIDIA NIM alignment

The hybrid paths map cleanly onto the two throughput metrics NVIDIA documents for NIM benchmarking:

TPS-per-user = output_tokens ÷ e2e_latency — corresponds to our wall-clock path (whole-generation rate per request).
Decode TPS = 1 ÷ ITL (inter-token latency) — corresponds to our inter-token path (steady-state decode rate).

We report a single TPS number per run, selecting the path that fits the response's delivery shape. See NVIDIA's NIM benchmarking documentation for the underlying definitions.

Timeout

Each request has a hard timeout of 120 seconds. If no complete response arrives in that window the run is recorded as a timeout error and counted against reliability.

Error taxonomy

Every failed run is classified into one of six error kinds:

auth — HTTP 401 or 403 (bad or expired API key)
rate_limit — HTTP 429 (provider throttle)
server — HTTP 5xx (provider-side error)
timeout — no complete response within 120 s
network — TCP/fetch-level failure (ECONNREFUSED, etc.)
malformed — response arrived but was unparseable or too short

Failed runs are stored with ok = false and excluded from TPS and TTFT statistics. They are counted in the reliability percentage (success rate = successful runs ÷ total runs in the window).

Benchmark cadence

The worker runs benchmarks continuously using a round-robin priority queue. Each (provider, model) pair has a target interval of ~10 minutes. The scheduler always picks the most-overdue pair next, so the order naturally staggers across models without fixed cron slots.

Circuit breaker: if a model records 3 consecutive failures, it is dropped to 30-minute probe intervals until it recovers (a successful run resets the counter). This prevents a failing model from flooding the queue.

Rate-limit backoff: a 429 response pushes that provider's next benchmark slot back by 5 minutes, giving the provider time to recover without hammering a quota.

Data retention

Raw samples — kept for 60 days, then deleted.
Hourly rollups (avg/min/max/p50/p95 TPS, avg TTFT, success rate) — kept forever.
Daily rollups — kept forever.

Chart windows pick the right table automatically: 24 h and 7 d windows use raw samples; 30 d and 1 y windows use hourly or daily rollups.

Providers benchmarked

TokenDyno benchmarks three providers on the same engine, same prompt, same measurement method:

Ollama — hosted API. Full model catalog coverage on the premium plan.
OpenCode Zen — pay-per-use API (Zen endpoint). Benchmarked when an active key is configured.
OpenCode Go — monthly subscription plan (Go endpoint). Different model set and endpoint from Zen; benchmarked separately.

Provider pricing tiers can change the models available and the speed a given model runs at. If you are on a free or lower tier you may see different throughput. Our numbers are not a ceiling — they reflect our specific plan and the state of the provider's infrastructure at measurement time.

Sequential, not parallel

Benchmarks run sequentially — one request at a time, waiting for the full response before the next. This mirrors realistic single-client usage and avoids inflating TPS numbers by running requests in parallel (which would share provider capacity).

Why no concurrency / throughput metrics

We deliberately do not report concurrent-request throughput, requests per second, or aggregate bandwidth. TokenDyno answers the consumer question — "how fast is this model for me, on a single request?" — not the server-capacity question. This matches the approach taken by Artificial Analysis and OpenRouter, and the per-user metrics in NVIDIA NIM benchmarking. Concurrency numbers depend heavily on the provider's load balancing and quota shape, which a single external client cannot measure fairly.

What we do not claim

Absolute throughput — numbers depend on network path, time of day, and provider load. Treat them as relative indicators, not hardware specs.
Batch or parallel throughput — if your workload sends many concurrent requests, throughput per request will differ.
Internal SLA compliance — we measure from outside the provider's network; TTFT includes our egress latency.

Open questions and feedback

If you notice a measurement that looks wrong, or want to suggest an improvement to the methodology, open an issue or start a discussion in the project repository. Accuracy and transparency are the point.