Measuring quality with perplexity
Speed is easy to measure. Quality — how much you give up
when you run a model at int4 instead of bf16 —
is not. millrace measures it with perplexity, end to end,
through the same serving path a request takes.
What perplexity tells you
Perplexity (PPL) is how surprised a model is by real text it didn't write. Feed it a held-out passage and, at every position, ask how much probability it put on the word that actually came next. Average that surprise and exponentiate it once:
PPL = exp( − mean log P(token | preceding tokens) ) Lower is better. A model that confidently predicts the real next token scores near the low single digits; one whose distribution is mushy or wrong scores high. Crucially, PPL is sensitive to the whole probability distribution, not just the top choice — so it catches quality loss that a greedy "does it still answer correctly?" check sails right past. That makes it the right tool for the question we actually care about: how much does int4 cost?
How it's measured
The engine already speaks an OpenAI-compatible API, so scoring rides on it
directly. A /v1/completions request with
echo + logprobs runs a single teacher-forced
forward over the input and returns the log-probability of every token
given its context — no text generated, just the model's own confidence
read back out. A small GPU kernel gathers the per-token log-probs in one
pass, so the full tokens × vocab logit tensor never leaves the
GPU.
- Corpus. A held-out public-domain English text, split into non-overlapping windows. The same corpus and windowing is used for every model, so comparisons are apples-to-apples.
- Aggregate once. Negative log-likelihood is summed across the whole corpus and exponentiated a single time — not averaged per-window — so long and short windows are weighted by their token count.
- The real path. Scoring uses the production tokenizer, forward pass, and KV session the server uses to answer chat — so the number reflects what you'd actually deploy, and the measurement doubles as a serving-stack regression test.
One subtlety: the score is relative. Each model is read through its own tokenizer, so a model is strictly comparable to itself (int4 vs bf16). Across different model families the numbers are a guide, not a photo-finish ranking.
The harness
The whole loop is a small pure-Mojo tool, a sibling to the engine. It downloads the corpus over flare's TLS client, then for each model: launches the server on that checkpoint (one model resident at a time — the GPU holds one), waits for the endpoint, scores perplexity, and tears the process group down before the next. No Python on the path.
pixi run assay # download corpus, score every model
pixi run -- ./build/assay --only Qwen3-8B What int4 costs
Perplexity on the held-out corpus, measured through the millrace serving
path. The bf16 reference exists where the unquantized model
fits the GPU; the gap to it is the price of int4.
| chat model | engine | perplexity |
|---|---|---|
| Qwen2.5-3B | bf16 (reference) | 9.3 |
| Qwen2.5-3B | millrace int4 | 13.6 |
| Qwen3-8B | millrace int4 | 11.6 |
| Qwen3-14B | millrace int4 | 8.9 |
Two things fall out. First, group-128 int4 costs Qwen2.5-3B
about +46% perplexity versus bf16 (13.6 vs 9.3) — real,
and worth knowing when you trade it for the memory. Second, capacity still
wins: at the same int4, the bigger Qwen3 models score better
(14B 8.9 < 8B 11.6 < 3B 13.6),
exactly as size predicts — a sanity check that the quantized models are
behaving like models, not like noise.
Absolute numbers run higher than the familiar single-digit WikiText figures because the windows are short and start cold, with no long running context per token; the deltas are the signal. Quality validation of the Gemma 4 models is ongoing and not yet listed here.