deep dive

Measuring quality with perplexity

Speed is easy to measure. Quality — how much you give up when you run a model at int4 instead of bf16 — is not. millrace measures it with perplexity, end to end, through the same serving path a request takes.

What perplexity tells you

Perplexity (PPL) is how surprised a model is by real text it didn't write. Feed it a held-out passage and, at every position, ask how much probability it put on the word that actually came next. Average that surprise and exponentiate it once:

PPL = exp( − mean log P(token | preceding tokens) )

Lower is better. A model that confidently predicts the real next token scores near the low single digits; one whose distribution is mushy or wrong scores high. Crucially, PPL is sensitive to the whole probability distribution, not just the top choice — so it catches quality loss that a greedy "does it still answer correctly?" check sails right past. That makes it the right tool for the question we actually care about: how much does int4 cost?

How it's measured

The engine already speaks an OpenAI-compatible API, so scoring rides on it directly. A /v1/completions request with echo + logprobs runs a single teacher-forced forward over the input and returns the log-probability of every token given its context — no text generated, just the model's own confidence read back out. A small GPU kernel gathers the per-token log-probs in one pass, so the full tokens × vocab logit tensor never leaves the GPU.

Corpus. A held-out public-domain English text, split into non-overlapping windows. The same corpus and windowing is used for every model, so comparisons are apples-to-apples.
Aggregate once. Negative log-likelihood is summed across the whole corpus and exponentiated a single time — not averaged per-window — so long and short windows are weighted by their token count.
The real path. Scoring uses the production tokenizer, forward pass, and KV session the server uses to answer chat — so the number reflects what you'd actually deploy, and the measurement doubles as a serving-stack regression test.

One subtlety: the score is relative. Each model is read through its own tokenizer, so a model is strictly comparable to itself (int4 vs bf16). Across different model families the numbers are a guide, not a photo-finish ranking.

The harness

The whole loop is a small pure-Mojo tool, a sibling to the engine. It downloads the corpus over flare's TLS client, then for each model: launches the server on that checkpoint (one model resident at a time — the GPU holds one), waits for the endpoint, scores perplexity, and tears the process group down before the next. No Python on the path.

pixi run assay                # download corpus, score every model
pixi run -- ./build/assay --only Qwen3-8B

What int4 costs

Perplexity on the held-out corpus, measured through the millrace serving path. The bf16 reference exists where the unquantized model fits the GPU; the gap to it is the price of int4.

chat model	engine	perplexity
Qwen2.5-3B	bf16 (reference)	9.3
Qwen2.5-3B	millrace int4	13.6
Qwen3-8B	millrace int4	11.6
Qwen3-14B	millrace int4	8.9

Two things fall out. First, group-128 int4 costs Qwen2.5-3B about +46% perplexity versus bf16 (13.6 vs 9.3) — real, and worth knowing when you trade it for the memory. Second, capacity still wins: at the same int4, the bigger Qwen3 models score better (14B 8.9 < 8B 11.6 < 3B 13.6), exactly as size predicts — a sanity check that the quantized models are behaving like models, not like noise.

Absolute numbers run higher than the familiar single-digit WikiText figures because the windows are short and start cold, with no long running context per token; the deltas are the signal. Quality validation of the Gemma 4 models is ongoing and not yet listed here.