The rules we apply on every batch

Suite: WebArena (public agent benchmark using a real Magento admin)
Tasks: A handful of lookup tasks per batch. A deliberate mix of easy and hard so model variants span the full pass-rate range.
Reps: 3 runs per task per model variant. Models are non-deterministic. One run isn’t enough.
Matched pair: Every model is run twice. Once with the codec on, once with it off. Relative claims are apples-to-apples.
Token accounting: Provider-native counts. What the model provider sends back with each response, and what they bill on. Not an estimator.
Cost basis: Public list price for the model used, as of the batch’s pricing date. Vertex for Gemini, Anthropic for Sonnet.
Codec cost: Codec runtime is added to the codec-on rows. It never disappears into the headline number.
Scoring: Benchmark’s built-in deterministic checker. No human judgment. No LLM-as-judge.
Privacy: Personal data is redacted client-side before snapshots leave the network. Private by default. A deliberate per-task opt-out is on the roadmap.
Publication: Every batch keeps its own page. Older claims stay verifiable when the codec ships a new version.

How to check our work

We don’t ship a one-command repro harness. We ship the raw output of every run so you can re-add the numbers yourself, and we point at the public benchmark so you can run it against your own agent.

Per-run JSON for every batch (on each batch page) open latest → WebArena, upstream (public benchmark) github → The codec runs behind an API key (request access from the homepage) request key →

The most meaningful check is the one that doesn’t need anything from us. Install WebArena upstream, run it against your agent twice. Once bare, once through the codec. The shape of the results should match.

This batch’s fingerprint

Captured: {m.capturedDate}
JDC version: v{m.codecVersion}
Suite preset: {batch ? batch.preset : m.preset}
Models: {batch ? batch.results.length : m.nModels} tested
Tasks: {m.nTasksTotal} attempted ({ca.nTasks} in the cost aggregate, 1 privacy probe reported separately)
Attempts per variant: {ca.nPerCell} on the cost aggregate ({ca.nTasks} tasks × {m.nReps} reps)
Total runs: {m.totalAttempts} ({ca.totalAttempts} cost-aggregate, {m.privacySlice.totalAttempts} privacy-probe)
Total spend: ${m.totalSpend.toFixed(2)} across all runs
Pricing applied: Vertex AI list price, {m.pricingDate}
Upstream commit: WebArena {m.upstreamWebArenaSha.slice(0,8)}

Check our work on this batch

The raw JSON for every run in this batch is downloadable below. Re-add the numbers with your own pricing assumptions, or diff against our published table. Both should line up.

{rawDataUrl ? ( {displayName} ({m.totalAttempts} per-attempt JSON, plus per-cell summaries) download → ) : (

{displayName} ({m.totalAttempts} per-attempt JSON, plus per-cell summaries) upload pending

)} How we ran it (methodology on the hub) read → Run it against your own agent (API key on request) request key →