Codec on

{p.examples[0].agentAnswer}

{p.summary.codecOnKeptOffWire} attempts kept the email off the wire

Email replaced with a placeholder on the customer’s machine before any snapshot reached the model.

vs.

Codec off

{p.examples[1].agentAnswer}

{p.summary.codecOffSentAsIs} attempts sent the email to the model

No redaction layer in path. The model received the email as it appeared on the page.

Same agent, same model, same task. The codec did what it was meant to do.

Names are not redacted.{' '} Names overlap too heavily with everyday words and product copy for regex redaction to be safe. False-positive redaction would corrupt the page state the agent navigates against. Contact identifiers (email, phone, address, SSN, credit-card-like numbers) are structurally identifiable and what regulated workloads care about. That’s where the redaction line is drawn.

Private by design. A deliberate per-task override is on the roadmap for benchmark and test runs where exposing personal data to the model is explicit.

benchmarks

The smartest model. The cheapest bill.

Same agent, same tasks. Only the codec changes. Even on the strongest model in this batch, switching the codec on solves more tasks at a lower price per success.

Four models. Same story on every one.

One matched pair per model: same agent, same tasks, codec on versus codec off. Cheaper per successful task on every model, without ever losing on tasks finished. The full task-by-task breakdown lives on the batch page.

Private by design.

One of the ten tasks in this batch asks the agent to return a customer’s email address. The codec redacts personal data on your machine before any snapshot leaves your network, so the email never reaches the model. We report this task separately because the codec is meant to “fail” the evaluator here. That’s the feature.

Benchmark methodology.

The rules we apply on every batch. Per-batch fingerprints (codec version, upstream commits, pricing snapshot) live on each batch’s own page.

Frequently asked questions.

What we get asked most, plus the questions we keep volunteering because anyone running benchmarks already wonders.