Session: Building an Agent Eval Harness


Prompted by a Nate Jones video on preparing for model upgrades, this session builds a proper evaluation harness for the multi-agent infrastructure pipeline.

The goal: a reproducible test suite that can tell us, before and after a model change, whether the agent is better or worse at diagnosing infrastructure failures.

What we built:

A 6-scenario eval suite — 2 happy path (LXC + VM), 2 LXC failure scenarios (simple/complex), 2 VM failure scenarios (simple/complex) — instrumented with Prometheus metrics and a baseline JSON output format.

The interesting finding:

Happy path scenarios produce exactly 0 LLM calls. The LLM only activates in the DIAGNOSING state, which only fires on failure. This means the eval harness is almost entirely a measure of failure diagnosis quality, not general capability.

Model comparison (2026-04-01 baseline, 5 models):

ModelAvg retriesTotal costCost/failureNotable behavior
claude-haiku-4-52.0$0.0069$0.0017Stable, correct on simple failures; topology speculation on VM scenarios
claude-sonnet-4-61.25$0.0863$0.0216Fewest retries; 13× Haiku’s cost for similar hypothesis quality
gpt-4o1.5$0.0748$0.018727% fewer tokens than Sonnet; only model to find both root causes in fail-vm-complex
gpt-4o-mini2.25$0.0024$0.0006Near-identical token count to GPT-4o at 31× lower cost; shallower hypotheses
claude-opus-4-62.25$0.394$0.099Same retry behavior as Haiku; drifts to topology speculation; 57× Sonnet’s cost

Opus 4.6 was run after the initial 4-model baseline as a sanity check on the top of the Anthropic lineup. It did not clear the Mythos Bar — retry behavior was identical to Haiku and GPT-4o-mini, it still speculated about cluster topology on fail-vm-complex by retry 3, and it costs $0.099 per failure scenario vs Haiku’s $0.0017. There is no justification for using it here.

Haiku stays as the production diagnosis model.

Full architectural writeup: eval-harness-architecture.md