Session: Building an Agent Eval Harness

Prompted by a Nate B. Jones video on preparing for model upgrades, this session builds a proper evaluation harness for the multi-agent infrastructure pipeline.

The goal: a reproducible test suite that can tell us, before and after a model change, whether the agent is better or worse at diagnosing infrastructure failures.

What we built:

A 6-scenario eval suite — 2 happy path (LXC + VM), 2 LXC failure scenarios (simple/complex), 2 VM failure scenarios (simple/complex) — instrumented with Prometheus metrics and a baseline JSON output format.

The interesting finding:

Happy path scenarios produce exactly 0 LLM calls. The LLM only activates in the DIAGNOSING state, which only fires on failure. This means the eval harness is almost entirely a measure of failure diagnosis quality, not general capability.

Model comparison (2026-04-01 baseline, 5 models):

Model	Avg retries	Total cost	Cost/failure	Notable behavior
claude-haiku-4-5	2.0	$0.0069	$0.0017	Stable, correct on simple failures; topology speculation on VM scenarios
claude-sonnet-4-6	1.25	$0.0863	$0.0216	Fewest retries; 13× Haiku’s cost for similar hypothesis quality
gpt-4o	1.5	$0.0748	$0.0187	27% fewer tokens than Sonnet; only model to find both root causes in fail-vm-complex
gpt-4o-mini	2.25	$0.0024	$0.0006	Near-identical token count to GPT-4o at 31× lower cost; shallower hypotheses
claude-opus-4-6	2.25	$0.394	$0.099	Same retry behavior as Haiku; drifts to topology speculation; 57× Sonnet’s cost

Opus 4.6 was run after the initial 4-model baseline as a sanity check on the top of the Anthropic lineup. It did not clear the Mythos Bar — retry behavior was identical to Haiku and GPT-4o-mini, it still speculated about cluster topology on fail-vm-complex by retry 3, and it costs $0.099 per failure scenario vs Haiku’s $0.0017. There is no justification for using it here.

Haiku stays as the production diagnosis model.

Full architectural writeup: eval-harness-architecture.md