Project: Agent Eval Harness


When a new model drops — Claude Mythos, GPT-5, whatever’s next — how do you know if your agent got better or worse?

The agent eval harness answers that question with reproducible data.

Scenario suite (6 fixed scenarios):

ScenarioTypeExpected
lxc-happy-pathLXC provision0 LLM calls, SUCCEEDED
vm-happy-pathVM clone0 LLM calls, SUCCEEDED
fail-simpleWrong storage poolSUCCEEDED via diagnosis
fail-complexStorage + template + bridge wrongSUCCEEDED via diagnosis
fail-vm-simpleBad template VMIDSUCCEEDED via diagnosis
fail-vm-complexBad VMID + bad storageSUCCEEDED via diagnosis

Metrics (Prometheus):

  • eval_scenario_runs_total — labeled by scenario, model, status
  • eval_scenario_tokens_total — input/output token counts
  • eval_scenario_cost_dollars_total — USD cost per scenario
  • eval_scenario_retries_total — retry count (proxy for diagnosis difficulty)

2026-04-01 baseline results (5 models):

ModelAvg retriesTotal costCost/failureNotable
claude-haiku-4-52.0$0.0069$0.0017Stable baseline — production model
claude-sonnet-4-61.25$0.0863$0.0216Fewest retries; 13× Haiku cost
gpt-4o1.5$0.0748$0.0187Only model to find both root causes in fail-vm-complex
gpt-4o-mini2.25$0.0024$0.0006Cheapest; hypothesis depth slightly shallower
claude-opus-4-62.25$0.394$0.099Did not clear Mythos Bar; 57× Sonnet cost, same retry behavior

Opus 4.6 was added after the initial 4-model run. It matched Haiku’s retry count, reproduced the same cluster topology speculation failure on fail-vm-complex, and cost 57× more than Sonnet. Haiku remains the production diagnosis model.

Session notes: Agent Eval Harness