Failure Detection Dashboard — Observability for AI Agent Failures

When you build autonomous AI agents that interact with infrastructure, things go wrong in predictable patterns. My Failure Detection Dashboard makes those patterns visible — and teaches you which failures are cheap to detect and which ones are expensive lessons.

The Problem

The multi-agent infrastructure deployment system I built in Week 2 runs Terraform jobs through an LLM diagnosis loop: if terraform plan/apply fails, the LLM proposes a fix, we retry, and repeat up to a cap. This works well, but visibility is poor. You know whether it failed, but not how or why — and more importantly, which failure modes are burning budget.

Early runs showed 228 eval scenarios across 8 different failure types. With 125 failures and only $3.94 in cost, something didn’t add up. Digging into the metrics revealed the culprit: one scenario (OCR geometry degraded) consumed 2/3 of the entire cost.

That observation led to this dashboard and the infrastructure-wide monitoring setup it validates.

Architecture: 6 Failure Modes, Fully Instrumented

I designed the dashboard around 6 distinct failure modes from real agent diagnosis patterns:

Failure Mode	Symptom	Cost Signal
Context Degradation	LLM hypothesis drifts further from root cause with each retry	High input tokens on later retries
Specification Drift	Same error, different models propose different fixes (Haiku vs Sonnet)	Cost variance across model lines
Cascading Failure	3+ independent config errors; LLM must enumerate all of them sequentially	High retry count (2-3 per scenario)
Silent Failure	Happy path: 0 LLM calls, $0.00 cost	Target metric: should be 100% of happy runs
Sycophantic Confirmation	LLM confidently proposes the same fix repeatedly despite apply-time failures	Identical hypothesis repeatedly in logs
Tool Selection Errors	LLM proposes unavailable providers/tools not in registry	LLM proposals vs ToolRegistry enforcement

Each mode is instrumented via Prometheus counters pushed from the eval harness:

eval_scenario_runs_total{scenario, model, status}        — outcomes
eval_scenario_tokens_total{scenario, model, token_type}  — input/output tokens
eval_scenario_cost_dollars_total{scenario, model}        — cumulative cost
eval_scenario_retries_total{scenario, model}             — retry attempts

The Pushgateway pattern (from ADR-003) fits perfectly here: batch jobs push metrics after completion, Prometheus scrapes the gateway every 15s, and Grafana refreshes panels every 30s. No pull-based scraping overhead.

The Dashboard: 10 Panels, 228 Data Points

Summary Stats (Top Row)

Success Rate Gauge: Currently 45.2% (103 SUCCESS, 125 FAILED) — Room for improvement
Total Cost: $3.94 across all 228 runs — Surprisingly low, but let’s dig in
Total Runs: 228 eval scenarios executed — Good coverage of all failure types

Failure Breakdown (Panel 2: Pie Chart)

Shows SUCCESS vs FAILED distribution by status. Key insight: Pie chart by status is less useful than breakdown by scenario — you need to know which scenarios fail most, not just aggregate success rate.

Failure Modes by Scenario (Panel 3: Stacked Bar)

sum by(scenario, status)(eval_scenario_runs_total)

lxc-happy-path, vm-happy-path: Majority SUCCESS (good — these are baselines)
fail-simple, fail-vm-simple: Mostly FAILED (expected — single error scenarios)
fail-complex, fail-vm-complex: All FAILED (expected — cascading failures hit retry cap)
ocr-handwritten-biology, ocr-geometry-degraded: Mixed (vision OCR quality varies)

LLM Calls Per Scenario (Panel 4: Bar Chart)

max by(scenario)(eval_scenario_retries_total)

fail-simple: ~1 retry (single error, clear signal)
fail-complex: ~2-3 retries (enumeration across 3 axes)
fail-vm-complex: ~2-3 retries (template + storage, both invalid)
OCR scenarios: ~1 retry (mostly pass/fail, limited recovery)

Token Efficiency (Panel 5: Table)

Raw data: input_tokens, output_tokens, cost_usd per scenario+model.

Surprising finding: OCR geometry has low token count (~500-1000 total) but high cost. Why?

→ Sonnet model at $5/1M input, $15/1M output. Haiku costs ~5x less. Geometry used Sonnet (handwritten-like degradation), running 40 pages through vision OCR.

Cost Trend Over Time (Panel 6: Line Chart)

sum by(scenario)(eval_scenario_cost_dollars_total)

The smoking gun: One line dominates the chart.

The Case Study: Why OCR Geometry Degraded Is So Expensive

Let me trace the cost:

Ground truth: 40-page geometry practice test, poor scan quality
Strategy: Evaluate OCR recovery rate — how much text can we extract despite degradation?
Model choice: Tested Haiku, Sonnet, Opus, GPT-4o
Result: Only Sonnet achieved 0.846 recovery rate (threshold 0.80)
Cost: Sonnet @ $5/1M input, $15/1M output. 40 pages × ~5000 tokens/page = 200K input tokens
The bill: (200,000 / 1,000,000) × $5 = $1.00 per run

Run it twice (comparative eval), and you’re at $2.00 — nearly half the total eval budget for one 40-page PDF.

Compare to fail-simple: Single terraform error, Haiku ~800 tokens input, ~100 output = $0.0005 per run. 4,000x cheaper.

Lessons From the Numbers

1. Vision OCR is expensive at scale

Handwritten-1page (biology notes): Sonnet $0.015/run
Degraded-40pages (geometry): Sonnet $1.00/run
Solution: Haiku for printed/clean, Sonnet only for handwritten, direct extraction for text-layer PDFs

2. Model size drives hypothesis complexity

Haiku: Narrow focus, ~300-500 output tokens, clear recommendations
Sonnet: Broader hypothesis space, ~500-1000 output tokens, multi-axis reasoning
Cost multiplier: 5-10x depending on input
Solution: Route by error class. Simple errors → Haiku. Cascading → Sonnet.

3. Eval harness amplifies real-world costs

We ran multiple scenarios with multiple models (cross-product testing)
If you ship a feature that uses Sonnet everywhere, multiply by production volume
The geometry scenario is a warning: don’t OCR 40-page PDFs page-by-page through Claude

4. Happy path is cheap (and you need it)

105 successful runs, 0 LLM calls, $0.00 cost
That’s the baseline. Any degradation from there is visible.
Alert on regression: if happy-path success rate drops below 99%, something is wrong

How to Read the Dashboard

When Something Costs Too Much:

Check the Cost Trend panel — which scenario line is growing?
Jump to Token Efficiency table — input/output tokens per model
Check LLM Calls — is retry count high (2-3)? If so, maybe context degradation
Look at Runbooks for diagnosis steps

When Something Fails Unexpectedly:

Navigate to Failure Mode Breakdown — which scenario failed?
Read corresponding runbook: e.g., cascading-failure if fail-complex has high retry count
Check Prometheus: http://monitoring.mcmahon.home:9090/graph
Query: eval_scenario_runs_total{scenario="fail-complex", status="FAILED"}

Infrastructure Setup

Monitoring stack (Prometheus + Pushgateway + Grafana) lives in the home-lab monitoring/ directory and is version controlled along with dashboard definitions and runbooks.

To deploy:

cd monitoring/docker
docker-compose up -d

Verify:

Prometheus: http://monitoring.mcmahon.home:9090/targets
Grafana: http://monitoring.mcmahon.home:3000/d/failure-detection-main
Pushgateway: http://monitoring.mcmahon.home:9091/metrics

What’s Next

Week 3B: Token Economics Calculator

Real cost data is in hand (228 eval runs × 6 models)
Next: CLI tool to predict costs for new scenarios
Input: task type + complexity, Output: projected cost across Haiku/Sonnet/GPT-4o/etc

Beyond Week 3:

Automated alerts: “OCR scenario exceeds $X per run”
A/B testing harness: Compare model A vs B on same task
Cost optimization: Auto-route to cheapest viable model per failure mode
Runbook automation: Alert links to relevant runbook + remediation steps

The Key Insight

Failure detection isn’t about preventing failures — it’s about making them visible and cheap. When your LLM-powered agent hits an error, it should cost $0.0005 to diagnose (like fail-simple), not $1.00 (like OCR geometry). The dashboard makes that difference unmissable. And once it’s visible, you can fix it: use cheaper models, add pre-flight validation, route intelligently.

The 2/3 cost for one scenario wasn’t a surprise because of luck — it was inevitable because vision OCR is expensive. The surprise was that the dashboard made it obvious in 15 seconds of looking at one chart.

That’s the value of instrumentation.

Update (post-migration): the monitoring stack has since moved from the standalone Docker host to Kubernetes. The Prometheus, Grafana, and Pushgateway URLs in the Verify and access steps above now point at the monitoring.mcmahon.home VIP accordingly.

Project: Failure Detection Dashboard