Project: Multi-Agent Infrastructure Automation
The multi-agent-lab is a production-grade experiment in agentic infrastructure automation. It provisions real infrastructure on a Proxmox homelab cluster — LXC containers and VM clones — using a 3-agent pipeline where the LLM only activates when something goes wrong.
Architecture
┌─────────────────────────────────────────┐
│ Planner │
│ (deterministic state machine) │
│ │
│ INIT → RUNNING → VALIDATING → SUCCESS │
│ ↓ ↓ │
│ DIAGNOSING ← ─ ─ │
│ ↓ │
│ RUNNING (retry) or FAILED │
└─────────────────────────────────────────┘
│ │
TerraformRequest ValidationRequest
│ │
┌──────▼──────┐ ┌────────▼────────┐
│ Terraform │ │ Validator │
│ sub-agent │ │ sub-agent │
│ plan+apply │ │ 4 checks │
│ (no LLM) │ │ (no LLM) │
└──────┬──────┘ └────────┬─────────┘
│ │
┌──────▼────────────────────▼──┐
│ diagnosis/llm.py │
│ (called ONLY on failure) │
│ Anthropic or OpenAI API │
└───────────────────────────────┘
The key design insight: infrastructure provisioning is almost entirely deterministic. There’s no reason to burn tokens on clean runs. The LLM activates exclusively in the DIAGNOSING state — which only fires on failure.
State machine
| Transition | Trigger |
|---|---|
| INIT → RUNNING | Task received |
| RUNNING → VALIDATING | Terraform apply succeeds |
| VALIDATING → SUCCESS | All 4 checks pass |
| RUNNING → DIAGNOSING | Terraform apply fails |
| VALIDATING → DIAGNOSING | Any check fails |
| DIAGNOSING → RUNNING | New hypothesis, retry count < cap |
| DIAGNOSING → FAILED | Retry cap hit or repeated hypothesis |
Hypothesis deduplication prevents circular reasoning loops. If the LLM proposes the same fix twice, the run terminates rather than spinning.
Infrastructure modules
Both modules use the bpg/proxmox provider (~0.73). The telmate/proxmox provider (~2.9) was evaluated and rejected — unmaintained, with known permission-check bugs against Proxmox VE 9.x.
| Module | What it deploys |
|---|---|
proxmox-lxc | Debian 13 LXC containers, DHCP, count support |
proxmox-vm | Ubuntu 24.04 VM clones from template VMID 9000, cloud-init, SSH |
Terraform remote state is stored on MinIO running on a QNAP NAS (StanzaLab) — S3-compatible, self-hosted, no cloud dependency.
Observability
Every LLM diagnosis call pushes metrics to a Prometheus Pushgateway:
| Metric | Labels |
|---|---|
agent_llm_cost_dollars_total | model |
agent_llm_input_tokens_total | model |
agent_llm_output_tokens_total | model |
agent_llm_calls_total | model |
agent_llm_latency_seconds | model |
Production results
All real-world failures to date were diagnosed correctly on the first hypothesis. LLM first-hypothesis success rate: 100%.
| Run | Result | Retries | LLM cost |
|---|---|---|---|
| LXC single container | SUCCESS | 0 | $0.00 |
| LXC 5-container fleet | SUCCESS | 0 | $0.00 |
| Ubuntu VM clone | SUCCESS | 0 | $0.00 |
Eval harness
A fixed 6-scenario eval suite measures diagnosis quality across models. 2026-04-01 baseline:
| Model | Avg retries (failure scenarios) | Avg cost |
|---|---|---|
| claude-haiku-4-5 | 2.0 | $0.0003 |
| claude-sonnet-4-6 | 1.25 | $0.0021 |
| gpt-4o | 1.5 | $0.0089 |
| gpt-4o-mini | 2.25 | $0.0004 |
Sonnet diagnoses failures in fewer retries than Haiku. GPT-4o matches Sonnet’s retry efficiency at 4× the cost.
Session notes: Multi-Agent System · Agent Eval Harness