Project: Multi-Agent Infrastructure Automation


The multi-agent-lab is a production-grade experiment in agentic infrastructure automation. It provisions real infrastructure on a Proxmox homelab cluster — LXC containers and VM clones — using a 3-agent pipeline where the LLM only activates when something goes wrong.

Architecture

                ┌─────────────────────────────────────────┐
                │               Planner                    │
                │   (deterministic state machine)          │
                │                                          │
                │  INIT → RUNNING → VALIDATING → SUCCESS   │
                │              ↓           ↓               │
                │          DIAGNOSING ← ─ ─                │
                │              ↓                           │
                │          RUNNING (retry) or FAILED       │
                └─────────────────────────────────────────┘
                       │                    │
           TerraformRequest        ValidationRequest
                       │                    │
                ┌──────▼──────┐    ┌────────▼────────┐
                │  Terraform  │    │    Validator     │
                │  sub-agent  │    │    sub-agent     │
                │ plan+apply  │    │  4 checks        │
                │  (no LLM)   │    │  (no LLM)        │
                └──────┬──────┘    └────────┬─────────┘
                       │                    │
                ┌──────▼────────────────────▼──┐
                │         diagnosis/llm.py      │
                │   (called ONLY on failure)    │
                │   Anthropic or OpenAI API     │
                └───────────────────────────────┘

The key design insight: infrastructure provisioning is almost entirely deterministic. There’s no reason to burn tokens on clean runs. The LLM activates exclusively in the DIAGNOSING state — which only fires on failure.

State machine

TransitionTrigger
INIT → RUNNINGTask received
RUNNING → VALIDATINGTerraform apply succeeds
VALIDATING → SUCCESSAll 4 checks pass
RUNNING → DIAGNOSINGTerraform apply fails
VALIDATING → DIAGNOSINGAny check fails
DIAGNOSING → RUNNINGNew hypothesis, retry count < cap
DIAGNOSING → FAILEDRetry cap hit or repeated hypothesis

Hypothesis deduplication prevents circular reasoning loops. If the LLM proposes the same fix twice, the run terminates rather than spinning.

Infrastructure modules

Both modules use the bpg/proxmox provider (~0.73). The telmate/proxmox provider (~2.9) was evaluated and rejected — unmaintained, with known permission-check bugs against Proxmox VE 9.x.

ModuleWhat it deploys
proxmox-lxcDebian 13 LXC containers, DHCP, count support
proxmox-vmUbuntu 24.04 VM clones from template VMID 9000, cloud-init, SSH

Terraform remote state is stored on MinIO running on a QNAP NAS (StanzaLab) — S3-compatible, self-hosted, no cloud dependency.

Observability

Every LLM diagnosis call pushes metrics to a Prometheus Pushgateway:

MetricLabels
agent_llm_cost_dollars_totalmodel
agent_llm_input_tokens_totalmodel
agent_llm_output_tokens_totalmodel
agent_llm_calls_totalmodel
agent_llm_latency_secondsmodel

Production results

All real-world failures to date were diagnosed correctly on the first hypothesis. LLM first-hypothesis success rate: 100%.

RunResultRetriesLLM cost
LXC single containerSUCCESS0$0.00
LXC 5-container fleetSUCCESS0$0.00
Ubuntu VM cloneSUCCESS0$0.00

Eval harness

A fixed 6-scenario eval suite measures diagnosis quality across models. 2026-04-01 baseline:

ModelAvg retries (failure scenarios)Avg cost
claude-haiku-4-52.0$0.0003
claude-sonnet-4-61.25$0.0021
gpt-4o1.5$0.0089
gpt-4o-mini2.25$0.0004

Sonnet diagnoses failures in fewer retries than Haiku. GPT-4o matches Sonnet’s retry efficiency at 4× the cost.

Session notes: Multi-Agent System · Agent Eval Harness