Session: Local Inference — Diagnosing the Lockups
The Problem
After Phase 1 of the local inference project was standing — Ollama running, both models pulled, compare_models.py working — I started hitting hard system lockups. Keyboard unresponsive, no crash log, full reboot required. Happened twice.
The pattern: lockups always occurred during or shortly after sessions where I was actively using the llama3.3:70b-instruct-q4_K_M model.
Root Cause: Memory Exhaustion
llama3.3:70b-instruct-q4_K_M is a 40GB model. Loading it on a 48GB machine consumes ~83% of the available unified memory pool, leaving roughly 8GB for the OS, active applications, and Metal GPU overhead.
That 8GB isn’t enough. The machine locks.
The subtlety with Apple Silicon: unified memory is shared between CPU, GPU (Metal), and OS. It’s not like discrete GPU VRAM where the model sits in one pool and system memory is separate. A model that fits on paper may still cause instability because Metal-mapped buffers can’t be compressed or swapped out under pressure. There’s no overflow path — when memory is exhausted, the system doesn’t gracefully degrade. It locks.
Ollama has no built-in guard against this. It will attempt to load whatever you ask, regardless of available RAM.
The Fix: Three Environment Variables
Three env vars added to start_ollama.sh:
export OLLAMA_MAX_LOADED_MODELS=1
export OLLAMA_KEEP_ALIVE=5m
export OLLAMA_NUM_PARALLEL=1
OLLAMA_MAX_LOADED_MODELS=1 — prevents both models from loading simultaneously. Without this, a session that uses both llama3.3:70b and deepseek-r1:32b could attempt to hold 40 + 18 = 58GB in memory at once, well beyond the 48GB pool.
OLLAMA_KEEP_ALIVE=5m — Ollama’s default behavior is to keep a loaded model in memory indefinitely after use. This means if you run the 70B, walk away, and later start a comparison run that also loads the 32B, both are in memory. Five-minute keepalive ensures idle models are evicted before they compound pressure.
OLLAMA_NUM_PARALLEL=1 — prevents concurrent requests from being processed simultaneously. Under load, parallel inference can spike memory usage above the model’s baseline footprint.
70B Off the Default Toolchain
llama3.3:70b-instruct-q4_K_M is retained on disk but removed from compare_models.py defaults and the smoke check.
It’s still available:
python scripts/compare_models.py --models llama3.3:70b-instruct-q4_K_M
But only for intentional, monitored sessions — not routine use.
deepseek-r1:32b (~18GB) is the safe operational model for this machine. Safe rule: no model larger than ~30GB for routine use on 48GB hardware. The 70B needs another 16+ GB of headroom the machine doesn’t reliably have once the OS, apps, and Metal overhead are accounted for.
Real Inference Validation
The existing health check only pinged /api/tags — Ollama’s model list endpoint. That tells you the server is running. It tells you nothing about whether inference actually works.
scripts/smoke_check.py sends two known-answer probes:
probes = [
{"prompt": "What is the capital of France?", "expected": "paris"},
{"prompt": "What is 15 + 27?", "expected": "42"},
]
Validates:
- Response contains the expected answer (case-insensitive)
- Response is at least 5 characters (not an empty or truncated response)
- Response time is under 120 seconds
Exit 0 on pass, exit 1 on failure — designed for CI use. Both probes confirmed passing post-fix on the live Ollama instance with the env vars active.
What the Lockups Were Actually Telling Me
The lockups weren’t a bug — they were the hardware enforcing a constraint I’d ignored. The M5 Max is genuinely capable of running a 70B model, but “capable” and “stable for routine use” are different things. At 40GB of model weight, you’re running at ~83% memory utilization before the first token is generated. There’s no slack for anything else.
The right mental model for unified memory: it’s a shared pool with no overflow. Every GB of model weight is a GB unavailable to the OS and Metal. The lockup isn’t a crash — it’s the hardware running out of headroom.
Process Notes
This session also formalized the git workflow for the project:
- Branch protection on
main— direct push blocked at the GitHub level - All changes go through feature branch → PR → merge
/commitskill for Claude Code sessions enforces this consistently
What’s Next
Phase 2: the RAG layer.
- Local vector database (Chroma or Qdrant — evaluating which fits the OpenBrain schema better)
- Ingest pipeline mirroring OpenBrain’s
source,source_type,subject,topicstructure - Hybrid retrieval: keyword + vector search with RRF scoring, wired to Ollama inference
- Validation: RAG-augmented vs raw Ollama on the same query set, measuring response quality
The end goal of Phase 2 is a like-for-like comparison: local+RAG vs cloud+RAG. That comparison isolates the inference layer as the variable — which is the professionally relevant question.
Project overview: Local Inference Project