Closes#1123. Implements all three phases of the local LLM standardization:
PHASE 1 — Deployment:
- docs/local-llm.md: full deployment guide (build, model download, health check,
model path convention /opt/models/llama/, hardware recommendations)
- systemd/llama-server.service: hardened unit with resource limits and auto-restart
- Health check: /health endpoint + model loaded verification
PHASE 2 — Hermes Integration:
- bin/llama_client.py: OpenAI-compatible Python client wrapping llama.cpp HTTP API
(chat completions, streaming, raw completions, health check, model listing,
benchmarking, full CLI interface)
- nexus/llama_provider.py: Hermes inference router provider adapter
- Activates when external APIs fail, LOCAL_ONLY=true, or explicit local request
- Response format normalized to OpenAI-compatible chat completions
- Token usage estimated and logged
- Health caching with TTL for efficiency
PHASE 3 — Optimization & Ops:
- Benchmarking: client.benchmark() + CLI benchmark command
- Quantization guide: Q4_K_M recommended for fleet, Q6_K for high-RAM, Q3_K for low
- Model recommendations for VPS Beta (3B), VPS Alpha (7B), Mac (7B Q6_K)
- Night watch integration: health probe script with auto-restart
Fleet standard model: Qwen2.5-7B-Instruct-Q4_K_M.gguf
Default endpoint: http://localhost:11435
22 tests pass.