feat: standardize llama.cpp backend for sovereign local inference

Closes #1123. Implements all three phases of the local LLM standardization: PHASE 1 — Deployment: - docs/local-llm.md: full deployment guide (build, model download, health check, model path convention /opt/models/llama/, hardware recommendations) - systemd/llama-server.service: hardened unit with resource limits and auto-restart - Health check: /health endpoint + model loaded verification PHASE 2 — Hermes Integration: - bin/llama_client.py: OpenAI-compatible Python client wrapping llama.cpp HTTP API (chat completions, streaming, raw completions, health check, model listing, benchmarking, full CLI interface) - nexus/llama_provider.py: Hermes inference router provider adapter - Activates when external APIs fail, LOCAL_ONLY=true, or explicit local request - Response format normalized to OpenAI-compatible chat completions - Token usage estimated and logged - Health caching with TTL for efficiency PHASE 3 — Optimization & Ops: - Benchmarking: client.benchmark() + CLI benchmark command - Quantization guide: Q4_K_M recommended for fleet, Q6_K for high-RAM, Q3_K for low - Model recommendations for VPS Beta (3B), VPS Alpha (7B), Mac (7B Q6_K) - Night watch integration: health probe script with auto-restart Fleet standard model: Qwen2.5-7B-Instruct-Q4_K_M.gguf Default endpoint: http://localhost:11435 22 tests pass.
2026-04-13 21:16:31 -04:00
parent dfbd96f792
commit ac2ec40657
5 changed files with 1003 additions and 0 deletions
--- a/docs/local-llm.md
+++ b/docs/local-llm.md
@@ -0,0 +1,184 @@
+# Local LLM Deployment Guide — llama.cpp Sovereign Inference
+
+## Overview
+
+llama.cpp provides sovereign, offline-capable inference on CPU, CUDA, and
+Apple Silicon. This guide standardizes deployment across the fleet.
+
+**Golden path:** One binary, one model path, one health endpoint.
+
+## Quick Start
+
+```bash
+# 1. Install llama.cpp (build from source)
+git clone https://github.com/ggerganov/llama.cpp.git
+cd llama.cpp && cmake -B build && cmake --build build --config Release -j$(nproc)
+sudo cp build/bin/llama-server /usr/local/bin/
+
+# 2. Download a model
+mkdir -p /opt/models/llama
+wget -O /opt/models/llama/Qwen2.5-7B-Instruct-Q4_K_M.gguf \
+  "https://huggingface.co/Qwen/Qwen2.5-7B-Instruct-GGUF/resolve/main/qwen2.5-7b-instruct-q4_k_m.gguf"
+
+# 3. Start the server
+llama-server -m /opt/models/llama/Qwen2.5-7B-Instruct-Q4_K_M.gguf \
+  --host 0.0.0.0 --port 11435 -c 4096 -t $(nproc) --cont-batching
+
+# 4. Verify
+curl http://localhost:11435/health
+```
+
+## Model Path Convention
+
+| Path | Purpose |
+|------|---------|
+| `/opt/models/llama/` | Production models (system-wide) |
+| `~/models/llama/` | Per-user models (development) |
+| `MODEL_DIR` env var | Override default path |
+
+All fleet nodes should use `/opt/models/llama/` for consistency.
+
+## Recommended Models
+
+| Model | Size (Q4_K_M) | RAM | Tokens/sec (est.) | Use Case |
+|-------|---------------|-----|-------------------|----------|
+| Qwen2.5-7B-Instruct | 4.7 GB | 8 GB | 25-40 | General chat, code assist |
+| Qwen2.5-3B-Instruct | 2.0 GB | 4 GB | 50-80 | Fast responses, lightweight |
+| Llama-3.2-3B-Instruct | 2.0 GB | 4 GB | 50-80 | Alternative small model |
+| Mistral-7B-Instruct-v0.3 | 4.4 GB | 8 GB | 25-40 | Strong reasoning |
+| Phi-3.5-mini-instruct | 2.3 GB | 4 GB | 45-70 | Microsoft small model |
+
+**Fleet standard:** `Qwen2.5-7B-Instruct-Q4_K_M.gguf`
+
+## Quantization Guide
+
+| Quantization | Size (7B) | Quality | Speed | Recommendation |
+|-------------|-----------|---------|-------|----------------|
+| Q8_0 | 7.2 GB | Excellent | Slow | Only if RAM allows |
+| Q6_K | 5.5 GB | Very Good | Medium | Best quality/speed ratio |
+| Q5_K_M | 5.0 GB | Good | Medium | Good balance |
+| **Q4_K_M** | **4.7 GB** | **Good** | **Fast** | **Fleet standard** |
+| Q3_K_M | 3.4 GB | Fair | Fast | Low-memory fallback |
+| Q2_K | 2.8 GB | Poor | Very Fast | Emergency only |
+
+**Rule of thumb:** Use Q4_K_M unless you have <6GB RAM (then Q3_K_M) or >16GB RAM (then Q6_K).
+
+## Hardware Recommendations
+
+### VPS Beta (2 vCPU, 4 GB RAM)
+- Model: Qwen2.5-3B-Instruct-Q4_K_M (2.0 GB)
+- Context: 2048 tokens
+- Threads: 2
+- Expected: ~40-60 tok/s
+
+### VPS Alpha (4 vCPU, 8 GB RAM)
+- Model: Qwen2.5-7B-Instruct-Q4_K_M (4.7 GB)
+- Context: 4096 tokens
+- Threads: 4
+- Expected: ~20-35 tok/s
+
+### Local Mac (Apple Silicon, 16+ GB)
+- Model: Qwen2.5-7B-Instruct-Q6_K (5.5 GB)
+- Context: 8192 tokens
+- Metal acceleration enabled
+- Expected: ~30-50 tok/s
+
+## Health Check
+
+```bash
+# Simple health probe
+curl -sf http://localhost:11435/health && echo "OK" || echo "FAIL"
+
+# Detailed status
+curl -s http://localhost:11435/health | python3 -m json.tool
+
+# Model loaded check
+curl -s http://localhost:11435/v1/models | python3 -c "
+import sys, json
+data = json.load(sys.stdin)
+models = [m['id'] for m in data.get('data', [])]
+print(f'Loaded: {models}' if models else 'No models loaded')
+"
+```
+
+## Night Watch Integration
+
+Add to your health check cron:
+
+```bash
+#!/bin/bash
+# llama-health.sh — probe local llama.cpp server
+ENDPOINT="${LLAMA_ENDPOINT:-http://localhost:11435}"
+
+if ! curl -sf "$ENDPOINT/health" > /dev/null 2>&1; then
+  echo "ALERT: llama.cpp server at $ENDPOINT is DOWN"
+  # Auto-restart if systemd service exists
+  systemctl is-active llama-server && sudo systemctl restart llama-server
+  exit 1
+fi
+
+# Verify model is loaded
+MODELS=$(curl -s "$ENDPOINT/v1/models" | python3 -c "
+import sys, json
+data = json.load(sys.stdin)
+print(len(data.get('data', [])))
+" 2>/dev/null)
+
+if [ "$MODELS" = "0" ] || [ -z "$MODELS" ]; then
+  echo "WARNING: llama.cpp server running but no model loaded"
+  exit 1
+fi
+
+echo "OK: llama.cpp healthy, $MODELS model(s) loaded"
+```
+
+## Benchmarking
+
+```bash
+# Using the built-in llama_client.py benchmark
+python3 bin/llama_client.py --url http://localhost:11435 benchmark --prompt "Explain sovereignty in 3 sentences." --iterations 10
+
+# Using llama.cpp native benchmark
+llama-bench -m /opt/models/llama/Qwen2.5-7B-Instruct-Q4_K_M.gguf -t 4
+```
+
+## API Compatibility
+
+llama-server exposes an OpenAI-compatible API:
+
+```bash
+# Chat completions (compatible with OpenAI SDK)
+curl http://localhost:11435/v1/chat/completions \
+  -H "Content-Type: application/json" \
+  -d '{
+    "model": "qwen2.5-7b",
+    "messages": [{"role": "user", "content": "Hello"}],
+    "max_tokens": 256,
+    "temperature": 0.7
+  }'
+
+# Raw completions
+curl http://localhost:11435/completion \
+  -H "Content-Type: application/json" \
+  -d '{"prompt": "Once upon a time", "n_predict": 128}'
+```
+
+## Troubleshooting
+
+| Problem | Cause | Fix |
+|---------|-------|-----|
+| Server won't start | Not enough RAM | Use smaller model or lower quantization |
+| Slow inference | Wrong thread count | Match `-t` to available cores |
+| Out of memory during load | Context too large | Reduce `-c` parameter |
+| Model not found | Wrong path | Check `ls /opt/models/llama/` |
+| Port already in use | Another process on 11435 | `lsof -i :11435` then kill |
+
+## systemd Service
+
+See `systemd/llama-server.service` in this repo. Install:
+
+```bash
+sudo cp systemd/llama-server.service /etc/systemd/system/
+sudo systemctl daemon-reload
+sudo systemctl enable --now llama-server
+```