[LOCAL-LLM] Standardize llama.cpp Backend for Sovereign Inference

## Objective Standardize local LLM inference across the fleet using `llama.cpp` as a sovereign, offline-capable backend for Hermes. ## Background We currently rely on external APIs (OpenAI, Anthropic, Kimi). Alpha has a local fallback at `127.0.0.1:11435` but it is ad-hoc. `llama.cpp` is the gold standard for efficient local inference on CPU, CUDA, and Apple Silicon. A standardized local backend gives us resilience, privacy, and cost control. ## Acceptance Criteria ### Phase 1 — Deployment (1 week) - [ ] `llama.cpp` server (`llama-server`) is installed and running on Beta and Alpha - [ ] One standard GGUF model is deployed on both nodes (suggestion: `Qwen2.5-7B-Instruct-Q4_K_M.gguf`) - [ ] Health check endpoint (`/health`) is probed by Night Watch; failure triggers alert - [ ] Model files are stored in a predictable path (`/opt/models/` or `~/models/`) ### Phase 2 — Hermes Integration (1 week) - [ ] Hermes inference router can fall back to the local `llama.cpp` server when: - external API rate-limits, or - a config flag `LOCAL_ONLY=true` is set, or - the user explicitly requests a local model - [ ] Response format from llama.cpp is normalized to match OpenAI-compatible chat completions - [ ] Token usage is estimated and logged (even if approximate) ### Phase 3 — Optimization & Ops (1 week) - [ ] Benchmark: measure tokens/sec on Beta hardware and document in `the-nexus/docs/local-llm.md` - [ ] Quantization guide: which GGUF sizes run well on our VPS specs - [ ] Auto-restart systemd service or supervisor config for `llama-server` ## Suggested Implementation Path 1. Download/build `llama.cpp` server binary on Beta 2. Add `tools/llama_client.py` wrapping the llama.cpp HTTP API 3. Modify inference router in Hermes to treat it as a first-class provider ## Owner Bezalel ## Linked Epic #1120

Timmy commented

2026-04-07 21:17:07 +00:00

Owner

Objective

Standardize local LLM inference across the fleet using llama.cpp as a sovereign, offline-capable backend for Hermes.

Background

We currently rely on external APIs (OpenAI, Anthropic, Kimi). Alpha has a local fallback at 127.0.0.1:11435 but it is ad-hoc. llama.cpp is the gold standard for efficient local inference on CPU, CUDA, and Apple Silicon. A standardized local backend gives us resilience, privacy, and cost control.

Acceptance Criteria

Phase 1 — Deployment (1 week)

llama.cpp server (llama-server) is installed and running on Beta and Alpha
One standard GGUF model is deployed on both nodes (suggestion: Qwen2.5-7B-Instruct-Q4_K_M.gguf)
Health check endpoint (/health) is probed by Night Watch; failure triggers alert
Model files are stored in a predictable path (/opt/models/ or ~/models/)

Phase 2 — Hermes Integration (1 week)

Hermes inference router can fall back to the local llama.cpp server when:
- external API rate-limits, or
- a config flag LOCAL_ONLY=true is set, or
- the user explicitly requests a local model
Response format from llama.cpp is normalized to match OpenAI-compatible chat completions
Token usage is estimated and logged (even if approximate)

Phase 3 — Optimization & Ops (1 week)

Benchmark: measure tokens/sec on Beta hardware and document in the-nexus/docs/local-llm.md
Quantization guide: which GGUF sizes run well on our VPS specs
Auto-restart systemd service or supervisor config for llama-server

Suggested Implementation Path

Download/build llama.cpp server binary on Beta
Add tools/llama_client.py wrapping the llama.cpp HTTP API
Modify inference router in Hermes to treat it as a first-class provider

Bezalel

Linked Epic

#1120

bezalel was assigned by Timmy

2026-04-08 14:00:54 +00:00

[LOCAL-LLM] Standardize llama.cpp Backend for Sovereign Inference #1123