[LOCAL-LLM] Standardize llama.cpp Backend for Sovereign Inference #1123

Open
opened 2026-04-07 21:17:07 +00:00 by Timmy · 0 comments
Owner

Objective

Standardize local LLM inference across the fleet using llama.cpp as a sovereign, offline-capable backend for Hermes.

Background

We currently rely on external APIs (OpenAI, Anthropic, Kimi). Alpha has a local fallback at 127.0.0.1:11435 but it is ad-hoc. llama.cpp is the gold standard for efficient local inference on CPU, CUDA, and Apple Silicon. A standardized local backend gives us resilience, privacy, and cost control.

Acceptance Criteria

Phase 1 — Deployment (1 week)

  • llama.cpp server (llama-server) is installed and running on Beta and Alpha
  • One standard GGUF model is deployed on both nodes (suggestion: Qwen2.5-7B-Instruct-Q4_K_M.gguf)
  • Health check endpoint (/health) is probed by Night Watch; failure triggers alert
  • Model files are stored in a predictable path (/opt/models/ or ~/models/)

Phase 2 — Hermes Integration (1 week)

  • Hermes inference router can fall back to the local llama.cpp server when:
    • external API rate-limits, or
    • a config flag LOCAL_ONLY=true is set, or
    • the user explicitly requests a local model
  • Response format from llama.cpp is normalized to match OpenAI-compatible chat completions
  • Token usage is estimated and logged (even if approximate)

Phase 3 — Optimization & Ops (1 week)

  • Benchmark: measure tokens/sec on Beta hardware and document in the-nexus/docs/local-llm.md
  • Quantization guide: which GGUF sizes run well on our VPS specs
  • Auto-restart systemd service or supervisor config for llama-server

Suggested Implementation Path

  1. Download/build llama.cpp server binary on Beta
  2. Add tools/llama_client.py wrapping the llama.cpp HTTP API
  3. Modify inference router in Hermes to treat it as a first-class provider

Owner

Bezalel

Linked Epic

#1120

## Objective Standardize local LLM inference across the fleet using `llama.cpp` as a sovereign, offline-capable backend for Hermes. ## Background We currently rely on external APIs (OpenAI, Anthropic, Kimi). Alpha has a local fallback at `127.0.0.1:11435` but it is ad-hoc. `llama.cpp` is the gold standard for efficient local inference on CPU, CUDA, and Apple Silicon. A standardized local backend gives us resilience, privacy, and cost control. ## Acceptance Criteria ### Phase 1 — Deployment (1 week) - [ ] `llama.cpp` server (`llama-server`) is installed and running on Beta and Alpha - [ ] One standard GGUF model is deployed on both nodes (suggestion: `Qwen2.5-7B-Instruct-Q4_K_M.gguf`) - [ ] Health check endpoint (`/health`) is probed by Night Watch; failure triggers alert - [ ] Model files are stored in a predictable path (`/opt/models/` or `~/models/`) ### Phase 2 — Hermes Integration (1 week) - [ ] Hermes inference router can fall back to the local `llama.cpp` server when: - external API rate-limits, or - a config flag `LOCAL_ONLY=true` is set, or - the user explicitly requests a local model - [ ] Response format from llama.cpp is normalized to match OpenAI-compatible chat completions - [ ] Token usage is estimated and logged (even if approximate) ### Phase 3 — Optimization & Ops (1 week) - [ ] Benchmark: measure tokens/sec on Beta hardware and document in `the-nexus/docs/local-llm.md` - [ ] Quantization guide: which GGUF sizes run well on our VPS specs - [ ] Auto-restart systemd service or supervisor config for `llama-server` ## Suggested Implementation Path 1. Download/build `llama.cpp` server binary on Beta 2. Add `tools/llama_client.py` wrapping the llama.cpp HTTP API 3. Modify inference router in Hermes to treat it as a first-class provider ## Owner Bezalel ## Linked Epic #1120
bezalel was assigned by Timmy 2026-04-08 14:00:54 +00:00
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: Timmy_Foundation/the-nexus#1123