Timmy_Foundation/the-nexus

Fork 2

Files

Timmy (WHIP) ac2ec40657

CI / test (pull_request) Failing after 51s

Details

Review Approval Gate / verify-review (pull_request) Failing after 6s

Details

CI / validate (pull_request) Failing after 40s

Details

feat: standardize llama.cpp backend for sovereign local inference

Closes #1123. Implements all three phases of the local LLM standardization:

PHASE 1 — Deployment:
- docs/local-llm.md: full deployment guide (build, model download, health check,
  model path convention /opt/models/llama/, hardware recommendations)
- systemd/llama-server.service: hardened unit with resource limits and auto-restart
- Health check: /health endpoint + model loaded verification

PHASE 2 — Hermes Integration:
- bin/llama_client.py: OpenAI-compatible Python client wrapping llama.cpp HTTP API
  (chat completions, streaming, raw completions, health check, model listing,
  benchmarking, full CLI interface)
- nexus/llama_provider.py: Hermes inference router provider adapter
  - Activates when external APIs fail, LOCAL_ONLY=true, or explicit local request
  - Response format normalized to OpenAI-compatible chat completions
  - Token usage estimated and logged
  - Health caching with TTL for efficiency

PHASE 3 — Optimization & Ops:
- Benchmarking: client.benchmark() + CLI benchmark command
- Quantization guide: Q4_K_M recommended for fleet, Q6_K for high-RAM, Q3_K for low
- Model recommendations for VPS Beta (3B), VPS Alpha (7B), Mac (7B Q6_K)
- Night watch integration: health probe script with auto-restart

Fleet standard model: Qwen2.5-7B-Instruct-Q4_K_M.gguf
Default endpoint: http://localhost:11435

22 tests pass.

2026-04-13 21:16:31 -04:00

5.5 KiB

Raw Blame History

Local LLM Deployment Guide — llama.cpp Sovereign Inference

Overview

llama.cpp provides sovereign, offline-capable inference on CPU, CUDA, and Apple Silicon. This guide standardizes deployment across the fleet.

Golden path: One binary, one model path, one health endpoint.

Quick Start

# 1. Install llama.cpp (build from source)
git clone https://github.com/ggerganov/llama.cpp.git
cd llama.cpp && cmake -B build && cmake --build build --config Release -j$(nproc)
sudo cp build/bin/llama-server /usr/local/bin/

# 2. Download a model
mkdir -p /opt/models/llama
wget -O /opt/models/llama/Qwen2.5-7B-Instruct-Q4_K_M.gguf \
  "https://huggingface.co/Qwen/Qwen2.5-7B-Instruct-GGUF/resolve/main/qwen2.5-7b-instruct-q4_k_m.gguf"

# 3. Start the server
llama-server -m /opt/models/llama/Qwen2.5-7B-Instruct-Q4_K_M.gguf \
  --host 0.0.0.0 --port 11435 -c 4096 -t $(nproc) --cont-batching

# 4. Verify
curl http://localhost:11435/health

Model Path Convention

Path	Purpose
`/opt/models/llama/`	Production models (system-wide)
`~/models/llama/`	Per-user models (development)
`MODEL_DIR` env var	Override default path

All fleet nodes should use /opt/models/llama/ for consistency.

Recommended Models

Model	Size (Q4_K_M)	RAM	Tokens/sec (est.)	Use Case
Qwen2.5-7B-Instruct	4.7 GB	8 GB	25-40	General chat, code assist
Qwen2.5-3B-Instruct	2.0 GB	4 GB	50-80	Fast responses, lightweight
Llama-3.2-3B-Instruct	2.0 GB	4 GB	50-80	Alternative small model
Mistral-7B-Instruct-v0.3	4.4 GB	8 GB	25-40	Strong reasoning
Phi-3.5-mini-instruct	2.3 GB	4 GB	45-70	Microsoft small model

Fleet standard: Qwen2.5-7B-Instruct-Q4_K_M.gguf

Quantization Guide

Quantization	Size (7B)	Quality	Speed	Recommendation
Q8_0	7.2 GB	Excellent	Slow	Only if RAM allows
Q6_K	5.5 GB	Very Good	Medium	Best quality/speed ratio
Q5_K_M	5.0 GB	Good	Medium	Good balance
Q4_K_M	4.7 GB	Good	Fast	Fleet standard
Q3_K_M	3.4 GB	Fair	Fast	Low-memory fallback
Q2_K	2.8 GB	Poor	Very Fast	Emergency only

Rule of thumb: Use Q4_K_M unless you have <6GB RAM (then Q3_K_M) or >16GB RAM (then Q6_K).

Hardware Recommendations

VPS Beta (2 vCPU, 4 GB RAM)

Model: Qwen2.5-3B-Instruct-Q4_K_M (2.0 GB)
Context: 2048 tokens
Threads: 2
Expected: ~40-60 tok/s

VPS Alpha (4 vCPU, 8 GB RAM)

Model: Qwen2.5-7B-Instruct-Q4_K_M (4.7 GB)
Context: 4096 tokens
Threads: 4
Expected: ~20-35 tok/s

Local Mac (Apple Silicon, 16+ GB)

Model: Qwen2.5-7B-Instruct-Q6_K (5.5 GB)
Context: 8192 tokens
Metal acceleration enabled
Expected: ~30-50 tok/s

Health Check

# Simple health probe
curl -sf http://localhost:11435/health && echo "OK" || echo "FAIL"

# Detailed status
curl -s http://localhost:11435/health | python3 -m json.tool

# Model loaded check
curl -s http://localhost:11435/v1/models | python3 -c "
import sys, json
data = json.load(sys.stdin)
models = [m['id'] for m in data.get('data', [])]
print(f'Loaded: {models}' if models else 'No models loaded')
"

Night Watch Integration

Add to your health check cron:

#!/bin/bash
# llama-health.sh — probe local llama.cpp server
ENDPOINT="${LLAMA_ENDPOINT:-http://localhost:11435}"

if ! curl -sf "$ENDPOINT/health" > /dev/null 2>&1; then
  echo "ALERT: llama.cpp server at $ENDPOINT is DOWN"
  # Auto-restart if systemd service exists
  systemctl is-active llama-server && sudo systemctl restart llama-server
  exit 1
fi

# Verify model is loaded
MODELS=$(curl -s "$ENDPOINT/v1/models" | python3 -c "
import sys, json
data = json.load(sys.stdin)
print(len(data.get('data', [])))
" 2>/dev/null)

if [ "$MODELS" = "0" ] || [ -z "$MODELS" ]; then
  echo "WARNING: llama.cpp server running but no model loaded"
  exit 1
fi

echo "OK: llama.cpp healthy, $MODELS model(s) loaded"

Benchmarking

# Using the built-in llama_client.py benchmark
python3 bin/llama_client.py --url http://localhost:11435 benchmark --prompt "Explain sovereignty in 3 sentences." --iterations 10

# Using llama.cpp native benchmark
llama-bench -m /opt/models/llama/Qwen2.5-7B-Instruct-Q4_K_M.gguf -t 4

API Compatibility

llama-server exposes an OpenAI-compatible API:

# Chat completions (compatible with OpenAI SDK)
curl http://localhost:11435/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "qwen2.5-7b",
    "messages": [{"role": "user", "content": "Hello"}],
    "max_tokens": 256,
    "temperature": 0.7
  }'

# Raw completions
curl http://localhost:11435/completion \
  -H "Content-Type: application/json" \
  -d '{"prompt": "Once upon a time", "n_predict": 128}'

Troubleshooting

Problem	Cause	Fix
Server won't start	Not enough RAM	Use smaller model or lower quantization
Slow inference	Wrong thread count	Match `-t` to available cores
Out of memory during load	Context too large	Reduce `-c` parameter
Model not found	Wrong path	Check `ls /opt/models/llama/`
Port already in use	Another process on 11435	`lsof -i :11435` then kill

systemd Service

See systemd/llama-server.service in this repo. Install:

sudo cp systemd/llama-server.service /etc/systemd/system/
sudo systemctl daemon-reload
sudo systemctl enable --now llama-server

5.5 KiB Raw Blame History

Local LLM Deployment Guide — llama.cpp Sovereign Inference

Overview

Quick Start

Model Path Convention

Recommended Models

Quantization Guide

Hardware Recommendations

VPS Beta (2 vCPU, 4 GB RAM)

VPS Alpha (4 vCPU, 8 GB RAM)

Local Mac (Apple Silicon, 16+ GB)

Health Check

Night Watch Integration

Benchmarking

API Compatibility

Troubleshooting

systemd Service

5.5 KiB

Raw Blame History