the-nexus/docs/local-llm.md

# Local LLM Deployment Guide — llama.cpp Sovereign Inference

## Overview

llama.cpp provides sovereign, offline-capable inference on CPU, CUDA, and
Apple Silicon. This guide standardizes deployment across the fleet.

**Golden path:** One binary, one model path, one health endpoint.

## Quick Start

```bash
# 1. Install llama.cpp (build from source)
git clone https://github.com/ggerganov/llama.cpp.git
cd llama.cpp && cmake -B build && cmake --build build --config Release -j$(nproc)
sudo cp build/bin/llama-server /usr/local/bin/

# 2. Download a model
mkdir -p /opt/models/llama
wget -O /opt/models/llama/Qwen2.5-7B-Instruct-Q4_K_M.gguf \
  "https://huggingface.co/Qwen/Qwen2.5-7B-Instruct-GGUF/resolve/main/qwen2.5-7b-instruct-q4_k_m.gguf"

# 3. Start the server
llama-server -m /opt/models/llama/Qwen2.5-7B-Instruct-Q4_K_M.gguf \
  --host 0.0.0.0 --port 11435 -c 4096 -t $(nproc) --cont-batching

# 4. Verify
curl http://localhost:11435/health
```

## Model Path Convention

| Path | Purpose |
|------|---------|
| `/opt/models/llama/` | Production models (system-wide) |
| `~/models/llama/` | Per-user models (development) |
| `MODEL_DIR` env var | Override default path |

All fleet nodes should use `/opt/models/llama/` for consistency.

## Recommended Models

| Model | Size (Q4_K_M) | RAM | Tokens/sec (est.) | Use Case |
|-------|---------------|-----|-------------------|----------|
| Qwen2.5-7B-Instruct | 4.7 GB | 8 GB | 25-40 | General chat, code assist |
| Qwen2.5-3B-Instruct | 2.0 GB | 4 GB | 50-80 | Fast responses, lightweight |
| Llama-3.2-3B-Instruct | 2.0 GB | 4 GB | 50-80 | Alternative small model |
| Mistral-7B-Instruct-v0.3 | 4.4 GB | 8 GB | 25-40 | Strong reasoning |
| Phi-3.5-mini-instruct | 2.3 GB | 4 GB | 45-70 | Microsoft small model |

**Fleet standard:** `Qwen2.5-7B-Instruct-Q4_K_M.gguf`

## Quantization Guide

| Quantization | Size (7B) | Quality | Speed | Recommendation |
|-------------|-----------|---------|-------|----------------|
| Q8_0 | 7.2 GB | Excellent | Slow | Only if RAM allows |
| Q6_K | 5.5 GB | Very Good | Medium | Best quality/speed ratio |
| Q5_K_M | 5.0 GB | Good | Medium | Good balance |
| **Q4_K_M** | **4.7 GB** | **Good** | **Fast** | **Fleet standard** |
| Q3_K_M | 3.4 GB | Fair | Fast | Low-memory fallback |
| Q2_K | 2.8 GB | Poor | Very Fast | Emergency only |

**Rule of thumb:** Use Q4_K_M unless you have <6GB RAM (then Q3_K_M) or >16GB RAM (then Q6_K).

## Hardware Recommendations

### VPS Beta (2 vCPU, 4 GB RAM)
- Model: Qwen2.5-3B-Instruct-Q4_K_M (2.0 GB)
- Context: 2048 tokens
- Threads: 2
- Expected: ~40-60 tok/s

### VPS Alpha (4 vCPU, 8 GB RAM)
- Model: Qwen2.5-7B-Instruct-Q4_K_M (4.7 GB)
- Context: 4096 tokens
- Threads: 4
- Expected: ~20-35 tok/s

### Local Mac (Apple Silicon, 16+ GB)
- Model: Qwen2.5-7B-Instruct-Q6_K (5.5 GB)
- Context: 8192 tokens
- Metal acceleration enabled
- Expected: ~30-50 tok/s

## Health Check

```bash
# Simple health probe
curl -sf http://localhost:11435/health && echo "OK" || echo "FAIL"

# Detailed status
curl -s http://localhost:11435/health | python3 -m json.tool

# Model loaded check
curl -s http://localhost:11435/v1/models | python3 -c "
import sys, json
data = json.load(sys.stdin)
models = [m['id'] for m in data.get('data', [])]
print(f'Loaded: {models}' if models else 'No models loaded')
"
```

## Night Watch Integration

Add to your health check cron:

```bash
#!/bin/bash
# llama-health.sh — probe local llama.cpp server
ENDPOINT="${LLAMA_ENDPOINT:-http://localhost:11435}"

if ! curl -sf "$ENDPOINT/health" > /dev/null 2>&1; then
  echo "ALERT: llama.cpp server at $ENDPOINT is DOWN"
  # Auto-restart if systemd service exists
  systemctl is-active llama-server && sudo systemctl restart llama-server
  exit 1
fi

# Verify model is loaded
MODELS=$(curl -s "$ENDPOINT/v1/models" | python3 -c "
import sys, json
data = json.load(sys.stdin)
print(len(data.get('data', [])))
" 2>/dev/null)

if [ "$MODELS" = "0" ] || [ -z "$MODELS" ]; then
  echo "WARNING: llama.cpp server running but no model loaded"
  exit 1
fi

echo "OK: llama.cpp healthy, $MODELS model(s) loaded"
```

## Benchmarking

```bash
# Using the built-in llama_client.py benchmark
python3 bin/llama_client.py --url http://localhost:11435 benchmark --prompt "Explain sovereignty in 3 sentences." --iterations 10

# Using llama.cpp native benchmark
llama-bench -m /opt/models/llama/Qwen2.5-7B-Instruct-Q4_K_M.gguf -t 4
```

## API Compatibility

llama-server exposes an OpenAI-compatible API:

```bash
# Chat completions (compatible with OpenAI SDK)
curl http://localhost:11435/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "qwen2.5-7b",
    "messages": [{"role": "user", "content": "Hello"}],
    "max_tokens": 256,
    "temperature": 0.7
  }'

# Raw completions
curl http://localhost:11435/completion \
  -H "Content-Type: application/json" \
  -d '{"prompt": "Once upon a time", "n_predict": 128}'
```

## Troubleshooting

| Problem | Cause | Fix |
|---------|-------|-----|
| Server won't start | Not enough RAM | Use smaller model or lower quantization |
| Slow inference | Wrong thread count | Match `-t` to available cores |
| Out of memory during load | Context too large | Reduce `-c` parameter |
| Model not found | Wrong path | Check `ls /opt/models/llama/` |
| Port already in use | Another process on 11435 | `lsof -i :11435` then kill |

## systemd Service

See `systemd/llama-server.service` in this repo. Install:

```bash
sudo cp systemd/llama-server.service /etc/systemd/system/
sudo systemctl daemon-reload
sudo systemctl enable --now llama-server
```