Files
the-nexus/systemd/llama-server.service
Timmy (WHIP) ac2ec40657
Some checks failed
CI / test (pull_request) Failing after 51s
Review Approval Gate / verify-review (pull_request) Failing after 6s
CI / validate (pull_request) Failing after 40s
feat: standardize llama.cpp backend for sovereign local inference
Closes #1123. Implements all three phases of the local LLM standardization:

PHASE 1 — Deployment:
- docs/local-llm.md: full deployment guide (build, model download, health check,
  model path convention /opt/models/llama/, hardware recommendations)
- systemd/llama-server.service: hardened unit with resource limits and auto-restart
- Health check: /health endpoint + model loaded verification

PHASE 2 — Hermes Integration:
- bin/llama_client.py: OpenAI-compatible Python client wrapping llama.cpp HTTP API
  (chat completions, streaming, raw completions, health check, model listing,
  benchmarking, full CLI interface)
- nexus/llama_provider.py: Hermes inference router provider adapter
  - Activates when external APIs fail, LOCAL_ONLY=true, or explicit local request
  - Response format normalized to OpenAI-compatible chat completions
  - Token usage estimated and logged
  - Health caching with TTL for efficiency

PHASE 3 — Optimization & Ops:
- Benchmarking: client.benchmark() + CLI benchmark command
- Quantization guide: Q4_K_M recommended for fleet, Q6_K for high-RAM, Q3_K for low
- Model recommendations for VPS Beta (3B), VPS Alpha (7B), Mac (7B Q6_K)
- Night watch integration: health probe script with auto-restart

Fleet standard model: Qwen2.5-7B-Instruct-Q4_K_M.gguf
Default endpoint: http://localhost:11435

22 tests pass.
2026-04-13 21:16:31 -04:00

52 lines
1015 B
Desktop File

[Unit]
Description=llama.cpp Local LLM Server
After=network-online.target
Wants=network-online.target
[Service]
Type=simple
User=root
Group=root
# Model and server configuration
Environment=MODEL_PATH=/opt/models/llama/Qwen2.5-7B-Instruct-Q4_K_M.gguf
Environment=LLAMA_HOST=0.0.0.0
Environment=LLAMA_PORT=11435
Environment=LLAMA_CTX_SIZE=4096
Environment=LLAMA_THREADS=4
ExecStart=/usr/local/bin/llama-server \
-m ${MODEL_PATH} \
--host ${LLAMA_HOST} \
--port ${LLAMA_PORT} \
-c ${LLAMA_CTX_SIZE} \
-t ${LLAMA_THREADS} \
--cont-batching
Restart=on-failure
RestartSec=10
StartLimitBurst=3
StartLimitIntervalSec=60
# Resource limits
MemoryMax=12G
CPUQuota=90%
# Security hardening
NoNewPrivileges=true
ProtectSystem=strict
ProtectHome=read-only
ReadWritePaths=/opt/models
PrivateTmp=true
ProtectKernelTunables=true
ProtectControlGroups=true
RestrictSUIDSGID=true
# Logging
StandardOutput=journal
StandardError=journal
SyslogIdentifier=llama-server
[Install]
WantedBy=multi-user.target