feat: standardize llama.cpp backend for sovereign local inference
Closes #1123. Implements all three phases of the local LLM standardization: PHASE 1 — Deployment: - docs/local-llm.md: full deployment guide (build, model download, health check, model path convention /opt/models/llama/, hardware recommendations) - systemd/llama-server.service: hardened unit with resource limits and auto-restart - Health check: /health endpoint + model loaded verification PHASE 2 — Hermes Integration: - bin/llama_client.py: OpenAI-compatible Python client wrapping llama.cpp HTTP API (chat completions, streaming, raw completions, health check, model listing, benchmarking, full CLI interface) - nexus/llama_provider.py: Hermes inference router provider adapter - Activates when external APIs fail, LOCAL_ONLY=true, or explicit local request - Response format normalized to OpenAI-compatible chat completions - Token usage estimated and logged - Health caching with TTL for efficiency PHASE 3 — Optimization & Ops: - Benchmarking: client.benchmark() + CLI benchmark command - Quantization guide: Q4_K_M recommended for fleet, Q6_K for high-RAM, Q3_K for low - Model recommendations for VPS Beta (3B), VPS Alpha (7B), Mac (7B Q6_K) - Night watch integration: health probe script with auto-restart Fleet standard model: Qwen2.5-7B-Instruct-Q4_K_M.gguf Default endpoint: http://localhost:11435 22 tests pass.
This commit is contained in:
51
systemd/llama-server.service
Normal file
51
systemd/llama-server.service
Normal file
@@ -0,0 +1,51 @@
|
||||
[Unit]
|
||||
Description=llama.cpp Local LLM Server
|
||||
After=network-online.target
|
||||
Wants=network-online.target
|
||||
|
||||
[Service]
|
||||
Type=simple
|
||||
User=root
|
||||
Group=root
|
||||
|
||||
# Model and server configuration
|
||||
Environment=MODEL_PATH=/opt/models/llama/Qwen2.5-7B-Instruct-Q4_K_M.gguf
|
||||
Environment=LLAMA_HOST=0.0.0.0
|
||||
Environment=LLAMA_PORT=11435
|
||||
Environment=LLAMA_CTX_SIZE=4096
|
||||
Environment=LLAMA_THREADS=4
|
||||
|
||||
ExecStart=/usr/local/bin/llama-server \
|
||||
-m ${MODEL_PATH} \
|
||||
--host ${LLAMA_HOST} \
|
||||
--port ${LLAMA_PORT} \
|
||||
-c ${LLAMA_CTX_SIZE} \
|
||||
-t ${LLAMA_THREADS} \
|
||||
--cont-batching
|
||||
|
||||
Restart=on-failure
|
||||
RestartSec=10
|
||||
StartLimitBurst=3
|
||||
StartLimitIntervalSec=60
|
||||
|
||||
# Resource limits
|
||||
MemoryMax=12G
|
||||
CPUQuota=90%
|
||||
|
||||
# Security hardening
|
||||
NoNewPrivileges=true
|
||||
ProtectSystem=strict
|
||||
ProtectHome=read-only
|
||||
ReadWritePaths=/opt/models
|
||||
PrivateTmp=true
|
||||
ProtectKernelTunables=true
|
||||
ProtectControlGroups=true
|
||||
RestrictSUIDSGID=true
|
||||
|
||||
# Logging
|
||||
StandardOutput=journal
|
||||
StandardError=journal
|
||||
SyslogIdentifier=llama-server
|
||||
|
||||
[Install]
|
||||
WantedBy=multi-user.target
|
||||
Reference in New Issue
Block a user