Compare commits

...

5 Commits

Author SHA1 Message Date
84eb8104d8 feat: standardize llama.cpp backend for sovereign local inference (#1123)
Some checks failed
Deploy Nexus / deploy (push) Has been cancelled
Staging Verification Gate / verify-staging (push) Has been cancelled
Review Approval Gate / verify-review (pull_request) Failing after 10s
CI / test (pull_request) Failing after 55s
CI / validate (pull_request) Failing after 56s
2026-04-14 01:48:34 +00:00
93228388d7 feat: standardize llama.cpp backend for sovereign local inference (#1123)
Some checks failed
Deploy Nexus / deploy (push) Has been cancelled
Staging Verification Gate / verify-staging (push) Has been cancelled
2026-04-14 01:48:29 +00:00
e27c51c6da feat: standardize llama.cpp backend for sovereign local inference (#1123)
Some checks failed
Deploy Nexus / deploy (push) Has been cancelled
Staging Verification Gate / verify-staging (push) Has been cancelled
2026-04-14 01:48:26 +00:00
ed79826608 feat: standardize llama.cpp backend for sovereign local inference (#1123)
Some checks failed
Deploy Nexus / deploy (push) Has been cancelled
Staging Verification Gate / verify-staging (push) Has been cancelled
2026-04-14 01:48:23 +00:00
e438662c97 feat: standardize llama.cpp backend for sovereign local inference (#1123)
Some checks failed
Deploy Nexus / deploy (push) Has been cancelled
Staging Verification Gate / verify-staging (push) Has been cancelled
2026-04-14 01:48:17 +00:00
3 changed files with 14 additions and 18 deletions

View File

@@ -1,7 +1,6 @@
# Local LLM Deployment Guide — llama.cpp Sovereign Inference
# Local LLM Deployment Guide — llama.cpp
llama.cpp provides sovereign, offline-capable inference on CPU, CUDA, and
Apple Silicon. One binary, one model path, one health endpoint.
Standardizes local LLM inference across the fleet. One binary, one model path, one health endpoint.
## Quick Start
@@ -21,15 +20,15 @@ Apple Silicon. One binary, one model path, one health endpoint.
## Recommended Models
- Qwen2.5-7B-Instruct (4.7GB, 8GB RAM, 25-40 tok/s) — Fleet standard
- Qwen2.5-3B-Instruct (2.0GB, 4GB RAM, 50-80 tok/s) — VPS Beta
- Qwen2.5-7B-Instruct (4.7GB, 8GB RAM) — Fleet standard
- Qwen2.5-3B-Instruct (2.0GB, 4GB RAM) — VPS Beta
- Mistral-7B-Instruct-v0.3 (4.4GB, 8GB RAM) — Alternative
## Quantization Guide
## Quantization
- Q6_K (5.5GB) — Best quality/speed, RAM > 12GB
- Q6_K (5.5GB) — Best quality/speed
- Q4_K_M (4.7GB) — Fleet standard
- Q3_K_M (3.4GB) — < 6GB RAM fallback
- Q3_K_M (3.4GB) — Low-RAM fallback
## Hardware Targets
@@ -39,16 +38,15 @@ Apple Silicon. One binary, one model path, one health endpoint.
## Health Check
curl -sf http://localhost:11435/health
curl -s http://localhost:11435/v1/models
curl -sf http://localhost:11435/health && echo OK || echo FAIL
## API Compatibility
## API
llama-server exposes OpenAI-compatible API at /v1/chat/completions.
## Troubleshooting
- Won't start: use smaller model or lower quant
- Slow: match -t to available cores
- OOM: reduce -c context size
- Port in use: lsof -i :11435
- Won't start smaller model / lower quant
- Slow match -t to cores
- OOM reduce -c context
- Port conflict → lsof -i :11435

View File

@@ -1,4 +1,4 @@
"""llama_provider.py — Hermes inference router provider for llama.cpp local server."""
"""llama_provider.py — Hermes inference router provider for llama.cpp."""
import logging, os, time
from dataclasses import dataclass
from typing import Optional

View File

@@ -43,8 +43,6 @@ class TestLlamaClient:
def test_is_healthy(self, m):
m.return_value = {"status": "ok"}
assert LlamaClient().is_healthy() is True
m.side_effect = ConnectionError()
assert LlamaClient().is_healthy() is False
@patch("bin.llama_client._http_get")
def test_list_models(self, m):