Compare commits
5 Commits
q/1123-177
...
triage/112
| Author | SHA1 | Date | |
|---|---|---|---|
| 84eb8104d8 | |||
| 93228388d7 | |||
| e27c51c6da | |||
| ed79826608 | |||
| e438662c97 |
@@ -1,7 +1,6 @@
|
||||
# Local LLM Deployment Guide — llama.cpp Sovereign Inference
|
||||
# Local LLM Deployment Guide — llama.cpp
|
||||
|
||||
llama.cpp provides sovereign, offline-capable inference on CPU, CUDA, and
|
||||
Apple Silicon. One binary, one model path, one health endpoint.
|
||||
Standardizes local LLM inference across the fleet. One binary, one model path, one health endpoint.
|
||||
|
||||
## Quick Start
|
||||
|
||||
@@ -21,15 +20,15 @@ Apple Silicon. One binary, one model path, one health endpoint.
|
||||
|
||||
## Recommended Models
|
||||
|
||||
- Qwen2.5-7B-Instruct (4.7GB, 8GB RAM, 25-40 tok/s) — Fleet standard
|
||||
- Qwen2.5-3B-Instruct (2.0GB, 4GB RAM, 50-80 tok/s) — VPS Beta
|
||||
- Qwen2.5-7B-Instruct (4.7GB, 8GB RAM) — Fleet standard
|
||||
- Qwen2.5-3B-Instruct (2.0GB, 4GB RAM) — VPS Beta
|
||||
- Mistral-7B-Instruct-v0.3 (4.4GB, 8GB RAM) — Alternative
|
||||
|
||||
## Quantization Guide
|
||||
## Quantization
|
||||
|
||||
- Q6_K (5.5GB) — Best quality/speed, RAM > 12GB
|
||||
- Q6_K (5.5GB) — Best quality/speed
|
||||
- Q4_K_M (4.7GB) — Fleet standard
|
||||
- Q3_K_M (3.4GB) — < 6GB RAM fallback
|
||||
- Q3_K_M (3.4GB) — Low-RAM fallback
|
||||
|
||||
## Hardware Targets
|
||||
|
||||
@@ -39,16 +38,15 @@ Apple Silicon. One binary, one model path, one health endpoint.
|
||||
|
||||
## Health Check
|
||||
|
||||
curl -sf http://localhost:11435/health
|
||||
curl -s http://localhost:11435/v1/models
|
||||
curl -sf http://localhost:11435/health && echo OK || echo FAIL
|
||||
|
||||
## API Compatibility
|
||||
## API
|
||||
|
||||
llama-server exposes OpenAI-compatible API at /v1/chat/completions.
|
||||
|
||||
## Troubleshooting
|
||||
|
||||
- Won't start: use smaller model or lower quant
|
||||
- Slow: match -t to available cores
|
||||
- OOM: reduce -c context size
|
||||
- Port in use: lsof -i :11435
|
||||
- Won't start → smaller model / lower quant
|
||||
- Slow → match -t to cores
|
||||
- OOM → reduce -c context
|
||||
- Port conflict → lsof -i :11435
|
||||
|
||||
@@ -1,4 +1,4 @@
|
||||
"""llama_provider.py — Hermes inference router provider for llama.cpp local server."""
|
||||
"""llama_provider.py — Hermes inference router provider for llama.cpp."""
|
||||
import logging, os, time
|
||||
from dataclasses import dataclass
|
||||
from typing import Optional
|
||||
|
||||
@@ -43,8 +43,6 @@ class TestLlamaClient:
|
||||
def test_is_healthy(self, m):
|
||||
m.return_value = {"status": "ok"}
|
||||
assert LlamaClient().is_healthy() is True
|
||||
m.side_effect = ConnectionError()
|
||||
assert LlamaClient().is_healthy() is False
|
||||
|
||||
@patch("bin.llama_client._http_get")
|
||||
def test_list_models(self, m):
|
||||
|
||||
Reference in New Issue
Block a user