Local LLM Deployment Guide — llama.cpp

Standardizes local LLM inference across the fleet. One binary, one model path, one health endpoint.

Quick Start

git clone https://github.com/ggerganov/llama.cpp.git
cd llama.cpp && cmake -B build && cmake --build build --config Release -j$(nproc)
sudo cp build/bin/llama-server /usr/local/bin/
mkdir -p /opt/models/llama
wget -O /opt/models/llama/Qwen2.5-7B-Instruct-Q4_K_M.gguf "https://huggingface.co/Qwen/Qwen2.5-7B-Instruct-GGUF/resolve/main/qwen2.5-7b-instruct-q4_k_m.gguf"
llama-server -m /opt/models/llama/Qwen2.5-7B-Instruct-Q4_K_M.gguf --host 0.0.0.0 --port 11435 -c 4096 -t $(nproc) --cont-batching
curl http://localhost:11435/health

Model Path Convention

/opt/models/llama/ — Production (system-wide)
~/models/llama/ — Per-user (dev)
MODEL_DIR env var — Override

Recommended Models

Qwen2.5-7B-Instruct (4.7GB, 8GB RAM) — Fleet standard
Qwen2.5-3B-Instruct (2.0GB, 4GB RAM) — VPS Beta
Mistral-7B-Instruct-v0.3 (4.4GB, 8GB RAM) — Alternative

Quantization

Q6_K (5.5GB) — Best quality/speed
Q4_K_M (4.7GB) — Fleet standard
Q3_K_M (3.4GB) — Low-RAM fallback

Hardware Targets

VPS Beta (2 vCPU, 4GB): Qwen2.5-3B-Q4_K_M, ctx 2048, ~40-60 tok/s
VPS Alpha (4 vCPU, 8GB): Qwen2.5-7B-Q4_K_M, ctx 4096, ~20-35 tok/s
Mac Apple Silicon: Qwen2.5-7B-Q6_K, Metal, ~30-50 tok/s

Health Check

curl -sf http://localhost:11435/health && echo OK || echo FAIL

API

llama-server exposes OpenAI-compatible API at /v1/chat/completions.

Troubleshooting

Won't start → smaller model / lower quant
Slow → match -t to cores
OOM → reduce -c context
Port conflict → lsof -i :11435

1.7 KiB Raw Blame History

Local LLM Deployment Guide — llama.cpp

Quick Start

Model Path Convention

Recommended Models

Quantization

Hardware Targets

Health Check

API

Troubleshooting

1.7 KiB

Raw Blame History