49 lines
1.5 KiB
Markdown
49 lines
1.5 KiB
Markdown
# Local LLM Deployment Guide — llama.cpp
|
|
|
|
Standardizes local LLM inference across the fleet using llama.cpp.
|
|
|
|
## Quick Start
|
|
|
|
git clone https://github.com/ggerganov/llama.cpp.git
|
|
cd llama.cpp && cmake -B build && cmake --build build --config Release -j$(nproc)
|
|
sudo cp build/bin/llama-server /usr/local/bin/
|
|
mkdir -p /opt/models/llama
|
|
wget -O /opt/models/llama/Qwen2.5-7B-Instruct-Q4_K_M.gguf "https://huggingface.co/Qwen/Qwen2.5-7B-Instruct-GGUF/resolve/main/qwen2.5-7b-instruct-q4_k_m.gguf"
|
|
llama-server -m /opt/models/llama/Qwen2.5-7B-Instruct-Q4_K_M.gguf --host 0.0.0.0 --port 11435 -c 4096 -t $(nproc) --cont-batching
|
|
|
|
## Model Paths
|
|
|
|
- /opt/models/llama/ — Production
|
|
- ~/models/llama/ — Dev
|
|
- MODEL_DIR env var — Override
|
|
|
|
## Models
|
|
|
|
- Qwen2.5-7B-Instruct-Q4_K_M (4.7GB) — Fleet standard, VPS Alpha
|
|
- Qwen2.5-3B-Instruct-Q4_K_M (2.0GB) — VPS Beta
|
|
- Mistral-7B-Instruct-v0.3-Q4_K_M (4.4GB) — Alternative
|
|
|
|
## Quantization
|
|
|
|
- Q6_K (5.5GB) — Best quality/speed, 12GB+ RAM
|
|
- Q4_K_M (4.7GB) — Fleet standard, 8GB RAM
|
|
- Q3_K_M (3.4GB) — Low-RAM fallback, 4GB
|
|
|
|
## Hardware
|
|
|
|
- VPS Beta (2c/4GB): 3B-Q4_K_M, ctx 2048, ~40-60 tok/s
|
|
- VPS Alpha (4c/8GB): 7B-Q4_K_M, ctx 4096, ~20-35 tok/s
|
|
- Mac (AS/16GB+): 7B-Q6_K, Metal, ~30-50 tok/s
|
|
|
|
## Health
|
|
|
|
curl -sf http://localhost:11435/health
|
|
curl -s http://localhost:11435/v1/models
|
|
|
|
## Troubleshooting
|
|
|
|
- Won't start → smaller model / lower quant
|
|
- Slow → -t to core count
|
|
- OOM → reduce -c
|
|
- Port conflict → lsof -i :11435
|