Sovereign backup of all Hermes Agent configuration and data. Excludes: secrets, auth tokens, sessions, caches, code (separate repo). Tracked: - config.yaml (model, fallback chain, toolsets, display prefs) - SOUL.md (Timmy personality charter) - memories/ (persistent MEMORY.md + USER.md) - skills/ (371 files — full skill library) - cron/jobs.json (scheduled tasks) - channel_directory.json (platform channels) - hooks/ (custom hooks)
2.2 KiB
2.2 KiB
Server Deployment Guide
Production deployment of llama.cpp server with OpenAI-compatible API.
Server Modes
llama-server
# Basic server
./llama-server \
-m models/llama-2-7b-chat.Q4_K_M.gguf \
--host 0.0.0.0 \
--port 8080 \
-c 4096 # Context size
# With GPU acceleration
./llama-server \
-m models/llama-2-70b.Q4_K_M.gguf \
-ngl 40 # Offload 40 layers to GPU
OpenAI-Compatible API
Chat completions
curl http://localhost:8080/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "llama-2",
"messages": [
{"role": "system", "content": "You are helpful"},
{"role": "user", "content": "Hello"}
],
"temperature": 0.7,
"max_tokens": 100
}'
Streaming
curl http://localhost:8080/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "llama-2",
"messages": [{"role": "user", "content": "Count to 10"}],
"stream": true
}'
Docker Deployment
Dockerfile:
FROM ubuntu:22.04
RUN apt-get update && apt-get install -y git build-essential
RUN git clone https://github.com/ggerganov/llama.cpp
WORKDIR /llama.cpp
RUN make LLAMA_CUDA=1
COPY models/ /models/
EXPOSE 8080
CMD ["./llama-server", "-m", "/models/model.gguf", "--host", "0.0.0.0", "--port", "8080"]
Run:
docker run --gpus all -p 8080:8080 llama-cpp:latest
Monitoring
# Server metrics endpoint
curl http://localhost:8080/metrics
# Health check
curl http://localhost:8080/health
Metrics:
- requests_total
- tokens_generated
- prompt_tokens
- completion_tokens
- kv_cache_tokens
Load Balancing
NGINX:
upstream llama_cpp {
server llama1:8080;
server llama2:8080;
}
server {
location / {
proxy_pass http://llama_cpp;
proxy_read_timeout 300s;
}
}
Performance Tuning
Parallel requests:
./llama-server \
-m model.gguf \
-np 4 # 4 parallel slots
Continuous batching:
./llama-server \
-m model.gguf \
--cont-batching # Enable continuous batching
Context caching:
./llama-server \
-m model.gguf \
--cache-prompt # Cache processed prompts