skills/mlops/llama-cpp/references/server.md

# Server Deployment Guide

Production deployment of llama.cpp server with OpenAI-compatible API.

## Server Modes

### llama-server

```bash
# Basic server
./llama-server \
    -m models/llama-2-7b-chat.Q4_K_M.gguf \
    --host 0.0.0.0 \
    --port 8080 \
    -c 4096  # Context size

# With GPU acceleration
./llama-server \
    -m models/llama-2-70b.Q4_K_M.gguf \
    -ngl 40  # Offload 40 layers to GPU
```

## OpenAI-Compatible API

### Chat completions
```bash
curl http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "llama-2",
    "messages": [
      {"role": "system", "content": "You are helpful"},
      {"role": "user", "content": "Hello"}
    ],
    "temperature": 0.7,
    "max_tokens": 100
  }'
```

### Streaming
```bash
curl http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "llama-2",
    "messages": [{"role": "user", "content": "Count to 10"}],
    "stream": true
  }'
```

## Docker Deployment

**Dockerfile**:
```dockerfile
FROM ubuntu:22.04
RUN apt-get update && apt-get install -y git build-essential
RUN git clone https://github.com/ggerganov/llama.cpp
WORKDIR /llama.cpp
RUN make LLAMA_CUDA=1
COPY models/ /models/
EXPOSE 8080
CMD ["./llama-server", "-m", "/models/model.gguf", "--host", "0.0.0.0", "--port", "8080"]
```

**Run**:
```bash
docker run --gpus all -p 8080:8080 llama-cpp:latest
```

## Monitoring

```bash
# Server metrics endpoint
curl http://localhost:8080/metrics

# Health check
curl http://localhost:8080/health
```

**Metrics**:
- requests_total
- tokens_generated
- prompt_tokens
- completion_tokens
- kv_cache_tokens

## Load Balancing

**NGINX**:
```nginx
upstream llama_cpp {
    server llama1:8080;
    server llama2:8080;
}

server {
    location / {
        proxy_pass http://llama_cpp;
        proxy_read_timeout 300s;
    }
}
```

## Performance Tuning

**Parallel requests**:
```bash
./llama-server \
    -m model.gguf \
    -np 4  # 4 parallel slots
```

**Continuous batching**:
```bash
./llama-server \
    -m model.gguf \
    --cont-batching  # Enable continuous batching
```

**Context caching**:
```bash
./llama-server \
    -m model.gguf \
    --cache-prompt  # Cache processed prompts
```
Add skills tools and enhance model integration - Introduced new skills tools: `skills_categories`, `skills_list`, and `skill_view` in `model_tools.py`, allowing for better organization and access to skill-related functionalities. - Updated `toolsets.py` to include a new `skills` toolset, providing a dedicated space for skill tools. - Enhanced `batch_runner.py` to recognize and validate skills tools during batch processing. - Added comprehensive tool definitions for skills tools, ensuring compatibility with OpenAI's expected format. - Created new shell script `test_skills_kimi.sh` for testing skills tool functionality with Kimi K2.5. - Added example skill files demonstrating the structure and usage of skills within the Hermes-Agent framework, including `SKILL.md` for example and audiocraft skills. - Improved documentation for skills tools and their integration into the existing tool framework, ensuring clarity for future development and usage. 2026-01-30 07:39:55 +00:00			`# Server Deployment Guide`

			`Production deployment of llama.cpp server with OpenAI-compatible API.`

			`## Server Modes`

			`### llama-server`

			```bash
			`# Basic server`
			`./llama-server \`
			`-m models/llama-2-7b-chat.Q4_K_M.gguf \`
			`--host 0.0.0.0 \`
			`--port 8080 \`
			`-c 4096 # Context size`

			`# With GPU acceleration`
			`./llama-server \`
			`-m models/llama-2-70b.Q4_K_M.gguf \`
			`-ngl 40 # Offload 40 layers to GPU`
			```

			`## OpenAI-Compatible API`

			`### Chat completions`
			```bash
			`curl http://localhost:8080/v1/chat/completions \`
			`-H "Content-Type: application/json" \`
			`-d '{`
			`"model": "llama-2",`
			`"messages": [`
			`{"role": "system", "content": "You are helpful"},`
			`{"role": "user", "content": "Hello"}`
			`],`
			`"temperature": 0.7,`
			`"max_tokens": 100`
			`}'`
			```

			`### Streaming`
			```bash
			`curl http://localhost:8080/v1/chat/completions \`
			`-H "Content-Type: application/json" \`
			`-d '{`
			`"model": "llama-2",`
			`"messages": [{"role": "user", "content": "Count to 10"}],`
			`"stream": true`
			`}'`
			```

			`## Docker Deployment`

			`Dockerfile:`
			```dockerfile
			`FROM ubuntu:22.04`
			`RUN apt-get update && apt-get install -y git build-essential`
			`RUN git clone https://github.com/ggerganov/llama.cpp`
			`WORKDIR /llama.cpp`
			`RUN make LLAMA_CUDA=1`
			`COPY models/ /models/`
			`EXPOSE 8080`
			`CMD ["./llama-server", "-m", "/models/model.gguf", "--host", "0.0.0.0", "--port", "8080"]`
			```

			`Run:`
			```bash
			`docker run --gpus all -p 8080:8080 llama-cpp:latest`
			```

			`## Monitoring`

			```bash
			`# Server metrics endpoint`
			`curl http://localhost:8080/metrics`

			`# Health check`
			`curl http://localhost:8080/health`
			```

			`Metrics:`
			`- requests_total`
			`- tokens_generated`
			`- prompt_tokens`
			`- completion_tokens`
			`- kv_cache_tokens`

			`## Load Balancing`

			`NGINX:`
			```nginx
			`upstream llama_cpp {`
			`server llama1:8080;`
			`server llama2:8080;`
			`}`

			`server {`
			`location / {`
			`proxy_pass http://llama_cpp;`
			`proxy_read_timeout 300s;`
			`}`
			`}`
			```

			`## Performance Tuning`

			`Parallel requests:`
			```bash
			`./llama-server \`
			`-m model.gguf \`
			`-np 4 # 4 parallel slots`
			```

			`Continuous batching:`
			```bash
			`./llama-server \`
			`-m model.gguf \`
			`--cont-batching # Enable continuous batching`
			```

			`Context caching:`
			```bash
			`./llama-server \`
			`-m model.gguf \`
			`--cache-prompt # Cache processed prompts`
			```