Files

teknium f172f7d4aa Add skills tools and enhance model integration

- Introduced new skills tools: `skills_categories`, `skills_list`, and `skill_view` in `model_tools.py`, allowing for better organization and access to skill-related functionalities.
- Updated `toolsets.py` to include a new `skills` toolset, providing a dedicated space for skill tools.
- Enhanced `batch_runner.py` to recognize and validate skills tools during batch processing.
- Added comprehensive tool definitions for skills tools, ensuring compatibility with OpenAI's expected format.
- Created new shell script `test_skills_kimi.sh` for testing skills tool functionality with Kimi K2.5.
- Added example skill files demonstrating the structure and usage of skills within the Hermes-Agent framework, including `SKILL.md` for example and audiocraft skills.
- Improved documentation for skills tools and their integration into the existing tool framework, ensuring clarity for future development and usage.

2026-01-30 07:39:55 +00:00

2.2 KiB

Raw Blame History

Server Deployment Guide

Production deployment of llama.cpp server with OpenAI-compatible API.

Server Modes

llama-server

# Basic server
./llama-server \
    -m models/llama-2-7b-chat.Q4_K_M.gguf \
    --host 0.0.0.0 \
    --port 8080 \
    -c 4096  # Context size

# With GPU acceleration
./llama-server \
    -m models/llama-2-70b.Q4_K_M.gguf \
    -ngl 40  # Offload 40 layers to GPU

OpenAI-Compatible API

Chat completions

curl http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "llama-2",
    "messages": [
      {"role": "system", "content": "You are helpful"},
      {"role": "user", "content": "Hello"}
    ],
    "temperature": 0.7,
    "max_tokens": 100
  }'

Streaming

curl http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "llama-2",
    "messages": [{"role": "user", "content": "Count to 10"}],
    "stream": true
  }'

Docker Deployment

Dockerfile:

FROM ubuntu:22.04
RUN apt-get update && apt-get install -y git build-essential
RUN git clone https://github.com/ggerganov/llama.cpp
WORKDIR /llama.cpp
RUN make LLAMA_CUDA=1
COPY models/ /models/
EXPOSE 8080
CMD ["./llama-server", "-m", "/models/model.gguf", "--host", "0.0.0.0", "--port", "8080"]

Run:

docker run --gpus all -p 8080:8080 llama-cpp:latest

Monitoring

# Server metrics endpoint
curl http://localhost:8080/metrics

# Health check
curl http://localhost:8080/health

Metrics:

requests_total
tokens_generated
prompt_tokens
completion_tokens
kv_cache_tokens

Load Balancing

NGINX:

upstream llama_cpp {
    server llama1:8080;
    server llama2:8080;
}

server {
    location / {
        proxy_pass http://llama_cpp;
        proxy_read_timeout 300s;
    }
}

Performance Tuning

Parallel requests:

./llama-server \
    -m model.gguf \
    -np 4  # 4 parallel slots

Continuous batching:

./llama-server \
    -m model.gguf \
    --cont-batching  # Enable continuous batching

Context caching:

./llama-server \
    -m model.gguf \
    --cache-prompt  # Cache processed prompts

2.2 KiB Raw Blame History