Files

teknium f172f7d4aa Add skills tools and enhance model integration

- Introduced new skills tools: `skills_categories`, `skills_list`, and `skill_view` in `model_tools.py`, allowing for better organization and access to skill-related functionalities.
- Updated `toolsets.py` to include a new `skills` toolset, providing a dedicated space for skill tools.
- Enhanced `batch_runner.py` to recognize and validate skills tools during batch processing.
- Added comprehensive tool definitions for skills tools, ensuring compatibility with OpenAI's expected format.
- Created new shell script `test_skills_kimi.sh` for testing skills tool functionality with Kimi K2.5.
- Added example skill files demonstrating the structure and usage of skills within the Hermes-Agent framework, including `SKILL.md` for example and audiocraft skills.
- Improved documentation for skills tools and their integration into the existing tool framework, ensuring clarity for future development and usage.

2026-01-30 07:39:55 +00:00

4.8 KiB

Raw Blame History

GGUF Quantization Guide

Complete guide to GGUF quantization formats and model conversion.

Quantization Overview

GGUF (GPT-Generated Unified Format) - Standard format for llama.cpp models.

Format Comparison

Format	Perplexity	Size (7B)	Tokens/sec	Notes
FP16	5.9565 (baseline)	13.0 GB	15 tok/s	Original quality
Q8_0	5.9584 (+0.03%)	7.0 GB	25 tok/s	Nearly lossless
Q6_K	5.9642 (+0.13%)	5.5 GB	30 tok/s	Best quality/size
Q5_K_M	5.9796 (+0.39%)	4.8 GB	35 tok/s	Balanced
Q4_K_M	6.0565 (+1.68%)	4.1 GB	40 tok/s	Recommended
Q4_K_S	6.1125 (+2.62%)	3.9 GB	42 tok/s	Faster, lower quality
Q3_K_M	6.3184 (+6.07%)	3.3 GB	45 tok/s	Small models only
Q2_K	6.8673 (+15.3%)	2.7 GB	50 tok/s	Not recommended

Recommendation: Use Q4_K_M for best balance of quality and speed.

Converting Models

HuggingFace to GGUF

# 1. Download HuggingFace model
huggingface-cli download meta-llama/Llama-2-7b-chat-hf \
    --local-dir models/llama-2-7b-chat/

# 2. Convert to FP16 GGUF
python convert_hf_to_gguf.py \
    models/llama-2-7b-chat/ \
    --outtype f16 \
    --outfile models/llama-2-7b-chat-f16.gguf

# 3. Quantize to Q4_K_M
./llama-quantize \
    models/llama-2-7b-chat-f16.gguf \
    models/llama-2-7b-chat-Q4_K_M.gguf \
    Q4_K_M

Batch quantization

# Quantize to multiple formats
for quant in Q4_K_M Q5_K_M Q6_K Q8_0; do
    ./llama-quantize \
        model-f16.gguf \
        model-${quant}.gguf \
        $quant
done

K-Quantization Methods

K-quants use mixed precision for better quality:

Attention weights: Higher precision
Feed-forward weights: Lower precision

Variants:

_S (Small): Faster, lower quality
_M (Medium): Balanced (recommended)
_L (Large): Better quality, larger size

Example: Q4_K_M

Q4: 4-bit quantization
K: Mixed precision method
M: Medium quality

Quality Testing

# Calculate perplexity (quality metric)
./llama-perplexity \
    -m model.gguf \
    -f wikitext-2-raw/wiki.test.raw \
    -c 512

# Lower perplexity = better quality
# Baseline (FP16): ~5.96
# Q4_K_M: ~6.06 (+1.7%)
# Q2_K: ~6.87 (+15.3% - too much degradation)

Use Case Guide

General purpose (chatbots, assistants)

Q4_K_M - Best balance
Q5_K_M - If you have extra RAM

Code generation

Q5_K_M or Q6_K - Higher precision helps with code

Creative writing

Q4_K_M - Sufficient quality
Q3_K_M - Acceptable for draft generation

Technical/medical

Q6_K or Q8_0 - Maximum accuracy

Edge devices (Raspberry Pi)

Q2_K or Q3_K_S - Fit in limited RAM

Model Size Scaling

7B parameter models

Format	Size	RAM needed
Q2_K	2.7 GB	5 GB
Q3_K_M	3.3 GB	6 GB
Q4_K_M	4.1 GB	7 GB
Q5_K_M	4.8 GB	8 GB
Q6_K	5.5 GB	9 GB
Q8_0	7.0 GB	11 GB

13B parameter models

Format	Size	RAM needed
Q2_K	5.1 GB	8 GB
Q3_K_M	6.2 GB	10 GB
Q4_K_M	7.9 GB	12 GB
Q5_K_M	9.2 GB	14 GB
Q6_K	10.7 GB	16 GB

70B parameter models

Format	Size	RAM needed
Q2_K	26 GB	32 GB
Q3_K_M	32 GB	40 GB
Q4_K_M	41 GB	48 GB
Q4_K_S	39 GB	46 GB
Q5_K_M	48 GB	56 GB

Recommendation for 70B: Use Q3_K_M or Q4_K_S to fit in consumer hardware.

Finding Pre-Quantized Models

TheBloke on HuggingFace:

https://huggingface.co/TheBloke
Most models available in all GGUF formats
No conversion needed

Example:

# Download pre-quantized Llama 2-7B
huggingface-cli download \
    TheBloke/Llama-2-7B-Chat-GGUF \
    llama-2-7b-chat.Q4_K_M.gguf \
    --local-dir models/

Importance Matrices (imatrix)

What: Calibration data to improve quantization quality.

Benefits:

10-20% perplexity improvement with Q4
Essential for Q3 and below

Usage:

# 1. Generate importance matrix
./llama-imatrix \
    -m model-f16.gguf \
    -f calibration-data.txt \
    -o model.imatrix

# 2. Quantize with imatrix
./llama-quantize \
    --imatrix model.imatrix \
    model-f16.gguf \
    model-Q4_K_M.gguf \
    Q4_K_M

Calibration data:

Use domain-specific text (e.g., code for code models)
~100MB of representative text
Higher quality data = better quantization

Troubleshooting

Model outputs gibberish:

Quantization too aggressive (Q2_K)
Try Q4_K_M or Q5_K_M
Verify model converted correctly

Out of memory:

Use lower quantization (Q4_K_S instead of Q5_K_M)
Offload fewer layers to GPU (-ngl)
Use smaller context (-c 2048)

Slow inference:

Higher quantization uses more compute
Q8_0 much slower than Q4_K_M
Consider speed vs quality trade-off

4.8 KiB Raw Blame History