- Introduced new skills tools: `skills_categories`, `skills_list`, and `skill_view` in `model_tools.py`, allowing for better organization and access to skill-related functionalities. - Updated `toolsets.py` to include a new `skills` toolset, providing a dedicated space for skill tools. - Enhanced `batch_runner.py` to recognize and validate skills tools during batch processing. - Added comprehensive tool definitions for skills tools, ensuring compatibility with OpenAI's expected format. - Created new shell script `test_skills_kimi.sh` for testing skills tool functionality with Kimi K2.5. - Added example skill files demonstrating the structure and usage of skills within the Hermes-Agent framework, including `SKILL.md` for example and audiocraft skills. - Improved documentation for skills tools and their integration into the existing tool framework, ensuring clarity for future development and usage.
214 lines
4.8 KiB
Markdown
214 lines
4.8 KiB
Markdown
# GGUF Quantization Guide
|
|
|
|
Complete guide to GGUF quantization formats and model conversion.
|
|
|
|
## Quantization Overview
|
|
|
|
**GGUF** (GPT-Generated Unified Format) - Standard format for llama.cpp models.
|
|
|
|
### Format Comparison
|
|
|
|
| Format | Perplexity | Size (7B) | Tokens/sec | Notes |
|
|
|--------|------------|-----------|------------|-------|
|
|
| FP16 | 5.9565 (baseline) | 13.0 GB | 15 tok/s | Original quality |
|
|
| Q8_0 | 5.9584 (+0.03%) | 7.0 GB | 25 tok/s | Nearly lossless |
|
|
| **Q6_K** | 5.9642 (+0.13%) | 5.5 GB | 30 tok/s | Best quality/size |
|
|
| **Q5_K_M** | 5.9796 (+0.39%) | 4.8 GB | 35 tok/s | Balanced |
|
|
| **Q4_K_M** | 6.0565 (+1.68%) | 4.1 GB | 40 tok/s | **Recommended** |
|
|
| Q4_K_S | 6.1125 (+2.62%) | 3.9 GB | 42 tok/s | Faster, lower quality |
|
|
| Q3_K_M | 6.3184 (+6.07%) | 3.3 GB | 45 tok/s | Small models only |
|
|
| Q2_K | 6.8673 (+15.3%) | 2.7 GB | 50 tok/s | Not recommended |
|
|
|
|
**Recommendation**: Use **Q4_K_M** for best balance of quality and speed.
|
|
|
|
## Converting Models
|
|
|
|
### HuggingFace to GGUF
|
|
|
|
```bash
|
|
# 1. Download HuggingFace model
|
|
huggingface-cli download meta-llama/Llama-2-7b-chat-hf \
|
|
--local-dir models/llama-2-7b-chat/
|
|
|
|
# 2. Convert to FP16 GGUF
|
|
python convert_hf_to_gguf.py \
|
|
models/llama-2-7b-chat/ \
|
|
--outtype f16 \
|
|
--outfile models/llama-2-7b-chat-f16.gguf
|
|
|
|
# 3. Quantize to Q4_K_M
|
|
./llama-quantize \
|
|
models/llama-2-7b-chat-f16.gguf \
|
|
models/llama-2-7b-chat-Q4_K_M.gguf \
|
|
Q4_K_M
|
|
```
|
|
|
|
### Batch quantization
|
|
|
|
```bash
|
|
# Quantize to multiple formats
|
|
for quant in Q4_K_M Q5_K_M Q6_K Q8_0; do
|
|
./llama-quantize \
|
|
model-f16.gguf \
|
|
model-${quant}.gguf \
|
|
$quant
|
|
done
|
|
```
|
|
|
|
## K-Quantization Methods
|
|
|
|
**K-quants** use mixed precision for better quality:
|
|
- Attention weights: Higher precision
|
|
- Feed-forward weights: Lower precision
|
|
|
|
**Variants**:
|
|
- `_S` (Small): Faster, lower quality
|
|
- `_M` (Medium): Balanced (recommended)
|
|
- `_L` (Large): Better quality, larger size
|
|
|
|
**Example**: `Q4_K_M`
|
|
- `Q4`: 4-bit quantization
|
|
- `K`: Mixed precision method
|
|
- `M`: Medium quality
|
|
|
|
## Quality Testing
|
|
|
|
```bash
|
|
# Calculate perplexity (quality metric)
|
|
./llama-perplexity \
|
|
-m model.gguf \
|
|
-f wikitext-2-raw/wiki.test.raw \
|
|
-c 512
|
|
|
|
# Lower perplexity = better quality
|
|
# Baseline (FP16): ~5.96
|
|
# Q4_K_M: ~6.06 (+1.7%)
|
|
# Q2_K: ~6.87 (+15.3% - too much degradation)
|
|
```
|
|
|
|
## Use Case Guide
|
|
|
|
### General purpose (chatbots, assistants)
|
|
```
|
|
Q4_K_M - Best balance
|
|
Q5_K_M - If you have extra RAM
|
|
```
|
|
|
|
### Code generation
|
|
```
|
|
Q5_K_M or Q6_K - Higher precision helps with code
|
|
```
|
|
|
|
### Creative writing
|
|
```
|
|
Q4_K_M - Sufficient quality
|
|
Q3_K_M - Acceptable for draft generation
|
|
```
|
|
|
|
### Technical/medical
|
|
```
|
|
Q6_K or Q8_0 - Maximum accuracy
|
|
```
|
|
|
|
### Edge devices (Raspberry Pi)
|
|
```
|
|
Q2_K or Q3_K_S - Fit in limited RAM
|
|
```
|
|
|
|
## Model Size Scaling
|
|
|
|
### 7B parameter models
|
|
|
|
| Format | Size | RAM needed |
|
|
|--------|------|------------|
|
|
| Q2_K | 2.7 GB | 5 GB |
|
|
| Q3_K_M | 3.3 GB | 6 GB |
|
|
| Q4_K_M | 4.1 GB | 7 GB |
|
|
| Q5_K_M | 4.8 GB | 8 GB |
|
|
| Q6_K | 5.5 GB | 9 GB |
|
|
| Q8_0 | 7.0 GB | 11 GB |
|
|
|
|
### 13B parameter models
|
|
|
|
| Format | Size | RAM needed |
|
|
|--------|------|------------|
|
|
| Q2_K | 5.1 GB | 8 GB |
|
|
| Q3_K_M | 6.2 GB | 10 GB |
|
|
| Q4_K_M | 7.9 GB | 12 GB |
|
|
| Q5_K_M | 9.2 GB | 14 GB |
|
|
| Q6_K | 10.7 GB | 16 GB |
|
|
|
|
### 70B parameter models
|
|
|
|
| Format | Size | RAM needed |
|
|
|--------|------|------------|
|
|
| Q2_K | 26 GB | 32 GB |
|
|
| Q3_K_M | 32 GB | 40 GB |
|
|
| Q4_K_M | 41 GB | 48 GB |
|
|
| Q4_K_S | 39 GB | 46 GB |
|
|
| Q5_K_M | 48 GB | 56 GB |
|
|
|
|
**Recommendation for 70B**: Use Q3_K_M or Q4_K_S to fit in consumer hardware.
|
|
|
|
## Finding Pre-Quantized Models
|
|
|
|
**TheBloke** on HuggingFace:
|
|
- https://huggingface.co/TheBloke
|
|
- Most models available in all GGUF formats
|
|
- No conversion needed
|
|
|
|
**Example**:
|
|
```bash
|
|
# Download pre-quantized Llama 2-7B
|
|
huggingface-cli download \
|
|
TheBloke/Llama-2-7B-Chat-GGUF \
|
|
llama-2-7b-chat.Q4_K_M.gguf \
|
|
--local-dir models/
|
|
```
|
|
|
|
## Importance Matrices (imatrix)
|
|
|
|
**What**: Calibration data to improve quantization quality.
|
|
|
|
**Benefits**:
|
|
- 10-20% perplexity improvement with Q4
|
|
- Essential for Q3 and below
|
|
|
|
**Usage**:
|
|
```bash
|
|
# 1. Generate importance matrix
|
|
./llama-imatrix \
|
|
-m model-f16.gguf \
|
|
-f calibration-data.txt \
|
|
-o model.imatrix
|
|
|
|
# 2. Quantize with imatrix
|
|
./llama-quantize \
|
|
--imatrix model.imatrix \
|
|
model-f16.gguf \
|
|
model-Q4_K_M.gguf \
|
|
Q4_K_M
|
|
```
|
|
|
|
**Calibration data**:
|
|
- Use domain-specific text (e.g., code for code models)
|
|
- ~100MB of representative text
|
|
- Higher quality data = better quantization
|
|
|
|
## Troubleshooting
|
|
|
|
**Model outputs gibberish**:
|
|
- Quantization too aggressive (Q2_K)
|
|
- Try Q4_K_M or Q5_K_M
|
|
- Verify model converted correctly
|
|
|
|
**Out of memory**:
|
|
- Use lower quantization (Q4_K_S instead of Q5_K_M)
|
|
- Offload fewer layers to GPU (`-ngl`)
|
|
- Use smaller context (`-c 2048`)
|
|
|
|
**Slow inference**:
|
|
- Higher quantization uses more compute
|
|
- Q8_0 much slower than Q4_K_M
|
|
- Consider speed vs quality trade-off
|