Files
hermes-agent/skills/mlops/inference/llama-cpp/references/quantization.md
teknium1 732c66b0f3 refactor: reorganize skills into sub-categories
The skills directory was getting disorganized — mlops alone had 40
skills in a flat list, and 12 categories were singletons with just
one skill each.

Code change:
- prompt_builder.py: Support sub-categories in skill scanner.
  skills/mlops/training/axolotl/SKILL.md now shows as category
  'mlops/training' instead of just 'mlops'. Backwards-compatible
  with existing flat structure.

Split mlops (40 skills) into 7 sub-categories:
- mlops/training (12): accelerate, axolotl, flash-attention,
  grpo-rl-training, peft, pytorch-fsdp, pytorch-lightning,
  simpo, slime, torchtitan, trl-fine-tuning, unsloth
- mlops/inference (8): gguf, guidance, instructor, llama-cpp,
  obliteratus, outlines, tensorrt-llm, vllm
- mlops/models (6): audiocraft, clip, llava, segment-anything,
  stable-diffusion, whisper
- mlops/vector-databases (4): chroma, faiss, pinecone, qdrant
- mlops/evaluation (5): huggingface-tokenizers,
  lm-evaluation-harness, nemo-curator, saelens, weights-and-biases
- mlops/cloud (2): lambda-labs, modal
- mlops/research (1): dspy

Merged singleton categories:
- gifs → media (gif-search joins youtube-content)
- music-creation → media (heartmula, songsee)
- diagramming → creative (excalidraw joins ascii-art)
- ocr-and-documents → productivity
- domain → research (domain-intel)
- feeds → research (blogwatcher)
- market-data → research (polymarket)

Fixed misplaced skills:
- mlops/code-review → software-development (not ML-specific)
- mlops/ml-paper-writing → research (academic writing)

Added DESCRIPTION.md files for all new/updated categories.
2026-03-09 03:35:53 -07:00

4.8 KiB

GGUF Quantization Guide

Complete guide to GGUF quantization formats and model conversion.

Quantization Overview

GGUF (GPT-Generated Unified Format) - Standard format for llama.cpp models.

Format Comparison

Format Perplexity Size (7B) Tokens/sec Notes
FP16 5.9565 (baseline) 13.0 GB 15 tok/s Original quality
Q8_0 5.9584 (+0.03%) 7.0 GB 25 tok/s Nearly lossless
Q6_K 5.9642 (+0.13%) 5.5 GB 30 tok/s Best quality/size
Q5_K_M 5.9796 (+0.39%) 4.8 GB 35 tok/s Balanced
Q4_K_M 6.0565 (+1.68%) 4.1 GB 40 tok/s Recommended
Q4_K_S 6.1125 (+2.62%) 3.9 GB 42 tok/s Faster, lower quality
Q3_K_M 6.3184 (+6.07%) 3.3 GB 45 tok/s Small models only
Q2_K 6.8673 (+15.3%) 2.7 GB 50 tok/s Not recommended

Recommendation: Use Q4_K_M for best balance of quality and speed.

Converting Models

HuggingFace to GGUF

# 1. Download HuggingFace model
huggingface-cli download meta-llama/Llama-2-7b-chat-hf \
    --local-dir models/llama-2-7b-chat/

# 2. Convert to FP16 GGUF
python convert_hf_to_gguf.py \
    models/llama-2-7b-chat/ \
    --outtype f16 \
    --outfile models/llama-2-7b-chat-f16.gguf

# 3. Quantize to Q4_K_M
./llama-quantize \
    models/llama-2-7b-chat-f16.gguf \
    models/llama-2-7b-chat-Q4_K_M.gguf \
    Q4_K_M

Batch quantization

# Quantize to multiple formats
for quant in Q4_K_M Q5_K_M Q6_K Q8_0; do
    ./llama-quantize \
        model-f16.gguf \
        model-${quant}.gguf \
        $quant
done

K-Quantization Methods

K-quants use mixed precision for better quality:

  • Attention weights: Higher precision
  • Feed-forward weights: Lower precision

Variants:

  • _S (Small): Faster, lower quality
  • _M (Medium): Balanced (recommended)
  • _L (Large): Better quality, larger size

Example: Q4_K_M

  • Q4: 4-bit quantization
  • K: Mixed precision method
  • M: Medium quality

Quality Testing

# Calculate perplexity (quality metric)
./llama-perplexity \
    -m model.gguf \
    -f wikitext-2-raw/wiki.test.raw \
    -c 512

# Lower perplexity = better quality
# Baseline (FP16): ~5.96
# Q4_K_M: ~6.06 (+1.7%)
# Q2_K: ~6.87 (+15.3% - too much degradation)

Use Case Guide

General purpose (chatbots, assistants)

Q4_K_M - Best balance
Q5_K_M - If you have extra RAM

Code generation

Q5_K_M or Q6_K - Higher precision helps with code

Creative writing

Q4_K_M - Sufficient quality
Q3_K_M - Acceptable for draft generation

Technical/medical

Q6_K or Q8_0 - Maximum accuracy

Edge devices (Raspberry Pi)

Q2_K or Q3_K_S - Fit in limited RAM

Model Size Scaling

7B parameter models

Format Size RAM needed
Q2_K 2.7 GB 5 GB
Q3_K_M 3.3 GB 6 GB
Q4_K_M 4.1 GB 7 GB
Q5_K_M 4.8 GB 8 GB
Q6_K 5.5 GB 9 GB
Q8_0 7.0 GB 11 GB

13B parameter models

Format Size RAM needed
Q2_K 5.1 GB 8 GB
Q3_K_M 6.2 GB 10 GB
Q4_K_M 7.9 GB 12 GB
Q5_K_M 9.2 GB 14 GB
Q6_K 10.7 GB 16 GB

70B parameter models

Format Size RAM needed
Q2_K 26 GB 32 GB
Q3_K_M 32 GB 40 GB
Q4_K_M 41 GB 48 GB
Q4_K_S 39 GB 46 GB
Q5_K_M 48 GB 56 GB

Recommendation for 70B: Use Q3_K_M or Q4_K_S to fit in consumer hardware.

Finding Pre-Quantized Models

TheBloke on HuggingFace:

Example:

# Download pre-quantized Llama 2-7B
huggingface-cli download \
    TheBloke/Llama-2-7B-Chat-GGUF \
    llama-2-7b-chat.Q4_K_M.gguf \
    --local-dir models/

Importance Matrices (imatrix)

What: Calibration data to improve quantization quality.

Benefits:

  • 10-20% perplexity improvement with Q4
  • Essential for Q3 and below

Usage:

# 1. Generate importance matrix
./llama-imatrix \
    -m model-f16.gguf \
    -f calibration-data.txt \
    -o model.imatrix

# 2. Quantize with imatrix
./llama-quantize \
    --imatrix model.imatrix \
    model-f16.gguf \
    model-Q4_K_M.gguf \
    Q4_K_M

Calibration data:

  • Use domain-specific text (e.g., code for code models)
  • ~100MB of representative text
  • Higher quality data = better quantization

Troubleshooting

Model outputs gibberish:

  • Quantization too aggressive (Q2_K)
  • Try Q4_K_M or Q5_K_M
  • Verify model converted correctly

Out of memory:

  • Use lower quantization (Q4_K_S instead of Q5_K_M)
  • Offload fewer layers to GPU (-ngl)
  • Use smaller context (-c 2048)

Slow inference:

  • Higher quantization uses more compute
  • Q8_0 much slower than Q4_K_M
  • Consider speed vs quality trade-off