The skills directory was getting disorganized — mlops alone had 40 skills in a flat list, and 12 categories were singletons with just one skill each. Code change: - prompt_builder.py: Support sub-categories in skill scanner. skills/mlops/training/axolotl/SKILL.md now shows as category 'mlops/training' instead of just 'mlops'. Backwards-compatible with existing flat structure. Split mlops (40 skills) into 7 sub-categories: - mlops/training (12): accelerate, axolotl, flash-attention, grpo-rl-training, peft, pytorch-fsdp, pytorch-lightning, simpo, slime, torchtitan, trl-fine-tuning, unsloth - mlops/inference (8): gguf, guidance, instructor, llama-cpp, obliteratus, outlines, tensorrt-llm, vllm - mlops/models (6): audiocraft, clip, llava, segment-anything, stable-diffusion, whisper - mlops/vector-databases (4): chroma, faiss, pinecone, qdrant - mlops/evaluation (5): huggingface-tokenizers, lm-evaluation-harness, nemo-curator, saelens, weights-and-biases - mlops/cloud (2): lambda-labs, modal - mlops/research (1): dspy Merged singleton categories: - gifs → media (gif-search joins youtube-content) - music-creation → media (heartmula, songsee) - diagramming → creative (excalidraw joins ascii-art) - ocr-and-documents → productivity - domain → research (domain-intel) - feeds → research (blogwatcher) - market-data → research (polymarket) Fixed misplaced skills: - mlops/code-review → software-development (not ML-specific) - mlops/ml-paper-writing → research (academic writing) Added DESCRIPTION.md files for all new/updated categories.
4.8 KiB
4.8 KiB
GGUF Quantization Guide
Complete guide to GGUF quantization formats and model conversion.
Quantization Overview
GGUF (GPT-Generated Unified Format) - Standard format for llama.cpp models.
Format Comparison
| Format | Perplexity | Size (7B) | Tokens/sec | Notes |
|---|---|---|---|---|
| FP16 | 5.9565 (baseline) | 13.0 GB | 15 tok/s | Original quality |
| Q8_0 | 5.9584 (+0.03%) | 7.0 GB | 25 tok/s | Nearly lossless |
| Q6_K | 5.9642 (+0.13%) | 5.5 GB | 30 tok/s | Best quality/size |
| Q5_K_M | 5.9796 (+0.39%) | 4.8 GB | 35 tok/s | Balanced |
| Q4_K_M | 6.0565 (+1.68%) | 4.1 GB | 40 tok/s | Recommended |
| Q4_K_S | 6.1125 (+2.62%) | 3.9 GB | 42 tok/s | Faster, lower quality |
| Q3_K_M | 6.3184 (+6.07%) | 3.3 GB | 45 tok/s | Small models only |
| Q2_K | 6.8673 (+15.3%) | 2.7 GB | 50 tok/s | Not recommended |
Recommendation: Use Q4_K_M for best balance of quality and speed.
Converting Models
HuggingFace to GGUF
# 1. Download HuggingFace model
huggingface-cli download meta-llama/Llama-2-7b-chat-hf \
--local-dir models/llama-2-7b-chat/
# 2. Convert to FP16 GGUF
python convert_hf_to_gguf.py \
models/llama-2-7b-chat/ \
--outtype f16 \
--outfile models/llama-2-7b-chat-f16.gguf
# 3. Quantize to Q4_K_M
./llama-quantize \
models/llama-2-7b-chat-f16.gguf \
models/llama-2-7b-chat-Q4_K_M.gguf \
Q4_K_M
Batch quantization
# Quantize to multiple formats
for quant in Q4_K_M Q5_K_M Q6_K Q8_0; do
./llama-quantize \
model-f16.gguf \
model-${quant}.gguf \
$quant
done
K-Quantization Methods
K-quants use mixed precision for better quality:
- Attention weights: Higher precision
- Feed-forward weights: Lower precision
Variants:
_S(Small): Faster, lower quality_M(Medium): Balanced (recommended)_L(Large): Better quality, larger size
Example: Q4_K_M
Q4: 4-bit quantizationK: Mixed precision methodM: Medium quality
Quality Testing
# Calculate perplexity (quality metric)
./llama-perplexity \
-m model.gguf \
-f wikitext-2-raw/wiki.test.raw \
-c 512
# Lower perplexity = better quality
# Baseline (FP16): ~5.96
# Q4_K_M: ~6.06 (+1.7%)
# Q2_K: ~6.87 (+15.3% - too much degradation)
Use Case Guide
General purpose (chatbots, assistants)
Q4_K_M - Best balance
Q5_K_M - If you have extra RAM
Code generation
Q5_K_M or Q6_K - Higher precision helps with code
Creative writing
Q4_K_M - Sufficient quality
Q3_K_M - Acceptable for draft generation
Technical/medical
Q6_K or Q8_0 - Maximum accuracy
Edge devices (Raspberry Pi)
Q2_K or Q3_K_S - Fit in limited RAM
Model Size Scaling
7B parameter models
| Format | Size | RAM needed |
|---|---|---|
| Q2_K | 2.7 GB | 5 GB |
| Q3_K_M | 3.3 GB | 6 GB |
| Q4_K_M | 4.1 GB | 7 GB |
| Q5_K_M | 4.8 GB | 8 GB |
| Q6_K | 5.5 GB | 9 GB |
| Q8_0 | 7.0 GB | 11 GB |
13B parameter models
| Format | Size | RAM needed |
|---|---|---|
| Q2_K | 5.1 GB | 8 GB |
| Q3_K_M | 6.2 GB | 10 GB |
| Q4_K_M | 7.9 GB | 12 GB |
| Q5_K_M | 9.2 GB | 14 GB |
| Q6_K | 10.7 GB | 16 GB |
70B parameter models
| Format | Size | RAM needed |
|---|---|---|
| Q2_K | 26 GB | 32 GB |
| Q3_K_M | 32 GB | 40 GB |
| Q4_K_M | 41 GB | 48 GB |
| Q4_K_S | 39 GB | 46 GB |
| Q5_K_M | 48 GB | 56 GB |
Recommendation for 70B: Use Q3_K_M or Q4_K_S to fit in consumer hardware.
Finding Pre-Quantized Models
TheBloke on HuggingFace:
- https://huggingface.co/TheBloke
- Most models available in all GGUF formats
- No conversion needed
Example:
# Download pre-quantized Llama 2-7B
huggingface-cli download \
TheBloke/Llama-2-7B-Chat-GGUF \
llama-2-7b-chat.Q4_K_M.gguf \
--local-dir models/
Importance Matrices (imatrix)
What: Calibration data to improve quantization quality.
Benefits:
- 10-20% perplexity improvement with Q4
- Essential for Q3 and below
Usage:
# 1. Generate importance matrix
./llama-imatrix \
-m model-f16.gguf \
-f calibration-data.txt \
-o model.imatrix
# 2. Quantize with imatrix
./llama-quantize \
--imatrix model.imatrix \
model-f16.gguf \
model-Q4_K_M.gguf \
Q4_K_M
Calibration data:
- Use domain-specific text (e.g., code for code models)
- ~100MB of representative text
- Higher quality data = better quantization
Troubleshooting
Model outputs gibberish:
- Quantization too aggressive (Q2_K)
- Try Q4_K_M or Q5_K_M
- Verify model converted correctly
Out of memory:
- Use lower quantization (Q4_K_S instead of Q5_K_M)
- Offload fewer layers to GPU (
-ngl) - Use smaller context (
-c 2048)
Slow inference:
- Higher quantization uses more compute
- Q8_0 much slower than Q4_K_M
- Consider speed vs quality trade-off