The skills directory was getting disorganized — mlops alone had 40 skills in a flat list, and 12 categories were singletons with just one skill each. Code change: - prompt_builder.py: Support sub-categories in skill scanner. skills/mlops/training/axolotl/SKILL.md now shows as category 'mlops/training' instead of just 'mlops'. Backwards-compatible with existing flat structure. Split mlops (40 skills) into 7 sub-categories: - mlops/training (12): accelerate, axolotl, flash-attention, grpo-rl-training, peft, pytorch-fsdp, pytorch-lightning, simpo, slime, torchtitan, trl-fine-tuning, unsloth - mlops/inference (8): gguf, guidance, instructor, llama-cpp, obliteratus, outlines, tensorrt-llm, vllm - mlops/models (6): audiocraft, clip, llava, segment-anything, stable-diffusion, whisper - mlops/vector-databases (4): chroma, faiss, pinecone, qdrant - mlops/evaluation (5): huggingface-tokenizers, lm-evaluation-harness, nemo-curator, saelens, weights-and-biases - mlops/cloud (2): lambda-labs, modal - mlops/research (1): dspy Merged singleton categories: - gifs → media (gif-search joins youtube-content) - music-creation → media (heartmula, songsee) - diagramming → creative (excalidraw joins ascii-art) - ocr-and-documents → productivity - domain → research (domain-intel) - feeds → research (blogwatcher) - market-data → research (polymarket) Fixed misplaced skills: - mlops/code-review → software-development (not ML-specific) - mlops/ml-paper-writing → research (academic writing) Added DESCRIPTION.md files for all new/updated categories.
90 lines
1.6 KiB
Markdown
90 lines
1.6 KiB
Markdown
# Performance Optimization Guide
|
||
|
||
Maximize llama.cpp inference speed and efficiency.
|
||
|
||
## CPU Optimization
|
||
|
||
### Thread tuning
|
||
```bash
|
||
# Set threads (default: physical cores)
|
||
./llama-cli -m model.gguf -t 8
|
||
|
||
# For AMD Ryzen 9 7950X (16 cores, 32 threads)
|
||
-t 16 # Best: physical cores
|
||
|
||
# Avoid hyperthreading (slower for matrix ops)
|
||
```
|
||
|
||
### BLAS acceleration
|
||
```bash
|
||
# OpenBLAS (faster matrix ops)
|
||
make LLAMA_OPENBLAS=1
|
||
|
||
# BLAS gives 2-3× speedup
|
||
```
|
||
|
||
## GPU Offloading
|
||
|
||
### Layer offloading
|
||
```bash
|
||
# Offload 35 layers to GPU (hybrid mode)
|
||
./llama-cli -m model.gguf -ngl 35
|
||
|
||
# Offload all layers
|
||
./llama-cli -m model.gguf -ngl 999
|
||
|
||
# Find optimal value:
|
||
# Start with -ngl 999
|
||
# If OOM, reduce by 5 until fits
|
||
```
|
||
|
||
### Memory usage
|
||
```bash
|
||
# Check VRAM usage
|
||
nvidia-smi dmon
|
||
|
||
# Reduce context if needed
|
||
./llama-cli -m model.gguf -c 2048 # 2K context instead of 4K
|
||
```
|
||
|
||
## Batch Processing
|
||
|
||
```bash
|
||
# Increase batch size for throughput
|
||
./llama-cli -m model.gguf -b 512 # Default: 512
|
||
|
||
# Physical batch (GPU)
|
||
--ubatch 128 # Process 128 tokens at once
|
||
```
|
||
|
||
## Context Management
|
||
|
||
```bash
|
||
# Default context (512 tokens)
|
||
-c 512
|
||
|
||
# Longer context (slower, more memory)
|
||
-c 4096
|
||
|
||
# Very long context (if model supports)
|
||
-c 32768
|
||
```
|
||
|
||
## Benchmarks
|
||
|
||
### CPU Performance (Llama 2-7B Q4_K_M)
|
||
|
||
| Setup | Speed | Notes |
|
||
|-------|-------|-------|
|
||
| Apple M3 Max | 50 tok/s | Metal acceleration |
|
||
| AMD 7950X (16c) | 35 tok/s | OpenBLAS |
|
||
| Intel i9-13900K | 30 tok/s | AVX2 |
|
||
|
||
### GPU Offloading (RTX 4090)
|
||
|
||
| Layers GPU | Speed | VRAM |
|
||
|------------|-------|------|
|
||
| 0 (CPU only) | 30 tok/s | 0 GB |
|
||
| 20 (hybrid) | 80 tok/s | 8 GB |
|
||
| 35 (all) | 120 tok/s | 12 GB |
|