Files
hermes-agent/skills/mlops/inference/vllm/references/optimization.md
teknium1 732c66b0f3 refactor: reorganize skills into sub-categories
The skills directory was getting disorganized — mlops alone had 40
skills in a flat list, and 12 categories were singletons with just
one skill each.

Code change:
- prompt_builder.py: Support sub-categories in skill scanner.
  skills/mlops/training/axolotl/SKILL.md now shows as category
  'mlops/training' instead of just 'mlops'. Backwards-compatible
  with existing flat structure.

Split mlops (40 skills) into 7 sub-categories:
- mlops/training (12): accelerate, axolotl, flash-attention,
  grpo-rl-training, peft, pytorch-fsdp, pytorch-lightning,
  simpo, slime, torchtitan, trl-fine-tuning, unsloth
- mlops/inference (8): gguf, guidance, instructor, llama-cpp,
  obliteratus, outlines, tensorrt-llm, vllm
- mlops/models (6): audiocraft, clip, llava, segment-anything,
  stable-diffusion, whisper
- mlops/vector-databases (4): chroma, faiss, pinecone, qdrant
- mlops/evaluation (5): huggingface-tokenizers,
  lm-evaluation-harness, nemo-curator, saelens, weights-and-biases
- mlops/cloud (2): lambda-labs, modal
- mlops/research (1): dspy

Merged singleton categories:
- gifs → media (gif-search joins youtube-content)
- music-creation → media (heartmula, songsee)
- diagramming → creative (excalidraw joins ascii-art)
- ocr-and-documents → productivity
- domain → research (domain-intel)
- feeds → research (blogwatcher)
- market-data → research (polymarket)

Fixed misplaced skills:
- mlops/code-review → software-development (not ML-specific)
- mlops/ml-paper-writing → research (academic writing)

Added DESCRIPTION.md files for all new/updated categories.
2026-03-09 03:35:53 -07:00

5.6 KiB

Performance Optimization

Contents

  • PagedAttention explained
  • Continuous batching mechanics
  • Prefix caching strategies
  • Speculative decoding setup
  • Benchmark results and comparisons
  • Performance tuning guide

PagedAttention explained

Traditional attention problem:

  • KV cache stored in contiguous memory
  • Wastes ~50% GPU memory due to fragmentation
  • Cannot dynamically reallocate for varying sequence lengths

PagedAttention solution:

  • Divides KV cache into fixed-size blocks (like OS virtual memory)
  • Dynamic allocation from free block queue
  • Shares blocks across sequences (for prefix caching)

Memory savings example:

Traditional: 70B model needs 160GB KV cache → OOM on 8x A100
PagedAttention: 70B model needs 80GB KV cache → Fits on 4x A100

Configuration:

# Block size (default: 16 tokens)
vllm serve MODEL --block-size 16

# Number of GPU blocks (auto-calculated)
# Controlled by --gpu-memory-utilization
vllm serve MODEL --gpu-memory-utilization 0.9

Continuous batching mechanics

Traditional batching:

  • Wait for all sequences in batch to finish
  • GPU idle while waiting for longest sequence
  • Low GPU utilization (~40-60%)

Continuous batching:

  • Add new requests as slots become available
  • Mix prefill (new requests) and decode (ongoing) in same batch
  • High GPU utilization (>90%)

Throughput improvement:

Traditional batching: 50 req/sec @ 50% GPU util
Continuous batching: 200 req/sec @ 90% GPU util
= 4x throughput improvement

Tuning parameters:

# Max concurrent sequences (higher = more batching)
vllm serve MODEL --max-num-seqs 256

# Prefill/decode schedule (auto-balanced by default)
# No manual tuning needed

Prefix caching strategies

Reuse computed KV cache for common prompt prefixes.

Use cases:

  • System prompts repeated across requests
  • Few-shot examples in every prompt
  • RAG contexts with overlapping chunks

Example savings:

Prompt: [System: 500 tokens] + [User: 100 tokens]

Without caching: Compute 600 tokens every request
With caching: Compute 500 tokens once, then 100 tokens/request
= 83% faster TTFT

Enable prefix caching:

vllm serve MODEL --enable-prefix-caching

Automatic prefix detection:

  • vLLM detects common prefixes automatically
  • No code changes required
  • Works with OpenAI-compatible API

Cache hit rate monitoring:

curl http://localhost:9090/metrics | grep cache_hit
# vllm_cache_hit_rate: 0.75  (75% hit rate)

Speculative decoding setup

Use smaller "draft" model to propose tokens, larger model to verify.

Speed improvement:

Standard: Generate 1 token per forward pass
Speculative: Generate 3-5 tokens per forward pass
= 2-3x faster generation

How it works:

  1. Draft model proposes K tokens (fast)
  2. Target model verifies all K tokens in parallel (one pass)
  3. Accept verified tokens, restart from first rejection

Setup with separate draft model:

vllm serve meta-llama/Llama-3-70B-Instruct \
  --speculative-model TinyLlama/TinyLlama-1.1B-Chat-v1.0 \
  --num-speculative-tokens 5

Setup with n-gram draft (no separate model):

vllm serve MODEL \
  --speculative-method ngram \
  --num-speculative-tokens 3

When to use:

  • Output length > 100 tokens
  • Draft model 5-10x smaller than target
  • Acceptable 2-3% accuracy trade-off

Benchmark results

vLLM vs HuggingFace Transformers (Llama 3 8B, A100):

Metric                  | HF Transformers | vLLM   | Improvement
------------------------|-----------------|--------|------------
Throughput (req/sec)    | 12              | 280    | 23x
TTFT (ms)              | 850             | 120    | 7x
Tokens/sec             | 45              | 2,100  | 47x
GPU Memory (GB)        | 28              | 16     | 1.75x less

vLLM vs TensorRT-LLM (Llama 2 70B, 4x A100):

Metric                  | TensorRT-LLM | vLLM   | Notes
------------------------|--------------|--------|------------------
Throughput (req/sec)    | 320          | 285    | TRT 12% faster
Setup complexity        | High         | Low    | vLLM much easier
NVIDIA-only            | Yes          | No     | vLLM multi-platform
Quantization support    | FP8, INT8    | AWQ/GPTQ/FP8 | vLLM more options

Performance tuning guide

Step 1: Measure baseline

# Install benchmarking tool
pip install locust

# Run baseline benchmark
vllm bench throughput \
  --model MODEL \
  --input-tokens 128 \
  --output-tokens 256 \
  --num-prompts 1000

# Record: throughput, TTFT, tokens/sec

Step 2: Tune memory utilization

# Try different values: 0.7, 0.85, 0.9, 0.95
vllm serve MODEL --gpu-memory-utilization 0.9

Higher = more batch capacity = higher throughput, but risk OOM.

Step 3: Tune concurrency

# Try values: 128, 256, 512, 1024
vllm serve MODEL --max-num-seqs 256

Higher = more batching opportunity, but may increase latency.

Step 4: Enable optimizations

vllm serve MODEL \
  --enable-prefix-caching \     # For repeated prompts
  --enable-chunked-prefill \    # For long prompts
  --gpu-memory-utilization 0.9 \
  --max-num-seqs 512

Step 5: Re-benchmark and compare

Target improvements:

  • Throughput: +30-100%
  • TTFT: -20-50%
  • GPU utilization: >85%

Common performance issues:

Low throughput (<50 req/sec):

  • Increase --max-num-seqs
  • Enable --enable-prefix-caching
  • Check GPU utilization (should be >80%)

High TTFT (>1 second):

  • Enable --enable-chunked-prefill
  • Reduce --max-model-len if possible
  • Check if model is too large for GPU

OOM errors:

  • Reduce --gpu-memory-utilization to 0.7
  • Reduce --max-model-len
  • Use quantization (--quantization awq)