Files

teknium1 732c66b0f3 refactor: reorganize skills into sub-categories

The skills directory was getting disorganized — mlops alone had 40
skills in a flat list, and 12 categories were singletons with just
one skill each.

Code change:
- prompt_builder.py: Support sub-categories in skill scanner.
  skills/mlops/training/axolotl/SKILL.md now shows as category
  'mlops/training' instead of just 'mlops'. Backwards-compatible
  with existing flat structure.

Split mlops (40 skills) into 7 sub-categories:
- mlops/training (12): accelerate, axolotl, flash-attention,
  grpo-rl-training, peft, pytorch-fsdp, pytorch-lightning,
  simpo, slime, torchtitan, trl-fine-tuning, unsloth
- mlops/inference (8): gguf, guidance, instructor, llama-cpp,
  obliteratus, outlines, tensorrt-llm, vllm
- mlops/models (6): audiocraft, clip, llava, segment-anything,
  stable-diffusion, whisper
- mlops/vector-databases (4): chroma, faiss, pinecone, qdrant
- mlops/evaluation (5): huggingface-tokenizers,
  lm-evaluation-harness, nemo-curator, saelens, weights-and-biases
- mlops/cloud (2): lambda-labs, modal
- mlops/research (1): dspy

Merged singleton categories:
- gifs → media (gif-search joins youtube-content)
- music-creation → media (heartmula, songsee)
- diagramming → creative (excalidraw joins ascii-art)
- ocr-and-documents → productivity
- domain → research (domain-intel)
- feeds → research (blogwatcher)
- market-data → research (polymarket)

Fixed misplaced skills:
- mlops/code-review → software-development (not ML-specific)
- mlops/ml-paper-writing → research (academic writing)

Added DESCRIPTION.md files for all new/updated categories.

2026-03-09 03:35:53 -07:00

5.5 KiB

Raw Blame History

TensorRT-LLM Optimization Guide

Comprehensive guide to optimizing LLM inference with TensorRT-LLM.

Quantization

FP8 Quantization (Recommended for H100)

Benefits:

2× faster inference
50% memory reduction
Minimal accuracy loss (<1% perplexity degradation)

Usage:

from tensorrt_llm import LLM

# Automatic FP8 quantization
llm = LLM(
    model="meta-llama/Meta-Llama-3-70B",
    dtype="fp8",
    quantization="fp8"
)

Performance (Llama 3-70B on 8× H100):

FP16: 5,000 tokens/sec
FP8: 10,000 tokens/sec (2× speedup)
Memory: 140GB → 70GB

INT4 Quantization (Maximum compression)

Benefits:

4× memory reduction
3-4× faster inference
Fits larger models on same hardware

Usage:

# INT4 with AWQ calibration
llm = LLM(
    model="meta-llama/Meta-Llama-3-405B",
    dtype="int4_awq",
    quantization="awq"
)

# INT4 with GPTQ calibration
llm = LLM(
    model="meta-llama/Meta-Llama-3-405B",
    dtype="int4_gptq",
    quantization="gptq"
)

Trade-offs:

Accuracy: 1-3% perplexity increase
Speed: 3-4× faster than FP16
Use case: When memory is critical

In-Flight Batching

What it does: Dynamically batches requests during generation instead of waiting for all sequences to finish.

Configuration:

# Server configuration
trtllm-serve meta-llama/Meta-Llama-3-8B \
    --max_batch_size 256 \           # Maximum concurrent sequences
    --max_num_tokens 4096 \           # Total tokens in batch
    --enable_chunked_context \        # Split long prompts
    --scheduler_policy max_utilization

Performance:

Throughput: 4-8× higher vs static batching
Latency: Lower P50/P99 for mixed workloads
GPU utilization: 80-95% vs 40-60%

Paged KV Cache

What it does: Manages KV cache memory like OS manages virtual memory (paging).

Benefits:

40-60% higher throughput
No memory fragmentation
Supports longer sequences

Configuration:

# Automatic paged KV cache (default)
llm = LLM(
    model="meta-llama/Meta-Llama-3-8B",
    kv_cache_free_gpu_mem_fraction=0.9,  # Use 90% GPU mem for cache
    enable_prefix_caching=True            # Cache common prefixes
)

Speculative Decoding

What it does: Uses small draft model to predict multiple tokens, verified by target model in parallel.

Speedup: 2-3× faster for long generations

Usage:

from tensorrt_llm import LLM

# Target model (Llama 3-70B)
llm = LLM(
    model="meta-llama/Meta-Llama-3-70B",
    speculative_model="meta-llama/Meta-Llama-3-8B",  # Draft model
    num_speculative_tokens=5                          # Tokens to predict ahead
)

# Same API, 2-3× faster
outputs = llm.generate(prompts)

Best models for drafting:

Target: Llama 3-70B → Draft: Llama 3-8B
Target: Qwen2-72B → Draft: Qwen2-7B
Same family, 8-10× smaller

CUDA Graphs

What it does: Reduces kernel launch overhead by recording GPU operations.

Benefits:

10-20% lower latency
More stable P99 latency
Better for small batch sizes

Configuration (automatic by default):

llm = LLM(
    model="meta-llama/Meta-Llama-3-8B",
    enable_cuda_graph=True,  # Default: True
    cuda_graph_cache_size=2  # Cache 2 graph variants
)

Chunked Context

What it does: Splits long prompts into chunks to reduce memory spikes.

Use case: Prompts >8K tokens with limited GPU memory

Configuration:

trtllm-serve meta-llama/Meta-Llama-3-8B \
    --max_num_tokens 4096 \
    --enable_chunked_context \
    --max_chunked_prefill_length 2048  # Process 2K tokens at a time

Overlap Scheduling

What it does: Overlaps compute and memory operations.

Benefits:

15-25% higher throughput
Better GPU utilization
Default in v1.2.0+

No configuration needed - enabled automatically.

Quantization Comparison Table

Method	Memory	Speed	Accuracy	Use Case
FP16	1× (baseline)	1×	Best	High accuracy needed
FP8	0.5×	2×	-0.5% ppl	H100 default
INT4 AWQ	0.25×	3-4×	-1.5% ppl	Memory critical
INT4 GPTQ	0.25×	3-4×	-2% ppl	Maximum speed

Tuning Workflow

Start with defaults:

llm = LLM(model="meta-llama/Meta-Llama-3-70B")

Enable FP8 (if H100):
```
llm = LLM(model="...", dtype="fp8")
```

Tune batch size:

# Increase until OOM, then reduce 20%
trtllm-serve ... --max_batch_size 256

Enable chunked context (if long prompts):

--enable_chunked_context --max_chunked_prefill_length 2048

Try speculative decoding (if latency critical):

llm = LLM(model="...", speculative_model="...")

Benchmarking

# Install benchmark tool
pip install tensorrt_llm[benchmark]

# Run benchmark
python benchmarks/python/benchmark.py \
    --model meta-llama/Meta-Llama-3-8B \
    --batch_size 64 \
    --input_len 128 \
    --output_len 256 \
    --dtype fp8

Metrics to track:

Throughput (tokens/sec)
Latency P50/P90/P99 (ms)
GPU memory usage (GB)
GPU utilization (%)

Common Issues

OOM errors:

Reduce max_batch_size
Reduce max_num_tokens
Enable INT4 quantization
Increase tensor_parallel_size

Low throughput:

Increase max_batch_size
Enable in-flight batching
Verify CUDA graphs enabled
Check GPU utilization

High latency:

Try speculative decoding
Reduce max_batch_size (less queueing)
Use FP8 instead of FP16

5.5 KiB Raw Blame History Unescape Escape

TensorRT-LLM Optimization Guide

Quantization

FP8 Quantization (Recommended for H100)

INT4 Quantization (Maximum compression)

In-Flight Batching

Paged KV Cache

Speculative Decoding

CUDA Graphs

Chunked Context

Overlap Scheduling

Quantization Comparison Table

Tuning Workflow

Benchmarking

Common Issues

5.5 KiB

Raw Blame History