skills/mlops/inference/tensorrt-llm/references/optimization.md

# TensorRT-LLM Optimization Guide

Comprehensive guide to optimizing LLM inference with TensorRT-LLM.

## Quantization

### FP8 Quantization (Recommended for H100)

**Benefits**:
- 2× faster inference
- 50% memory reduction
- Minimal accuracy loss (<1% perplexity degradation)

**Usage**:
```python
from tensorrt_llm import LLM

# Automatic FP8 quantization
llm = LLM(
    model="meta-llama/Meta-Llama-3-70B",
    dtype="fp8",
    quantization="fp8"
)
```

**Performance** (Llama 3-70B on 8× H100):
- FP16: 5,000 tokens/sec
- FP8: **10,000 tokens/sec** (2× speedup)
- Memory: 140GB → 70GB

### INT4 Quantization (Maximum compression)

**Benefits**:
- 4× memory reduction
- 3-4× faster inference
- Fits larger models on same hardware

**Usage**:
```python
# INT4 with AWQ calibration
llm = LLM(
    model="meta-llama/Meta-Llama-3-405B",
    dtype="int4_awq",
    quantization="awq"
)

# INT4 with GPTQ calibration
llm = LLM(
    model="meta-llama/Meta-Llama-3-405B",
    dtype="int4_gptq",
    quantization="gptq"
)
```

**Trade-offs**:
- Accuracy: 1-3% perplexity increase
- Speed: 3-4× faster than FP16
- Use case: When memory is critical

## In-Flight Batching

**What it does**: Dynamically batches requests during generation instead of waiting for all sequences to finish.

**Configuration**:
```python
# Server configuration
trtllm-serve meta-llama/Meta-Llama-3-8B \
    --max_batch_size 256 \           # Maximum concurrent sequences
    --max_num_tokens 4096 \           # Total tokens in batch
    --enable_chunked_context \        # Split long prompts
    --scheduler_policy max_utilization
```

**Performance**:
- Throughput: **4-8× higher** vs static batching
- Latency: Lower P50/P99 for mixed workloads
- GPU utilization: 80-95% vs 40-60%

## Paged KV Cache

**What it does**: Manages KV cache memory like OS manages virtual memory (paging).

**Benefits**:
- 40-60% higher throughput
- No memory fragmentation
- Supports longer sequences

**Configuration**:
```python
# Automatic paged KV cache (default)
llm = LLM(
    model="meta-llama/Meta-Llama-3-8B",
    kv_cache_free_gpu_mem_fraction=0.9,  # Use 90% GPU mem for cache
    enable_prefix_caching=True            # Cache common prefixes
)
```

## Speculative Decoding

**What it does**: Uses small draft model to predict multiple tokens, verified by target model in parallel.

**Speedup**: 2-3× faster for long generations

**Usage**:
```python
from tensorrt_llm import LLM

# Target model (Llama 3-70B)
llm = LLM(
    model="meta-llama/Meta-Llama-3-70B",
    speculative_model="meta-llama/Meta-Llama-3-8B",  # Draft model
    num_speculative_tokens=5                          # Tokens to predict ahead
)

# Same API, 2-3× faster
outputs = llm.generate(prompts)
```

**Best models for drafting**:
- Target: Llama 3-70B → Draft: Llama 3-8B
- Target: Qwen2-72B → Draft: Qwen2-7B
- Same family, 8-10× smaller

## CUDA Graphs

**What it does**: Reduces kernel launch overhead by recording GPU operations.

**Benefits**:
- 10-20% lower latency
- More stable P99 latency
- Better for small batch sizes

**Configuration** (automatic by default):
```python
llm = LLM(
    model="meta-llama/Meta-Llama-3-8B",
    enable_cuda_graph=True,  # Default: True
    cuda_graph_cache_size=2  # Cache 2 graph variants
)
```

## Chunked Context

**What it does**: Splits long prompts into chunks to reduce memory spikes.

**Use case**: Prompts >8K tokens with limited GPU memory

**Configuration**:
```bash
trtllm-serve meta-llama/Meta-Llama-3-8B \
    --max_num_tokens 4096 \
    --enable_chunked_context \
    --max_chunked_prefill_length 2048  # Process 2K tokens at a time
```

## Overlap Scheduling

**What it does**: Overlaps compute and memory operations.

**Benefits**:
- 15-25% higher throughput
- Better GPU utilization
- Default in v1.2.0+

**No configuration needed** - enabled automatically.

## Quantization Comparison Table

| Method | Memory | Speed | Accuracy | Use Case |
|--------|--------|-------|----------|----------|
| FP16 | 1× (baseline) | 1× | Best | High accuracy needed |
| FP8 | 0.5× | 2× | -0.5% ppl | **H100 default** |
| INT4 AWQ | 0.25× | 3-4× | -1.5% ppl | Memory critical |
| INT4 GPTQ | 0.25× | 3-4× | -2% ppl | Maximum speed |

## Tuning Workflow

1. **Start with defaults**:
   ```python
   llm = LLM(model="meta-llama/Meta-Llama-3-70B")
   ```

2. **Enable FP8** (if H100):
   ```python
   llm = LLM(model="...", dtype="fp8")
   ```

3. **Tune batch size**:
   ```python
   # Increase until OOM, then reduce 20%
   trtllm-serve ... --max_batch_size 256
   ```

4. **Enable chunked context** (if long prompts):
   ```bash
   --enable_chunked_context --max_chunked_prefill_length 2048
   ```

5. **Try speculative decoding** (if latency critical):
   ```python
   llm = LLM(model="...", speculative_model="...")
   ```

## Benchmarking

```bash
# Install benchmark tool
pip install tensorrt_llm[benchmark]

# Run benchmark
python benchmarks/python/benchmark.py \
    --model meta-llama/Meta-Llama-3-8B \
    --batch_size 64 \
    --input_len 128 \
    --output_len 256 \
    --dtype fp8
```

**Metrics to track**:
- Throughput (tokens/sec)
- Latency P50/P90/P99 (ms)
- GPU memory usage (GB)
- GPU utilization (%)

## Common Issues

**OOM errors**:
- Reduce `max_batch_size`
- Reduce `max_num_tokens`
- Enable INT4 quantization
- Increase `tensor_parallel_size`

**Low throughput**:
- Increase `max_batch_size`
- Enable in-flight batching
- Verify CUDA graphs enabled
- Check GPU utilization

**High latency**:
- Try speculative decoding
- Reduce `max_batch_size` (less queueing)
- Use FP8 instead of FP16
-												fix: restore all removed bundled skills + fix skills sync system

- Restored 21 skills removed in commits 757d012 and 740dd92:
  accelerate, audiocraft, code-review, faiss, flash-attention, gguf,
  grpo-rl-training, guidance, llava, nemo-curator, obliteratus, peft,
  pytorch-fsdp, pytorch-lightning, simpo, slime, stable-diffusion,
  tensorrt-llm, torchtitan, trl-fine-tuning, whisper

- Rewrote sync_skills() with proper update semantics:
  * New skills (not in manifest): copied to user dir
  * Existing skills (in manifest + on disk): updated via hash comparison
  * User-deleted skills (in manifest, not on disk): respected, not re-added
  * Stale manifest entries (removed from bundled): cleaned from manifest

- Added sync_skills() to CLI startup (cmd_chat) and gateway startup
  (start_gateway) — previously only ran during 'hermes update'

- Updated cmd_update output to show new/updated/cleaned counts

- Rewrote tests: 20 tests covering manifest CRUD, dir hashing, fresh
  install, user deletion respect, update detection, stale cleanup, and
  name collision handling

75 bundled skills total. 2002 tests pass.

											
										
										
											2026-03-06 15:57:12 -08:00
+								# TensorRT-LLM Optimization Guide
 								Comprehensive guide to optimizing LLM inference with TensorRT-LLM.
 								## Quantization
 								### FP8 Quantization (Recommended for H100)
 								**Benefits**:
 								- 2× faster inference
 								- 50% memory reduction
 								- Minimal accuracy loss (<1% perplexity degradation)
 								**Usage**:
 								```python
 								from tensorrt_llm import LLM
 								# Automatic FP8 quantization
 								llm = LLM(
 								    model="meta-llama/Meta-Llama-3-70B",
 								    dtype="fp8",
 								    quantization="fp8"
 								)
 								```
 								**Performance** (Llama 3-70B on 8× H100):
 								- FP16: 5,000 tokens/sec
 								- FP8: **10,000 tokens/sec** (2× speedup)
 								- Memory: 140GB → 70GB
 								### INT4 Quantization (Maximum compression)
 								**Benefits**:
 								- 4× memory reduction
 								- 3-4× faster inference
 								- Fits larger models on same hardware
 								**Usage**:
 								```python
 								# INT4 with AWQ calibration
 								llm = LLM(
 								    model="meta-llama/Meta-Llama-3-405B",
 								    dtype="int4_awq",
 								    quantization="awq"
 								)
 								# INT4 with GPTQ calibration
 								llm = LLM(
 								    model="meta-llama/Meta-Llama-3-405B",
 								    dtype="int4_gptq",
 								    quantization="gptq"
 								)
 								```
 								**Trade-offs**:
 								- Accuracy: 1-3% perplexity increase
 								- Speed: 3-4× faster than FP16
 								- Use case: When memory is critical
 								## In-Flight Batching
 								**What it does**: Dynamically batches requests during generation instead of waiting for all sequences to finish.
 								**Configuration**:
 								```python
 								# Server configuration
 								trtllm-serve meta-llama/Meta-Llama-3-8B \
 								    --max_batch_size 256 \           # Maximum concurrent sequences
 								    --max_num_tokens 4096 \           # Total tokens in batch
 								    --enable_chunked_context \        # Split long prompts
 								    --scheduler_policy max_utilization
 								```
 								**Performance**:
 								- Throughput: **4-8× higher** vs static batching
 								- Latency: Lower P50/P99 for mixed workloads
 								- GPU utilization: 80-95% vs 40-60%
 								## Paged KV Cache
 								**What it does**: Manages KV cache memory like OS manages virtual memory (paging).
 								**Benefits**:
 								- 40-60% higher throughput
 								- No memory fragmentation
 								- Supports longer sequences
 								**Configuration**:
 								```python
 								# Automatic paged KV cache (default)
 								llm = LLM(
 								    model="meta-llama/Meta-Llama-3-8B",
 								    kv_cache_free_gpu_mem_fraction=0.9,  # Use 90% GPU mem for cache
 								    enable_prefix_caching=True            # Cache common prefixes
 								)
 								```
 								## Speculative Decoding
 								**What it does**: Uses small draft model to predict multiple tokens, verified by target model in parallel.
 								**Speedup**: 2-3× faster for long generations
 								**Usage**:
 								```python
 								from tensorrt_llm import LLM
 								# Target model (Llama 3-70B)
 								llm = LLM(
 								    model="meta-llama/Meta-Llama-3-70B",
 								    speculative_model="meta-llama/Meta-Llama-3-8B",  # Draft model
 								    num_speculative_tokens=5                          # Tokens to predict ahead
 								)
 								# Same API, 2-3× faster
 								outputs = llm.generate(prompts)
 								```
 								**Best models for drafting**:
 								- Target: Llama 3-70B → Draft: Llama 3-8B
 								- Target: Qwen2-72B → Draft: Qwen2-7B
 								- Same family, 8-10× smaller
 								## CUDA Graphs
 								**What it does**: Reduces kernel launch overhead by recording GPU operations.
 								**Benefits**:
 								- 10-20% lower latency
 								- More stable P99 latency
 								- Better for small batch sizes
 								**Configuration** (automatic by default):
 								```python
 								llm = LLM(
 								    model="meta-llama/Meta-Llama-3-8B",
 								    enable_cuda_graph=True,  # Default: True
 								    cuda_graph_cache_size=2  # Cache 2 graph variants
 								)
 								```
 								## Chunked Context
 								**What it does**: Splits long prompts into chunks to reduce memory spikes.
 								**Use case**: Prompts >8K tokens with limited GPU memory
 								**Configuration**:
 								```bash
 								trtllm-serve meta-llama/Meta-Llama-3-8B \
 								    --max_num_tokens 4096 \
 								    --enable_chunked_context \
 								    --max_chunked_prefill_length 2048  # Process 2K tokens at a time
 								```
 								## Overlap Scheduling
 								**What it does**: Overlaps compute and memory operations.
 								**Benefits**:
 								- 15-25% higher throughput
 								- Better GPU utilization
 								- Default in v1.2.0+
 								**No configuration needed** - enabled automatically.
 								## Quantization Comparison Table
 								| Method | Memory | Speed | Accuracy | Use Case |
 								|--------|--------|-------|----------|----------|
 								| FP16 | 1× (baseline) | 1× | Best | High accuracy needed |
 								| FP8 | 0.5× | 2× | -0.5% ppl | **H100 default** |
 								| INT4 AWQ | 0.25× | 3-4× | -1.5% ppl | Memory critical |
 								| INT4 GPTQ | 0.25× | 3-4× | -2% ppl | Maximum speed |
 								## Tuning Workflow
 . **Start with defaults**:
 								   ```python
 								   llm = LLM(model="meta-llama/Meta-Llama-3-70B")
 								   ```
 . **Enable FP8** (if H100):
 								   ```python
 								   llm = LLM(model="...", dtype="fp8")
 								   ```
 . **Tune batch size**:
 								   ```python
 								   # Increase until OOM, then reduce 20%
 								   trtllm-serve ... --max_batch_size 256
 								   ```
 . **Enable chunked context** (if long prompts):
 								   ```bash
 								   --enable_chunked_context --max_chunked_prefill_length 2048
 								   ```
 . **Try speculative decoding** (if latency critical):
 								   ```python
 								   llm = LLM(model="...", speculative_model="...")
 								   ```
 								## Benchmarking
 								```bash
 								# Install benchmark tool
 								pip install tensorrt_llm[benchmark]
 								# Run benchmark
 								python benchmarks/python/benchmark.py \
 								    --model meta-llama/Meta-Llama-3-8B \
 								    --batch_size 64 \
 								    --input_len 128 \
 								    --output_len 256 \
 								    --dtype fp8
 								```
 								**Metrics to track**:
 								- Throughput (tokens/sec)
 								- Latency P50/P90/P99 (ms)
 								- GPU memory usage (GB)
 								- GPU utilization (%)
 								## Common Issues
 								**OOM errors**:
 								- Reduce `max_batch_size`
 								- Reduce `max_num_tokens`
 								- Enable INT4 quantization
 								- Increase `tensor_parallel_size`
 								**Low throughput**:
 								- Increase `max_batch_size`
 								- Enable in-flight batching
 								- Verify CUDA graphs enabled
 								- Check GPU utilization
 								**High latency**:
 								- Try speculative decoding
 								- Reduce `max_batch_size` (less queueing)
 								- Use FP8 instead of FP16