skills/mlops/tensorrt-llm/SKILL.md

---
name: tensorrt-llm
description: Optimizes LLM inference with NVIDIA TensorRT for maximum throughput and lowest latency. Use for production deployment on NVIDIA GPUs (A100/H100), when you need 10-100x faster inference than PyTorch, or for serving models with quantization (FP8/INT4), in-flight batching, and multi-GPU scaling.
version: 1.0.0
author: Orchestra Research
license: MIT
dependencies: [tensorrt-llm, torch]
metadata:
  hermes:
    tags: [Inference Serving, TensorRT-LLM, NVIDIA, Inference Optimization, High Throughput, Low Latency, Production, FP8, INT4, In-Flight Batching, Multi-GPU]

---

# TensorRT-LLM

NVIDIA's open-source library for optimizing LLM inference with state-of-the-art performance on NVIDIA GPUs.

## When to use TensorRT-LLM

**Use TensorRT-LLM when:**
- Deploying on NVIDIA GPUs (A100, H100, GB200)
- Need maximum throughput (24,000+ tokens/sec on Llama 3)
- Require low latency for real-time applications
- Working with quantized models (FP8, INT4, FP4)
- Scaling across multiple GPUs or nodes

**Use vLLM instead when:**
- Need simpler setup and Python-first API
- Want PagedAttention without TensorRT compilation
- Working with AMD GPUs or non-NVIDIA hardware

**Use llama.cpp instead when:**
- Deploying on CPU or Apple Silicon
- Need edge deployment without NVIDIA GPUs
- Want simpler GGUF quantization format

## Quick start

### Installation

```bash
# Docker (recommended)
docker pull nvidia/tensorrt_llm:latest

# pip install
pip install tensorrt_llm==1.2.0rc3

# Requires CUDA 13.0.0, TensorRT 10.13.2, Python 3.10-3.12
```

### Basic inference

```python
from tensorrt_llm import LLM, SamplingParams

# Initialize model
llm = LLM(model="meta-llama/Meta-Llama-3-8B")

# Configure sampling
sampling_params = SamplingParams(
    max_tokens=100,
    temperature=0.7,
    top_p=0.9
)

# Generate
prompts = ["Explain quantum computing"]
outputs = llm.generate(prompts, sampling_params)

for output in outputs:
    print(output.text)
```

### Serving with trtllm-serve

```bash
# Start server (automatic model download and compilation)
trtllm-serve meta-llama/Meta-Llama-3-8B \
    --tp_size 4 \              # Tensor parallelism (4 GPUs)
    --max_batch_size 256 \
    --max_num_tokens 4096

# Client request
curl -X POST http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "meta-llama/Meta-Llama-3-8B",
    "messages": [{"role": "user", "content": "Hello!"}],
    "temperature": 0.7,
    "max_tokens": 100
  }'
```

## Key features

### Performance optimizations
- **In-flight batching**: Dynamic batching during generation
- **Paged KV cache**: Efficient memory management
- **Flash Attention**: Optimized attention kernels
- **Quantization**: FP8, INT4, FP4 for 2-4× faster inference
- **CUDA graphs**: Reduced kernel launch overhead

### Parallelism
- **Tensor parallelism (TP)**: Split model across GPUs
- **Pipeline parallelism (PP)**: Layer-wise distribution
- **Expert parallelism**: For Mixture-of-Experts models
- **Multi-node**: Scale beyond single machine

### Advanced features
- **Speculative decoding**: Faster generation with draft models
- **LoRA serving**: Efficient multi-adapter deployment
- **Disaggregated serving**: Separate prefill and generation

## Common patterns

### Quantized model (FP8)

```python
from tensorrt_llm import LLM

# Load FP8 quantized model (2× faster, 50% memory)
llm = LLM(
    model="meta-llama/Meta-Llama-3-70B",
    dtype="fp8",
    max_num_tokens=8192
)

# Inference same as before
outputs = llm.generate(["Summarize this article..."])
```

### Multi-GPU deployment

```python
# Tensor parallelism across 8 GPUs
llm = LLM(
    model="meta-llama/Meta-Llama-3-405B",
    tensor_parallel_size=8,
    dtype="fp8"
)
```

### Batch inference

```python
# Process 100 prompts efficiently
prompts = [f"Question {i}: ..." for i in range(100)]

outputs = llm.generate(
    prompts,
    sampling_params=SamplingParams(max_tokens=200)
)

# Automatic in-flight batching for maximum throughput
```

## Performance benchmarks

**Meta Llama 3-8B** (H100 GPU):
- Throughput: 24,000 tokens/sec
- Latency: ~10ms per token
- vs PyTorch: **100× faster**

**Llama 3-70B** (8× A100 80GB):
- FP8 quantization: 2× faster than FP16
- Memory: 50% reduction with FP8

## Supported models

- **LLaMA family**: Llama 2, Llama 3, CodeLlama
- **GPT family**: GPT-2, GPT-J, GPT-NeoX
- **Qwen**: Qwen, Qwen2, QwQ
- **DeepSeek**: DeepSeek-V2, DeepSeek-V3
- **Mixtral**: Mixtral-8x7B, Mixtral-8x22B
- **Vision**: LLaVA, Phi-3-vision
- **100+ models** on HuggingFace

## References

- **[Optimization Guide](references/optimization.md)** - Quantization, batching, KV cache tuning
- **[Multi-GPU Setup](references/multi-gpu.md)** - Tensor/pipeline parallelism, multi-node
- **[Serving Guide](references/serving.md)** - Production deployment, monitoring, autoscaling

## Resources

- **Docs**: https://nvidia.github.io/TensorRT-LLM/
- **GitHub**: https://github.com/NVIDIA/TensorRT-LLM
- **Models**: https://huggingface.co/models?library=tensorrt_llm
-												init: Hermes config, skills, memories, cron

Sovereign backup of all Hermes Agent configuration and data.
Excludes: secrets, auth tokens, sessions, caches, code (separate repo).

Tracked:
- config.yaml (model, fallback chain, toolsets, display prefs)
- SOUL.md (Timmy personality charter)
- memories/ (persistent MEMORY.md + USER.md)
- skills/ (371 files — full skill library)
- cron/jobs.json (scheduled tasks)
- channel_directory.json (platform channels)
- hooks/ (custom hooks)

											
										
										
											2026-03-14 14:42:33 -04:00
+								---
 								name: tensorrt-llm
 								description: Optimizes LLM inference with NVIDIA TensorRT for maximum throughput and lowest latency. Use for production deployment on NVIDIA GPUs (A100/H100), when you need 10-100x faster inference than PyTorch, or for serving models with quantization (FP8/INT4), in-flight batching, and multi-GPU scaling.
 								version: 1.0.0
 								author: Orchestra Research
 								license: MIT
 								dependencies: [tensorrt-llm, torch]
 								metadata:
 								  hermes:
 								    tags: [Inference Serving, TensorRT-LLM, NVIDIA, Inference Optimization, High Throughput, Low Latency, Production, FP8, INT4, In-Flight Batching, Multi-GPU]
 								---
 								# TensorRT-LLM
 								NVIDIA's open-source library for optimizing LLM inference with state-of-the-art performance on NVIDIA GPUs.
 								## When to use TensorRT-LLM
 								**Use TensorRT-LLM when:**
 								- Deploying on NVIDIA GPUs (A100, H100, GB200)
 								- Need maximum throughput (24,000+ tokens/sec on Llama 3)
 								- Require low latency for real-time applications
 								- Working with quantized models (FP8, INT4, FP4)
 								- Scaling across multiple GPUs or nodes
 								**Use vLLM instead when:**
 								- Need simpler setup and Python-first API
 								- Want PagedAttention without TensorRT compilation
 								- Working with AMD GPUs or non-NVIDIA hardware
 								**Use llama.cpp instead when:**
 								- Deploying on CPU or Apple Silicon
 								- Need edge deployment without NVIDIA GPUs
 								- Want simpler GGUF quantization format
 								## Quick start
 								### Installation
 								```bash
 								# Docker (recommended)
 								docker pull nvidia/tensorrt_llm:latest
 								# pip install
 								pip install tensorrt_llm==1.2.0rc3
 								# Requires CUDA 13.0.0, TensorRT 10.13.2, Python 3.10-3.12
 								```
 								### Basic inference
 								```python
 								from tensorrt_llm import LLM, SamplingParams
 								# Initialize model
 								llm = LLM(model="meta-llama/Meta-Llama-3-8B")
 								# Configure sampling
 								sampling_params = SamplingParams(
 								    max_tokens=100,
 								    temperature=0.7,
 								    top_p=0.9
 								)
 								# Generate
 								prompts = ["Explain quantum computing"]
 								outputs = llm.generate(prompts, sampling_params)
 								for output in outputs:
 								    print(output.text)
 								```
 								### Serving with trtllm-serve
 								```bash
 								# Start server (automatic model download and compilation)
 								trtllm-serve meta-llama/Meta-Llama-3-8B \
 								    --tp_size 4 \              # Tensor parallelism (4 GPUs)
 								    --max_batch_size 256 \
 								    --max_num_tokens 4096
 								# Client request
 								curl -X POST http://localhost:8000/v1/chat/completions \
 								  -H "Content-Type: application/json" \
 								  -d '{
 								    "model": "meta-llama/Meta-Llama-3-8B",
 								    "messages": [{"role": "user", "content": "Hello!"}],
 								    "temperature": 0.7,
 								    "max_tokens": 100
 								  }'
 								```
 								## Key features
 								### Performance optimizations
 								- **In-flight batching**: Dynamic batching during generation
 								- **Paged KV cache**: Efficient memory management
 								- **Flash Attention**: Optimized attention kernels
 								- **Quantization**: FP8, INT4, FP4 for 2-4× faster inference
 								- **CUDA graphs**: Reduced kernel launch overhead
 								### Parallelism
 								- **Tensor parallelism (TP)**: Split model across GPUs
 								- **Pipeline parallelism (PP)**: Layer-wise distribution
 								- **Expert parallelism**: For Mixture-of-Experts models
 								- **Multi-node**: Scale beyond single machine
 								### Advanced features
 								- **Speculative decoding**: Faster generation with draft models
 								- **LoRA serving**: Efficient multi-adapter deployment
 								- **Disaggregated serving**: Separate prefill and generation
 								## Common patterns
 								### Quantized model (FP8)
 								```python
 								from tensorrt_llm import LLM
 								# Load FP8 quantized model (2× faster, 50% memory)
 								llm = LLM(
 								    model="meta-llama/Meta-Llama-3-70B",
 								    dtype="fp8",
 								    max_num_tokens=8192
 								)
 								# Inference same as before
 								outputs = llm.generate(["Summarize this article..."])
 								```
 								### Multi-GPU deployment
 								```python
 								# Tensor parallelism across 8 GPUs
 								llm = LLM(
 								    model="meta-llama/Meta-Llama-3-405B",
 								    tensor_parallel_size=8,
 								    dtype="fp8"
 								)
 								```
 								### Batch inference
 								```python
 								# Process 100 prompts efficiently
 								prompts = [f"Question {i}: ..." for i in range(100)]
 								outputs = llm.generate(
 								    prompts,
 								    sampling_params=SamplingParams(max_tokens=200)
 								)
 								# Automatic in-flight batching for maximum throughput
 								```
 								## Performance benchmarks
 								**Meta Llama 3-8B** (H100 GPU):
 								- Throughput: 24,000 tokens/sec
 								- Latency: ~10ms per token
 								- vs PyTorch: **100× faster**
 								**Llama 3-70B** (8× A100 80GB):
 								- FP8 quantization: 2× faster than FP16
 								- Memory: 50% reduction with FP8
 								## Supported models
 								- **LLaMA family**: Llama 2, Llama 3, CodeLlama
 								- **GPT family**: GPT-2, GPT-J, GPT-NeoX
 								- **Qwen**: Qwen, Qwen2, QwQ
 								- **DeepSeek**: DeepSeek-V2, DeepSeek-V3
 								- **Mixtral**: Mixtral-8x7B, Mixtral-8x22B
 								- **Vision**: LLaVA, Phi-3-vision
 								- **100+ models** on HuggingFace
 								## References
 								- **[Optimization Guide](references/optimization.md)** - Quantization, batching, KV cache tuning
 								- **[Multi-GPU Setup](references/multi-gpu.md)** - Tensor/pipeline parallelism, multi-node
 								- **[Serving Guide](references/serving.md)** - Production deployment, monitoring, autoscaling
 								## Resources
 								- **Docs**: https://nvidia.github.io/TensorRT-LLM/
 								- **GitHub**: https://github.com/NVIDIA/TensorRT-LLM
 								- **Models**: https://huggingface.co/models?library=tensorrt_llm