Sovereign backup of all Hermes Agent configuration and data. Excludes: secrets, auth tokens, sessions, caches, code (separate repo). Tracked: - config.yaml (model, fallback chain, toolsets, display prefs) - SOUL.md (Timmy personality charter) - memories/ (persistent MEMORY.md + USER.md) - skills/ (371 files — full skill library) - cron/jobs.json (scheduled tasks) - channel_directory.json (platform channels) - hooks/ (custom hooks)
191 lines
4.9 KiB
Markdown
191 lines
4.9 KiB
Markdown
---
|
||
name: tensorrt-llm
|
||
description: Optimizes LLM inference with NVIDIA TensorRT for maximum throughput and lowest latency. Use for production deployment on NVIDIA GPUs (A100/H100), when you need 10-100x faster inference than PyTorch, or for serving models with quantization (FP8/INT4), in-flight batching, and multi-GPU scaling.
|
||
version: 1.0.0
|
||
author: Orchestra Research
|
||
license: MIT
|
||
dependencies: [tensorrt-llm, torch]
|
||
metadata:
|
||
hermes:
|
||
tags: [Inference Serving, TensorRT-LLM, NVIDIA, Inference Optimization, High Throughput, Low Latency, Production, FP8, INT4, In-Flight Batching, Multi-GPU]
|
||
|
||
---
|
||
|
||
# TensorRT-LLM
|
||
|
||
NVIDIA's open-source library for optimizing LLM inference with state-of-the-art performance on NVIDIA GPUs.
|
||
|
||
## When to use TensorRT-LLM
|
||
|
||
**Use TensorRT-LLM when:**
|
||
- Deploying on NVIDIA GPUs (A100, H100, GB200)
|
||
- Need maximum throughput (24,000+ tokens/sec on Llama 3)
|
||
- Require low latency for real-time applications
|
||
- Working with quantized models (FP8, INT4, FP4)
|
||
- Scaling across multiple GPUs or nodes
|
||
|
||
**Use vLLM instead when:**
|
||
- Need simpler setup and Python-first API
|
||
- Want PagedAttention without TensorRT compilation
|
||
- Working with AMD GPUs or non-NVIDIA hardware
|
||
|
||
**Use llama.cpp instead when:**
|
||
- Deploying on CPU or Apple Silicon
|
||
- Need edge deployment without NVIDIA GPUs
|
||
- Want simpler GGUF quantization format
|
||
|
||
## Quick start
|
||
|
||
### Installation
|
||
|
||
```bash
|
||
# Docker (recommended)
|
||
docker pull nvidia/tensorrt_llm:latest
|
||
|
||
# pip install
|
||
pip install tensorrt_llm==1.2.0rc3
|
||
|
||
# Requires CUDA 13.0.0, TensorRT 10.13.2, Python 3.10-3.12
|
||
```
|
||
|
||
### Basic inference
|
||
|
||
```python
|
||
from tensorrt_llm import LLM, SamplingParams
|
||
|
||
# Initialize model
|
||
llm = LLM(model="meta-llama/Meta-Llama-3-8B")
|
||
|
||
# Configure sampling
|
||
sampling_params = SamplingParams(
|
||
max_tokens=100,
|
||
temperature=0.7,
|
||
top_p=0.9
|
||
)
|
||
|
||
# Generate
|
||
prompts = ["Explain quantum computing"]
|
||
outputs = llm.generate(prompts, sampling_params)
|
||
|
||
for output in outputs:
|
||
print(output.text)
|
||
```
|
||
|
||
### Serving with trtllm-serve
|
||
|
||
```bash
|
||
# Start server (automatic model download and compilation)
|
||
trtllm-serve meta-llama/Meta-Llama-3-8B \
|
||
--tp_size 4 \ # Tensor parallelism (4 GPUs)
|
||
--max_batch_size 256 \
|
||
--max_num_tokens 4096
|
||
|
||
# Client request
|
||
curl -X POST http://localhost:8000/v1/chat/completions \
|
||
-H "Content-Type: application/json" \
|
||
-d '{
|
||
"model": "meta-llama/Meta-Llama-3-8B",
|
||
"messages": [{"role": "user", "content": "Hello!"}],
|
||
"temperature": 0.7,
|
||
"max_tokens": 100
|
||
}'
|
||
```
|
||
|
||
## Key features
|
||
|
||
### Performance optimizations
|
||
- **In-flight batching**: Dynamic batching during generation
|
||
- **Paged KV cache**: Efficient memory management
|
||
- **Flash Attention**: Optimized attention kernels
|
||
- **Quantization**: FP8, INT4, FP4 for 2-4× faster inference
|
||
- **CUDA graphs**: Reduced kernel launch overhead
|
||
|
||
### Parallelism
|
||
- **Tensor parallelism (TP)**: Split model across GPUs
|
||
- **Pipeline parallelism (PP)**: Layer-wise distribution
|
||
- **Expert parallelism**: For Mixture-of-Experts models
|
||
- **Multi-node**: Scale beyond single machine
|
||
|
||
### Advanced features
|
||
- **Speculative decoding**: Faster generation with draft models
|
||
- **LoRA serving**: Efficient multi-adapter deployment
|
||
- **Disaggregated serving**: Separate prefill and generation
|
||
|
||
## Common patterns
|
||
|
||
### Quantized model (FP8)
|
||
|
||
```python
|
||
from tensorrt_llm import LLM
|
||
|
||
# Load FP8 quantized model (2× faster, 50% memory)
|
||
llm = LLM(
|
||
model="meta-llama/Meta-Llama-3-70B",
|
||
dtype="fp8",
|
||
max_num_tokens=8192
|
||
)
|
||
|
||
# Inference same as before
|
||
outputs = llm.generate(["Summarize this article..."])
|
||
```
|
||
|
||
### Multi-GPU deployment
|
||
|
||
```python
|
||
# Tensor parallelism across 8 GPUs
|
||
llm = LLM(
|
||
model="meta-llama/Meta-Llama-3-405B",
|
||
tensor_parallel_size=8,
|
||
dtype="fp8"
|
||
)
|
||
```
|
||
|
||
### Batch inference
|
||
|
||
```python
|
||
# Process 100 prompts efficiently
|
||
prompts = [f"Question {i}: ..." for i in range(100)]
|
||
|
||
outputs = llm.generate(
|
||
prompts,
|
||
sampling_params=SamplingParams(max_tokens=200)
|
||
)
|
||
|
||
# Automatic in-flight batching for maximum throughput
|
||
```
|
||
|
||
## Performance benchmarks
|
||
|
||
**Meta Llama 3-8B** (H100 GPU):
|
||
- Throughput: 24,000 tokens/sec
|
||
- Latency: ~10ms per token
|
||
- vs PyTorch: **100× faster**
|
||
|
||
**Llama 3-70B** (8× A100 80GB):
|
||
- FP8 quantization: 2× faster than FP16
|
||
- Memory: 50% reduction with FP8
|
||
|
||
## Supported models
|
||
|
||
- **LLaMA family**: Llama 2, Llama 3, CodeLlama
|
||
- **GPT family**: GPT-2, GPT-J, GPT-NeoX
|
||
- **Qwen**: Qwen, Qwen2, QwQ
|
||
- **DeepSeek**: DeepSeek-V2, DeepSeek-V3
|
||
- **Mixtral**: Mixtral-8x7B, Mixtral-8x22B
|
||
- **Vision**: LLaVA, Phi-3-vision
|
||
- **100+ models** on HuggingFace
|
||
|
||
## References
|
||
|
||
- **[Optimization Guide](references/optimization.md)** - Quantization, batching, KV cache tuning
|
||
- **[Multi-GPU Setup](references/multi-gpu.md)** - Tensor/pipeline parallelism, multi-node
|
||
- **[Serving Guide](references/serving.md)** - Production deployment, monitoring, autoscaling
|
||
|
||
## Resources
|
||
|
||
- **Docs**: https://nvidia.github.io/TensorRT-LLM/
|
||
- **GitHub**: https://github.com/NVIDIA/TensorRT-LLM
|
||
- **Models**: https://huggingface.co/models?library=tensorrt_llm
|
||
|
||
|