The skills directory was getting disorganized — mlops alone had 40 skills in a flat list, and 12 categories were singletons with just one skill each. Code change: - prompt_builder.py: Support sub-categories in skill scanner. skills/mlops/training/axolotl/SKILL.md now shows as category 'mlops/training' instead of just 'mlops'. Backwards-compatible with existing flat structure. Split mlops (40 skills) into 7 sub-categories: - mlops/training (12): accelerate, axolotl, flash-attention, grpo-rl-training, peft, pytorch-fsdp, pytorch-lightning, simpo, slime, torchtitan, trl-fine-tuning, unsloth - mlops/inference (8): gguf, guidance, instructor, llama-cpp, obliteratus, outlines, tensorrt-llm, vllm - mlops/models (6): audiocraft, clip, llava, segment-anything, stable-diffusion, whisper - mlops/vector-databases (4): chroma, faiss, pinecone, qdrant - mlops/evaluation (5): huggingface-tokenizers, lm-evaluation-harness, nemo-curator, saelens, weights-and-biases - mlops/cloud (2): lambda-labs, modal - mlops/research (1): dspy Merged singleton categories: - gifs → media (gif-search joins youtube-content) - music-creation → media (heartmula, songsee) - diagramming → creative (excalidraw joins ascii-art) - ocr-and-documents → productivity - domain → research (domain-intel) - feeds → research (blogwatcher) - market-data → research (polymarket) Fixed misplaced skills: - mlops/code-review → software-development (not ML-specific) - mlops/ml-paper-writing → research (academic writing) Added DESCRIPTION.md files for all new/updated categories.
4.9 KiB
4.9 KiB
name, description, version, author, license, dependencies, metadata
| name | description | version | author | license | dependencies | metadata | |||||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| tensorrt-llm | Optimizes LLM inference with NVIDIA TensorRT for maximum throughput and lowest latency. Use for production deployment on NVIDIA GPUs (A100/H100), when you need 10-100x faster inference than PyTorch, or for serving models with quantization (FP8/INT4), in-flight batching, and multi-GPU scaling. | 1.0.0 | Orchestra Research | MIT |
|
|
TensorRT-LLM
NVIDIA's open-source library for optimizing LLM inference with state-of-the-art performance on NVIDIA GPUs.
When to use TensorRT-LLM
Use TensorRT-LLM when:
- Deploying on NVIDIA GPUs (A100, H100, GB200)
- Need maximum throughput (24,000+ tokens/sec on Llama 3)
- Require low latency for real-time applications
- Working with quantized models (FP8, INT4, FP4)
- Scaling across multiple GPUs or nodes
Use vLLM instead when:
- Need simpler setup and Python-first API
- Want PagedAttention without TensorRT compilation
- Working with AMD GPUs or non-NVIDIA hardware
Use llama.cpp instead when:
- Deploying on CPU or Apple Silicon
- Need edge deployment without NVIDIA GPUs
- Want simpler GGUF quantization format
Quick start
Installation
# Docker (recommended)
docker pull nvidia/tensorrt_llm:latest
# pip install
pip install tensorrt_llm==1.2.0rc3
# Requires CUDA 13.0.0, TensorRT 10.13.2, Python 3.10-3.12
Basic inference
from tensorrt_llm import LLM, SamplingParams
# Initialize model
llm = LLM(model="meta-llama/Meta-Llama-3-8B")
# Configure sampling
sampling_params = SamplingParams(
max_tokens=100,
temperature=0.7,
top_p=0.9
)
# Generate
prompts = ["Explain quantum computing"]
outputs = llm.generate(prompts, sampling_params)
for output in outputs:
print(output.text)
Serving with trtllm-serve
# Start server (automatic model download and compilation)
trtllm-serve meta-llama/Meta-Llama-3-8B \
--tp_size 4 \ # Tensor parallelism (4 GPUs)
--max_batch_size 256 \
--max_num_tokens 4096
# Client request
curl -X POST http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "meta-llama/Meta-Llama-3-8B",
"messages": [{"role": "user", "content": "Hello!"}],
"temperature": 0.7,
"max_tokens": 100
}'
Key features
Performance optimizations
- In-flight batching: Dynamic batching during generation
- Paged KV cache: Efficient memory management
- Flash Attention: Optimized attention kernels
- Quantization: FP8, INT4, FP4 for 2-4× faster inference
- CUDA graphs: Reduced kernel launch overhead
Parallelism
- Tensor parallelism (TP): Split model across GPUs
- Pipeline parallelism (PP): Layer-wise distribution
- Expert parallelism: For Mixture-of-Experts models
- Multi-node: Scale beyond single machine
Advanced features
- Speculative decoding: Faster generation with draft models
- LoRA serving: Efficient multi-adapter deployment
- Disaggregated serving: Separate prefill and generation
Common patterns
Quantized model (FP8)
from tensorrt_llm import LLM
# Load FP8 quantized model (2× faster, 50% memory)
llm = LLM(
model="meta-llama/Meta-Llama-3-70B",
dtype="fp8",
max_num_tokens=8192
)
# Inference same as before
outputs = llm.generate(["Summarize this article..."])
Multi-GPU deployment
# Tensor parallelism across 8 GPUs
llm = LLM(
model="meta-llama/Meta-Llama-3-405B",
tensor_parallel_size=8,
dtype="fp8"
)
Batch inference
# Process 100 prompts efficiently
prompts = [f"Question {i}: ..." for i in range(100)]
outputs = llm.generate(
prompts,
sampling_params=SamplingParams(max_tokens=200)
)
# Automatic in-flight batching for maximum throughput
Performance benchmarks
Meta Llama 3-8B (H100 GPU):
- Throughput: 24,000 tokens/sec
- Latency: ~10ms per token
- vs PyTorch: 100× faster
Llama 3-70B (8× A100 80GB):
- FP8 quantization: 2× faster than FP16
- Memory: 50% reduction with FP8
Supported models
- LLaMA family: Llama 2, Llama 3, CodeLlama
- GPT family: GPT-2, GPT-J, GPT-NeoX
- Qwen: Qwen, Qwen2, QwQ
- DeepSeek: DeepSeek-V2, DeepSeek-V3
- Mixtral: Mixtral-8x7B, Mixtral-8x22B
- Vision: LLaVA, Phi-3-vision
- 100+ models on HuggingFace
References
- Optimization Guide - Quantization, batching, KV cache tuning
- Multi-GPU Setup - Tensor/pipeline parallelism, multi-node
- Serving Guide - Production deployment, monitoring, autoscaling