# Multi-GPU Deployment Guide Comprehensive guide to scaling TensorRT-LLM across multiple GPUs and nodes. ## Parallelism Strategies ### Tensor Parallelism (TP) **What it does**: Splits model layers across GPUs horizontally. **Use case**: - Model fits in total GPU memory but not single GPU - Need low latency (single forward pass) - GPUs on same node (NVLink required for best performance) **Example** (Llama 3-70B on 4× A100): ```python from tensorrt_llm import LLM llm = LLM( model="meta-llama/Meta-Llama-3-70B", tensor_parallel_size=4, # Split across 4 GPUs dtype="fp16" ) # Model automatically sharded across GPUs # Single forward pass, low latency ``` **Performance**: - Latency: ~Same as single GPU - Throughput: 4× higher (4 GPUs) - Communication: High (activations synced every layer) ### Pipeline Parallelism (PP) **What it does**: Splits model layers across GPUs vertically (layer-wise). **Use case**: - Very large models (175B+) - Can tolerate higher latency - GPUs across multiple nodes **Example** (Llama 3-405B on 8× H100): ```python llm = LLM( model="meta-llama/Meta-Llama-3-405B", tensor_parallel_size=4, # TP=4 within nodes pipeline_parallel_size=2, # PP=2 across nodes dtype="fp8" ) # Total: 8 GPUs (4×2) # Layers 0-40: Node 1 (4 GPUs with TP) # Layers 41-80: Node 2 (4 GPUs with TP) ``` **Performance**: - Latency: Higher (sequential through pipeline) - Throughput: High with micro-batching - Communication: Lower than TP ### Expert Parallelism (EP) **What it does**: Distributes MoE experts across GPUs. **Use case**: Mixture-of-Experts models (Mixtral, DeepSeek-V2) **Example** (Mixtral-8x22B on 8× A100): ```python llm = LLM( model="mistralai/Mixtral-8x22B", tensor_parallel_size=4, expert_parallel_size=2, # Distribute 8 experts across 2 groups dtype="fp8" ) ``` ## Configuration Examples ### Small model (7-13B) - Single GPU ```python # Llama 3-8B on 1× A100 80GB llm = LLM( model="meta-llama/Meta-Llama-3-8B", dtype="fp16" # or fp8 for H100 ) ``` **Resources**: - GPU: 1× A100 80GB - Memory: ~16GB model + 30GB KV cache - Throughput: 3,000-5,000 tokens/sec ### Medium model (70B) - Multi-GPU same node ```python # Llama 3-70B on 4× A100 80GB (NVLink) llm = LLM( model="meta-llama/Meta-Llama-3-70B", tensor_parallel_size=4, dtype="fp8" # 70GB → 35GB per GPU ) ``` **Resources**: - GPU: 4× A100 80GB with NVLink - Memory: ~35GB per GPU (FP8) - Throughput: 10,000-15,000 tokens/sec - Latency: 15-20ms per token ### Large model (405B) - Multi-node ```python # Llama 3-405B on 2 nodes × 8 H100 = 16 GPUs llm = LLM( model="meta-llama/Meta-Llama-3-405B", tensor_parallel_size=8, # TP within each node pipeline_parallel_size=2, # PP across 2 nodes dtype="fp8" ) ``` **Resources**: - GPU: 2 nodes × 8 H100 80GB - Memory: ~25GB per GPU (FP8) - Throughput: 20,000-30,000 tokens/sec - Network: InfiniBand recommended ## Server Deployment ### Single-node multi-GPU ```bash # Llama 3-70B on 4 GPUs (automatic TP) trtllm-serve meta-llama/Meta-Llama-3-70B \ --tp_size 4 \ --max_batch_size 256 \ --dtype fp8 # Listens on http://localhost:8000 ``` ### Multi-node with Ray ```bash # Node 1 (head node) ray start --head --port=6379 # Node 2 (worker) ray start --address='node1:6379' # Deploy across cluster trtllm-serve meta-llama/Meta-Llama-3-405B \ --tp_size 8 \ --pp_size 2 \ --num_workers 2 \ # 2 nodes --dtype fp8 ``` ### Kubernetes deployment ```yaml apiVersion: apps/v1 kind: Deployment metadata: name: tensorrt-llm-llama3-70b spec: replicas: 1 template: spec: containers: - name: trtllm image: nvidia/tensorrt_llm:latest command: - trtllm-serve - meta-llama/Meta-Llama-3-70B - --tp_size=4 - --max_batch_size=256 resources: limits: nvidia.com/gpu: 4 # Request 4 GPUs ``` ## Parallelism Decision Tree ``` Model size < 20GB? ├─ YES: Single GPU (no parallelism) └─ NO: Model size < 80GB? ├─ YES: TP=2 or TP=4 (same node) └─ NO: Model size < 320GB? ├─ YES: TP=4 or TP=8 (same node, NVLink required) └─ NO: TP=8 + PP=2 (multi-node) ``` ## Communication Optimization ### NVLink vs PCIe **NVLink** (DGX A100, HGX H100): - Bandwidth: 600 GB/s (A100), 900 GB/s (H100) - Ideal for TP (high communication) - **Recommended for all multi-GPU setups** **PCIe**: - Bandwidth: 64 GB/s (PCIe 4.0 x16) - 10× slower than NVLink - Avoid TP, use PP instead ### InfiniBand for multi-node **HDR InfiniBand** (200 Gb/s): - Required for multi-node TP or PP - Latency: <1μs - **Essential for 405B+ models** ## Monitoring Multi-GPU ```python # Monitor GPU utilization nvidia-smi dmon -s u # Monitor memory nvidia-smi dmon -s m # Monitor NVLink utilization nvidia-smi nvlink --status # TensorRT-LLM built-in metrics curl http://localhost:8000/metrics ``` **Key metrics**: - GPU utilization: Target 80-95% - Memory usage: Should be balanced across GPUs - NVLink traffic: High for TP, low for PP - Throughput: Tokens/sec across all GPUs ## Common Issues ### Imbalanced GPU memory **Symptom**: GPU 0 has 90% memory, GPU 3 has 40% **Solutions**: - Verify TP/PP configuration - Check model sharding (should be equal) - Restart server to reset state ### Low NVLink utilization **Symptom**: NVLink bandwidth <100 GB/s with TP=4 **Solutions**: - Verify NVLink topology: `nvidia-smi topo -m` - Check for PCIe fallback - Ensure GPUs are on same NVSwitch ### OOM with multi-GPU **Solutions**: - Increase TP size (more GPUs) - Reduce batch size - Enable FP8 quantization - Use pipeline parallelism ## Performance Scaling ### TP Scaling (Llama 3-70B, FP8) | GPUs | TP Size | Throughput | Latency | Efficiency | |------|---------|------------|---------|------------| | 1 | 1 | OOM | - | - | | 2 | 2 | 6,000 tok/s | 18ms | 85% | | 4 | 4 | 11,000 tok/s | 16ms | 78% | | 8 | 8 | 18,000 tok/s | 15ms | 64% | **Note**: Efficiency drops with more GPUs due to communication overhead. ### PP Scaling (Llama 3-405B, FP8) | Nodes | TP | PP | Total GPUs | Throughput | |-------|----|----|------------|------------| | 1 | 8 | 1 | 8 | OOM | | 2 | 8 | 2 | 16 | 25,000 tok/s | | 4 | 8 | 4 | 32 | 45,000 tok/s | ## Best Practices 1. **Prefer TP over PP** when possible (lower latency) 2. **Use NVLink** for all TP deployments 3. **Use InfiniBand** for multi-node deployments 4. **Start with smallest TP** that fits model in memory 5. **Monitor GPU balance** - all GPUs should have similar utilization 6. **Test with benchmark** before production 7. **Use FP8** on H100 for 2× speedup