471 lines
9.6 KiB
Markdown
471 lines
9.6 KiB
Markdown
|
|
# Production Serving Guide
|
|||
|
|
|
|||
|
|
Comprehensive guide to deploying TensorRT-LLM in production environments.
|
|||
|
|
|
|||
|
|
## Server Modes
|
|||
|
|
|
|||
|
|
### trtllm-serve (Recommended)
|
|||
|
|
|
|||
|
|
**Features**:
|
|||
|
|
- OpenAI-compatible API
|
|||
|
|
- Automatic model download and compilation
|
|||
|
|
- Built-in load balancing
|
|||
|
|
- Prometheus metrics
|
|||
|
|
- Health checks
|
|||
|
|
|
|||
|
|
**Basic usage**:
|
|||
|
|
```bash
|
|||
|
|
trtllm-serve meta-llama/Meta-Llama-3-8B \
|
|||
|
|
--tp_size 1 \
|
|||
|
|
--max_batch_size 256 \
|
|||
|
|
--port 8000
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
**Advanced configuration**:
|
|||
|
|
```bash
|
|||
|
|
trtllm-serve meta-llama/Meta-Llama-3-70B \
|
|||
|
|
--tp_size 4 \
|
|||
|
|
--dtype fp8 \
|
|||
|
|
--max_batch_size 256 \
|
|||
|
|
--max_num_tokens 4096 \
|
|||
|
|
--enable_chunked_context \
|
|||
|
|
--scheduler_policy max_utilization \
|
|||
|
|
--port 8000 \
|
|||
|
|
--api_key $API_KEY # Optional authentication
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
### Python LLM API (For embedding)
|
|||
|
|
|
|||
|
|
```python
|
|||
|
|
from tensorrt_llm import LLM
|
|||
|
|
|
|||
|
|
class LLMService:
|
|||
|
|
def __init__(self):
|
|||
|
|
self.llm = LLM(
|
|||
|
|
model="meta-llama/Meta-Llama-3-8B",
|
|||
|
|
dtype="fp8"
|
|||
|
|
)
|
|||
|
|
|
|||
|
|
def generate(self, prompt, max_tokens=100):
|
|||
|
|
from tensorrt_llm import SamplingParams
|
|||
|
|
|
|||
|
|
params = SamplingParams(
|
|||
|
|
max_tokens=max_tokens,
|
|||
|
|
temperature=0.7
|
|||
|
|
)
|
|||
|
|
outputs = self.llm.generate([prompt], params)
|
|||
|
|
return outputs[0].text
|
|||
|
|
|
|||
|
|
# Use in FastAPI, Flask, etc
|
|||
|
|
from fastapi import FastAPI
|
|||
|
|
app = FastAPI()
|
|||
|
|
service = LLMService()
|
|||
|
|
|
|||
|
|
@app.post("/generate")
|
|||
|
|
def generate(prompt: str):
|
|||
|
|
return {"response": service.generate(prompt)}
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
## OpenAI-Compatible API
|
|||
|
|
|
|||
|
|
### Chat Completions
|
|||
|
|
|
|||
|
|
```bash
|
|||
|
|
curl -X POST http://localhost:8000/v1/chat/completions \
|
|||
|
|
-H "Content-Type: application/json" \
|
|||
|
|
-d '{
|
|||
|
|
"model": "meta-llama/Meta-Llama-3-8B",
|
|||
|
|
"messages": [
|
|||
|
|
{"role": "system", "content": "You are a helpful assistant."},
|
|||
|
|
{"role": "user", "content": "Explain quantum computing"}
|
|||
|
|
],
|
|||
|
|
"temperature": 0.7,
|
|||
|
|
"max_tokens": 500,
|
|||
|
|
"stream": false
|
|||
|
|
}'
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
**Response**:
|
|||
|
|
```json
|
|||
|
|
{
|
|||
|
|
"id": "chat-abc123",
|
|||
|
|
"object": "chat.completion",
|
|||
|
|
"created": 1234567890,
|
|||
|
|
"model": "meta-llama/Meta-Llama-3-8B",
|
|||
|
|
"choices": [{
|
|||
|
|
"index": 0,
|
|||
|
|
"message": {
|
|||
|
|
"role": "assistant",
|
|||
|
|
"content": "Quantum computing is..."
|
|||
|
|
},
|
|||
|
|
"finish_reason": "stop"
|
|||
|
|
}],
|
|||
|
|
"usage": {
|
|||
|
|
"prompt_tokens": 25,
|
|||
|
|
"completion_tokens": 150,
|
|||
|
|
"total_tokens": 175
|
|||
|
|
}
|
|||
|
|
}
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
### Streaming
|
|||
|
|
|
|||
|
|
```bash
|
|||
|
|
curl -X POST http://localhost:8000/v1/chat/completions \
|
|||
|
|
-H "Content-Type: application/json" \
|
|||
|
|
-d '{
|
|||
|
|
"model": "meta-llama/Meta-Llama-3-8B",
|
|||
|
|
"messages": [{"role": "user", "content": "Count to 10"}],
|
|||
|
|
"stream": true
|
|||
|
|
}'
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
**Response** (SSE stream):
|
|||
|
|
```
|
|||
|
|
data: {"choices":[{"delta":{"content":"1"}}]}
|
|||
|
|
|
|||
|
|
data: {"choices":[{"delta":{"content":", 2"}}]}
|
|||
|
|
|
|||
|
|
data: {"choices":[{"delta":{"content":", 3"}}]}
|
|||
|
|
|
|||
|
|
data: [DONE]
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
### Completions
|
|||
|
|
|
|||
|
|
```bash
|
|||
|
|
curl -X POST http://localhost:8000/v1/completions \
|
|||
|
|
-H "Content-Type: application/json" \
|
|||
|
|
-d '{
|
|||
|
|
"model": "meta-llama/Meta-Llama-3-8B",
|
|||
|
|
"prompt": "The capital of France is",
|
|||
|
|
"max_tokens": 10,
|
|||
|
|
"temperature": 0.0
|
|||
|
|
}'
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
## Monitoring
|
|||
|
|
|
|||
|
|
### Prometheus Metrics
|
|||
|
|
|
|||
|
|
**Enable metrics**:
|
|||
|
|
```bash
|
|||
|
|
trtllm-serve meta-llama/Meta-Llama-3-8B \
|
|||
|
|
--enable_metrics \
|
|||
|
|
--metrics_port 9090
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
**Key metrics**:
|
|||
|
|
```bash
|
|||
|
|
# Scrape metrics
|
|||
|
|
curl http://localhost:9090/metrics
|
|||
|
|
|
|||
|
|
# Important metrics:
|
|||
|
|
# - trtllm_request_success_total - Total successful requests
|
|||
|
|
# - trtllm_request_latency_seconds - Request latency histogram
|
|||
|
|
# - trtllm_tokens_generated_total - Total tokens generated
|
|||
|
|
# - trtllm_active_requests - Current active requests
|
|||
|
|
# - trtllm_queue_size - Requests waiting in queue
|
|||
|
|
# - trtllm_gpu_memory_usage_bytes - GPU memory usage
|
|||
|
|
# - trtllm_kv_cache_usage_ratio - KV cache utilization
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
### Health Checks
|
|||
|
|
|
|||
|
|
```bash
|
|||
|
|
# Readiness probe
|
|||
|
|
curl http://localhost:8000/health/ready
|
|||
|
|
|
|||
|
|
# Liveness probe
|
|||
|
|
curl http://localhost:8000/health/live
|
|||
|
|
|
|||
|
|
# Model info
|
|||
|
|
curl http://localhost:8000/v1/models
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
**Kubernetes probes**:
|
|||
|
|
```yaml
|
|||
|
|
livenessProbe:
|
|||
|
|
httpGet:
|
|||
|
|
path: /health/live
|
|||
|
|
port: 8000
|
|||
|
|
initialDelaySeconds: 60
|
|||
|
|
periodSeconds: 10
|
|||
|
|
|
|||
|
|
readinessProbe:
|
|||
|
|
httpGet:
|
|||
|
|
path: /health/ready
|
|||
|
|
port: 8000
|
|||
|
|
initialDelaySeconds: 30
|
|||
|
|
periodSeconds: 5
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
## Production Deployment
|
|||
|
|
|
|||
|
|
### Docker Deployment
|
|||
|
|
|
|||
|
|
**Dockerfile**:
|
|||
|
|
```dockerfile
|
|||
|
|
FROM nvidia/tensorrt_llm:latest
|
|||
|
|
|
|||
|
|
# Copy any custom configs
|
|||
|
|
COPY config.yaml /app/config.yaml
|
|||
|
|
|
|||
|
|
# Expose ports
|
|||
|
|
EXPOSE 8000 9090
|
|||
|
|
|
|||
|
|
# Start server
|
|||
|
|
CMD ["trtllm-serve", "meta-llama/Meta-Llama-3-8B", \
|
|||
|
|
"--tp_size", "4", \
|
|||
|
|
"--dtype", "fp8", \
|
|||
|
|
"--max_batch_size", "256", \
|
|||
|
|
"--enable_metrics", \
|
|||
|
|
"--metrics_port", "9090"]
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
**Run container**:
|
|||
|
|
```bash
|
|||
|
|
docker run --gpus all -p 8000:8000 -p 9090:9090 \
|
|||
|
|
tensorrt-llm:latest
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
### Kubernetes Deployment
|
|||
|
|
|
|||
|
|
**Complete deployment**:
|
|||
|
|
```yaml
|
|||
|
|
apiVersion: apps/v1
|
|||
|
|
kind: Deployment
|
|||
|
|
metadata:
|
|||
|
|
name: tensorrt-llm
|
|||
|
|
spec:
|
|||
|
|
replicas: 2 # Multiple replicas for HA
|
|||
|
|
selector:
|
|||
|
|
matchLabels:
|
|||
|
|
app: tensorrt-llm
|
|||
|
|
template:
|
|||
|
|
metadata:
|
|||
|
|
labels:
|
|||
|
|
app: tensorrt-llm
|
|||
|
|
spec:
|
|||
|
|
containers:
|
|||
|
|
- name: trtllm
|
|||
|
|
image: nvidia/tensorrt_llm:latest
|
|||
|
|
command:
|
|||
|
|
- trtllm-serve
|
|||
|
|
- meta-llama/Meta-Llama-3-70B
|
|||
|
|
- --tp_size=4
|
|||
|
|
- --dtype=fp8
|
|||
|
|
- --max_batch_size=256
|
|||
|
|
- --enable_metrics
|
|||
|
|
ports:
|
|||
|
|
- containerPort: 8000
|
|||
|
|
name: http
|
|||
|
|
- containerPort: 9090
|
|||
|
|
name: metrics
|
|||
|
|
resources:
|
|||
|
|
limits:
|
|||
|
|
nvidia.com/gpu: 4
|
|||
|
|
livenessProbe:
|
|||
|
|
httpGet:
|
|||
|
|
path: /health/live
|
|||
|
|
port: 8000
|
|||
|
|
readinessProbe:
|
|||
|
|
httpGet:
|
|||
|
|
path: /health/ready
|
|||
|
|
port: 8000
|
|||
|
|
---
|
|||
|
|
apiVersion: v1
|
|||
|
|
kind: Service
|
|||
|
|
metadata:
|
|||
|
|
name: tensorrt-llm
|
|||
|
|
spec:
|
|||
|
|
selector:
|
|||
|
|
app: tensorrt-llm
|
|||
|
|
ports:
|
|||
|
|
- name: http
|
|||
|
|
port: 80
|
|||
|
|
targetPort: 8000
|
|||
|
|
- name: metrics
|
|||
|
|
port: 9090
|
|||
|
|
targetPort: 9090
|
|||
|
|
type: LoadBalancer
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
### Load Balancing
|
|||
|
|
|
|||
|
|
**NGINX configuration**:
|
|||
|
|
```nginx
|
|||
|
|
upstream tensorrt_llm {
|
|||
|
|
least_conn; # Route to least busy server
|
|||
|
|
server trtllm-1:8000 max_fails=3 fail_timeout=30s;
|
|||
|
|
server trtllm-2:8000 max_fails=3 fail_timeout=30s;
|
|||
|
|
server trtllm-3:8000 max_fails=3 fail_timeout=30s;
|
|||
|
|
}
|
|||
|
|
|
|||
|
|
server {
|
|||
|
|
listen 80;
|
|||
|
|
location / {
|
|||
|
|
proxy_pass http://tensorrt_llm;
|
|||
|
|
proxy_read_timeout 300s; # Long timeout for slow generations
|
|||
|
|
proxy_connect_timeout 10s;
|
|||
|
|
}
|
|||
|
|
}
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
## Autoscaling
|
|||
|
|
|
|||
|
|
### Horizontal Pod Autoscaler (HPA)
|
|||
|
|
|
|||
|
|
```yaml
|
|||
|
|
apiVersion: autoscaling/v2
|
|||
|
|
kind: HorizontalPodAutoscaler
|
|||
|
|
metadata:
|
|||
|
|
name: tensorrt-llm-hpa
|
|||
|
|
spec:
|
|||
|
|
scaleTargetRef:
|
|||
|
|
apiVersion: apps/v1
|
|||
|
|
kind: Deployment
|
|||
|
|
name: tensorrt-llm
|
|||
|
|
minReplicas: 2
|
|||
|
|
maxReplicas: 10
|
|||
|
|
metrics:
|
|||
|
|
- type: Pods
|
|||
|
|
pods:
|
|||
|
|
metric:
|
|||
|
|
name: trtllm_active_requests
|
|||
|
|
target:
|
|||
|
|
type: AverageValue
|
|||
|
|
averageValue: "50" # Scale when avg >50 active requests
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
### Custom Metrics
|
|||
|
|
|
|||
|
|
```yaml
|
|||
|
|
# Scale based on queue size
|
|||
|
|
- type: Pods
|
|||
|
|
pods:
|
|||
|
|
metric:
|
|||
|
|
name: trtllm_queue_size
|
|||
|
|
target:
|
|||
|
|
type: AverageValue
|
|||
|
|
averageValue: "10"
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
## Cost Optimization
|
|||
|
|
|
|||
|
|
### GPU Selection
|
|||
|
|
|
|||
|
|
**A100 80GB** ($3-4/hour):
|
|||
|
|
- Use for: 70B models with FP8
|
|||
|
|
- Throughput: 10,000-15,000 tok/s (TP=4)
|
|||
|
|
- Cost per 1M tokens: $0.20-0.30
|
|||
|
|
|
|||
|
|
**H100 80GB** ($6-8/hour):
|
|||
|
|
- Use for: 70B models with FP8, 405B models
|
|||
|
|
- Throughput: 20,000-30,000 tok/s (TP=4)
|
|||
|
|
- Cost per 1M tokens: $0.15-0.25 (2× faster = lower cost)
|
|||
|
|
|
|||
|
|
**L4** ($0.50-1/hour):
|
|||
|
|
- Use for: 7-8B models
|
|||
|
|
- Throughput: 1,000-2,000 tok/s
|
|||
|
|
- Cost per 1M tokens: $0.25-0.50
|
|||
|
|
|
|||
|
|
### Batch Size Tuning
|
|||
|
|
|
|||
|
|
**Impact on cost**:
|
|||
|
|
- Batch size 1: 1,000 tok/s → $3/hour per 1M = $3/M tokens
|
|||
|
|
- Batch size 64: 5,000 tok/s → $3/hour per 5M = $0.60/M tokens
|
|||
|
|
- **5× cost reduction** with batching
|
|||
|
|
|
|||
|
|
**Recommendation**: Target batch size 32-128 for cost efficiency.
|
|||
|
|
|
|||
|
|
## Security
|
|||
|
|
|
|||
|
|
### API Authentication
|
|||
|
|
|
|||
|
|
```bash
|
|||
|
|
# Generate API key
|
|||
|
|
export API_KEY=$(openssl rand -hex 32)
|
|||
|
|
|
|||
|
|
# Start server with authentication
|
|||
|
|
trtllm-serve meta-llama/Meta-Llama-3-8B \
|
|||
|
|
--api_key $API_KEY
|
|||
|
|
|
|||
|
|
# Client request
|
|||
|
|
curl -X POST http://localhost:8000/v1/chat/completions \
|
|||
|
|
-H "Authorization: Bearer $API_KEY" \
|
|||
|
|
-H "Content-Type: application/json" \
|
|||
|
|
-d '{"model": "...", "messages": [...]}'
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
### Network Policies
|
|||
|
|
|
|||
|
|
```yaml
|
|||
|
|
apiVersion: networking.k8s.io/v1
|
|||
|
|
kind: NetworkPolicy
|
|||
|
|
metadata:
|
|||
|
|
name: tensorrt-llm-policy
|
|||
|
|
spec:
|
|||
|
|
podSelector:
|
|||
|
|
matchLabels:
|
|||
|
|
app: tensorrt-llm
|
|||
|
|
policyTypes:
|
|||
|
|
- Ingress
|
|||
|
|
ingress:
|
|||
|
|
- from:
|
|||
|
|
- podSelector:
|
|||
|
|
matchLabels:
|
|||
|
|
app: api-gateway # Only allow from gateway
|
|||
|
|
ports:
|
|||
|
|
- protocol: TCP
|
|||
|
|
port: 8000
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
## Troubleshooting
|
|||
|
|
|
|||
|
|
### High latency
|
|||
|
|
|
|||
|
|
**Diagnosis**:
|
|||
|
|
```bash
|
|||
|
|
# Check queue size
|
|||
|
|
curl http://localhost:9090/metrics | grep queue_size
|
|||
|
|
|
|||
|
|
# Check active requests
|
|||
|
|
curl http://localhost:9090/metrics | grep active_requests
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
**Solutions**:
|
|||
|
|
- Scale horizontally (more replicas)
|
|||
|
|
- Increase batch size (if GPU underutilized)
|
|||
|
|
- Enable chunked context (if long prompts)
|
|||
|
|
- Use FP8 quantization
|
|||
|
|
|
|||
|
|
### OOM crashes
|
|||
|
|
|
|||
|
|
**Solutions**:
|
|||
|
|
- Reduce `max_batch_size`
|
|||
|
|
- Reduce `max_num_tokens`
|
|||
|
|
- Enable FP8 or INT4 quantization
|
|||
|
|
- Increase `tensor_parallel_size`
|
|||
|
|
|
|||
|
|
### Timeout errors
|
|||
|
|
|
|||
|
|
**NGINX config**:
|
|||
|
|
```nginx
|
|||
|
|
proxy_read_timeout 600s; # 10 minutes for very long generations
|
|||
|
|
proxy_send_timeout 600s;
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
## Best Practices
|
|||
|
|
|
|||
|
|
1. **Use FP8 on H100** for 2× speedup and 50% cost reduction
|
|||
|
|
2. **Monitor metrics** - Set up Prometheus + Grafana
|
|||
|
|
3. **Set readiness probes** - Prevent routing to unhealthy pods
|
|||
|
|
4. **Use load balancing** - Distribute load across replicas
|
|||
|
|
5. **Tune batch size** - Balance latency and throughput
|
|||
|
|
6. **Enable streaming** - Better UX for chat applications
|
|||
|
|
7. **Set up autoscaling** - Handle traffic spikes
|
|||
|
|
8. **Use persistent volumes** - Cache compiled models
|
|||
|
|
9. **Implement retries** - Handle transient failures
|
|||
|
|
10. **Monitor costs** - Track cost per token
|