skills/mlops/tensorrt-llm/references/serving.md

# Production Serving Guide

Comprehensive guide to deploying TensorRT-LLM in production environments.

## Server Modes

### trtllm-serve (Recommended)

**Features**:
- OpenAI-compatible API
- Automatic model download and compilation
- Built-in load balancing
- Prometheus metrics
- Health checks

**Basic usage**:
```bash
trtllm-serve meta-llama/Meta-Llama-3-8B \
    --tp_size 1 \
    --max_batch_size 256 \
    --port 8000
```

**Advanced configuration**:
```bash
trtllm-serve meta-llama/Meta-Llama-3-70B \
    --tp_size 4 \
    --dtype fp8 \
    --max_batch_size 256 \
    --max_num_tokens 4096 \
    --enable_chunked_context \
    --scheduler_policy max_utilization \
    --port 8000 \
    --api_key $API_KEY  # Optional authentication
```

### Python LLM API (For embedding)

```python
from tensorrt_llm import LLM

class LLMService:
    def __init__(self):
        self.llm = LLM(
            model="meta-llama/Meta-Llama-3-8B",
            dtype="fp8"
        )

    def generate(self, prompt, max_tokens=100):
        from tensorrt_llm import SamplingParams

        params = SamplingParams(
            max_tokens=max_tokens,
            temperature=0.7
        )
        outputs = self.llm.generate([prompt], params)
        return outputs[0].text

# Use in FastAPI, Flask, etc
from fastapi import FastAPI
app = FastAPI()
service = LLMService()

@app.post("/generate")
def generate(prompt: str):
    return {"response": service.generate(prompt)}
```

## OpenAI-Compatible API

### Chat Completions

```bash
curl -X POST http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "meta-llama/Meta-Llama-3-8B",
    "messages": [
      {"role": "system", "content": "You are a helpful assistant."},
      {"role": "user", "content": "Explain quantum computing"}
    ],
    "temperature": 0.7,
    "max_tokens": 500,
    "stream": false
  }'
```

**Response**:
```json
{
  "id": "chat-abc123",
  "object": "chat.completion",
  "created": 1234567890,
  "model": "meta-llama/Meta-Llama-3-8B",
  "choices": [{
    "index": 0,
    "message": {
      "role": "assistant",
      "content": "Quantum computing is..."
    },
    "finish_reason": "stop"
  }],
  "usage": {
    "prompt_tokens": 25,
    "completion_tokens": 150,
    "total_tokens": 175
  }
}
```

### Streaming

```bash
curl -X POST http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "meta-llama/Meta-Llama-3-8B",
    "messages": [{"role": "user", "content": "Count to 10"}],
    "stream": true
  }'
```

**Response** (SSE stream):
```
data: {"choices":[{"delta":{"content":"1"}}]}

data: {"choices":[{"delta":{"content":", 2"}}]}

data: {"choices":[{"delta":{"content":", 3"}}]}

data: [DONE]
```

### Completions

```bash
curl -X POST http://localhost:8000/v1/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "meta-llama/Meta-Llama-3-8B",
    "prompt": "The capital of France is",
    "max_tokens": 10,
    "temperature": 0.0
  }'
```

## Monitoring

### Prometheus Metrics

**Enable metrics**:
```bash
trtllm-serve meta-llama/Meta-Llama-3-8B \
    --enable_metrics \
    --metrics_port 9090
```

**Key metrics**:
```bash
# Scrape metrics
curl http://localhost:9090/metrics

# Important metrics:
# - trtllm_request_success_total - Total successful requests
# - trtllm_request_latency_seconds - Request latency histogram
# - trtllm_tokens_generated_total - Total tokens generated
# - trtllm_active_requests - Current active requests
# - trtllm_queue_size - Requests waiting in queue
# - trtllm_gpu_memory_usage_bytes - GPU memory usage
# - trtllm_kv_cache_usage_ratio - KV cache utilization
```

### Health Checks

```bash
# Readiness probe
curl http://localhost:8000/health/ready

# Liveness probe
curl http://localhost:8000/health/live

# Model info
curl http://localhost:8000/v1/models
```

**Kubernetes probes**:
```yaml
livenessProbe:
  httpGet:
    path: /health/live
    port: 8000
  initialDelaySeconds: 60
  periodSeconds: 10

readinessProbe:
  httpGet:
    path: /health/ready
    port: 8000
  initialDelaySeconds: 30
  periodSeconds: 5
```

## Production Deployment

### Docker Deployment

**Dockerfile**:
```dockerfile
FROM nvidia/tensorrt_llm:latest

# Copy any custom configs
COPY config.yaml /app/config.yaml

# Expose ports
EXPOSE 8000 9090

# Start server
CMD ["trtllm-serve", "meta-llama/Meta-Llama-3-8B", \
     "--tp_size", "4", \
     "--dtype", "fp8", \
     "--max_batch_size", "256", \
     "--enable_metrics", \
     "--metrics_port", "9090"]
```

**Run container**:
```bash
docker run --gpus all -p 8000:8000 -p 9090:9090 \
    tensorrt-llm:latest
```

### Kubernetes Deployment

**Complete deployment**:
```yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: tensorrt-llm
spec:
  replicas: 2  # Multiple replicas for HA
  selector:
    matchLabels:
      app: tensorrt-llm
  template:
    metadata:
      labels:
        app: tensorrt-llm
    spec:
      containers:
      - name: trtllm
        image: nvidia/tensorrt_llm:latest
        command:
          - trtllm-serve
          - meta-llama/Meta-Llama-3-70B
          - --tp_size=4
          - --dtype=fp8
          - --max_batch_size=256
          - --enable_metrics
        ports:
        - containerPort: 8000
          name: http
        - containerPort: 9090
          name: metrics
        resources:
          limits:
            nvidia.com/gpu: 4
        livenessProbe:
          httpGet:
            path: /health/live
            port: 8000
        readinessProbe:
          httpGet:
            path: /health/ready
            port: 8000
---
apiVersion: v1
kind: Service
metadata:
  name: tensorrt-llm
spec:
  selector:
    app: tensorrt-llm
  ports:
  - name: http
    port: 80
    targetPort: 8000
  - name: metrics
    port: 9090
    targetPort: 9090
  type: LoadBalancer
```

### Load Balancing

**NGINX configuration**:
```nginx
upstream tensorrt_llm {
    least_conn;  # Route to least busy server
    server trtllm-1:8000 max_fails=3 fail_timeout=30s;
    server trtllm-2:8000 max_fails=3 fail_timeout=30s;
    server trtllm-3:8000 max_fails=3 fail_timeout=30s;
}

server {
    listen 80;
    location / {
        proxy_pass http://tensorrt_llm;
        proxy_read_timeout 300s;  # Long timeout for slow generations
        proxy_connect_timeout 10s;
    }
}
```

## Autoscaling

### Horizontal Pod Autoscaler (HPA)

```yaml
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: tensorrt-llm-hpa
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: tensorrt-llm
  minReplicas: 2
  maxReplicas: 10
  metrics:
  - type: Pods
    pods:
      metric:
        name: trtllm_active_requests
      target:
        type: AverageValue
        averageValue: "50"  # Scale when avg >50 active requests
```

### Custom Metrics

```yaml
# Scale based on queue size
- type: Pods
  pods:
    metric:
      name: trtllm_queue_size
    target:
      type: AverageValue
      averageValue: "10"
```

## Cost Optimization

### GPU Selection

**A100 80GB** ($3-4/hour):
- Use for: 70B models with FP8
- Throughput: 10,000-15,000 tok/s (TP=4)
- Cost per 1M tokens: $0.20-0.30

**H100 80GB** ($6-8/hour):
- Use for: 70B models with FP8, 405B models
- Throughput: 20,000-30,000 tok/s (TP=4)
- Cost per 1M tokens: $0.15-0.25 (2× faster = lower cost)

**L4** ($0.50-1/hour):
- Use for: 7-8B models
- Throughput: 1,000-2,000 tok/s
- Cost per 1M tokens: $0.25-0.50

### Batch Size Tuning

**Impact on cost**:
- Batch size 1: 1,000 tok/s → $3/hour per 1M = $3/M tokens
- Batch size 64: 5,000 tok/s → $3/hour per 5M = $0.60/M tokens
- **5× cost reduction** with batching

**Recommendation**: Target batch size 32-128 for cost efficiency.

## Security

### API Authentication

```bash
# Generate API key
export API_KEY=$(openssl rand -hex 32)

# Start server with authentication
trtllm-serve meta-llama/Meta-Llama-3-8B \
    --api_key $API_KEY

# Client request
curl -X POST http://localhost:8000/v1/chat/completions \
  -H "Authorization: Bearer $API_KEY" \
  -H "Content-Type: application/json" \
  -d '{"model": "...", "messages": [...]}'
```

### Network Policies

```yaml
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: tensorrt-llm-policy
spec:
  podSelector:
    matchLabels:
      app: tensorrt-llm
  policyTypes:
  - Ingress
  ingress:
  - from:
    - podSelector:
        matchLabels:
          app: api-gateway  # Only allow from gateway
    ports:
    - protocol: TCP
      port: 8000
```

## Troubleshooting

### High latency

**Diagnosis**:
```bash
# Check queue size
curl http://localhost:9090/metrics | grep queue_size

# Check active requests
curl http://localhost:9090/metrics | grep active_requests
```

**Solutions**:
- Scale horizontally (more replicas)
- Increase batch size (if GPU underutilized)
- Enable chunked context (if long prompts)
- Use FP8 quantization

### OOM crashes

**Solutions**:
- Reduce `max_batch_size`
- Reduce `max_num_tokens`
- Enable FP8 or INT4 quantization
- Increase `tensor_parallel_size`

### Timeout errors

**NGINX config**:
```nginx
proxy_read_timeout 600s;  # 10 minutes for very long generations
proxy_send_timeout 600s;
```

## Best Practices

1. **Use FP8 on H100** for 2× speedup and 50% cost reduction
2. **Monitor metrics** - Set up Prometheus + Grafana
3. **Set readiness probes** - Prevent routing to unhealthy pods
4. **Use load balancing** - Distribute load across replicas
5. **Tune batch size** - Balance latency and throughput
6. **Enable streaming** - Better UX for chat applications
7. **Set up autoscaling** - Handle traffic spikes
8. **Use persistent volumes** - Cache compiled models
9. **Implement retries** - Handle transient failures
10. **Monitor costs** - Track cost per token
-												init: Hermes config, skills, memories, cron

Sovereign backup of all Hermes Agent configuration and data.
Excludes: secrets, auth tokens, sessions, caches, code (separate repo).

Tracked:
- config.yaml (model, fallback chain, toolsets, display prefs)
- SOUL.md (Timmy personality charter)
- memories/ (persistent MEMORY.md + USER.md)
- skills/ (371 files — full skill library)
- cron/jobs.json (scheduled tasks)
- channel_directory.json (platform channels)
- hooks/ (custom hooks)

											
										
										
											2026-03-14 14:42:33 -04:00
+								# Production Serving Guide
 								Comprehensive guide to deploying TensorRT-LLM in production environments.
 								## Server Modes
 								### trtllm-serve (Recommended)
 								**Features**:
 								- OpenAI-compatible API
 								- Automatic model download and compilation
 								- Built-in load balancing
 								- Prometheus metrics
 								- Health checks
 								**Basic usage**:
 								```bash
 								trtllm-serve meta-llama/Meta-Llama-3-8B \
 								    --tp_size 1 \
 								    --max_batch_size 256 \
 								    --port 8000
 								```
 								**Advanced configuration**:
 								```bash
 								trtllm-serve meta-llama/Meta-Llama-3-70B \
 								    --tp_size 4 \
 								    --dtype fp8 \
 								    --max_batch_size 256 \
 								    --max_num_tokens 4096 \
 								    --enable_chunked_context \
 								    --scheduler_policy max_utilization \
 								    --port 8000 \
 								    --api_key $API_KEY  # Optional authentication
 								```
 								### Python LLM API (For embedding)
 								```python
 								from tensorrt_llm import LLM
 								class LLMService:
 								    def __init__(self):
 								        self.llm = LLM(
 								            model="meta-llama/Meta-Llama-3-8B",
 								            dtype="fp8"
 								        )
 								    def generate(self, prompt, max_tokens=100):
 								        from tensorrt_llm import SamplingParams
 								        params = SamplingParams(
 								            max_tokens=max_tokens,
 								            temperature=0.7
 								        )
 								        outputs = self.llm.generate([prompt], params)
 								        return outputs[0].text
 								# Use in FastAPI, Flask, etc
 								from fastapi import FastAPI
 								app = FastAPI()
 								service = LLMService()
 								@app.post("/generate")
 								def generate(prompt: str):
 								    return {"response": service.generate(prompt)}
 								```
 								## OpenAI-Compatible API
 								### Chat Completions
 								```bash
 								curl -X POST http://localhost:8000/v1/chat/completions \
 								  -H "Content-Type: application/json" \
 								  -d '{
 								    "model": "meta-llama/Meta-Llama-3-8B",
 								    "messages": [
 								      {"role": "system", "content": "You are a helpful assistant."},
 								      {"role": "user", "content": "Explain quantum computing"}
 								    ],
 								    "temperature": 0.7,
 								    "max_tokens": 500,
 								    "stream": false
 								  }'
 								```
 								**Response**:
 								```json
 								{
 								  "id": "chat-abc123",
 								  "object": "chat.completion",
 								  "created": 1234567890,
 								  "model": "meta-llama/Meta-Llama-3-8B",
 								  "choices": [{
 								    "index": 0,
 								    "message": {
 								      "role": "assistant",
 								      "content": "Quantum computing is..."
 								    },
 								    "finish_reason": "stop"
 								  }],
 								  "usage": {
 								    "prompt_tokens": 25,
 								    "completion_tokens": 150,
 								    "total_tokens": 175
 								  }
 								}
 								```
 								### Streaming
 								```bash
 								curl -X POST http://localhost:8000/v1/chat/completions \
 								  -H "Content-Type: application/json" \
 								  -d '{
 								    "model": "meta-llama/Meta-Llama-3-8B",
 								    "messages": [{"role": "user", "content": "Count to 10"}],
 								    "stream": true
 								  }'
 								```
 								**Response** (SSE stream):
 								```
 								data: {"choices":[{"delta":{"content":"1"}}]}
 								data: {"choices":[{"delta":{"content":", 2"}}]}
 								data: {"choices":[{"delta":{"content":", 3"}}]}
 								data: [DONE]
 								```
 								### Completions
 								```bash
 								curl -X POST http://localhost:8000/v1/completions \
 								  -H "Content-Type: application/json" \
 								  -d '{
 								    "model": "meta-llama/Meta-Llama-3-8B",
 								    "prompt": "The capital of France is",
 								    "max_tokens": 10,
 								    "temperature": 0.0
 								  }'
 								```
 								## Monitoring
 								### Prometheus Metrics
 								**Enable metrics**:
 								```bash
 								trtllm-serve meta-llama/Meta-Llama-3-8B \
 								    --enable_metrics \
 								    --metrics_port 9090
 								```
 								**Key metrics**:
 								```bash
 								# Scrape metrics
 								curl http://localhost:9090/metrics
 								# Important metrics:
 								# - trtllm_request_success_total - Total successful requests
 								# - trtllm_request_latency_seconds - Request latency histogram
 								# - trtllm_tokens_generated_total - Total tokens generated
 								# - trtllm_active_requests - Current active requests
 								# - trtllm_queue_size - Requests waiting in queue
 								# - trtllm_gpu_memory_usage_bytes - GPU memory usage
 								# - trtllm_kv_cache_usage_ratio - KV cache utilization
 								```
 								### Health Checks
 								```bash
 								# Readiness probe
 								curl http://localhost:8000/health/ready
 								# Liveness probe
 								curl http://localhost:8000/health/live
 								# Model info
 								curl http://localhost:8000/v1/models
 								```
 								**Kubernetes probes**:
 								```yaml
 								livenessProbe:
 								  httpGet:
 								    path: /health/live
 								    port: 8000
 								  initialDelaySeconds: 60
 								  periodSeconds: 10
 								readinessProbe:
 								  httpGet:
 								    path: /health/ready
 								    port: 8000
 								  initialDelaySeconds: 30
 								  periodSeconds: 5
 								```
 								## Production Deployment
 								### Docker Deployment
 								**Dockerfile**:
 								```dockerfile
 								FROM nvidia/tensorrt_llm:latest
 								# Copy any custom configs
 								COPY config.yaml /app/config.yaml
 								# Expose ports
 								EXPOSE 8000 9090
 								# Start server
 								CMD ["trtllm-serve", "meta-llama/Meta-Llama-3-8B", \
 								     "--tp_size", "4", \
 								     "--dtype", "fp8", \
 								     "--max_batch_size", "256", \
 								     "--enable_metrics", \
 								     "--metrics_port", "9090"]
 								```
 								**Run container**:
 								```bash
 								docker run --gpus all -p 8000:8000 -p 9090:9090 \
 								    tensorrt-llm:latest
 								```
 								### Kubernetes Deployment
 								**Complete deployment**:
 								```yaml
 								apiVersion: apps/v1
 								kind: Deployment
 								metadata:
 								  name: tensorrt-llm
 								spec:
 								  replicas: 2  # Multiple replicas for HA
 								  selector:
 								    matchLabels:
 								      app: tensorrt-llm
 								  template:
 								    metadata:
 								      labels:
 								        app: tensorrt-llm
 								    spec:
 								      containers:
 								      - name: trtllm
 								        image: nvidia/tensorrt_llm:latest
 								        command:
 								          - trtllm-serve
 								          - meta-llama/Meta-Llama-3-70B
 								          - --tp_size=4
 								          - --dtype=fp8
 								          - --max_batch_size=256
 								          - --enable_metrics
 								        ports:
 								        - containerPort: 8000
 								          name: http
 								        - containerPort: 9090
 								          name: metrics
 								        resources:
 								          limits:
 								            nvidia.com/gpu: 4
 								        livenessProbe:
 								          httpGet:
 								            path: /health/live
 								            port: 8000
 								        readinessProbe:
 								          httpGet:
 								            path: /health/ready
 								            port: 8000
 								---
 								apiVersion: v1
 								kind: Service
 								metadata:
 								  name: tensorrt-llm
 								spec:
 								  selector:
 								    app: tensorrt-llm
 								  ports:
 								  - name: http
 								    port: 80
 								    targetPort: 8000
 								  - name: metrics
 								    port: 9090
 								    targetPort: 9090
 								  type: LoadBalancer
 								```
 								### Load Balancing
 								**NGINX configuration**:
 								```nginx
 								upstream tensorrt_llm {
 								    least_conn;  # Route to least busy server
 								    server trtllm-1:8000 max_fails=3 fail_timeout=30s;
 								    server trtllm-2:8000 max_fails=3 fail_timeout=30s;
 								    server trtllm-3:8000 max_fails=3 fail_timeout=30s;
 								}
 								server {
 								    listen 80;
 								    location / {
 								        proxy_pass http://tensorrt_llm;
 								        proxy_read_timeout 300s;  # Long timeout for slow generations
 								        proxy_connect_timeout 10s;
 								    }
 								}
 								```
 								## Autoscaling
 								### Horizontal Pod Autoscaler (HPA)
 								```yaml
 								apiVersion: autoscaling/v2
 								kind: HorizontalPodAutoscaler
 								metadata:
 								  name: tensorrt-llm-hpa
 								spec:
 								  scaleTargetRef:
 								    apiVersion: apps/v1
 								    kind: Deployment
 								    name: tensorrt-llm
 								  minReplicas: 2
 								  maxReplicas: 10
 								  metrics:
 								  - type: Pods
 								    pods:
 								      metric:
 								        name: trtllm_active_requests
 								      target:
 								        type: AverageValue
 								        averageValue: "50"  # Scale when avg >50 active requests
 								```
 								### Custom Metrics
 								```yaml
 								# Scale based on queue size
 								- type: Pods
 								  pods:
 								    metric:
 								      name: trtllm_queue_size
 								    target:
 								      type: AverageValue
 								      averageValue: "10"
 								```
 								## Cost Optimization
 								### GPU Selection
 								**A100 80GB** ($3-4/hour):
 								- Use for: 70B models with FP8
 								- Throughput: 10,000-15,000 tok/s (TP=4)
 								- Cost per 1M tokens: $0.20-0.30
 								**H100 80GB** ($6-8/hour):
 								- Use for: 70B models with FP8, 405B models
 								- Throughput: 20,000-30,000 tok/s (TP=4)
 								- Cost per 1M tokens: $0.15-0.25 (2× faster = lower cost)
 								**L4** ($0.50-1/hour):
 								- Use for: 7-8B models
 								- Throughput: 1,000-2,000 tok/s
 								- Cost per 1M tokens: $0.25-0.50
 								### Batch Size Tuning
 								**Impact on cost**:
 								- Batch size 1: 1,000 tok/s → $3/hour per 1M = $3/M tokens
 								- Batch size 64: 5,000 tok/s → $3/hour per 5M = $0.60/M tokens
 								- **5× cost reduction** with batching
 								**Recommendation**: Target batch size 32-128 for cost efficiency.
 								## Security
 								### API Authentication
 								```bash
 								# Generate API key
 								export API_KEY=$(openssl rand -hex 32)
 								# Start server with authentication
 								trtllm-serve meta-llama/Meta-Llama-3-8B \
 								    --api_key $API_KEY
 								# Client request
 								curl -X POST http://localhost:8000/v1/chat/completions \
 								  -H "Authorization: Bearer $API_KEY" \
 								  -H "Content-Type: application/json" \
 								  -d '{"model": "...", "messages": [...]}'
 								```
 								### Network Policies
 								```yaml
 								apiVersion: networking.k8s.io/v1
 								kind: NetworkPolicy
 								metadata:
 								  name: tensorrt-llm-policy
 								spec:
 								  podSelector:
 								    matchLabels:
 								      app: tensorrt-llm
 								  policyTypes:
 								  - Ingress
 								  ingress:
 								  - from:
 								    - podSelector:
 								        matchLabels:
 								          app: api-gateway  # Only allow from gateway
 								    ports:
 								    - protocol: TCP
 								      port: 8000
 								```
 								## Troubleshooting
 								### High latency
 								**Diagnosis**:
 								```bash
 								# Check queue size
 								curl http://localhost:9090/metrics | grep queue_size
 								# Check active requests
 								curl http://localhost:9090/metrics | grep active_requests
 								```
 								**Solutions**:
 								- Scale horizontally (more replicas)
 								- Increase batch size (if GPU underutilized)
 								- Enable chunked context (if long prompts)
 								- Use FP8 quantization
 								### OOM crashes
 								**Solutions**:
 								- Reduce `max_batch_size`
 								- Reduce `max_num_tokens`
 								- Enable FP8 or INT4 quantization
 								- Increase `tensor_parallel_size`
 								### Timeout errors
 								**NGINX config**:
 								```nginx
 								proxy_read_timeout 600s;  # 10 minutes for very long generations
 								proxy_send_timeout 600s;
 								```
 								## Best Practices
 . **Use FP8 on H100** for 2× speedup and 50% cost reduction
 . **Monitor metrics** - Set up Prometheus + Grafana
 . **Set readiness probes** - Prevent routing to unhealthy pods
 . **Use load balancing** - Distribute load across replicas
 . **Tune batch size** - Balance latency and throughput
 . **Enable streaming** - Better UX for chat applications
 . **Set up autoscaling** - Handle traffic spikes
 . **Use persistent volumes** - Cache compiled models
 . **Implement retries** - Handle transient failures
 . **Monitor costs** - Track cost per token