allegro-checkpoint/skills/mlops/cloud/modal/references/troubleshooting.md

# Modal Troubleshooting Guide

## Installation Issues

### Authentication fails

**Error**: `modal setup` doesn't complete or token is invalid

**Solutions**:
```bash
# Re-authenticate
modal token new

# Check current token
modal config show

# Set token via environment
export MODAL_TOKEN_ID=ak-...
export MODAL_TOKEN_SECRET=as-...
```

### Package installation issues

**Error**: `pip install modal` fails

**Solutions**:
```bash
# Upgrade pip
pip install --upgrade pip

# Install with specific Python version
python3.11 -m pip install modal

# Install from wheel
pip install modal --prefer-binary
```

## Container Image Issues

### Image build fails

**Error**: `ImageBuilderError: Failed to build image`

**Solutions**:
```python
# Pin package versions to avoid conflicts
image = modal.Image.debian_slim().pip_install(
    "torch==2.1.0",
    "transformers==4.36.0",  # Pin versions
    "accelerate==0.25.0"
)

# Use compatible CUDA versions
image = modal.Image.from_registry(
    "nvidia/cuda:12.1.0-cudnn8-runtime-ubuntu22.04",  # Match PyTorch CUDA
    add_python="3.11"
)
```

### Dependency conflicts

**Error**: `ERROR: Cannot install package due to conflicting dependencies`

**Solutions**:
```python
# Layer dependencies separately
base = modal.Image.debian_slim().pip_install("torch")
ml = base.pip_install("transformers")  # Install after torch

# Use uv for better resolution
image = modal.Image.debian_slim().uv_pip_install(
    "torch", "transformers"
)
```

### Large image builds timeout

**Error**: Image build exceeds time limit

**Solutions**:
```python
# Split into multiple layers (better caching)
base = modal.Image.debian_slim().pip_install("torch")  # Cached
ml = base.pip_install("transformers", "datasets")      # Cached
app = ml.copy_local_dir("./src", "/app")               # Rebuilds on code change

# Download models during build, not runtime
image = modal.Image.debian_slim().pip_install("transformers").run_commands(
    "python -c 'from transformers import AutoModel; AutoModel.from_pretrained(\"bert-base\")'"
)
```

## GPU Issues

### GPU not available

**Error**: `RuntimeError: CUDA not available`

**Solutions**:
```python
# Ensure GPU is specified
@app.function(gpu="T4")  # Must specify GPU
def my_function():
    import torch
    assert torch.cuda.is_available()

# Check CUDA compatibility in image
image = modal.Image.from_registry(
    "nvidia/cuda:12.1.0-cudnn8-devel-ubuntu22.04",
    add_python="3.11"
).pip_install(
    "torch",
    index_url="https://download.pytorch.org/whl/cu121"  # Match CUDA
)
```

### GPU out of memory

**Error**: `torch.cuda.OutOfMemoryError: CUDA out of memory`

**Solutions**:
```python
# Use larger GPU
@app.function(gpu="A100-80GB")  # More VRAM
def train():
    pass

# Enable memory optimization
@app.function(gpu="A100")
def memory_optimized():
    import torch
    torch.backends.cuda.enable_flash_sdp(True)

    # Use gradient checkpointing
    model.gradient_checkpointing_enable()

    # Mixed precision
    with torch.autocast(device_type="cuda", dtype=torch.float16):
        outputs = model(**inputs)
```

### Wrong GPU allocated

**Error**: Got different GPU than requested

**Solutions**:
```python
# Use strict GPU selection
@app.function(gpu="H100!")  # H100! prevents auto-upgrade to H200

# Specify exact memory variant
@app.function(gpu="A100-80GB")  # Not just "A100"

# Check GPU at runtime
@app.function(gpu="A100")
def check_gpu():
    import subprocess
    result = subprocess.run(["nvidia-smi"], capture_output=True, text=True)
    print(result.stdout)
```

## Cold Start Issues

### Slow cold starts

**Problem**: First request takes too long

**Solutions**:
```python
# Keep containers warm
@app.function(
    container_idle_timeout=600,  # Keep warm 10 min
    keep_warm=1                  # Always keep 1 container ready
)
def low_latency():
    pass

# Load model during container start
@app.cls(gpu="A100")
class Model:
    @modal.enter()
    def load(self):
        # This runs once at container start, not per request
        self.model = load_heavy_model()

# Cache model in volume
volume = modal.Volume.from_name("models", create_if_missing=True)

@app.function(volumes={"/cache": volume})
def cached_model():
    if os.path.exists("/cache/model"):
        model = load_from_disk("/cache/model")
    else:
        model = download_model()
        save_to_disk(model, "/cache/model")
        volume.commit()
```

### Container keeps restarting

**Problem**: Containers are killed and restarted frequently

**Solutions**:
```python
# Increase memory
@app.function(memory=32768)  # 32GB RAM
def memory_heavy():
    pass

# Increase timeout
@app.function(timeout=3600)  # 1 hour
def long_running():
    pass

# Handle signals gracefully
import signal

def handler(signum, frame):
    cleanup()
    exit(0)

signal.signal(signal.SIGTERM, handler)
```

## Volume Issues

### Volume changes not persisting

**Error**: Data written to volume disappears

**Solutions**:
```python
volume = modal.Volume.from_name("my-volume", create_if_missing=True)

@app.function(volumes={"/data": volume})
def write_data():
    with open("/data/file.txt", "w") as f:
        f.write("data")

    # CRITICAL: Commit changes!
    volume.commit()
```

### Volume read shows stale data

**Error**: Reading outdated data from volume

**Solutions**:
```python
@app.function(volumes={"/data": volume})
def read_data():
    # Reload to get latest
    volume.reload()

    with open("/data/file.txt", "r") as f:
        return f.read()
```

### Volume mount fails

**Error**: `VolumeError: Failed to mount volume`

**Solutions**:
```python
# Ensure volume exists
volume = modal.Volume.from_name("my-volume", create_if_missing=True)

# Use absolute path
@app.function(volumes={"/data": volume})  # Not "./data"
def my_function():
    pass

# Check volume in dashboard
# modal volume list
```

## Web Endpoint Issues

### Endpoint returns 502

**Error**: Gateway timeout or bad gateway

**Solutions**:
```python
# Increase timeout
@app.function(timeout=300)  # 5 min
@modal.web_endpoint()
def slow_endpoint():
    pass

# Return streaming response for long operations
from fastapi.responses import StreamingResponse

@app.function()
@modal.asgi_app()
def streaming_app():
    async def generate():
        for i in range(100):
            yield f"data: {i}\n\n"
            await process_chunk(i)
    return StreamingResponse(generate(), media_type="text/event-stream")
```

### Endpoint not accessible

**Error**: 404 or cannot reach endpoint

**Solutions**:
```bash
# Check deployment status
modal app list

# Redeploy
modal deploy my_app.py

# Check logs
modal app logs my-app
```

### CORS errors

**Error**: Cross-origin request blocked

**Solutions**:
```python
from fastapi import FastAPI
from fastapi.middleware.cors import CORSMiddleware

web_app = FastAPI()
web_app.add_middleware(
    CORSMiddleware,
    allow_origins=["*"],
    allow_credentials=True,
    allow_methods=["*"],
    allow_headers=["*"],
)

@app.function()
@modal.asgi_app()
def cors_enabled():
    return web_app
```

## Secret Issues

### Secret not found

**Error**: `SecretNotFound: Secret 'my-secret' not found`

**Solutions**:
```bash
# Create secret via CLI
modal secret create my-secret KEY=value

# List secrets
modal secret list

# Check secret name matches exactly
```

### Secret value not accessible

**Error**: Environment variable is empty

**Solutions**:
```python
# Ensure secret is attached
@app.function(secrets=[modal.Secret.from_name("my-secret")])
def use_secret():
    import os
    value = os.environ.get("KEY")  # Use get() to handle missing
    if not value:
        raise ValueError("KEY not set in secret")
```

## Scheduling Issues

### Scheduled job not running

**Error**: Cron job doesn't execute

**Solutions**:
```python
# Verify cron syntax
@app.function(schedule=modal.Cron("0 0 * * *"))  # Daily at midnight UTC
def daily_job():
    pass

# Check timezone (Modal uses UTC)
# "0 8 * * *" = 8am UTC, not local time

# Ensure app is deployed
# modal deploy my_app.py
```

### Job runs multiple times

**Problem**: Scheduled job executes more than expected

**Solutions**:
```python
# Implement idempotency
@app.function(schedule=modal.Cron("0 * * * *"))
def hourly_job():
    job_id = get_current_hour_id()
    if already_processed(job_id):
        return
    process()
    mark_processed(job_id)
```

## Debugging Tips

### Enable debug logging

```python
import logging
logging.basicConfig(level=logging.DEBUG)

@app.function()
def debug_function():
    logging.debug("Debug message")
    logging.info("Info message")
```

### View container logs

```bash
# Stream logs
modal app logs my-app

# View specific function
modal app logs my-app --function my_function

# View historical logs
modal app logs my-app --since 1h
```

### Test locally

```python
# Run function locally without Modal
if __name__ == "__main__":
    result = my_function.local()  # Runs on your machine
    print(result)
```

### Inspect container

```python
@app.function(gpu="T4")
def debug_environment():
    import subprocess
    import sys

    # System info
    print(f"Python: {sys.version}")
    print(subprocess.run(["nvidia-smi"], capture_output=True, text=True).stdout)
    print(subprocess.run(["pip", "list"], capture_output=True, text=True).stdout)

    # CUDA info
    import torch
    print(f"CUDA available: {torch.cuda.is_available()}")
    print(f"CUDA version: {torch.version.cuda}")
    print(f"GPU: {torch.cuda.get_device_name(0)}")
```

## Common Error Messages

| Error | Cause | Solution |
|-------|-------|----------|
| `FunctionTimeoutError` | Function exceeded timeout | Increase `timeout` parameter |
| `ContainerMemoryExceeded` | OOM killed | Increase `memory` parameter |
| `ImageBuilderError` | Build failed | Check dependencies, pin versions |
| `ResourceExhausted` | No GPUs available | Use GPU fallbacks, try later |
| `AuthenticationError` | Invalid token | Run `modal token new` |
| `VolumeNotFound` | Volume doesn't exist | Use `create_if_missing=True` |
| `SecretNotFound` | Secret doesn't exist | Create secret via CLI |

## Getting Help

1. **Documentation**: https://modal.com/docs
2. **Examples**: https://github.com/modal-labs/modal-examples
3. **Discord**: https://discord.gg/modal
4. **Status**: https://status.modal.com

### Reporting Issues

Include:
- Modal client version: `modal --version`
- Python version: `python --version`
- Full error traceback
- Minimal reproducible code
- GPU type if relevant