10 KiB
Modal Troubleshooting Guide
Installation Issues
Authentication fails
Error: modal setup doesn't complete or token is invalid
Solutions:
# Re-authenticate
modal token new
# Check current token
modal config show
# Set token via environment
export MODAL_TOKEN_ID=ak-...
export MODAL_TOKEN_SECRET=as-...
Package installation issues
Error: pip install modal fails
Solutions:
# Upgrade pip
pip install --upgrade pip
# Install with specific Python version
python3.11 -m pip install modal
# Install from wheel
pip install modal --prefer-binary
Container Image Issues
Image build fails
Error: ImageBuilderError: Failed to build image
Solutions:
# Pin package versions to avoid conflicts
image = modal.Image.debian_slim().pip_install(
"torch==2.1.0",
"transformers==4.36.0", # Pin versions
"accelerate==0.25.0"
)
# Use compatible CUDA versions
image = modal.Image.from_registry(
"nvidia/cuda:12.1.0-cudnn8-runtime-ubuntu22.04", # Match PyTorch CUDA
add_python="3.11"
)
Dependency conflicts
Error: ERROR: Cannot install package due to conflicting dependencies
Solutions:
# Layer dependencies separately
base = modal.Image.debian_slim().pip_install("torch")
ml = base.pip_install("transformers") # Install after torch
# Use uv for better resolution
image = modal.Image.debian_slim().uv_pip_install(
"torch", "transformers"
)
Large image builds timeout
Error: Image build exceeds time limit
Solutions:
# Split into multiple layers (better caching)
base = modal.Image.debian_slim().pip_install("torch") # Cached
ml = base.pip_install("transformers", "datasets") # Cached
app = ml.copy_local_dir("./src", "/app") # Rebuilds on code change
# Download models during build, not runtime
image = modal.Image.debian_slim().pip_install("transformers").run_commands(
"python -c 'from transformers import AutoModel; AutoModel.from_pretrained(\"bert-base\")'"
)
GPU Issues
GPU not available
Error: RuntimeError: CUDA not available
Solutions:
# Ensure GPU is specified
@app.function(gpu="T4") # Must specify GPU
def my_function():
import torch
assert torch.cuda.is_available()
# Check CUDA compatibility in image
image = modal.Image.from_registry(
"nvidia/cuda:12.1.0-cudnn8-devel-ubuntu22.04",
add_python="3.11"
).pip_install(
"torch",
index_url="https://download.pytorch.org/whl/cu121" # Match CUDA
)
GPU out of memory
Error: torch.cuda.OutOfMemoryError: CUDA out of memory
Solutions:
# Use larger GPU
@app.function(gpu="A100-80GB") # More VRAM
def train():
pass
# Enable memory optimization
@app.function(gpu="A100")
def memory_optimized():
import torch
torch.backends.cuda.enable_flash_sdp(True)
# Use gradient checkpointing
model.gradient_checkpointing_enable()
# Mixed precision
with torch.autocast(device_type="cuda", dtype=torch.float16):
outputs = model(**inputs)
Wrong GPU allocated
Error: Got different GPU than requested
Solutions:
# Use strict GPU selection
@app.function(gpu="H100!") # H100! prevents auto-upgrade to H200
# Specify exact memory variant
@app.function(gpu="A100-80GB") # Not just "A100"
# Check GPU at runtime
@app.function(gpu="A100")
def check_gpu():
import subprocess
result = subprocess.run(["nvidia-smi"], capture_output=True, text=True)
print(result.stdout)
Cold Start Issues
Slow cold starts
Problem: First request takes too long
Solutions:
# Keep containers warm
@app.function(
container_idle_timeout=600, # Keep warm 10 min
keep_warm=1 # Always keep 1 container ready
)
def low_latency():
pass
# Load model during container start
@app.cls(gpu="A100")
class Model:
@modal.enter()
def load(self):
# This runs once at container start, not per request
self.model = load_heavy_model()
# Cache model in volume
volume = modal.Volume.from_name("models", create_if_missing=True)
@app.function(volumes={"/cache": volume})
def cached_model():
if os.path.exists("/cache/model"):
model = load_from_disk("/cache/model")
else:
model = download_model()
save_to_disk(model, "/cache/model")
volume.commit()
Container keeps restarting
Problem: Containers are killed and restarted frequently
Solutions:
# Increase memory
@app.function(memory=32768) # 32GB RAM
def memory_heavy():
pass
# Increase timeout
@app.function(timeout=3600) # 1 hour
def long_running():
pass
# Handle signals gracefully
import signal
def handler(signum, frame):
cleanup()
exit(0)
signal.signal(signal.SIGTERM, handler)
Volume Issues
Volume changes not persisting
Error: Data written to volume disappears
Solutions:
volume = modal.Volume.from_name("my-volume", create_if_missing=True)
@app.function(volumes={"/data": volume})
def write_data():
with open("/data/file.txt", "w") as f:
f.write("data")
# CRITICAL: Commit changes!
volume.commit()
Volume read shows stale data
Error: Reading outdated data from volume
Solutions:
@app.function(volumes={"/data": volume})
def read_data():
# Reload to get latest
volume.reload()
with open("/data/file.txt", "r") as f:
return f.read()
Volume mount fails
Error: VolumeError: Failed to mount volume
Solutions:
# Ensure volume exists
volume = modal.Volume.from_name("my-volume", create_if_missing=True)
# Use absolute path
@app.function(volumes={"/data": volume}) # Not "./data"
def my_function():
pass
# Check volume in dashboard
# modal volume list
Web Endpoint Issues
Endpoint returns 502
Error: Gateway timeout or bad gateway
Solutions:
# Increase timeout
@app.function(timeout=300) # 5 min
@modal.web_endpoint()
def slow_endpoint():
pass
# Return streaming response for long operations
from fastapi.responses import StreamingResponse
@app.function()
@modal.asgi_app()
def streaming_app():
async def generate():
for i in range(100):
yield f"data: {i}\n\n"
await process_chunk(i)
return StreamingResponse(generate(), media_type="text/event-stream")
Endpoint not accessible
Error: 404 or cannot reach endpoint
Solutions:
# Check deployment status
modal app list
# Redeploy
modal deploy my_app.py
# Check logs
modal app logs my-app
CORS errors
Error: Cross-origin request blocked
Solutions:
from fastapi import FastAPI
from fastapi.middleware.cors import CORSMiddleware
web_app = FastAPI()
web_app.add_middleware(
CORSMiddleware,
allow_origins=["*"],
allow_credentials=True,
allow_methods=["*"],
allow_headers=["*"],
)
@app.function()
@modal.asgi_app()
def cors_enabled():
return web_app
Secret Issues
Secret not found
Error: SecretNotFound: Secret 'my-secret' not found
Solutions:
# Create secret via CLI
modal secret create my-secret KEY=value
# List secrets
modal secret list
# Check secret name matches exactly
Secret value not accessible
Error: Environment variable is empty
Solutions:
# Ensure secret is attached
@app.function(secrets=[modal.Secret.from_name("my-secret")])
def use_secret():
import os
value = os.environ.get("KEY") # Use get() to handle missing
if not value:
raise ValueError("KEY not set in secret")
Scheduling Issues
Scheduled job not running
Error: Cron job doesn't execute
Solutions:
# Verify cron syntax
@app.function(schedule=modal.Cron("0 0 * * *")) # Daily at midnight UTC
def daily_job():
pass
# Check timezone (Modal uses UTC)
# "0 8 * * *" = 8am UTC, not local time
# Ensure app is deployed
# modal deploy my_app.py
Job runs multiple times
Problem: Scheduled job executes more than expected
Solutions:
# Implement idempotency
@app.function(schedule=modal.Cron("0 * * * *"))
def hourly_job():
job_id = get_current_hour_id()
if already_processed(job_id):
return
process()
mark_processed(job_id)
Debugging Tips
Enable debug logging
import logging
logging.basicConfig(level=logging.DEBUG)
@app.function()
def debug_function():
logging.debug("Debug message")
logging.info("Info message")
View container logs
# Stream logs
modal app logs my-app
# View specific function
modal app logs my-app --function my_function
# View historical logs
modal app logs my-app --since 1h
Test locally
# Run function locally without Modal
if __name__ == "__main__":
result = my_function.local() # Runs on your machine
print(result)
Inspect container
@app.function(gpu="T4")
def debug_environment():
import subprocess
import sys
# System info
print(f"Python: {sys.version}")
print(subprocess.run(["nvidia-smi"], capture_output=True, text=True).stdout)
print(subprocess.run(["pip", "list"], capture_output=True, text=True).stdout)
# CUDA info
import torch
print(f"CUDA available: {torch.cuda.is_available()}")
print(f"CUDA version: {torch.version.cuda}")
print(f"GPU: {torch.cuda.get_device_name(0)}")
Common Error Messages
| Error | Cause | Solution |
|---|---|---|
FunctionTimeoutError |
Function exceeded timeout | Increase timeout parameter |
ContainerMemoryExceeded |
OOM killed | Increase memory parameter |
ImageBuilderError |
Build failed | Check dependencies, pin versions |
ResourceExhausted |
No GPUs available | Use GPU fallbacks, try later |
AuthenticationError |
Invalid token | Run modal token new |
VolumeNotFound |
Volume doesn't exist | Use create_if_missing=True |
SecretNotFound |
Secret doesn't exist | Create secret via CLI |
Getting Help
- Documentation: https://modal.com/docs
- Examples: https://github.com/modal-labs/modal-examples
- Discord: https://discord.gg/modal
- Status: https://status.modal.com
Reporting Issues
Include:
- Modal client version:
modal --version - Python version:
python --version - Full error traceback
- Minimal reproducible code
- GPU type if relevant