allegro-checkpoint/skills/mlops/cloud/modal/references/troubleshooting.md at 13fca8ebea9e5147b020e241b516c5eb2688cef0

allegro/allegro-checkpoint

Fork 0

Files

Allegro 13fca8ebea Checkpoint: Allegro state pre-migration

2026-04-01 11:04:00 +00:00

10 KiB

Raw Blame History

Installation Issues

Authentication fails

Error: modal setup doesn't complete or token is invalid

Solutions:

# Re-authenticate
modal token new

# Check current token
modal config show

# Set token via environment
export MODAL_TOKEN_ID=ak-...
export MODAL_TOKEN_SECRET=as-...

Package installation issues

Error: pip install modal fails

Solutions:

# Upgrade pip
pip install --upgrade pip

# Install with specific Python version
python3.11 -m pip install modal

# Install from wheel
pip install modal --prefer-binary

Container Image Issues

Image build fails

Error: ImageBuilderError: Failed to build image

Solutions:

# Pin package versions to avoid conflicts
image = modal.Image.debian_slim().pip_install(
    "torch==2.1.0",
    "transformers==4.36.0",  # Pin versions
    "accelerate==0.25.0"
)

# Use compatible CUDA versions
image = modal.Image.from_registry(
    "nvidia/cuda:12.1.0-cudnn8-runtime-ubuntu22.04",  # Match PyTorch CUDA
    add_python="3.11"
)

Dependency conflicts

Error: ERROR: Cannot install package due to conflicting dependencies

Solutions:

# Layer dependencies separately
base = modal.Image.debian_slim().pip_install("torch")
ml = base.pip_install("transformers")  # Install after torch

# Use uv for better resolution
image = modal.Image.debian_slim().uv_pip_install(
    "torch", "transformers"
)

Large image builds timeout

Error: Image build exceeds time limit

Solutions:

# Split into multiple layers (better caching)
base = modal.Image.debian_slim().pip_install("torch")  # Cached
ml = base.pip_install("transformers", "datasets")      # Cached
app = ml.copy_local_dir("./src", "/app")               # Rebuilds on code change

# Download models during build, not runtime
image = modal.Image.debian_slim().pip_install("transformers").run_commands(
    "python -c 'from transformers import AutoModel; AutoModel.from_pretrained(\"bert-base\")'"
)

GPU Issues

GPU not available

Error: RuntimeError: CUDA not available

Solutions:

# Ensure GPU is specified
@app.function(gpu="T4")  # Must specify GPU
def my_function():
    import torch
    assert torch.cuda.is_available()

# Check CUDA compatibility in image
image = modal.Image.from_registry(
    "nvidia/cuda:12.1.0-cudnn8-devel-ubuntu22.04",
    add_python="3.11"
).pip_install(
    "torch",
    index_url="https://download.pytorch.org/whl/cu121"  # Match CUDA
)

GPU out of memory

Error: torch.cuda.OutOfMemoryError: CUDA out of memory

Solutions:

# Use larger GPU
@app.function(gpu="A100-80GB")  # More VRAM
def train():
    pass

# Enable memory optimization
@app.function(gpu="A100")
def memory_optimized():
    import torch
    torch.backends.cuda.enable_flash_sdp(True)

    # Use gradient checkpointing
    model.gradient_checkpointing_enable()

    # Mixed precision
    with torch.autocast(device_type="cuda", dtype=torch.float16):
        outputs = model(**inputs)

Wrong GPU allocated

Error: Got different GPU than requested

Solutions:

# Use strict GPU selection
@app.function(gpu="H100!")  # H100! prevents auto-upgrade to H200

# Specify exact memory variant
@app.function(gpu="A100-80GB")  # Not just "A100"

# Check GPU at runtime
@app.function(gpu="A100")
def check_gpu():
    import subprocess
    result = subprocess.run(["nvidia-smi"], capture_output=True, text=True)
    print(result.stdout)

Cold Start Issues

Slow cold starts

Problem: First request takes too long

Solutions:

# Keep containers warm
@app.function(
    container_idle_timeout=600,  # Keep warm 10 min
    keep_warm=1                  # Always keep 1 container ready
)
def low_latency():
    pass

# Load model during container start
@app.cls(gpu="A100")
class Model:
    @modal.enter()
    def load(self):
        # This runs once at container start, not per request
        self.model = load_heavy_model()

# Cache model in volume
volume = modal.Volume.from_name("models", create_if_missing=True)

@app.function(volumes={"/cache": volume})
def cached_model():
    if os.path.exists("/cache/model"):
        model = load_from_disk("/cache/model")
    else:
        model = download_model()
        save_to_disk(model, "/cache/model")
        volume.commit()

Container keeps restarting

Problem: Containers are killed and restarted frequently

Solutions:

# Increase memory
@app.function(memory=32768)  # 32GB RAM
def memory_heavy():
    pass

# Increase timeout
@app.function(timeout=3600)  # 1 hour
def long_running():
    pass

# Handle signals gracefully
import signal

def handler(signum, frame):
    cleanup()
    exit(0)

signal.signal(signal.SIGTERM, handler)

Volume Issues

Volume changes not persisting

Error: Data written to volume disappears

Solutions:

volume = modal.Volume.from_name("my-volume", create_if_missing=True)

@app.function(volumes={"/data": volume})
def write_data():
    with open("/data/file.txt", "w") as f:
        f.write("data")

    # CRITICAL: Commit changes!
    volume.commit()

Volume read shows stale data

Error: Reading outdated data from volume

Solutions:

@app.function(volumes={"/data": volume})
def read_data():
    # Reload to get latest
    volume.reload()

    with open("/data/file.txt", "r") as f:
        return f.read()

Volume mount fails

Error: VolumeError: Failed to mount volume

Solutions:

# Ensure volume exists
volume = modal.Volume.from_name("my-volume", create_if_missing=True)

# Use absolute path
@app.function(volumes={"/data": volume})  # Not "./data"
def my_function():
    pass

# Check volume in dashboard
# modal volume list

Web Endpoint Issues

Endpoint returns 502

Error: Gateway timeout or bad gateway

Solutions:

# Increase timeout
@app.function(timeout=300)  # 5 min
@modal.web_endpoint()
def slow_endpoint():
    pass

# Return streaming response for long operations
from fastapi.responses import StreamingResponse

@app.function()
@modal.asgi_app()
def streaming_app():
    async def generate():
        for i in range(100):
            yield f"data: {i}\n\n"
            await process_chunk(i)
    return StreamingResponse(generate(), media_type="text/event-stream")

Endpoint not accessible

Error: 404 or cannot reach endpoint

Solutions:

# Check deployment status
modal app list

# Redeploy
modal deploy my_app.py

# Check logs
modal app logs my-app

CORS errors

Error: Cross-origin request blocked

Solutions:

from fastapi import FastAPI
from fastapi.middleware.cors import CORSMiddleware

web_app = FastAPI()
web_app.add_middleware(
    CORSMiddleware,
    allow_origins=["*"],
    allow_credentials=True,
    allow_methods=["*"],
    allow_headers=["*"],
)

@app.function()
@modal.asgi_app()
def cors_enabled():
    return web_app

Secret Issues

Secret not found

Error: SecretNotFound: Secret 'my-secret' not found

Solutions:

# Create secret via CLI
modal secret create my-secret KEY=value

# List secrets
modal secret list

# Check secret name matches exactly

Secret value not accessible

Error: Environment variable is empty

Solutions:

# Ensure secret is attached
@app.function(secrets=[modal.Secret.from_name("my-secret")])
def use_secret():
    import os
    value = os.environ.get("KEY")  # Use get() to handle missing
    if not value:
        raise ValueError("KEY not set in secret")

Scheduling Issues

Scheduled job not running

Error: Cron job doesn't execute

Solutions:

# Verify cron syntax
@app.function(schedule=modal.Cron("0 0 * * *"))  # Daily at midnight UTC
def daily_job():
    pass

# Check timezone (Modal uses UTC)
# "0 8 * * *" = 8am UTC, not local time

# Ensure app is deployed
# modal deploy my_app.py

Job runs multiple times

Problem: Scheduled job executes more than expected

Solutions:

# Implement idempotency
@app.function(schedule=modal.Cron("0 * * * *"))
def hourly_job():
    job_id = get_current_hour_id()
    if already_processed(job_id):
        return
    process()
    mark_processed(job_id)

Debugging Tips

Enable debug logging

import logging
logging.basicConfig(level=logging.DEBUG)

@app.function()
def debug_function():
    logging.debug("Debug message")
    logging.info("Info message")

View container logs

# Stream logs
modal app logs my-app

# View specific function
modal app logs my-app --function my_function

# View historical logs
modal app logs my-app --since 1h

Test locally

# Run function locally without Modal
if __name__ == "__main__":
    result = my_function.local()  # Runs on your machine
    print(result)

Inspect container

@app.function(gpu="T4")
def debug_environment():
    import subprocess
    import sys

    # System info
    print(f"Python: {sys.version}")
    print(subprocess.run(["nvidia-smi"], capture_output=True, text=True).stdout)
    print(subprocess.run(["pip", "list"], capture_output=True, text=True).stdout)

    # CUDA info
    import torch
    print(f"CUDA available: {torch.cuda.is_available()}")
    print(f"CUDA version: {torch.version.cuda}")
    print(f"GPU: {torch.cuda.get_device_name(0)}")

Common Error Messages

Error	Cause	Solution
`FunctionTimeoutError`	Function exceeded timeout	Increase `timeout` parameter
`ContainerMemoryExceeded`	OOM killed	Increase `memory` parameter
`ImageBuilderError`	Build failed	Check dependencies, pin versions
`ResourceExhausted`	No GPUs available	Use GPU fallbacks, try later
`AuthenticationError`	Invalid token	Run `modal token new`
`VolumeNotFound`	Volume doesn't exist	Use `create_if_missing=True`
`SecretNotFound`	Secret doesn't exist	Create secret via CLI

Getting Help

Documentation: https://modal.com/docs
Examples: https://github.com/modal-labs/modal-examples
Discord: https://discord.gg/modal
Status: https://status.modal.com

Reporting Issues

Include:

Modal client version: modal --version
Python version: python --version
Full error traceback
Minimal reproducible code
GPU type if relevant

10 KiB Raw Blame History

Modal Troubleshooting Guide

Installation Issues

Authentication fails

Package installation issues

Container Image Issues

Image build fails

Dependency conflicts

Large image builds timeout

GPU Issues

GPU not available

GPU out of memory

Wrong GPU allocated

Cold Start Issues

Slow cold starts

Container keeps restarting

Volume Issues

Volume changes not persisting

Volume read shows stale data

Volume mount fails

Web Endpoint Issues

Endpoint returns 502

Endpoint not accessible

CORS errors

Secret Issues

Secret not found

Secret value not accessible

Scheduling Issues

Scheduled job not running

Job runs multiple times

Debugging Tips

Enable debug logging

View container logs

Test locally

Inspect container

Common Error Messages

Getting Help

Reporting Issues

10 KiB

Raw Blame History