feat: Atlas RunPod L40S evaluation harness (#708 )

Atlas benchmarks are on DGX Spark (Blackwell SM120/121). Our hardware is RunPod L40S (Ada Lovelace SM89). This provides the tools to evaluate compatibility and performance. New scripts/atlas_benchmark.py: - 5 benchmark prompts (short, code, reasoning, long-form, tool use) - Cold start measurement - Throughput measurement (tok/s) - vLLM comparison mode (--compare-vllm) - JSON report output - RunPod setup instructions (--runpod-setup) - GPU info detection via nvidia-smi New docs/atlas-evaluation-runpod.md: - Hardware specs and expected issues - Step-by-step evaluation procedure - Checklist for systematic testing - Results template Follow-up to #674.
2026-04-14 21:13:10 -04:00
2 changed files with 515 additions and 0 deletions
--- a/docs/atlas-evaluation-runpod.md
+++ b/docs/atlas-evaluation-runpod.md
@@ -0,0 +1,112 @@
 # Atlas Inference Engine — RunPod L40S Evaluation
 ## Status: PENDING
 Atlas benchmarks are on DGX Spark (Blackwell SM120/121). Our hardware is
 RunPod L40S (Ada Lovelace SM89). This evaluation tests compatibility.
 ## Hardware
 | Spec | Value |
 |------|-------|
 | GPU | NVIDIA L40S |
 | VRAM | 48 GB |
 | Architecture | Ada Lovelace (SM89) |
 | CUDA Compute | 8.9 |
 | Provider | RunPod |
 ## Expected Issues
 1. **CUDA compatibility**: Atlas uses custom CUDA kernels for Blackwell SM120/121.
   L40S is SM89 — kernels may not compile or may have PTX fallback.
 2. **Quantization**: Atlas uses NVFP4. L40S supports FP8 natively but NVFP4
   may require Blackwell tensor cores.
 3. **Performance**: Even if it works, L40S won't match Blackwell throughput.
 ## Test Procedure
 ### 1. Deploy on RunPod
 ```bash
 # Start RunPod instance with:
 # - Template: RunPod PyTorch 2.4
 # - GPU: L40S
 # - Volume: 100GB (model cache)
 # SSH into pod
 runpod ssh <pod-id>
 # Pull and run Atlas
 docker pull avarok/atlas-gb10:alpha-2.8
 docker run -d --gpus all --ipc=host -p 8888:8888 \
  -v /root/.cache/huggingface:/root/.cache/huggingface \
  --name atlas \
  avarok/atlas-gb10:alpha-2.8 serve \
  Sehyo/Qwen3.5-35B-A3B-NVFP4 \
  --speculative --scheduling-policy slai \
  --max-seq-len 131072 --max-batch-size 1 \
  --max-prefill-tokens 0
 ```
 ### 2. Check Compatibility
 ```bash
 # Watch for CUDA errors
 docker logs -f atlas
 # Expected success: "Model loaded" or similar
 # Expected failure: "CUDA error" or "unsupported architecture"
 ```
 ### 3. Run Benchmark
 ```bash
 python3 scripts/atlas_benchmark.py --base-url http://localhost:8888/v1
 ```
 ### 4. Compare with vLLM
 ```bash
 # Start vLLM on another port
 docker run -d --gpus all -p 8000:8000 \
  vllm/vllm-openai \
  --model Qwen/Qwen2.5-7B \
  --max-model-len 8192
 # Run comparison
 python3 scripts/atlas_benchmark.py \
  --base-url http://localhost:8888/v1 \
  --compare-vllm http://localhost:8000/v1
 ```
 ## Evaluation Checklist
 - [ ] Atlas starts without CUDA errors on L40S
 - [ ] Model loads successfully
 - [ ] `/v1/models` returns model list
 - [ ] Chat completions work
 - [ ] Tool calls work (function calling)
 - [ ] Cold start measured
 - [ ] Throughput measured (tok/s)
 - [ ] vLLM comparison completed
 - [ ] Report saved to ~/.hermes/atlas-benchmark-report.json
 ## Results
 (Fill in after evaluation)
 | Metric | Atlas | vLLM | Notes |
 |--------|-------|------|-------|
 | Starts? | | | |
 | CUDA compatible? | | | |
 | Cold start | | | |
 | tok/s (short) | | | |
 | tok/s (code) | | | |
 | tok/s (reasoning) | | | |
 | tok/s (long) | | | |
 | Tool calls work? | | | |
 | Overall verdict | | | |
 ## Recommendation
 (Pending evaluation results)
--- a/scripts/atlas_benchmark.py
+++ b/scripts/atlas_benchmark.py
@@ -0,0 +1,403 @@
 #!/usr/bin/env python3
 """Atlas Inference Engine benchmark — RunPod L40S evaluation.
 Tests Atlas on RunPod L40S (Ada Lovelace, SM89) and compares to vLLM.
 Atlas benchmarks are on DGX Spark (Blackwell SM120/121), so this validates
 whether it works on our hardware.
 Usage:
    python3 scripts/atlas_benchmark.py --base-url http://localhost:8888/v1
    python3 scripts/atlas_benchmark.py --base-url http://localhost:8888/v1 --compare-vllm
    python3 scripts/atlas_benchmark.py --runpod-setup
 Outputs JSON report to stdout and saves to ~/.hermes/atlas-benchmark-report.json
 """
 from __future__ import annotations
 import argparse
 import json
 import os
 import sys
 import time
 from dataclasses import dataclass, asdict
 from pathlib import Path
 from typing import Any, Dict, List, Optional
 # ---------------------------------------------------------------------------
 # Benchmark prompts
 # ---------------------------------------------------------------------------
 BENCHMARK_PROMPTS = [
    {
        "name": "short_answer",
        "prompt": "What is the capital of France?",
        "max_tokens": 50,
    },
    {
        "name": "code_generation",
        "prompt": "Write a Python function that implements binary search on a sorted list.",
        "max_tokens": 200,
    },
    {
        "name": "reasoning",
        "prompt": "If a train travels at 60 mph for 2.5 hours, then at 80 mph for 1.5 hours, what is the total distance traveled? Show your work step by step.",
        "max_tokens": 300,
    },
    {
        "name": "long_form",
        "prompt": "Explain the difference between TCP and UDP protocols. Include use cases, advantages, disadvantages, and when to choose each one.",
        "max_tokens": 500,
    },
    {
        "name": "tool_use_simulation",
        "prompt": "I need to find all Python files in the current directory that contain the word 'import'. What command would I use?",
        "max_tokens": 100,
    },
 ]
@dataclass
 class BenchmarkResult:
    name: str
    model: str
    provider: str
    prompt_tokens: int
    completion_tokens: int
    total_time_ms: int
    time_to_first_token_ms: int
    tokens_per_second: float
    success: bool
    error: str = ""
@dataclass
 class BenchmarkReport:
    provider: str
    base_url: str
    model: str
    gpu_info: str
    timestamp: str
    results: List[BenchmarkResult]
    summary: Dict[str, Any]
    def to_dict(self) -> dict:
        d = asdict(self)
        d["results"] = [asdict(r) for r in self.results]
        return d
 # ---------------------------------------------------------------------------
 # API calls
 # ---------------------------------------------------------------------------
 def call_openai_compat(
    base_url: str,
    model: str,
    messages: list,
    max_tokens: int = 200,
    api_key: str = "",
    timeout: int = 120,
 ) -> dict:
    """Call an OpenAI-compatible API endpoint."""
    import urllib.request
    url = f"{base_url.rstrip('/')}/chat/completions"
    body = {
        "model": model,
        "messages": messages,
        "max_tokens": max_tokens,
        "stream": False,
    }
    headers = {"Content-Type": "application/json"}
    if api_key:
        headers["Authorization"] = f"Bearer {api_key}"
    req = urllib.request.Request(
        url,
        data=json.dumps(body).encode(),
        headers=headers,
        method="POST",
    )
    with urllib.request.urlopen(req, timeout=timeout) as resp:
        return json.loads(resp.read())
 def list_models(base_url: str, api_key: str = "") -> list:
    """List available models."""
    import urllib.request
    url = f"{base_url.rstrip('/')}/models"
    headers = {}
    if api_key:
        headers["Authorization"] = f"Bearer {api_key}"
    req = urllib.request.Request(url, headers=headers, method="GET")
    with urllib.request.urlopen(req, timeout=10) as resp:
        data = json.loads(resp.read())
        return data.get("data", [])
 def measure_cold_start(base_url: str, model: str, api_key: str = "") -> dict:
    """Measure cold start time (time to first token on first request)."""
    messages = [{"role": "user", "content": "Hello. Reply with just 'Ready.'"}]
    t0 = time.monotonic()
    try:
        result = call_openai_compat(base_url, model, messages, max_tokens=10, api_key=api_key)
        elapsed = time.monotonic() - t0
        return {
            "cold_start_ms": int(elapsed * 1000),
            "success": True,
            "model": result.get("model", model),
        }
    except Exception as exc:
        return {
            "cold_start_ms": int((time.monotonic() - t0) * 1000),
            "success": False,
            "error": str(exc),
        }
 def run_benchmark(
    base_url: str,
    model: str,
    prompt_config: dict,
    api_key: str = "",
 ) -> BenchmarkResult:
    """Run a single benchmark prompt."""
    messages = [{"role": "user", "content": prompt_config["prompt"]}]
    max_tokens = prompt_config.get("max_tokens", 200)
    t0 = time.monotonic()
    try:
        result = call_openai_compat(
            base_url, model, messages,
            max_tokens=max_tokens, api_key=api_key,
        )
        elapsed = time.monotonic() - t0
        usage = result.get("usage", {})
        return BenchmarkResult(
            name=prompt_config["name"],
            model=result.get("model", model),
            provider="atlas" if "atlas" in base_url.lower() else "unknown",
            prompt_tokens=usage.get("prompt_tokens", 0),
            completion_tokens=usage.get("completion_tokens", 0),
            total_time_ms=int(elapsed * 1000),
            time_to_first_token_ms=int(elapsed * 1000),  # non-streaming, same as total
            tokens_per_second=round(
                usage.get("completion_tokens", 0) / elapsed, 1
            ) if elapsed > 0 else 0.0,
            success=True,
        )
    except Exception as exc:
        return BenchmarkResult(
            name=prompt_config["name"],
            model=model,
            provider="atlas",
            prompt_tokens=0,
            completion_tokens=0,
            total_time_ms=int((time.monotonic() - t0) * 1000),
            time_to_first_token_ms=0,
            tokens_per_second=0.0,
            success=False,
            error=str(exc),
        )
 def get_gpu_info() -> str:
    """Get GPU info if available."""
    try:
        import subprocess
        result = subprocess.run(
            ["nvidia-smi", "--query-gpu=name,memory.total,driver_version", "--format=csv,noheader"],
            capture_output=True, text=True, timeout=5,
        )
        if result.returncode == 0:
            return result.stdout.strip()
    except Exception:
        pass
    return "Unknown (nvidia-smi not available)"
 # ---------------------------------------------------------------------------
 # RunPod setup
 # ---------------------------------------------------------------------------
 RUNPOD_SETUP_COMMANDS = """# Atlas on RunPod L40S Setup
 # 1. Start RunPod with L40S (48GB VRAM, Ada Lovelace SM89)
 #    Template: RunPod PyTorch 2.4
 #    GPU: L40S
 #    Volume: 50GB+ (for model cache)
 # 2. Install Docker (if not present)
 apt-get update && apt-get install -y docker.io
 # 3. Pull Atlas image
 docker pull avarok/atlas-gb10:alpha-2.8
 # 4. Start Atlas with Qwen3.5-35B (smallest supported model)
 docker run -d --gpus all --ipc=host -p 8888:8888 \\
  -v /root/.cache/huggingface:/root/.cache/huggingface \\
  --name atlas \\
  avarok/atlas-gb10:alpha-2.8 serve \\
  Sehyo/Qwen3.5-35B-A3B-NVFP4 \\
  --speculative --scheduling-policy slai \\
  --max-seq-len 131072 --max-batch-size 1 \\
  --max-prefill-tokens 0
 # 5. Wait for model to load (watch logs)
 docker logs -f atlas
 # 6. Test endpoint
 curl http://localhost:8888/v1/models
 # 7. Run benchmark
 python3 scripts/atlas_benchmark.py --base-url http://localhost:8888/v1
 # 8. Compare with vLLM (if installed)
 # Start vLLM:
 # docker run -d --gpus all -p 8000:8000 vllm/vllm-openai \\
 #   --model Qwen/Qwen2.5-7B --max-model-len 8192
 # python3 scripts/atlas_benchmark.py --base-url http://localhost:8888/v1 --compare-vllm http://localhost:8000/v1
 # NOTE: Atlas may NOT work on L40S (SM89). Benchmarks are on Blackwell (SM120/121).
 # If you get CUDA errors, Atlas doesn't support your GPU architecture yet.
 """
 # ---------------------------------------------------------------------------
 # Main
 # ---------------------------------------------------------------------------
 def main():
    parser = argparse.ArgumentParser(description="Atlas Inference Engine benchmark")
    parser.add_argument("--base-url", default="http://localhost:8888/v1", help="Atlas API base URL")
    parser.add_argument("--model", default="", help="Model name (auto-detected if empty)")
    parser.add_argument("--api-key", default="", help="API key (if required)")
    parser.add_argument("--compare-vllm", default="", help="vLLM base URL for comparison")
    parser.add_argument("--runpod-setup", action="store_true", help="Print RunPod setup commands")
    parser.add_argument("--output", default="", help="Output file path")
    args = parser.parse_args()
    if args.runpod_setup:
        print(RUNPOD_SETUP_COMMANDS)
        return 0
    print(f"Atlas Benchmark")
    print(f"=" * 60)
    print(f"Base URL: {args.base_url}")
    print(f"GPU: {get_gpu_info()}")
    print()
    # Check availability
    print("Checking Atlas availability...", end=" ", flush=True)
    models = list_models(args.base_url, args.api_key)
    if not models:
        print("FAILED")
        print("Atlas is not running or not reachable at", args.base_url)
        print("Run with --runpod-setup for deployment instructions.")
        return 1
    print(f"OK ({len(models)} models)")
    model = args.model or (models[0].get("id", "") if models else "")
    if not model:
        print("No model specified and none detected.")
        return 1
    print(f"Model: {model}")
    print()
    # Cold start measurement
    print("Measuring cold start...", end=" ", flush=True)
    cold = measure_cold_start(args.base_url, model, args.api_key)
    print(f"{cold['cold_start_ms']}ms {'OK' if cold['success'] else 'FAILED'}")
    if not cold["success"]:
        print(f"  Error: {cold.get('error', 'unknown')}")
    print()
    # Run benchmarks
    results = []
    for pc in BENCHMARK_PROMPTS:
        print(f"Benchmark: {pc['name']}...", end=" ", flush=True)
        result = run_benchmark(args.base_url, model, pc, args.api_key)
        results.append(result)
        if result.success:
            print(f"{result.tokens_per_second} tok/s ({result.total_time_ms}ms)")
        else:
            print(f"FAILED: {result.error}")
    # Summary
    successful = [r for r in results if r.success]
    total_tokens = sum(r.completion_tokens for r in successful)
    total_time = sum(r.total_time_ms for r in successful) / 1000
    avg_tps = round(total_tokens / total_time, 1) if total_time > 0 else 0
    print()
    print(f"Summary:")
    print(f"  Successful: {len(successful)}/{len(results)}")
    print(f"  Total tokens: {total_tokens}")
    print(f"  Average throughput: {avg_tps} tok/s")
    # vLLM comparison
    vllm_results = []
    if args.compare_vllm:
        print()
        print(f"Comparing with vLLM at {args.compare_vllm}...")
        for pc in BENCHMARK_PROMPTS:
            print(f"  vLLM: {pc['name']}...", end=" ", flush=True)
            result = run_benchmark(args.compare_vllm, model, pc, args.api_key)
            vllm_results.append(result)
            if result.success:
                print(f"{result.tokens_per_second} tok/s")
            else:
                print(f"FAILED")
        vllm_success = [r for r in vllm_results if r.success]
        vllm_tokens = sum(r.completion_tokens for r in vllm_success)
        vllm_time = sum(r.total_time_ms for r in vllm_success) / 1000
        vllm_tps = round(vllm_tokens / vllm_time, 1) if vllm_time > 0 else 0
        if avg_tps > 0 and vllm_tps > 0:
            speedup = round(avg_tps / vllm_tps, 2)
            print(f"\n  Atlas: {avg_tps} tok/s | vLLM: {vllm_tps} tok/s | Speedup: {speedup}x")
    # Build report
    import datetime
    report = BenchmarkReport(
        provider="atlas",
        base_url=args.base_url,
        model=model,
        gpu_info=get_gpu_info(),
        timestamp=datetime.datetime.now().isoformat(),
        results=results,
        summary={
            "successful_benchmarks": len(successful),
            "total_benchmarks": len(results),
            "total_completion_tokens": total_tokens,
            "average_tps": avg_tps,
            "cold_start_ms": cold.get("cold_start_ms", 0),
            "vllm_comparison": {
                "vllm_tps": vllm_tps if vllm_results else None,
                "speedup": speedup if vllm_results and avg_tps > 0 and vllm_tps > 0 else None,
            } if vllm_results else None,
        },
    )
    # Save report
    output_path = args.output or str(Path.home() / ".hermes" / "atlas-benchmark-report.json")
    Path(output_path).parent.mkdir(parents=True, exist_ok=True)
    with open(output_path, "w") as f:
        json.dump(report.to_dict(), f, indent=2)
    print(f"\nReport saved to: {output_path}")
    # Also print JSON to stdout
    print("\n" + json.dumps(report.to_dict(), indent=2))
    return 0
 if __name__ == "__main__":
    sys.exit(main())