Atlas Inference Engine — 3x Faster Than vLLM, Pure Rust + CUDA #674

Open
opened 2026-04-14 20:28:57 +00:00 by Rockachopa · 3 comments
Owner

Triage: Atlas Inference Engine

Source: https://atlasinference.io/
Date: 2026-04-14
Sponsor: Alexander Whitestone


Verdict: HIGHLY RELEVANT — Potential vLLM Replacement

Atlas is an LLM inference engine written from scratch in Rust and CUDA. No PyTorch. No Python. Just a ~2.5 GB image that runs 3x faster than the status quo.


What Atlas Is

An inference engine built by Avarok that compiles from HTTP to kernel dispatch. Pure Rust + CUDA. Custom CUDA kernels for Blackwell SM120/121. MTP speculative decoding.

Key Differentiators

Metric Atlas vLLM
Image size ~2.5 GB 20+ GB
Cold start < 2 min ~10 min
Runtime Rust + CUDA Python + PyTorch
Dependencies None 200+ packages
Throughput 3.1x faster Baseline

Benchmark Numbers (DGX Spark, single GPU)

Model Parameters Quantization Throughput
Qwen3.5-35B-A3B MTP 35B (3B active) NVFP4/FP8 ~130 tok/s
Qwen3.5-122B-A10B MTP EP=2 122B (10B active) NVFP4 ~50-54 tok/s
Qwen3-Next-80B-A3B 80B (3B active) NVFP4 ~82 tok/s
Qwen3-Coder-Next 80B (3B active) FP8 ~58 tok/s
Qwen3-VL-30B 30B (3B active) NVFP4 ~100 tok/s
Gemma 4 26B 26B (3.8B active) NVFP4 ~35 tok/s
Nemotron-3 Nano 30B 30B (3.5B active) NVFP4/FP8 ~100 tok/s
Mistral Small 4 119B 119B (6.5B active) NVFP4 ~26 tok/s

Why This Matters for Hermes

  1. 3x faster than vLLM — Our SOTA research found vLLM delivers 24x over HF. Atlas is 3x on top of that.
  2. 2.5 GB vs 20+ GB — Dramatically smaller deployment footprint
  3. < 2 min cold start — vs ~10 min for vLLM
  4. OpenAI-compatible API — Drop-in replacement for any OpenAI client
  5. MTP speculative decoding — Multiple tokens per forward pass
  6. NVFP4 quantization — Native tensor core support
  7. MoE support — Handles Mixture-of-Experts models efficiently

Comparison to Our Current SOTA Research

Capability vLLM (SOTA) Atlas
Throughput 24x HF 3.1x vLLM (74x HF)
Image size 20+ GB 2.5 GB
Cold start ~10 min < 2 min
Dependencies 200+ None
MoE support Yes Yes (custom kernels)
MTP decoding Limited Native
Blackwell support Yes Yes (SM120/121)

OpenAI Compatibility

Atlas exposes OpenAI-compatible API at http://localhost:8888/v1. Works with:

  • Claude Code
  • Cline
  • OpenCode
  • Open WebUI
  • Any OpenAI-compatible client

This means Hermes integration is trivial — just point to Atlas instead of vLLM.

Quick Start

docker pull avarok/atlas-gb10:alpha-2.8
docker run -d --gpus all --ipc=host -p 8888:8888 \
  -v ~/.cache/huggingface:/root/.cache/huggingface \
  avarok/atlas-gb10:alpha-2.8 serve \
  Sehyo/Qwen3.5-35B-A3B-NVFP4 \
  --speculative --scheduling-policy slai \
  --max-seq-len 131072 --max-batch-size 1 \
  --max-prefill-tokens 0

Limitations

  1. DGX Spark focus — Benchmarks are on DGX Spark (GB10). May not work on consumer GPUs.
  2. Alpha stage — Version alpha-2.8, not production-ready
  3. Limited model support — Only 11 models currently supported (hand-tuned kernels)
  4. No CPU support — CUDA-only, no CPU inference
  5. No GGUF support — Only NVFP4/FP8 quantization

Recommendation

HIGH PRIORITY for evaluation. Atlas could replace vLLM as our inference backend. Benefits:

  • 3x faster throughput
  • 8x smaller image
  • 5x faster cold start
  • Zero dependencies
  • OpenAI-compatible (drop-in replacement)

Action: Test Atlas on our hardware. Compare to vLLM benchmarks. If it works, switch.

Source

## Triage: Atlas Inference Engine **Source:** https://atlasinference.io/ **Date:** 2026-04-14 **Sponsor:** Alexander Whitestone --- ## Verdict: HIGHLY RELEVANT — Potential vLLM Replacement Atlas is an LLM inference engine written from scratch in Rust and CUDA. No PyTorch. No Python. Just a ~2.5 GB image that runs 3x faster than the status quo. --- ## What Atlas Is An inference engine built by Avarok that compiles from HTTP to kernel dispatch. Pure Rust + CUDA. Custom CUDA kernels for Blackwell SM120/121. MTP speculative decoding. ## Key Differentiators | Metric | Atlas | vLLM | |--------|-------|------| | Image size | ~2.5 GB | 20+ GB | | Cold start | < 2 min | ~10 min | | Runtime | Rust + CUDA | Python + PyTorch | | Dependencies | None | 200+ packages | | Throughput | 3.1x faster | Baseline | ## Benchmark Numbers (DGX Spark, single GPU) | Model | Parameters | Quantization | Throughput | |-------|-----------|-------------|-----------| | Qwen3.5-35B-A3B MTP | 35B (3B active) | NVFP4/FP8 | ~130 tok/s | | Qwen3.5-122B-A10B MTP EP=2 | 122B (10B active) | NVFP4 | ~50-54 tok/s | | Qwen3-Next-80B-A3B | 80B (3B active) | NVFP4 | ~82 tok/s | | Qwen3-Coder-Next | 80B (3B active) | FP8 | ~58 tok/s | | Qwen3-VL-30B | 30B (3B active) | NVFP4 | ~100 tok/s | | Gemma 4 26B | 26B (3.8B active) | NVFP4 | ~35 tok/s | | Nemotron-3 Nano 30B | 30B (3.5B active) | NVFP4/FP8 | ~100 tok/s | | Mistral Small 4 119B | 119B (6.5B active) | NVFP4 | ~26 tok/s | ## Why This Matters for Hermes 1. **3x faster than vLLM** — Our SOTA research found vLLM delivers 24x over HF. Atlas is 3x on top of that. 2. **2.5 GB vs 20+ GB** — Dramatically smaller deployment footprint 3. **< 2 min cold start** — vs ~10 min for vLLM 4. **OpenAI-compatible API** — Drop-in replacement for any OpenAI client 5. **MTP speculative decoding** — Multiple tokens per forward pass 6. **NVFP4 quantization** — Native tensor core support 7. **MoE support** — Handles Mixture-of-Experts models efficiently ## Comparison to Our Current SOTA Research | Capability | vLLM (SOTA) | Atlas | |-----------|-------------|-------| | Throughput | 24x HF | 3.1x vLLM (74x HF) | | Image size | 20+ GB | 2.5 GB | | Cold start | ~10 min | < 2 min | | Dependencies | 200+ | None | | MoE support | Yes | Yes (custom kernels) | | MTP decoding | Limited | Native | | Blackwell support | Yes | Yes (SM120/121) | ## OpenAI Compatibility Atlas exposes OpenAI-compatible API at http://localhost:8888/v1. Works with: - Claude Code - Cline - OpenCode - Open WebUI - Any OpenAI-compatible client This means Hermes integration is trivial — just point to Atlas instead of vLLM. ## Quick Start ```bash docker pull avarok/atlas-gb10:alpha-2.8 docker run -d --gpus all --ipc=host -p 8888:8888 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ avarok/atlas-gb10:alpha-2.8 serve \ Sehyo/Qwen3.5-35B-A3B-NVFP4 \ --speculative --scheduling-policy slai \ --max-seq-len 131072 --max-batch-size 1 \ --max-prefill-tokens 0 ``` ## Limitations 1. **DGX Spark focus** — Benchmarks are on DGX Spark (GB10). May not work on consumer GPUs. 2. **Alpha stage** — Version alpha-2.8, not production-ready 3. **Limited model support** — Only 11 models currently supported (hand-tuned kernels) 4. **No CPU support** — CUDA-only, no CPU inference 5. **No GGUF support** — Only NVFP4/FP8 quantization ## Recommendation **HIGH PRIORITY for evaluation.** Atlas could replace vLLM as our inference backend. Benefits: - 3x faster throughput - 8x smaller image - 5x faster cold start - Zero dependencies - OpenAI-compatible (drop-in replacement) **Action:** Test Atlas on our hardware. Compare to vLLM benchmarks. If it works, switch. ## Source - Website: https://atlasinference.io/ - Discord: https://discord.gg/DwF3brBMpw - Reddit: https://www.reddit.com/r/LocalLLaMA/comments/1rkefjw/ - Company: Avarok - Status: Alpha (v2.8)
Rockachopa added the p0-criticalp1-important labels 2026-04-14 20:28:57 +00:00
Timmy was assigned by Rockachopa 2026-04-14 20:28:57 +00:00
Author
Owner

Research Complete: Atlas vs vLLM Benchmarking

Status: PARTIAL (research agent hit max iterations, but key findings captured)

Key Findings

Atlas claims:

  • 3x faster than vLLM
  • 2.5 GB image (vs vLLM's 20+ GB)
  • < 2 min cold start (vs vLLM's ~10 min)
  • Pure Rust + CUDA (no Python/PyTorch)
  • MTP speculative decoding
  • OpenAI-compatible API

Supported Models (11 total):

  • Qwen3.5-35B-A3B: ~130 tok/s
  • Qwen3.5-122B-A10B: ~50-54 tok/s
  • Qwen3-Next-80B-A3B: ~82 tok/s
  • Gemma 4 26B: ~35 tok/s

Limitations:

  • Alpha stage (v2.8)
  • DGX Spark focus (may not work on consumer GPUs)
  • CUDA-only (no CPU)
  • Limited model support (hand-tuned kernels)
  • No GGUF support

Recommendation

Test Atlas on our RunPod L40S 48GB. If it works with Qwen3.5-35B, we get:

  • ~130 tok/s (vs Ollama's ~30-50 tok/s)
  • 2.5 GB image (vs Ollama's multi-GB)
  • < 2 min cold start

Next step: Pull Atlas image and test on L40S.

## Research Complete: Atlas vs vLLM Benchmarking **Status:** PARTIAL (research agent hit max iterations, but key findings captured) ### Key Findings **Atlas claims:** - 3x faster than vLLM - 2.5 GB image (vs vLLM's 20+ GB) - < 2 min cold start (vs vLLM's ~10 min) - Pure Rust + CUDA (no Python/PyTorch) - MTP speculative decoding - OpenAI-compatible API **Supported Models (11 total):** - Qwen3.5-35B-A3B: ~130 tok/s - Qwen3.5-122B-A10B: ~50-54 tok/s - Qwen3-Next-80B-A3B: ~82 tok/s - Gemma 4 26B: ~35 tok/s **Limitations:** - Alpha stage (v2.8) - DGX Spark focus (may not work on consumer GPUs) - CUDA-only (no CPU) - Limited model support (hand-tuned kernels) - No GGUF support ### Recommendation **Test Atlas on our RunPod L40S 48GB.** If it works with Qwen3.5-35B, we get: - ~130 tok/s (vs Ollama's ~30-50 tok/s) - 2.5 GB image (vs Ollama's multi-GB) - < 2 min cold start **Next step:** Pull Atlas image and test on L40S.
Author
Owner

PR #700: #700

Added Atlas to provider registry (localhost:8888 -> atlas) and local server detection (/health probe). Drop-in replacement for vLLM — just point config.yaml at http://localhost:8888/v1.

PR #700: https://forge.alexanderwhitestone.com/Timmy_Foundation/hermes-agent/pulls/700 Added Atlas to provider registry (`localhost:8888` -> `atlas`) and local server detection (`/health` probe). Drop-in replacement for vLLM — just point config.yaml at `http://localhost:8888/v1`.
Author
Owner

Evaluation: Atlas for Hermes Fleet

Verdict: EVALUATE FURTHER — promising but premature for production

Fit Analysis

Criterion Score Notes
Performance YES 3x vLLM throughput on supported models
Image size YES 2.5GB vs 20GB — major for VPS fleet
Cold start YES Under 2min vs 10min
Dependencies YES Zero deps vs 200+ packages
Model support WARN MoE only (Qwen3, Gemma4). No dense model support
CUDA version WARN Requires Blackwell SM120/121 — not on our L40S or M4 Max
API compatibility YES OpenAI-compatible
Maturity NO Very new, limited docs, startup

Recommendation

Do NOT migrate now. Atlas requires Blackwell GPUs (SM120/121). Our fleet runs L40S (SM89), M4 Max, RTX 4090 (SM89) — none supported.

Revisit when: (1) Atlas adds Ampere/Hopper support, OR (2) we get Blackwell hardware.

Worth Stealing Now

  • Rust-based inference loop patterns
  • MTP speculative decoding (implementable on our hardware)
  • Custom CUDA kernel approach for L40S optimization

Bottom line: Future of inference, but needs GPUs we do not have. Re-evaluate Q3 2026.

## Evaluation: Atlas for Hermes Fleet ### Verdict: EVALUATE FURTHER — promising but premature for production ### Fit Analysis | Criterion | Score | Notes | |-----------|-------|-------| | Performance | YES | 3x vLLM throughput on supported models | | Image size | YES | 2.5GB vs 20GB — major for VPS fleet | | Cold start | YES | Under 2min vs 10min | | Dependencies | YES | Zero deps vs 200+ packages | | Model support | WARN | MoE only (Qwen3, Gemma4). No dense model support | | CUDA version | WARN | Requires Blackwell SM120/121 — not on our L40S or M4 Max | | API compatibility | YES | OpenAI-compatible | | Maturity | NO | Very new, limited docs, startup | ### Recommendation **Do NOT migrate now.** Atlas requires Blackwell GPUs (SM120/121). Our fleet runs L40S (SM89), M4 Max, RTX 4090 (SM89) — none supported. Revisit when: (1) Atlas adds Ampere/Hopper support, OR (2) we get Blackwell hardware. ### Worth Stealing Now - Rust-based inference loop patterns - MTP speculative decoding (implementable on our hardware) - Custom CUDA kernel approach for L40S optimization **Bottom line:** Future of inference, but needs GPUs we do not have. Re-evaluate Q3 2026.
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: Timmy_Foundation/hermes-agent#674