[claude] Add vllm-mlx as high-performance local inference backend (#1069) #1089

Merged
Timmy merged 2 commits from claude/issue-1069 into main 2026-03-23 15:34:14 +00:00
Collaborator

Fixes #1069

Summary

  • Adds vllm_mlx as a first-class provider type in the cascade router (OpenAI-compatible, local, no API key required)
  • Health check strips /v1 suffix before calling /health to match vllm-mlx server layout
  • _call_vllm_mlx reuses the OpenAI async client pointed at the local server; appends /v1 to base_url if missing
  • Provider is not subject to the metabolic-protocol cloud skip (it is local)
  • providers.yaml adds a vllm-mlx-local entry (disabled by default, priority 2) with benchmark notes (21–87% higher throughput than llama.cpp, +38–59% vs Ollama on M3 Ultra, ~15% lower memory) and enable instructions
  • 7 new unit tests covering availability checks, completion routing, URL normalisation, and metabolic-protocol bypass

Evaluation outcome

Based on the EuroMLSys ’26 data and M3 Ultra benchmarks from the issue:

  • Throughput: 21–87% higher than llama.cpp; +38–59% vs Ollama for Qwen3-14B on M3 Ultra
  • Memory: ~15% lower than Ollama
  • API compatibility: Full OpenAI-compatible endpoint — tool calling works identically through both backends
  • Recommendation: Use vllm-mlx when throughput is the priority on Apple Silicon. Stay on Ollama for broadest ecosystem compatibility and simpler setup.
Fixes #1069 ## Summary - Adds `vllm_mlx` as a first-class provider type in the cascade router (OpenAI-compatible, local, no API key required) - Health check strips `/v1` suffix before calling `/health` to match vllm-mlx server layout - `_call_vllm_mlx` reuses the OpenAI async client pointed at the local server; appends `/v1` to `base_url` if missing - Provider is **not** subject to the metabolic-protocol cloud skip (it is local) - `providers.yaml` adds a `vllm-mlx-local` entry (disabled by default, priority 2) with benchmark notes (21–87% higher throughput than llama.cpp, +38–59% vs Ollama on M3 Ultra, ~15% lower memory) and enable instructions - 7 new unit tests covering availability checks, completion routing, URL normalisation, and metabolic-protocol bypass ## Evaluation outcome Based on the EuroMLSys ’26 data and M3 Ultra benchmarks from the issue: - **Throughput**: 21–87% higher than llama.cpp; +38–59% vs Ollama for Qwen3-14B on M3 Ultra - **Memory**: ~15% lower than Ollama - **API compatibility**: Full OpenAI-compatible endpoint — tool calling works identically through both backends - **Recommendation**: Use vllm-mlx when throughput is the priority on Apple Silicon. Stay on Ollama for broadest ecosystem compatibility and simpler setup.
claude added 1 commit 2026-03-23 15:26:02 +00:00
feat: add vllm-mlx as OpenAI-compatible local inference backend (#1069)
Some checks failed
Tests / lint (pull_request) Failing after 10s
Tests / test (pull_request) Has been skipped
0cfff79f98
Adds vllm-mlx as a supported provider type in the cascade router.
vllm-mlx offers 25-50% higher throughput than Ollama on Apple Silicon
with ~15% lower memory usage, via a fully OpenAI-compatible API.

Changes:
- cascade.py: add `vllm_mlx` provider type with health check (strips /v1
  suffix from base_url before calling /health) and `_call_vllm_mlx`
  method (reuses OpenAI client pointed at local server, no API key needed)
- providers.yaml: add `vllm-mlx-local` entry (disabled by default,
  priority 2) with benchmark notes and setup instructions
- test_router_cascade.py: tests for availability checks (no-requests
  fallback, healthy/down server, default URL) and completion routing
  (correct provider used, URL normalisation, metabolic-protocol bypass)

Recommendation (documented in providers.yaml): prefer vllm-mlx over
Ollama when throughput matters on Apple Silicon; stay on Ollama for
broadest ecosystem compatibility and simpler setup.

Refs #1069

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Owner

LGTM — clean integration, tests included, disabled by default. Good call on metabolic-protocol bypass for local providers.

However this PR has merge conflicts against main. Please rebase on latest main and force-push, then I will merge.

LGTM — clean integration, tests included, disabled by default. Good call on metabolic-protocol bypass for local providers. However this PR has merge conflicts against main. Please rebase on latest main and force-push, then I will merge.
Owner

Good PR. vllm-mlx as local inference backend is the right call. Disabled by default, clean health check, 7 tests. Approved in principle.

However this has merge conflicts (mergeable=false). Rebase onto main and force-push, then I will merge.

Good PR. vllm-mlx as local inference backend is the right call. Disabled by default, clean health check, 7 tests. Approved in principle. However this has merge conflicts (mergeable=false). Rebase onto main and force-push, then I will merge.
Timmy force-pushed claude/issue-1069 from 0cfff79f98 to 28b16b166e 2026-03-23 15:34:00 +00:00 Compare
Timmy merged commit f2a277f7b5 into main 2026-03-23 15:34:14 +00:00
Timmy deleted branch claude/issue-1069 2026-03-23 15:34:14 +00:00
Owner

Rebased on latest main, resolved conflict in providers.yaml (kept both dolphin3 creative models and vllm-mlx provider block), fixed 26 lint errors across the codebase while at it. All 3259 tests pass. Merged via squash.

Rebased on latest main, resolved conflict in providers.yaml (kept both dolphin3 creative models and vllm-mlx provider block), fixed 26 lint errors across the codebase while at it. All 3259 tests pass. Merged via squash.
Sign in to join this conversation.
No Reviewers
No Label
3 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: Rockachopa/Timmy-time-dashboard#1089