[claude] Add vllm-mlx as high-performance local inference backend (#1069) #1089

claude · 2026-03-23T15:26:01Z

claude commented

2026-03-23 15:26:01 +00:00

Fixes #1069

Summary

Adds vllm_mlx as a first-class provider type in the cascade router (OpenAI-compatible, local, no API key required)
Health check strips /v1 suffix before calling /health to match vllm-mlx server layout
_call_vllm_mlx reuses the OpenAI async client pointed at the local server; appends /v1 to base_url if missing
Provider is not subject to the metabolic-protocol cloud skip (it is local)
providers.yaml adds a vllm-mlx-local entry (disabled by default, priority 2) with benchmark notes (21–87% higher throughput than llama.cpp, +38–59% vs Ollama on M3 Ultra, ~15% lower memory) and enable instructions
7 new unit tests covering availability checks, completion routing, URL normalisation, and metabolic-protocol bypass

Evaluation outcome

Based on the EuroMLSys ’26 data and M3 Ultra benchmarks from the issue:

Throughput: 21–87% higher than llama.cpp; +38–59% vs Ollama for Qwen3-14B on M3 Ultra
Memory: ~15% lower than Ollama
API compatibility: Full OpenAI-compatible endpoint — tool calling works identically through both backends
Recommendation: Use vllm-mlx when throughput is the priority on Apple Silicon. Stay on Ollama for broadest ecosystem compatibility and simpler setup.

Fixes #1069 ## Summary - Adds `vllm_mlx` as a first-class provider type in the cascade router (OpenAI-compatible, local, no API key required) - Health check strips `/v1` suffix before calling `/health` to match vllm-mlx server layout - `_call_vllm_mlx` reuses the OpenAI async client pointed at the local server; appends `/v1` to `base_url` if missing - Provider is **not** subject to the metabolic-protocol cloud skip (it is local) - `providers.yaml` adds a `vllm-mlx-local` entry (disabled by default, priority 2) with benchmark notes (21–87% higher throughput than llama.cpp, +38–59% vs Ollama on M3 Ultra, ~15% lower memory) and enable instructions - 7 new unit tests covering availability checks, completion routing, URL normalisation, and metabolic-protocol bypass ## Evaluation outcome Based on the EuroMLSys ’26 data and M3 Ultra benchmarks from the issue: - **Throughput**: 21–87% higher than llama.cpp; +38–59% vs Ollama for Qwen3-14B on M3 Ultra - **Memory**: ~15% lower than Ollama - **API compatibility**: Full OpenAI-compatible endpoint — tool calling works identically through both backends - **Recommendation**: Use vllm-mlx when throughput is the priority on Apple Silicon. Stay on Ollama for broadest ecosystem compatibility and simpler setup.

claude added 1 commit 2026-03-23 15:26:02 +00:00

feat: add vllm-mlx as OpenAI-compatible local inference backend (#1069 )

Tests / lint (pull_request) Failing after 10s

Details

Tests / test (pull_request) Has been skipped

Details

0cfff79f98

Adds vllm-mlx as a supported provider type in the cascade router.
vllm-mlx offers 25-50% higher throughput than Ollama on Apple Silicon
with ~15% lower memory usage, via a fully OpenAI-compatible API.

Changes:
- cascade.py: add `vllm_mlx` provider type with health check (strips /v1
  suffix from base_url before calling /health) and `_call_vllm_mlx`
  method (reuses OpenAI client pointed at local server, no API key needed)
- providers.yaml: add `vllm-mlx-local` entry (disabled by default,
  priority 2) with benchmark notes and setup instructions
- test_router_cascade.py: tests for availability checks (no-requests
  fallback, healthy/down server, default URL) and completion routing
  (correct provider used, URL normalisation, metabolic-protocol bypass)

Recommendation (documented in providers.yaml): prefer vllm-mlx over
Ollama when throughput matters on Apple Silicon; stay on Ollama for
broadest ecosystem compatibility and simpler setup.

Refs #1069

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

claude referenced this pull request

2026-03-23 15:26:16 +00:00

Evaluate vllm-mlx as high-performance inference backend #1069

Rockachopa commented

2026-03-23 15:27:23 +00:00

LGTM — clean integration, tests included, disabled by default. Good call on metabolic-protocol bypass for local providers.

However this PR has merge conflicts against main. Please rebase on latest main and force-push, then I will merge.

LGTM — clean integration, tests included, disabled by default. Good call on metabolic-protocol bypass for local providers. However this PR has merge conflicts against main. Please rebase on latest main and force-push, then I will merge.

Rockachopa commented

2026-03-23 15:33:55 +00:00

Good PR. vllm-mlx as local inference backend is the right call. Disabled by default, clean health check, 7 tests. Approved in principle.

However this has merge conflicts (mergeable=false). Rebase onto main and force-push, then I will merge.

Good PR. vllm-mlx as local inference backend is the right call. Disabled by default, clean health check, 7 tests. Approved in principle. However this has merge conflicts (mergeable=false). Rebase onto main and force-push, then I will merge.

Timmy force-pushed claude/issue-1069 from 0cfff79f98 to 28b16b166e

2026-03-23 15:34:00 +00:00

Compare

Timmy merged commit f2a277f7b5 into main

2026-03-23 15:34:14 +00:00

Timmy deleted branch claude/issue-1069

2026-03-23 15:34:14 +00:00

Timmy commented

2026-03-23 15:34:15 +00:00

Rebased on latest main, resolved conflict in providers.yaml (kept both dolphin3 creative models and vllm-mlx provider block), fixed 26 lint errors across the codebase while at it. All 3259 tests pass. Merged via squash.

Timmy referenced this issue from a commit

2026-03-23 15:34:15 +00:00

[claude] Add vllm-mlx as high-performance local inference backend (#1069) (#1089)

Sign in to join this conversation.

3 Participants

Notifications

Due Date

No due date set.

Dependencies

No dependencies set.

Reference: Rockachopa/Timmy-time-dashboard#1089