[claude] Add vllm-mlx as high-performance local inference backend (#1069) #1089

Merged
Timmy merged 2 commits from claude/issue-1069 into main 2026-03-23 15:34:14 +00:00

2 Commits

Author SHA1 Message Date
28b16b166e fix: resolve lint errors across codebase (ruff UP042, UP045, UP017, E741, F841, F401, I001)
Some checks failed
Tests / lint (pull_request) Successful in 13s
Tests / test (pull_request) Failing after 25m43s
2026-03-23 11:33:12 -04:00
Alexander Whitestone
04f87df29c feat: add vllm-mlx as OpenAI-compatible local inference backend (#1069)
Adds vllm-mlx as a supported provider type in the cascade router.
vllm-mlx offers 25-50% higher throughput than Ollama on Apple Silicon
with ~15% lower memory usage, via a fully OpenAI-compatible API.

Changes:
- cascade.py: add `vllm_mlx` provider type with health check (strips /v1
  suffix from base_url before calling /health) and `_call_vllm_mlx`
  method (reuses OpenAI client pointed at local server, no API key needed)
- providers.yaml: add `vllm-mlx-local` entry (disabled by default,
  priority 2) with benchmark notes and setup instructions
- test_router_cascade.py: tests for availability checks (no-requests
  fallback, healthy/down server, default URL) and completion routing
  (correct provider used, URL normalisation, metabolic-protocol bypass)

Recommendation (documented in providers.yaml): prefer vllm-mlx over
Ollama when throughput matters on Apple Silicon; stay on Ollama for
broadest ecosystem compatibility and simpler setup.

Refs #1069

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-23 11:31:50 -04:00