[claude] Add vllm-mlx as high-performance local inference backend (#1069) #1089

Merged

Timmy merged 2 commits from claude/issue-1069 into main

2026-03-23 15:34:14 +00:00

Author	SHA1	Message	Date
Hermes	28b16b166e	fix: resolve lint errors across codebase (ruff UP042, UP045, UP017, E741, F841, F401, I001) Some checks failed Tests / lint (pull_request) Successful in 13s Details Tests / test (pull_request) Failing after 25m43s Details	2026-03-23 11:33:12 -04:00
Alexander Whitestone	04f87df29c	feat: add vllm-mlx as OpenAI-compatible local inference backend (#1069 ) Adds vllm-mlx as a supported provider type in the cascade router. vllm-mlx offers 25-50% higher throughput than Ollama on Apple Silicon with ~15% lower memory usage, via a fully OpenAI-compatible API. Changes: - cascade.py: add `vllm_mlx` provider type with health check (strips /v1 suffix from base_url before calling /health) and `_call_vllm_mlx` method (reuses OpenAI client pointed at local server, no API key needed) - providers.yaml: add `vllm-mlx-local` entry (disabled by default, priority 2) with benchmark notes and setup instructions - test_router_cascade.py: tests for availability checks (no-requests fallback, healthy/down server, default URL) and completion routing (correct provider used, URL normalisation, metabolic-protocol bypass) Recommendation (documented in providers.yaml): prefer vllm-mlx over Ollama when throughput matters on Apple Silicon; stay on Ollama for broadest ecosystem compatibility and simpler setup. Refs #1069 Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-03-23 11:31:50 -04:00

Author

SHA1

Message

Date

Hermes

28b16b166e

fix: resolve lint errors across codebase (ruff UP042, UP045, UP017, E741, F841, F401, I001)

Tests / lint (pull_request) Successful in 13s

Details

Tests / test (pull_request) Failing after 25m43s

Details

2026-03-23 11:33:12 -04:00

Alexander Whitestone

04f87df29c

feat: add vllm-mlx as OpenAI-compatible local inference backend (#1069 )

Adds vllm-mlx as a supported provider type in the cascade router.
vllm-mlx offers 25-50% higher throughput than Ollama on Apple Silicon
with ~15% lower memory usage, via a fully OpenAI-compatible API.

Changes:
- cascade.py: add `vllm_mlx` provider type with health check (strips /v1
  suffix from base_url before calling /health) and `_call_vllm_mlx`
  method (reuses OpenAI client pointed at local server, no API key needed)
- providers.yaml: add `vllm-mlx-local` entry (disabled by default,
  priority 2) with benchmark notes and setup instructions
- test_router_cascade.py: tests for availability checks (no-requests
  fallback, healthy/down server, default URL) and completion routing
  (correct provider used, URL normalisation, metabolic-protocol bypass)

Recommendation (documented in providers.yaml): prefer vllm-mlx over
Ollama when throughput matters on Apple Silicon; stay on Ollama for
broadest ecosystem compatibility and simpler setup.

Refs #1069

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

2026-03-23 11:31:50 -04:00

[claude] Add vllm-mlx as high-performance local inference backend (#1069) #1089

2 Commits