Files
Timmy-time-dashboard/tests
Alexander Whitestone 04f87df29c feat: add vllm-mlx as OpenAI-compatible local inference backend (#1069)
Adds vllm-mlx as a supported provider type in the cascade router.
vllm-mlx offers 25-50% higher throughput than Ollama on Apple Silicon
with ~15% lower memory usage, via a fully OpenAI-compatible API.

Changes:
- cascade.py: add `vllm_mlx` provider type with health check (strips /v1
  suffix from base_url before calling /health) and `_call_vllm_mlx`
  method (reuses OpenAI client pointed at local server, no API key needed)
- providers.yaml: add `vllm-mlx-local` entry (disabled by default,
  priority 2) with benchmark notes and setup instructions
- test_router_cascade.py: tests for availability checks (no-requests
  fallback, healthy/down server, default URL) and completion routing
  (correct provider used, URL normalisation, metabolic-protocol bypass)

Recommendation (documented in providers.yaml): prefer vllm-mlx over
Ollama when throughput matters on Apple Silicon; stay on Ollama for
broadest ecosystem compatibility and simpler setup.

Refs #1069

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-23 11:31:50 -04:00
..
2026-03-11 16:55:27 -04:00
2026-03-11 16:55:27 -04:00