Adds vllm-mlx as a supported provider type in the cascade router.
vllm-mlx offers 25-50% higher throughput than Ollama on Apple Silicon
with ~15% lower memory usage, via a fully OpenAI-compatible API.
Changes:
- cascade.py: add `vllm_mlx` provider type with health check (strips /v1
suffix from base_url before calling /health) and `_call_vllm_mlx`
method (reuses OpenAI client pointed at local server, no API key needed)
- providers.yaml: add `vllm-mlx-local` entry (disabled by default,
priority 2) with benchmark notes and setup instructions
- test_router_cascade.py: tests for availability checks (no-requests
fallback, healthy/down server, default URL) and completion routing
(correct provider used, URL normalisation, metabolic-protocol bypass)
Recommendation (documented in providers.yaml): prefer vllm-mlx over
Ollama when throughput matters on Apple Silicon; stay on Ollama for
broadest ecosystem compatibility and simpler setup.
Refs #1069
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>