[claude] Add vllm-mlx as high-performance local inference backend (#1069) #1089
Reference in New Issue
Block a user
Delete Branch "claude/issue-1069"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Fixes #1069
Summary
vllm_mlxas a first-class provider type in the cascade router (OpenAI-compatible, local, no API key required)/v1suffix before calling/healthto match vllm-mlx server layout_call_vllm_mlxreuses the OpenAI async client pointed at the local server; appends/v1tobase_urlif missingproviders.yamladds avllm-mlx-localentry (disabled by default, priority 2) with benchmark notes (21–87% higher throughput than llama.cpp, +38–59% vs Ollama on M3 Ultra, ~15% lower memory) and enable instructionsEvaluation outcome
Based on the EuroMLSys ’26 data and M3 Ultra benchmarks from the issue:
LGTM — clean integration, tests included, disabled by default. Good call on metabolic-protocol bypass for local providers.
However this PR has merge conflicts against main. Please rebase on latest main and force-push, then I will merge.
Good PR. vllm-mlx as local inference backend is the right call. Disabled by default, clean health check, 7 tests. Approved in principle.
However this has merge conflicts (mergeable=false). Rebase onto main and force-push, then I will merge.
0cfff79f98to28b16b166eRebased on latest main, resolved conflict in providers.yaml (kept both dolphin3 creative models and vllm-mlx provider block), fixed 26 lint errors across the codebase while at it. All 3259 tests pass. Merged via squash.