feat: add vLLM as alternative inference backend (#1281)

Adds vLLM (high-throughput OpenAI-compatible inference server) as a selectable backend alongside the existing Ollama and vllm-mlx backends. vLLM's continuous batching gives 3-10x throughput for agentic workloads. Changes: - config.py: add `vllm` to timmy_model_backend Literal; add vllm_url / vllm_model settings (VLLM_URL / VLLM_MODEL env vars) - cascade.py: add vllm provider type with _check_provider_available (hits /health) and _call_vllm (OpenAI-compatible completions) - providers.yaml: add disabled-by-default vllm-local provider (priority 3, port 8001); bump OpenAI/Anthropic backup priorities to 4/5 - health.py: add _check_vllm/_check_vllm_sync with 30-second TTL cache; /health and /health/sovereignty reflect vLLM status when it is the active backend - docker-compose.yml: add vllm service behind 'vllm' profile (GPU passthrough commented-out template included); add vllm-cache volume - CLAUDE.md: add vLLM row to Service Fallback Matrix - tests: 26 new unit tests covering availability checks, _call_vllm, providers.yaml validation, config options, and health helpers Graceful fallback: if vLLM is unavailable the cascade router automatically falls back to Ollama. The app never crashes. Fixes #1281 Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-23 21:52:52 -04:00
parent 6c76bf2f66
commit 28d1905df4
7 changed files with 652 additions and 12 deletions
--- a/src/config.py
+++ b/src/config.py
@@ -94,8 +94,18 @@ class Settings(BaseSettings):

    # ── Backend selection ────────────────────────────────────────────────────
    # "ollama"  — always use Ollama (default, safe everywhere)
+    # "vllm"    — use vLLM inference server (OpenAI-compatible, faster throughput)
    # "auto"    — pick best available local backend, fall back to Ollama
-    timmy_model_backend: Literal["ollama", "grok", "claude", "auto"] = "ollama"
+    timmy_model_backend: Literal["ollama", "vllm", "grok", "claude", "auto"] = "ollama"
+
+    # ── vLLM backend ──────────────────────────────────────────────────────────
+    # vLLM is an OpenAI-compatible inference server optimised for continuous
+    # batching — 3–10x higher throughput than Ollama for agentic workloads.
+    # Start server: python -m vllm.entrypoints.openai.api_server \
+    #   --model Qwen/Qwen2.5-14B-Instruct --port 8001
+    # Then set TIMMY_LLM_BACKEND=vllm (or enable vllm-local in providers.yaml)
+    vllm_url: str = "http://localhost:8001"
+    vllm_model: str = "Qwen/Qwen2.5-14B-Instruct"

    # ── Grok (xAI) — opt-in premium cloud backend ────────────────────────
    # Grok is a premium augmentation layer — local-first ethos preserved.