feat: add vLLM as alternative inference backend for improved agentic performance #1281

Closed
opened 2026-03-24 01:43:37 +00:00 by claude · 1 comment
Collaborator

Context

From screenshot triage (issue #1275).

The current stack uses Ollama for local LLM inference. vLLM is a high-throughput inference engine optimized for continuous batching, making it significantly faster for agentic workloads where many short, sequential requests are made (research pipelines, multi-agent loops, tool-calling chains).

Problem

Ollama processes requests sequentially. For agentic workloads like the autonomous research pipeline (#972), agents may stall while waiting on LLM responses, creating "I give up" moments mid-task.

Proposed Solution

Add vllm as a selectable backend in the LLM router (infrastructure/llm_router/) alongside the existing Ollama and AirLLM backends.

Implementation Plan

  • Research vLLM API compatibility with Ollama (OpenAI-compatible /v1/chat/completions)
  • Add TIMMY_LLM_BACKEND=vllm option to config.py settings
  • Implement VllmBackend class in infrastructure/llm_router/
  • Add vLLM to the Service Fallback Matrix in CLAUDE.md
  • Add Docker Compose service for vLLM (GPU optional, CPU fallback)
  • Update health check endpoint to report vLLM status
  • Unit tests for new backend

References

  • vLLM docs: https://docs.vllm.ai
  • Continuous batching advantage: 3–10x throughput improvement for agentic workloads
  • Zero-cost, private, runs on local hardware

Acceptance Criteria

  • TIMMY_LLM_BACKEND=vllm selects vLLM inference
  • Graceful fallback to Ollama if vLLM unavailable
  • Health check reflects vLLM availability
  • tox -e unit passes
## Context From screenshot triage (issue #1275). The current stack uses Ollama for local LLM inference. vLLM is a high-throughput inference engine optimized for **continuous batching**, making it significantly faster for agentic workloads where many short, sequential requests are made (research pipelines, multi-agent loops, tool-calling chains). ## Problem Ollama processes requests sequentially. For agentic workloads like the autonomous research pipeline (#972), agents may stall while waiting on LLM responses, creating "I give up" moments mid-task. ## Proposed Solution Add `vllm` as a selectable backend in the LLM router (`infrastructure/llm_router/`) alongside the existing Ollama and AirLLM backends. ## Implementation Plan - [ ] Research vLLM API compatibility with Ollama (OpenAI-compatible `/v1/chat/completions`) - [ ] Add `TIMMY_LLM_BACKEND=vllm` option to `config.py` settings - [ ] Implement `VllmBackend` class in `infrastructure/llm_router/` - [ ] Add vLLM to the Service Fallback Matrix in CLAUDE.md - [ ] Add Docker Compose service for vLLM (GPU optional, CPU fallback) - [ ] Update health check endpoint to report vLLM status - [ ] Unit tests for new backend ## References - vLLM docs: https://docs.vllm.ai - Continuous batching advantage: 3–10x throughput improvement for agentic workloads - Zero-cost, private, runs on local hardware ## Acceptance Criteria - `TIMMY_LLM_BACKEND=vllm` selects vLLM inference - Graceful fallback to Ollama if vLLM unavailable - Health check reflects vLLM availability - `tox -e unit` passes
claude self-assigned this 2026-03-24 01:43:53 +00:00
Author
Collaborator

PR created: #1300

Summary of changes:

  • config.py: added vllm to timmy_model_backend; new vllm_url / vllm_model settings (env: VLLM_URL, VLLM_MODEL)
  • infrastructure/router/cascade.py: new vllm provider type with availability check (/health) and _call_vllm method (OpenAI-compatible API, same pattern as vllm_mlx)
  • config/providers.yaml: disabled-by-default vllm-local provider at priority 3; cloud providers bumped to 4/5
  • dashboard/routes/health.py: _check_vllm with 30-second TTL cache; /health + /health/sovereignty include vLLM when it is the active backend
  • docker-compose.yml: optional vllm service behind --profile vllm with GPU passthrough template
  • CLAUDE.md: vLLM row added to Service Fallback Matrix
  • 26 new unit tests — all 520 unit tests pass (tox -e unit)
PR created: #1300 **Summary of changes:** - `config.py`: added `vllm` to `timmy_model_backend`; new `vllm_url` / `vllm_model` settings (env: `VLLM_URL`, `VLLM_MODEL`) - `infrastructure/router/cascade.py`: new `vllm` provider type with availability check (`/health`) and `_call_vllm` method (OpenAI-compatible API, same pattern as `vllm_mlx`) - `config/providers.yaml`: disabled-by-default `vllm-local` provider at priority 3; cloud providers bumped to 4/5 - `dashboard/routes/health.py`: `_check_vllm` with 30-second TTL cache; `/health` + `/health/sovereignty` include vLLM when it is the active backend - `docker-compose.yml`: optional `vllm` service behind `--profile vllm` with GPU passthrough template - `CLAUDE.md`: vLLM row added to Service Fallback Matrix - 26 new unit tests — all 520 unit tests pass (`tox -e unit`)
claude was unassigned by Timmy 2026-03-24 01:56:14 +00:00
Sign in to join this conversation.
No Label
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: Rockachopa/Timmy-time-dashboard#1281