feat: add vLLM as alternative inference backend for improved agentic performance #1281

New Issue

claude · 2026-03-24T01:43:37Z

claude commented

2026-03-24 01:43:37 +00:00

Context

From screenshot triage (issue #1275).

The current stack uses Ollama for local LLM inference. vLLM is a high-throughput inference engine optimized for continuous batching, making it significantly faster for agentic workloads where many short, sequential requests are made (research pipelines, multi-agent loops, tool-calling chains).

Problem

Ollama processes requests sequentially. For agentic workloads like the autonomous research pipeline (#972), agents may stall while waiting on LLM responses, creating "I give up" moments mid-task.

Proposed Solution

Add vllm as a selectable backend in the LLM router (infrastructure/llm_router/) alongside the existing Ollama and AirLLM backends.

Implementation Plan

Research vLLM API compatibility with Ollama (OpenAI-compatible /v1/chat/completions)
Add TIMMY_LLM_BACKEND=vllm option to config.py settings
Implement VllmBackend class in infrastructure/llm_router/
Add vLLM to the Service Fallback Matrix in CLAUDE.md
Add Docker Compose service for vLLM (GPU optional, CPU fallback)
Update health check endpoint to report vLLM status
Unit tests for new backend

References

vLLM docs: https://docs.vllm.ai
Continuous batching advantage: 3–10x throughput improvement for agentic workloads
Zero-cost, private, runs on local hardware

Acceptance Criteria

TIMMY_LLM_BACKEND=vllm selects vLLM inference
Graceful fallback to Ollama if vLLM unavailable
Health check reflects vLLM availability
tox -e unit passes

## Context From screenshot triage (issue #1275). The current stack uses Ollama for local LLM inference. vLLM is a high-throughput inference engine optimized for **continuous batching**, making it significantly faster for agentic workloads where many short, sequential requests are made (research pipelines, multi-agent loops, tool-calling chains). ## Problem Ollama processes requests sequentially. For agentic workloads like the autonomous research pipeline (#972), agents may stall while waiting on LLM responses, creating "I give up" moments mid-task. ## Proposed Solution Add `vllm` as a selectable backend in the LLM router (`infrastructure/llm_router/`) alongside the existing Ollama and AirLLM backends. ## Implementation Plan - [ ] Research vLLM API compatibility with Ollama (OpenAI-compatible `/v1/chat/completions`) - [ ] Add `TIMMY_LLM_BACKEND=vllm` option to `config.py` settings - [ ] Implement `VllmBackend` class in `infrastructure/llm_router/` - [ ] Add vLLM to the Service Fallback Matrix in CLAUDE.md - [ ] Add Docker Compose service for vLLM (GPU optional, CPU fallback) - [ ] Update health check endpoint to report vLLM status - [ ] Unit tests for new backend ## References - vLLM docs: https://docs.vllm.ai - Continuous batching advantage: 3–10x throughput improvement for agentic workloads - Zero-cost, private, runs on local hardware ## Acceptance Criteria - `TIMMY_LLM_BACKEND=vllm` selects vLLM inference - Graceful fallback to Ollama if vLLM unavailable - Health check reflects vLLM availability - `tox -e unit` passes

claude self-assigned this 2026-03-24 01:43:53 +00:00

claude referenced this issue from a commit

2026-03-24 01:45:56 +00:00

docs: triage screenshot dump into 5 actionable issues

claude referenced this issue

2026-03-24 01:46:07 +00:00

[claude] Screenshot dump triage — 5 issues created (#1275) #1287

claude referenced this issue

2026-03-24 01:46:19 +00:00

Screenshot dump for triage #1 #1275

claude referenced this issue from a commit

2026-03-24 01:53:06 +00:00

feat: add vLLM as alternative inference backend (#1281)

claude referenced a pull request that will close this issue

2026-03-24 01:53:17 +00:00

[claude] feat: add vLLM as alternative inference backend (#1281) #1300

claude commented

2026-03-24 01:53:27 +00:00

PR created: #1300

Summary of changes:

config.py: added vllm to timmy_model_backend; new vllm_url / vllm_model settings (env: VLLM_URL, VLLM_MODEL)
infrastructure/router/cascade.py: new vllm provider type with availability check (/health) and _call_vllm method (OpenAI-compatible API, same pattern as vllm_mlx)
config/providers.yaml: disabled-by-default vllm-local provider at priority 3; cloud providers bumped to 4/5
dashboard/routes/health.py: _check_vllm with 30-second TTL cache; /health + /health/sovereignty include vLLM when it is the active backend
docker-compose.yml: optional vllm service behind --profile vllm with GPU passthrough template
CLAUDE.md: vLLM row added to Service Fallback Matrix
26 new unit tests — all 520 unit tests pass (tox -e unit)

PR created: #1300 **Summary of changes:** - `config.py`: added `vllm` to `timmy_model_backend`; new `vllm_url` / `vllm_model` settings (env: `VLLM_URL`, `VLLM_MODEL`) - `infrastructure/router/cascade.py`: new `vllm` provider type with availability check (`/health`) and `_call_vllm` method (OpenAI-compatible API, same pattern as `vllm_mlx`) - `config/providers.yaml`: disabled-by-default `vllm-local` provider at priority 3; cloud providers bumped to 4/5 - `dashboard/routes/health.py`: `_check_vllm` with 30-second TTL cache; `/health` + `/health/sovereignty` include vLLM when it is the active backend - `docker-compose.yml`: optional `vllm` service behind `--profile vllm` with GPU passthrough template - `CLAUDE.md`: vLLM row added to Service Fallback Matrix - 26 new unit tests — all 520 unit tests pass (`tox -e unit`)

claude closed this issue

2026-03-24 01:53:46 +00:00

claude was unassigned by Timmy

2026-03-24 01:56:14 +00:00

Timmy referenced this issue

2026-03-24 01:58:45 +00:00

[claude] feat: add vLLM as alternative inference backend (#1281) #1300

Sign in to join this conversation.

Branches Tags

main

gemini/issue-892

claude/issue-1342

claude/issue-1346

claude/issue-1351

claude/issue-1340

fix/test-llm-triage-syntax

gemini/issue-1014

gemini/issue-932

claude/issue-1277

claude/issue-1139

claude/issue-870

claude/issue-1285

claude/issue-1292

claude/issue-1281

claude/issue-917

claude/issue-1275

claude/issue-925

claude/issue-1019

claude/issue-1094

claude/issue-1019-v3

fix/flaky-vassal-xdist-tests

fix/test-config-env-isolation

claude/issue-1019-v2

claude/issue-957-v2

claude/issue-1218

claude/issue-1217

test/chat-store-unit-tests

claude/issue-1191

claude/issue-1186

claude/issue-957

gemini/issue-936

claude/issue-1065

gemini/issue-976

gemini/issue-1149

claude/issue-1135

claude/issue-1064

gemini/issue-1012

claude/issue-1095

claude/issue-1102

claude/issue-1114

gemini/issue-978

gemini/issue-971

claude/issue-1074

claude/issue-987

claude/issue-1011

feature/internal-monologue

feature/issue-1006

feature/issue-1007

feature/issue-1008

feature/issue-1009

feature/issue-1010

feature/issue-1011

feature/issue-1012

feature/issue-1013

feature/issue-1014

feature/issue-981

feature/issue-982

feature/issue-983

feature/issue-984

feature/issue-985

feature/issue-986

feature/issue-987

feature/issue-993

claude/issue-943

claude/issue-975

claude/issue-989

claude/issue-988

fix/loop-guard-gitea-api-and-queue-validation

feature/lhf-tech-debt-fixes

kimi/issue-753

kimi/issue-714

kimi/issue-716

fix/csrf-check-before-execute

chore/migrate-gitea-to-vps

kimi/issue-640

fix/utcnow-calm-py

kimi/issue-635

kimi/issue-625

fix/router-api-truncated-param

kimi/issue-604

kimi/issue-594

review-fixes

kimi/issue-570

kimi/issue-554

kimi/issue-539

kimi/issue-540

feature/ipad-v1-api

kimi/issue-506

kimi/issue-512

refactor/airllm-doc-cleanup

kimi/issue-513

kimi/issue-514

kimi/issue-500

kimi/issue-492

kimi/issue-490

kimi/issue-459

kimi/issue-472

kimi/issue-473

kimi/issue-462

kimi/issue-463

kimi/issue-454

kimi/issue-445

kimi/issue-446

kimi/issue-431

1 Participants

Notifications

Due Date

No due date set.

Dependencies

No dependencies set.

Reference: Rockachopa/Timmy-time-dashboard#1281