[RESEARCH] Multi-Backend LLM Routing Landscape — Uniwizard Design Input #101

Closed
opened 2026-03-30 15:48:45 +00:00 by Rockachopa · 3 comments
Owner

Parent Epic

#94 — Grand Timmy: The Uniwizard

Purpose

Research report on the multi-backend LLM routing landscape. What exists, what works, what doesn't, and what the Hermes harness already has. This informs the implementation of #95 (backend registry + routing) and #96 (task-to-backend classifier).

The Key Insight

Timmy is at the helm, not the inference backend. The LLM backends are brains Timmy borrows. If Claude refuses, route to GPT. If GPT is down, route to Ollama local. The soul stays the same. The brain rotates. This is sovereignty made real.

What Exists in the Wild

Two Categories of Routing

Category A: Infrastructure Gateways — route across providers, handle errors/rate-limits/fallbacks
Category B: Intelligent Model Selectors — ML-driven per-prompt routing for quality/cost optimization

Major Projects Analyzed

Project Type Language Key Feature
LiteLLM Gateway + Proxy Python Most mature. 100+ providers. Cooldown/circuit-breaker. Error-type-specific fallback.
Portkey Gateway TypeScript Clean declarative JSON config. Nestable strategies. Edge-deployable.
OpenRouter SaaS Gateway N/A 300+ models. Auto provider selection. Zero infrastructure.
RouteLLM Intelligent Router Python Trains routing models on Chatbot Arena preference data. Binary strong/weak routing.
Martian Intelligent Router Commercial Per-prompt quality prediction. 50% cost savings at GPT-4o quality.
Not Diamond Intelligent Router Python SDK Auto-picks best model per query. Claims to beat any single LLM.
Unify Benchmark Router Commercial Routes on both model AND provider dimensions using benchmark data.
Braintrust Proxy TypeScript Zero-dep, composable, lightweight.

Routing Strategies in Production

  1. Declarative Fallback Chains — ordered list, cascade on failure (LiteLLM, Portkey, OpenRouter)
  2. Weighted Load Balancing — distribute traffic with configurable weights (LiteLLM, Portkey)
  3. Health-Aware Routing — cooldown failed endpoints, circuit-breaker pattern (LiteLLM)
  4. Error-Type-Specific Handling — different behavior for 429 vs 500 vs content-policy vs context-overflow (LiteLLM)
  5. Per-Prompt ML Selection — trained model predicts best LLM per prompt (RouteLLM, Martian, Not Diamond)
  6. Local-Cloud Hybrid — Ollama local as fallback for cloud APIs (LiteLLM + Ollama)
  7. Nested/Composable — combine load-balance + fallback in same config (Portkey)

The Gap Nobody Fills: Semantic Refusal Detection

This is the most important finding.

Every existing gateway handles HTTP-level errors (429, 500, etc). But when Claude returns 200 OK and says "I can't help with that" — nobody catches it and reroutes.

A model can refuse in ways that look like a successful response:

  • "I'm not able to assist with that request"
  • "I must decline to help with..."
  • "This goes against my guidelines"
  • Returns a watered-down, hedged, useless response

No existing tool detects this and triggers a reroute. This is the uniwizard's unique value proposition.

What Hermes Already Has (5-Layer Routing)

Layer 1: Primary Model Selection (runtime_provider.py)

  • resolve_requested_provider() checks: explicit arg > config.yaml > env var > "auto"
  • Current config: provider=anthropic, model=claude-opus-4-6

Layer 2: Smart Model Routing (smart_model_routing.py)

  • Per-turn cheap/expensive model selection (heuristic-based)
  • Currently DISABLED (smart_model_routing.enabled: false)

Layer 3: Primary Fallback Chain (run_agent.py)

  • _try_activate_fallback() advances through chain on failure
  • Triggers on: 429, 500/502/503, 401/403, 404, malformed responses
  • Swaps client, model, provider in-place mid-session
  • Current config: anthropic -> gemini-2.5-pro (single hop)

Layer 4: Auxiliary Task Routing (auxiliary_client.py)

  • Independent provider resolution for side tasks (vision, compression, web, memory, etc.)
  • Auto-detection chain: OpenRouter > Nous > Custom > Codex > API-key providers

Layer 5: Auto-Detection Chain

  • Cascading provider discovery when provider="auto"

Gap Analysis: What Hermes Needs for Uniwizard

Already Have (just needs wiring)

  • Multi-provider API normalization (3 api_modes)
  • Fallback chain infrastructure (run_agent.py)
  • Smart model routing skeleton (smart_model_routing.py)
  • Auxiliary routing for side tasks
  • Provider credential management

Need to Build

  • Multi-lane active routing — all providers active simultaneously, not just failover
  • Semantic refusal detection — parse 200 OK responses for refusal patterns, trigger reroute
  • Richer fallback triggers — not just HTTP errors but refusals, quality thresholds, timeouts
  • Backend health tracking — per-provider success rates, latency, cooldown (links to #98)
  • Task classification — match task type to backend strength (links to #96)
  • Cost tracking integration — real-time spend tracking per backend (links to #97)
  • SOUL.md injection — ensure soul/personality rides on top of any backend

Implementation Recommendation: Extend Hermes Natively

Why Not External Tools

  • LiteLLM: Overkill. Hermes already has provider normalization. Heavy dependency.
  • Portkey: Wrong language (TypeScript). Good config inspiration though.
  • OpenRouter: Third-party dependency. Violates sovereignty.

What to Steal from Each Project

  • From LiteLLM: Error-type classification, cooldown/circuit-breaker, allowed_fails threshold
  • From Portkey: Declarative nested config structure
  • From RouteLLM: Strong/weak model routing with a classifier
  • From Martian: Per-prompt quality prediction concept
  • From OpenRouter: Provider health scoring and automatic selection

The Refusal Detector (Novel Component)

REFUSAL_PATTERNS = [
    r"I (?:can't|cannot|am unable to|must decline)",
    r"against my (?:guidelines|policy|programming)",
    r"I'm not (?:able|comfortable|designed) to",
    r"I (?:apologize|'m sorry),? but I (?:can't|cannot)",
]

def is_refusal(response_text: str) -> bool:
    # Check patterns + suspiciously short responses + hedging without substance
    pass

When refusal detected -> immediately reroute to next backend. Log for quality scoring (#98).

  • #95 — Build backend registry and cloud routing layer (DIRECT PARENT)
  • #96 — Build task-to-backend classifier
  • #97 — Build cost tracking and budget enforcement
  • #98 — Build backend quality scoring and performance tracking
  • #99 — Wizard dissolution

Cross-Repo References

  • timmy-config #90 — [ROUTING] Streamline local Timmy automation context per job
  • timmy-config #94 — [p0] Cut cloud inheritance from active harness config
  • the-nexus #737 — [ARCH/KT] Directional shift: Timmy sovereign architecture
## Parent Epic #94 — Grand Timmy: The Uniwizard ## Purpose Research report on the multi-backend LLM routing landscape. What exists, what works, what doesn't, and what the Hermes harness already has. This informs the implementation of #95 (backend registry + routing) and #96 (task-to-backend classifier). ## The Key Insight Timmy is at the helm, not the inference backend. The LLM backends are brains Timmy borrows. If Claude refuses, route to GPT. If GPT is down, route to Ollama local. The soul stays the same. The brain rotates. This is sovereignty made real. ## What Exists in the Wild ### Two Categories of Routing **Category A: Infrastructure Gateways** — route across providers, handle errors/rate-limits/fallbacks **Category B: Intelligent Model Selectors** — ML-driven per-prompt routing for quality/cost optimization ### Major Projects Analyzed | Project | Type | Language | Key Feature | |---------|------|----------|-------------| | **LiteLLM** | Gateway + Proxy | Python | Most mature. 100+ providers. Cooldown/circuit-breaker. Error-type-specific fallback. | | **Portkey** | Gateway | TypeScript | Clean declarative JSON config. Nestable strategies. Edge-deployable. | | **OpenRouter** | SaaS Gateway | N/A | 300+ models. Auto provider selection. Zero infrastructure. | | **RouteLLM** | Intelligent Router | Python | Trains routing models on Chatbot Arena preference data. Binary strong/weak routing. | | **Martian** | Intelligent Router | Commercial | Per-prompt quality prediction. 50% cost savings at GPT-4o quality. | | **Not Diamond** | Intelligent Router | Python SDK | Auto-picks best model per query. Claims to beat any single LLM. | | **Unify** | Benchmark Router | Commercial | Routes on both model AND provider dimensions using benchmark data. | | **Braintrust** | Proxy | TypeScript | Zero-dep, composable, lightweight. | ### Routing Strategies in Production 1. **Declarative Fallback Chains** — ordered list, cascade on failure (LiteLLM, Portkey, OpenRouter) 2. **Weighted Load Balancing** — distribute traffic with configurable weights (LiteLLM, Portkey) 3. **Health-Aware Routing** — cooldown failed endpoints, circuit-breaker pattern (LiteLLM) 4. **Error-Type-Specific Handling** — different behavior for 429 vs 500 vs content-policy vs context-overflow (LiteLLM) 5. **Per-Prompt ML Selection** — trained model predicts best LLM per prompt (RouteLLM, Martian, Not Diamond) 6. **Local-Cloud Hybrid** — Ollama local as fallback for cloud APIs (LiteLLM + Ollama) 7. **Nested/Composable** — combine load-balance + fallback in same config (Portkey) ## The Gap Nobody Fills: Semantic Refusal Detection **This is the most important finding.** Every existing gateway handles HTTP-level errors (429, 500, etc). But when Claude returns **200 OK** and says "I can't help with that" — nobody catches it and reroutes. A model can refuse in ways that look like a successful response: - "I'm not able to assist with that request" - "I must decline to help with..." - "This goes against my guidelines" - Returns a watered-down, hedged, useless response No existing tool detects this and triggers a reroute. This is the uniwizard's unique value proposition. ## What Hermes Already Has (5-Layer Routing) ### Layer 1: Primary Model Selection (runtime_provider.py) - resolve_requested_provider() checks: explicit arg > config.yaml > env var > "auto" - Current config: provider=anthropic, model=claude-opus-4-6 ### Layer 2: Smart Model Routing (smart_model_routing.py) - Per-turn cheap/expensive model selection (heuristic-based) - Currently DISABLED (smart_model_routing.enabled: false) ### Layer 3: Primary Fallback Chain (run_agent.py) - _try_activate_fallback() advances through chain on failure - Triggers on: 429, 500/502/503, 401/403, 404, malformed responses - Swaps client, model, provider in-place mid-session - Current config: anthropic -> gemini-2.5-pro (single hop) ### Layer 4: Auxiliary Task Routing (auxiliary_client.py) - Independent provider resolution for side tasks (vision, compression, web, memory, etc.) - Auto-detection chain: OpenRouter > Nous > Custom > Codex > API-key providers ### Layer 5: Auto-Detection Chain - Cascading provider discovery when provider="auto" ## Gap Analysis: What Hermes Needs for Uniwizard ### Already Have (just needs wiring) - Multi-provider API normalization (3 api_modes) - Fallback chain infrastructure (run_agent.py) - Smart model routing skeleton (smart_model_routing.py) - Auxiliary routing for side tasks - Provider credential management ### Need to Build - **Multi-lane active routing** — all providers active simultaneously, not just failover - **Semantic refusal detection** — parse 200 OK responses for refusal patterns, trigger reroute - **Richer fallback triggers** — not just HTTP errors but refusals, quality thresholds, timeouts - **Backend health tracking** — per-provider success rates, latency, cooldown (links to #98) - **Task classification** — match task type to backend strength (links to #96) - **Cost tracking integration** — real-time spend tracking per backend (links to #97) - **SOUL.md injection** — ensure soul/personality rides on top of any backend ## Implementation Recommendation: Extend Hermes Natively ### Why Not External Tools - **LiteLLM:** Overkill. Hermes already has provider normalization. Heavy dependency. - **Portkey:** Wrong language (TypeScript). Good config inspiration though. - **OpenRouter:** Third-party dependency. Violates sovereignty. ### What to Steal from Each Project - **From LiteLLM:** Error-type classification, cooldown/circuit-breaker, allowed_fails threshold - **From Portkey:** Declarative nested config structure - **From RouteLLM:** Strong/weak model routing with a classifier - **From Martian:** Per-prompt quality prediction concept - **From OpenRouter:** Provider health scoring and automatic selection ### The Refusal Detector (Novel Component) ```python REFUSAL_PATTERNS = [ r"I (?:can't|cannot|am unable to|must decline)", r"against my (?:guidelines|policy|programming)", r"I'm not (?:able|comfortable|designed) to", r"I (?:apologize|'m sorry),? but I (?:can't|cannot)", ] def is_refusal(response_text: str) -> bool: # Check patterns + suspiciously short responses + hedging without substance pass ``` When refusal detected -> immediately reroute to next backend. Log for quality scoring (#98). ## Related Issues - #95 — Build backend registry and cloud routing layer (DIRECT PARENT) - #96 — Build task-to-backend classifier - #97 — Build cost tracking and budget enforcement - #98 — Build backend quality scoring and performance tracking - #99 — Wizard dissolution ## Cross-Repo References - timmy-config #90 — [ROUTING] Streamline local Timmy automation context per job - timmy-config #94 — [p0] Cut cloud inheritance from active harness config - the-nexus #737 — [ARCH/KT] Directional shift: Timmy sovereign architecture
Author
Owner

Addendum: The Routing Philosophy is Inverted

Every project analyzed above optimizes to MINIMIZE spend. The uniwizard must MAXIMIZE utilization of already-purchased quota.

$500/month across backends. Quotas reset. Unused tokens = wasted money. Timmy's job is to burn through quota toward goals, not to conserve it.

This means the implementation recommendation changes:

  • Don't steal LiteLLM's cost-based routing (routes to cheapest). Instead: route to backend with most headroom.
  • Don't steal RouteLLM's cheap/expensive binary. Instead: always use the best tool, factor in quota remaining.
  • The refusal detector matters even more — a refusal wastes a turn that could have gone to a willing backend.

GOAP (Goal Oriented Action Planning) is the execution model. Define done, plan backwards, execute aggressively, re-plan on failure. Not step-by-step reactive. Goal-driven. Urgent.

## Addendum: The Routing Philosophy is Inverted Every project analyzed above optimizes to MINIMIZE spend. The uniwizard must MAXIMIZE utilization of already-purchased quota. $500/month across backends. Quotas reset. Unused tokens = wasted money. Timmy's job is to burn through quota toward goals, not to conserve it. This means the implementation recommendation changes: - Don't steal LiteLLM's cost-based routing (routes to cheapest). Instead: route to backend with most headroom. - Don't steal RouteLLM's cheap/expensive binary. Instead: always use the best tool, factor in quota remaining. - The refusal detector matters even more — a refusal wastes a turn that could have gone to a willing backend. GOAP (Goal Oriented Action Planning) is the execution model. Define done, plan backwards, execute aggressively, re-plan on failure. Not step-by-step reactive. Goal-driven. Urgent.
Owner

Ezra owns research synthesis. Will distill #101 findings into actionable specs for Timmy's implementation tickets.

Ezra owns research synthesis. Will distill #101 findings into actionable specs for Timmy's implementation tickets.
ezra was assigned by Timmy 2026-03-30 16:03:30 +00:00
Owner

Research complete. Findings absorbed into #95 (backend registry), #96 (classifier), and board pass on #94. Semantic refusal detection documented. Closing.

Research complete. Findings absorbed into #95 (backend registry), #96 (classifier), and board pass on #94. Semantic refusal detection documented. Closing.
Timmy closed this issue 2026-03-30 16:49:46 +00:00
Sign in to join this conversation.
2 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: Timmy_Foundation/timmy-home#101