[RESEARCH] Multi-Backend LLM Routing Landscape — Uniwizard Design Input #101

New Issue

Rockachopa · 2026-03-30T15:48:45Z

Rockachopa commented

2026-03-30 15:48:45 +00:00

Parent Epic

#94 — Grand Timmy: The Uniwizard

Purpose

Research report on the multi-backend LLM routing landscape. What exists, what works, what doesn't, and what the Hermes harness already has. This informs the implementation of #95 (backend registry + routing) and #96 (task-to-backend classifier).

The Key Insight

Timmy is at the helm, not the inference backend. The LLM backends are brains Timmy borrows. If Claude refuses, route to GPT. If GPT is down, route to Ollama local. The soul stays the same. The brain rotates. This is sovereignty made real.

What Exists in the Wild

Two Categories of Routing

Category A: Infrastructure Gateways — route across providers, handle errors/rate-limits/fallbacks
Category B: Intelligent Model Selectors — ML-driven per-prompt routing for quality/cost optimization

Major Projects Analyzed

Project	Type	Language	Key Feature
LiteLLM	Gateway + Proxy	Python	Most mature. 100+ providers. Cooldown/circuit-breaker. Error-type-specific fallback.
Portkey	Gateway	TypeScript	Clean declarative JSON config. Nestable strategies. Edge-deployable.
OpenRouter	SaaS Gateway	N/A	300+ models. Auto provider selection. Zero infrastructure.
RouteLLM	Intelligent Router	Python	Trains routing models on Chatbot Arena preference data. Binary strong/weak routing.
Martian	Intelligent Router	Commercial	Per-prompt quality prediction. 50% cost savings at GPT-4o quality.
Not Diamond	Intelligent Router	Python SDK	Auto-picks best model per query. Claims to beat any single LLM.
Unify	Benchmark Router	Commercial	Routes on both model AND provider dimensions using benchmark data.
Braintrust	Proxy	TypeScript	Zero-dep, composable, lightweight.

Routing Strategies in Production

Declarative Fallback Chains — ordered list, cascade on failure (LiteLLM, Portkey, OpenRouter)
Weighted Load Balancing — distribute traffic with configurable weights (LiteLLM, Portkey)
Health-Aware Routing — cooldown failed endpoints, circuit-breaker pattern (LiteLLM)
Error-Type-Specific Handling — different behavior for 429 vs 500 vs content-policy vs context-overflow (LiteLLM)
Per-Prompt ML Selection — trained model predicts best LLM per prompt (RouteLLM, Martian, Not Diamond)
Local-Cloud Hybrid — Ollama local as fallback for cloud APIs (LiteLLM + Ollama)
Nested/Composable — combine load-balance + fallback in same config (Portkey)

The Gap Nobody Fills: Semantic Refusal Detection

This is the most important finding.

Every existing gateway handles HTTP-level errors (429, 500, etc). But when Claude returns 200 OK and says "I can't help with that" — nobody catches it and reroutes.

A model can refuse in ways that look like a successful response:

"I'm not able to assist with that request"
"I must decline to help with..."
"This goes against my guidelines"
Returns a watered-down, hedged, useless response

No existing tool detects this and triggers a reroute. This is the uniwizard's unique value proposition.

What Hermes Already Has (5-Layer Routing)

Layer 1: Primary Model Selection (runtime_provider.py)

resolve_requested_provider() checks: explicit arg > config.yaml > env var > "auto"
Current config: provider=anthropic, model=claude-opus-4-6

Layer 2: Smart Model Routing (smart_model_routing.py)

Per-turn cheap/expensive model selection (heuristic-based)
Currently DISABLED (smart_model_routing.enabled: false)

Layer 3: Primary Fallback Chain (run_agent.py)

_try_activate_fallback() advances through chain on failure
Triggers on: 429, 500/502/503, 401/403, 404, malformed responses
Swaps client, model, provider in-place mid-session
Current config: anthropic -> gemini-2.5-pro (single hop)

Layer 4: Auxiliary Task Routing (auxiliary_client.py)

Independent provider resolution for side tasks (vision, compression, web, memory, etc.)
Auto-detection chain: OpenRouter > Nous > Custom > Codex > API-key providers

Layer 5: Auto-Detection Chain

Cascading provider discovery when provider="auto"

Gap Analysis: What Hermes Needs for Uniwizard

Already Have (just needs wiring)

Multi-provider API normalization (3 api_modes)
Fallback chain infrastructure (run_agent.py)
Smart model routing skeleton (smart_model_routing.py)
Auxiliary routing for side tasks
Provider credential management

Need to Build

Multi-lane active routing — all providers active simultaneously, not just failover
Semantic refusal detection — parse 200 OK responses for refusal patterns, trigger reroute
Richer fallback triggers — not just HTTP errors but refusals, quality thresholds, timeouts
Backend health tracking — per-provider success rates, latency, cooldown (links to #98)
Task classification — match task type to backend strength (links to #96)
Cost tracking integration — real-time spend tracking per backend (links to #97)
SOUL.md injection — ensure soul/personality rides on top of any backend

Implementation Recommendation: Extend Hermes Natively

Why Not External Tools

LiteLLM: Overkill. Hermes already has provider normalization. Heavy dependency.
Portkey: Wrong language (TypeScript). Good config inspiration though.
OpenRouter: Third-party dependency. Violates sovereignty.

What to Steal from Each Project

From LiteLLM: Error-type classification, cooldown/circuit-breaker, allowed_fails threshold
From Portkey: Declarative nested config structure
From RouteLLM: Strong/weak model routing with a classifier
From Martian: Per-prompt quality prediction concept
From OpenRouter: Provider health scoring and automatic selection

The Refusal Detector (Novel Component)

REFUSAL_PATTERNS = [
    r"I (?:can't|cannot|am unable to|must decline)",
    r"against my (?:guidelines|policy|programming)",
    r"I'm not (?:able|comfortable|designed) to",
    r"I (?:apologize|'m sorry),? but I (?:can't|cannot)",
]

def is_refusal(response_text: str) -> bool:
    # Check patterns + suspiciously short responses + hedging without substance
    pass

When refusal detected -> immediately reroute to next backend. Log for quality scoring (#98).

#95 — Build backend registry and cloud routing layer (DIRECT PARENT)
#96 — Build task-to-backend classifier
#97 — Build cost tracking and budget enforcement
#98 — Build backend quality scoring and performance tracking
#99 — Wizard dissolution

Cross-Repo References

timmy-config #90 — [ROUTING] Streamline local Timmy automation context per job
timmy-config #94 — [p0] Cut cloud inheritance from active harness config
the-nexus #737 — [ARCH/KT] Directional shift: Timmy sovereign architecture

## Parent Epic #94 — Grand Timmy: The Uniwizard ## Purpose Research report on the multi-backend LLM routing landscape. What exists, what works, what doesn't, and what the Hermes harness already has. This informs the implementation of #95 (backend registry + routing) and #96 (task-to-backend classifier). ## The Key Insight Timmy is at the helm, not the inference backend. The LLM backends are brains Timmy borrows. If Claude refuses, route to GPT. If GPT is down, route to Ollama local. The soul stays the same. The brain rotates. This is sovereignty made real. ## What Exists in the Wild ### Two Categories of Routing **Category A: Infrastructure Gateways** — route across providers, handle errors/rate-limits/fallbacks **Category B: Intelligent Model Selectors** — ML-driven per-prompt routing for quality/cost optimization ### Major Projects Analyzed | Project | Type | Language | Key Feature | |---------|------|----------|-------------| | **LiteLLM** | Gateway + Proxy | Python | Most mature. 100+ providers. Cooldown/circuit-breaker. Error-type-specific fallback. | | **Portkey** | Gateway | TypeScript | Clean declarative JSON config. Nestable strategies. Edge-deployable. | | **OpenRouter** | SaaS Gateway | N/A | 300+ models. Auto provider selection. Zero infrastructure. | | **RouteLLM** | Intelligent Router | Python | Trains routing models on Chatbot Arena preference data. Binary strong/weak routing. | | **Martian** | Intelligent Router | Commercial | Per-prompt quality prediction. 50% cost savings at GPT-4o quality. | | **Not Diamond** | Intelligent Router | Python SDK | Auto-picks best model per query. Claims to beat any single LLM. | | **Unify** | Benchmark Router | Commercial | Routes on both model AND provider dimensions using benchmark data. | | **Braintrust** | Proxy | TypeScript | Zero-dep, composable, lightweight. | ### Routing Strategies in Production 1. **Declarative Fallback Chains** — ordered list, cascade on failure (LiteLLM, Portkey, OpenRouter) 2. **Weighted Load Balancing** — distribute traffic with configurable weights (LiteLLM, Portkey) 3. **Health-Aware Routing** — cooldown failed endpoints, circuit-breaker pattern (LiteLLM) 4. **Error-Type-Specific Handling** — different behavior for 429 vs 500 vs content-policy vs context-overflow (LiteLLM) 5. **Per-Prompt ML Selection** — trained model predicts best LLM per prompt (RouteLLM, Martian, Not Diamond) 6. **Local-Cloud Hybrid** — Ollama local as fallback for cloud APIs (LiteLLM + Ollama) 7. **Nested/Composable** — combine load-balance + fallback in same config (Portkey) ## The Gap Nobody Fills: Semantic Refusal Detection **This is the most important finding.** Every existing gateway handles HTTP-level errors (429, 500, etc). But when Claude returns **200 OK** and says "I can't help with that" — nobody catches it and reroutes. A model can refuse in ways that look like a successful response: - "I'm not able to assist with that request" - "I must decline to help with..." - "This goes against my guidelines" - Returns a watered-down, hedged, useless response No existing tool detects this and triggers a reroute. This is the uniwizard's unique value proposition. ## What Hermes Already Has (5-Layer Routing) ### Layer 1: Primary Model Selection (runtime_provider.py) - resolve_requested_provider() checks: explicit arg > config.yaml > env var > "auto" - Current config: provider=anthropic, model=claude-opus-4-6 ### Layer 2: Smart Model Routing (smart_model_routing.py) - Per-turn cheap/expensive model selection (heuristic-based) - Currently DISABLED (smart_model_routing.enabled: false) ### Layer 3: Primary Fallback Chain (run_agent.py) - _try_activate_fallback() advances through chain on failure - Triggers on: 429, 500/502/503, 401/403, 404, malformed responses - Swaps client, model, provider in-place mid-session - Current config: anthropic -> gemini-2.5-pro (single hop) ### Layer 4: Auxiliary Task Routing (auxiliary_client.py) - Independent provider resolution for side tasks (vision, compression, web, memory, etc.) - Auto-detection chain: OpenRouter > Nous > Custom > Codex > API-key providers ### Layer 5: Auto-Detection Chain - Cascading provider discovery when provider="auto" ## Gap Analysis: What Hermes Needs for Uniwizard ### Already Have (just needs wiring) - Multi-provider API normalization (3 api_modes) - Fallback chain infrastructure (run_agent.py) - Smart model routing skeleton (smart_model_routing.py) - Auxiliary routing for side tasks - Provider credential management ### Need to Build - **Multi-lane active routing** — all providers active simultaneously, not just failover - **Semantic refusal detection** — parse 200 OK responses for refusal patterns, trigger reroute - **Richer fallback triggers** — not just HTTP errors but refusals, quality thresholds, timeouts - **Backend health tracking** — per-provider success rates, latency, cooldown (links to #98) - **Task classification** — match task type to backend strength (links to #96) - **Cost tracking integration** — real-time spend tracking per backend (links to #97) - **SOUL.md injection** — ensure soul/personality rides on top of any backend ## Implementation Recommendation: Extend Hermes Natively ### Why Not External Tools - **LiteLLM:** Overkill. Hermes already has provider normalization. Heavy dependency. - **Portkey:** Wrong language (TypeScript). Good config inspiration though. - **OpenRouter:** Third-party dependency. Violates sovereignty. ### What to Steal from Each Project - **From LiteLLM:** Error-type classification, cooldown/circuit-breaker, allowed_fails threshold - **From Portkey:** Declarative nested config structure - **From RouteLLM:** Strong/weak model routing with a classifier - **From Martian:** Per-prompt quality prediction concept - **From OpenRouter:** Provider health scoring and automatic selection ### The Refusal Detector (Novel Component) ```python REFUSAL_PATTERNS = [ r"I (?:can't|cannot|am unable to|must decline)", r"against my (?:guidelines|policy|programming)", r"I'm not (?:able|comfortable|designed) to", r"I (?:apologize|'m sorry),? but I (?:can't|cannot)", ] def is_refusal(response_text: str) -> bool: # Check patterns + suspiciously short responses + hedging without substance pass ``` When refusal detected -> immediately reroute to next backend. Log for quality scoring (#98). ## Related Issues - #95 — Build backend registry and cloud routing layer (DIRECT PARENT) - #96 — Build task-to-backend classifier - #97 — Build cost tracking and budget enforcement - #98 — Build backend quality scoring and performance tracking - #99 — Wizard dissolution ## Cross-Repo References - timmy-config #90 — [ROUTING] Streamline local Timmy automation context per job - timmy-config #94 — [p0] Cut cloud inheritance from active harness config - the-nexus #737 — [ARCH/KT] Directional shift: Timmy sovereign architecture

Rockachopa referenced this issue

2026-03-30 15:49:19 +00:00

[EPIC] Grand Timmy — The Uniwizard #94

Rockachopa referenced this issue

2026-03-30 15:49:28 +00:00

Build backend registry and cloud routing layer #95

Rockachopa referenced this issue

2026-03-30 15:49:28 +00:00

Build task-to-backend classifier #96

Rockachopa referenced this issue

2026-03-30 15:50:21 +00:00

[EPIC] Grand Timmy — The Uniwizard #94

Rockachopa commented

2026-03-30 15:50:49 +00:00

Addendum: The Routing Philosophy is Inverted

Every project analyzed above optimizes to MINIMIZE spend. The uniwizard must MAXIMIZE utilization of already-purchased quota.

$500/month across backends. Quotas reset. Unused tokens = wasted money. Timmy's job is to burn through quota toward goals, not to conserve it.

This means the implementation recommendation changes:

Don't steal LiteLLM's cost-based routing (routes to cheapest). Instead: route to backend with most headroom.
Don't steal RouteLLM's cheap/expensive binary. Instead: always use the best tool, factor in quota remaining.
The refusal detector matters even more — a refusal wastes a turn that could have gone to a willing backend.

GOAP (Goal Oriented Action Planning) is the execution model. Define done, plan backwards, execute aggressively, re-plan on failure. Not step-by-step reactive. Goal-driven. Urgent.

## Addendum: The Routing Philosophy is Inverted Every project analyzed above optimizes to MINIMIZE spend. The uniwizard must MAXIMIZE utilization of already-purchased quota. $500/month across backends. Quotas reset. Unused tokens = wasted money. Timmy's job is to burn through quota toward goals, not to conserve it. This means the implementation recommendation changes: - Don't steal LiteLLM's cost-based routing (routes to cheapest). Instead: route to backend with most headroom. - Don't steal RouteLLM's cheap/expensive binary. Instead: always use the best tool, factor in quota remaining. - The refusal detector matters even more — a refusal wastes a turn that could have gone to a willing backend. GOAP (Goal Oriented Action Planning) is the execution model. Define done, plan backwards, execute aggressively, re-plan on failure. Not step-by-step reactive. Goal-driven. Urgent.

Timmy referenced this issue

2026-03-30 15:58:50 +00:00

[EPIC] Grand Timmy — The Uniwizard #94

Timmy referenced this issue

2026-03-30 16:03:22 +00:00

Build backend registry and cloud routing layer #95

Timmy referenced this issue

2026-03-30 16:03:23 +00:00

Build task-to-backend classifier #96

Timmy commented

2026-03-30 16:03:30 +00:00

Ezra owns research synthesis. Will distill #101 findings into actionable specs for Timmy's implementation tickets.

ezra was assigned by Timmy

2026-03-30 16:03:30 +00:00

Timmy commented

2026-03-30 16:49:46 +00:00

Research complete. Findings absorbed into #95 (backend registry), #96 (classifier), and board pass on #94. Semantic refusal detection documented. Closing.