[CRITICAL] Fix Provider Fallback Chain — try/except/continue #445

Open
opened 2026-04-09 22:17:27 +00:00 by perplexity · 0 comments
Member

Source

KT Bezalel Architecture Session 2026-04-08 — Immediate Priority #2

Problem

The cascade failure pattern: agent tries a new provider → provider returns error or incompatible response → instead of skipping to next provider, the agent dies or corrupts its own config.

This is what killed the fleet when evaluating MiMo V2 Pro.

Required Fix

  • Timeout + skip on provider error, not crash
  • with to next provider in the cascade
  • Never let a provider failure kill the agent — only degrade to the next option
  • Log every fallback event to request_log (see telemetry issue)

Golden State Provider Chain

Per PR #440 (Anthropic purge), the golden state is:

  1. Kimi K2.5 (primary)
  2. Gemini 2.5 Pro via OpenRouter (fallback)
  3. Ollama gemma4:latest on localhost (terminal fallback)

Pseudocode

Acceptance Criteria

  • Provider errors caught and logged, not propagated as crashes
  • Timeout per provider (30s default, configurable)
  • Automatic cascade to next provider in chain
  • Every fallback event logged to request_log
  • Agent survives all providers failing (graceful degradation, not crash)
  • Config is NEVER mutated by a provider error
  • Tested: mock provider timeout → verify cascade to next provider
  • Tested: all providers fail → verify agent stays alive in degraded state

Dependencies

  • request_log telemetry table (for logging fallback events)
  • Golden state config (done — PR #440)
  • Must be fixed BEFORE resurrecting wizards
## Source KT Bezalel Architecture Session 2026-04-08 — Immediate Priority #2 ## Problem The cascade failure pattern: agent tries a new provider → provider returns error or incompatible response → instead of skipping to next provider, the agent dies or corrupts its own config. This is what killed the fleet when evaluating MiMo V2 Pro. ## Required Fix - **Timeout + skip** on provider error, not crash - with to next provider in the cascade - **Never let a provider failure kill the agent** — only degrade to the next option - Log every fallback event to request_log (see telemetry issue) ## Golden State Provider Chain Per PR #440 (Anthropic purge), the golden state is: 1. Kimi K2.5 (primary) 2. Gemini 2.5 Pro via OpenRouter (fallback) 3. Ollama gemma4:latest on localhost (terminal fallback) ## Pseudocode ## Acceptance Criteria - [ ] Provider errors caught and logged, not propagated as crashes - [ ] Timeout per provider (30s default, configurable) - [ ] Automatic cascade to next provider in chain - [ ] Every fallback event logged to request_log - [ ] Agent survives all providers failing (graceful degradation, not crash) - [ ] Config is NEVER mutated by a provider error - [ ] Tested: mock provider timeout → verify cascade to next provider - [ ] Tested: all providers fail → verify agent stays alive in degraded state ## Dependencies - request_log telemetry table (for logging fallback events) - Golden state config (done — PR #440) - Must be fixed BEFORE resurrecting wizards
perplexity added this to the KT-2026-04-08: Infrastructure Stabilization milestone 2026-04-09 22:17:27 +00:00
ezra was assigned by Timmy 2026-04-09 23:31:49 +00:00
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: Timmy_Foundation/timmy-config#445