When an MCP server returns errors consistently (crashed, disconnected, auth expired), the model sees each error and retries the tool call. With no circuit breaker, this burned through all 90 iterations — each one a full LLM API call plus failed MCP call — producing 15-45 minutes of zero useful output while the gateway inactivity timeout never fired (because the agent WAS active, just uselessly). Fix: track consecutive error counts per MCP server. After 3 consecutive failures (connection errors, MCP-level errors, or transport exceptions), the handler short-circuits with a message telling the model to stop retrying and use alternative approaches. The counter resets to 0 on any successful call. Closes #10447
88 KiB
88 KiB