[RESILIENCE] Define per-agent fallback portfolios and routing doctrine (#170)

2026-04-04 21:40:36 +00:00
parent 2142d20129
commit ff7e22dcc8
3 changed files with 537 additions and 3 deletions
--- a/README.md
+++ b/README.md
@@ -13,6 +13,7 @@ timmy-config/
 ├── FALSEWORK.md               ← API cost management strategy
 ├── DEPRECATED.md              ← What was removed and why
 ├── config.yaml                ← Hermes harness configuration
 ├── fallback-portfolios.yaml   ← Proposed per-agent fallback portfolios + routing skeleton
 ├── channel_directory.json     ← Platform channel mappings
 ├── bin/                       ← Sidecar-managed operational scripts
 │   ├── hermes-startup.sh      ← Dormant startup path (audit before enabling)
@@ -28,14 +29,15 @@ timmy-config/
 ├── docs/
 │   ├── automation-inventory.md ← Live automation + stale-state inventory
 │   ├── ipc-hub-and-spoke-doctrine.md ← Coordinator-first, transport-agnostic fleet IPC doctrine
-│   └── coordinator-first-protocol.md ← Coordinator doctrine: intake → triage → route → track → verify → report
+│   ├── coordinator-first-protocol.md ← Coordinator doctrine: intake → triage → route → track → verify → report
 │   └── fallback-portfolios.md ← Routing and degraded-authority doctrine
 └── training/                  ← Transitional training recipes, not canonical lived data
 ```
 ## Boundary
-`timmy-config` owns identity, conscience, memories, skins, playbooks, channel
+`timmy-config` owns identity, conscience, memories, skins, playbooks, routing doctrine,
-maps, and harness-side orchestration glue.
+channel maps, fallback portfolio declarations, and harness-side orchestration glue.
 `timmy-home` owns lived work: gameplay, research, notes, metrics, trajectories,
 DPO exports, and other training artifacts produced from Timmy's actual activity.
--- a/docs/fallback-portfolios.md
+++ b/docs/fallback-portfolios.md
@@ -0,0 +1,248 @@
 # Per-Agent Fallback Portfolios and Task-Class Routing
 Status: proposed doctrine for issue #155
 Scope: policy and sidecar structure only; no runtime wiring in `tasks.py` or live loops yet
 ## Why this exists
 Timmy already has multiple model paths declared in `config.yaml`, multiple task surfaces in `playbooks/`, and multiple live automation lanes documented in `docs/automation-inventory.md`.
 What is missing is a declared resilience doctrine for how specific agents degrade when a provider, quota, or model family fails. Without that doctrine, the whole fleet tends to collapse onto the same fallback chain, which means one outage turns into synchronized fleet degradation.
 This spec makes the fallback graph explicit before runtime wiring lands.
 ## Timmy ownership boundary
 `timmy-config` owns:
 - routing doctrine for Timmy-side task classes
 - sidecar-readable fallback portfolio declarations
 - capability floors and degraded-mode authority restrictions
 - the mapping between current playbooks and future resilient agent lanes
 `timmy-config` does not own:
 - live queue state or issue truth outside Gitea
 - launchd state, loop resurrection, or stale runtime reuse
 - ad hoc worktree history or hidden queue mutation
 That split matters. This repo should declare how routing is supposed to work. Runtime surfaces should consume that declaration instead of inventing their own fallback orderings.
 ## Non-goals
 This issue does not:
 - fully wire portfolio selection into `tasks.py`, launch agents, or live loops
 - bless human-token or operator-token fallbacks as part of an automated chain
 - allow degraded agents to keep full authority just because they are still producing output
 ## Role classes
 ### 1. Judgment
 Use for work where the main risk is a bad decision, not a missing patch.
 Current Timmy surfaces:
 - `playbooks/issue-triager.yaml`
 - `playbooks/pr-reviewer.yaml`
 - `playbooks/verified-logic.yaml`
 Typical task classes:
 - issue triage
 - queue routing
 - PR review
 - proof / consistency checks
 - governance-sensitive review
 Judgment lanes may read broadly, but they lose authority earlier than builder lanes when degraded.
 ### 2. Builder
 Use for work where the main risk is producing or verifying a change.
 Current Timmy surfaces:
 - `playbooks/bug-fixer.yaml`
 - `playbooks/test-writer.yaml`
 - `playbooks/refactor-specialist.yaml`
 Typical task classes:
 - bug fixes
 - test writing
 - bounded refactors
 - narrow docs or code repairs with verification
 Builder lanes keep patch-producing usefulness longer than judgment lanes, but they must lose control-plane authority as they degrade.
 ### 3. Wolf / bulk
 Use for repetitive, high-volume, bounded, reversible work.
 Current Timmy world-state:
 - bulk and sweep behavior is still represented more by live ops reality in `docs/automation-inventory.md` than by a dedicated sidecar playbook
 - this class covers the work shape currently associated with queue hygiene, inventory refresh, docs sweeps, log summarization, and repetitive small-diff passes
 Typical task classes:
 - docs inventory refresh
 - log summarization
 - queue hygiene
 - repetitive small diffs
 - research or extraction sweeps
 Wolf / bulk lanes are throughput-first and deliberately lower-authority.
 ## Routing policy
 1. If the task touches a sensitive control surface, route to judgment first even if the edit is small.
 2. If the task is primarily about merge authority, routing authority, proof, or governance, route to judgment.
 3. If the task is primarily about producing a patch with local verification, route to builder.
 4. If the task is repetitive, bounded, reversible, and low-authority, route to wolf / bulk.
 5. If a wolf / bulk task expands beyond its size or authority envelope, promote it upward; do not let it keep grinding forward through scope creep.
 6. If a builder task becomes architecture, multi-repo coordination, or control-plane review, promote it to judgment.
 7. If a lane reaches terminal fallback, it must still land in a usable degraded mode. Dead silence is not an acceptable terminal state.
 ## Sensitive control surfaces
 These paths stay judgment-routed unless explicitly reviewed otherwise:
 - `SOUL.md`
 - `config.yaml`
 - `deploy.sh`
 - `tasks.py`
 - `playbooks/`
 - `cron/`
 - `memories/`
 - `skins/`
 - `training/`
 This mirrors the current PR-review doctrine and keeps degraded builder or bulk lanes away from Timmy's control plane.
 ## Portfolio design rules
 The sidecar portfolio declaration in `fallback-portfolios.yaml` follows these rules:
 1. Every critical agent gets four slots:
   - primary
   - fallback1
   - fallback2
   - terminal fallback
 2. No two critical agents may share the same `primary + fallback1` pair.
 3. Provider families should be anti-correlated across critical lanes whenever practical.
 4. Terminal fallbacks must end in a usable degraded lane, not a null lane.
 5. At least one critical lane must end on a local-capable path.
 6. No human-token fallback patterns are allowed in automated chains.
 7. Degraded mode reduces authority before it removes usefulness.
 8. A terminal lane that cannot safely produce an artifact is not a valid terminal lane.
 ## Explicit ban: synchronized fleet degradation
 Synchronized fleet degradation is forbidden.
 That means:
 - do not point every critical agent at the same fallback stack
 - do not let all judgment agents converge on the same first backup if avoidable
 - do not let all builder agents collapse onto the same weak terminal lane
 - do not treat "everyone fell back to the cheapest thing" as resilience
 A resilient fleet degrades unevenly on purpose. Some lanes should stay sharp while others become slower or narrower.
 ## Capability floors and degraded authority
 ### Shared slot semantics
 - `primary`: full role-class authority
 - `fallback1`: full task authority for normal work, but no silent broadening of scope
 - `fallback2`: bounded and reversible work only; no irreversible control-plane action
 - `terminal`: usable degraded lane only; must produce a machine-usable artifact but must not impersonate full authority
 ### Judgment floors
 Judgment agents lose authority earliest.
 At `fallback2` and below, judgment lanes must not:
 - merge PRs
 - close or rewrite governing issues or PRs
 - mutate sensitive control surfaces
 - bulk-reassign the fleet
 - silently change routing policy
 Their degraded usefulness is still real:
 - classify backlog
 - produce draft routing plans
 - summarize risk
 - leave bounded labels or comments with explicit evidence
 ### Builder floors
 Builder agents may continue doing useful narrow work deeper into degradation, but only inside a tighter box.
 At `fallback2`, builder lanes must be limited to:
 - single-issue work
 - reversible patches
 - narrow docs or test scaffolds
 - bounded file counts and small diff sizes
 At `terminal`, builder lanes must not:
 - touch sensitive control surfaces
 - merge or release
 - do multi-repo or architecture work
 - claim verification they did not run
 Their terminal usefulness may still include:
 - a small patch
 - a reproducer test
 - a docs fix
 - a draft branch or artifact for later review
 ### Wolf / bulk floors
 Wolf / bulk lanes stay useful as summarizers and sweepers, not as governors.
 At `fallback2` and `terminal`, wolf / bulk lanes must not:
 - fan out branch creation across repos
 - mass-assign agents
 - edit sensitive control surfaces
 - perform irreversible queue mutation
 Their degraded usefulness may still include:
 - gathering evidence
 - refreshing inventories
 - summarizing logs
 - proposing labels or routes
 - producing repetitive, low-risk artifacts inside explicit caps
 ## Usable terminal lanes
 A terminal fallback is only valid if it still does at least one of these safely:
 - classify and summarize a backlog
 - produce a bounded patch or test artifact
 - summarize a diff with explicit uncertainty
 - refresh an inventory or evidence bundle
 If the terminal lane can only say "model unavailable" and stop, the portfolio is incomplete.
 ## Current sidecar reference lanes
 `fallback-portfolios.yaml` defines the initial implementation-ready structure for four named lanes:
 - `triage-coordinator` — judgment
 - `pr-reviewer` — judgment
 - `builder-main` — builder
 - `wolf-sweeper` — wolf / bulk
 These are the canonical resilience lanes for the current Timmy world-state.
 Current playbooks should eventually map onto them like this:
 - `playbooks/issue-triager.yaml` -> `triage-coordinator`
 - `playbooks/pr-reviewer.yaml` -> `pr-reviewer`
 - `playbooks/verified-logic.yaml` -> judgment lane family, pending a dedicated proof profile if needed
 - `playbooks/bug-fixer.yaml`, `playbooks/test-writer.yaml`, and `playbooks/refactor-specialist.yaml` -> `builder-main`
 - future sidecar bulk playbooks should inherit from `wolf-sweeper` instead of inventing independent fallback chains
 Until runtime wiring lands, unmapped playbooks should be treated as policy-incomplete rather than inheriting an implicit fallback chain.
 ## Wiring contract for later implementation
 When this is wired into runtime selection, the selector should:
 - classify the incoming task into a role class
 - check whether the task touches a sensitive control surface
 - choose the named agent lane for that class
 - step through the declared portfolio slots in order
 - enforce the capability floor of the active slot before taking action
 - record when a fallback transition happened and what authority was still allowed
 The important part is not just choosing a different model. It is choosing a different authority envelope as the lane degrades.
--- a/fallback-portfolios.yaml
+++ b/fallback-portfolios.yaml
@@ -0,0 +1,284 @@
 schema_version: 1
 status: proposed
 runtime_wiring: false
 owner: timmy-config
 ownership:
  owns:
    - routing doctrine for task classes
    - sidecar-readable per-agent fallback portfolios
    - degraded-mode capability floors
  does_not_own:
    - live queue state outside Gitea truth
    - launchd or loop process state
    - ad hoc worktree history
 policy:
  require_four_slots_for_critical_agents: true
  terminal_fallback_must_be_usable: true
  forbid_synchronized_fleet_degradation: true
  forbid_human_token_fallbacks: true
  anti_correlation_rule: no two critical agents may share the same primary+fallback1 pair
 sensitive_control_surfaces:
  - SOUL.md
  - config.yaml
  - deploy.sh
  - tasks.py
  - playbooks/
  - cron/
  - memories/
  - skins/
  - training/
 role_classes:
  judgment:
    current_surfaces:
      - playbooks/issue-triager.yaml
      - playbooks/pr-reviewer.yaml
      - playbooks/verified-logic.yaml
    task_classes:
      - issue-triage
      - queue-routing
      - pr-review
      - proof-check
      - governance-review
    degraded_mode:
      fallback2:
        allowed:
          - classify backlog
          - summarize risk
          - produce draft routing plans
          - leave bounded labels or comments with evidence
        denied:
          - merge pull requests
          - close or rewrite governing issues or PRs
          - mutate sensitive control surfaces
          - bulk-reassign the fleet
          - silently change routing policy
      terminal:
        lane: report-and-route
        allowed:
          - classify backlog
          - summarize risk
          - produce draft routing artifacts
        denied:
          - merge pull requests
          - bulk-reassign the fleet
          - mutate sensitive control surfaces
  builder:
    current_surfaces:
      - playbooks/bug-fixer.yaml
      - playbooks/test-writer.yaml
      - playbooks/refactor-specialist.yaml
    task_classes:
      - bug-fix
      - test-writing
      - refactor
      - bounded-docs-change
    degraded_mode:
      fallback2:
        allowed:
          - reversible single-issue changes
          - narrow docs fixes
          - test scaffolds and reproducers
        denied:
          - cross-repo changes
          - sensitive control-surface edits
          - merge or release actions
      terminal:
        lane: narrow-patch
        allowed:
          - single-issue small patch
          - reproducer test
          - docs-only repair
        denied:
          - sensitive control-surface edits
          - multi-file architecture work
          - irreversible actions
  wolf_bulk:
    current_surfaces:
      - docs/automation-inventory.md
      - FALSEWORK.md
    task_classes:
      - docs-inventory
      - log-summarization
      - queue-hygiene
      - repetitive-small-diff
      - research-sweep
    degraded_mode:
      fallback2:
        allowed:
          - gather evidence
          - refresh inventories
          - summarize logs
          - propose labels or routes
        denied:
          - multi-repo branch fanout
          - mass agent assignment
          - sensitive control-surface edits
          - irreversible queue mutation
      terminal:
        lane: gather-and-summarize
        allowed:
          - inventory refresh
          - evidence bundles
          - summaries
        denied:
          - multi-repo branch fanout
          - mass agent assignment
          - sensitive control-surface edits
 routing:
  issue-triage: judgment
  queue-routing: judgment
  pr-review: judgment
  proof-check: judgment
  governance-review: judgment
  bug-fix: builder
  test-writing: builder
  refactor: builder
  bounded-docs-change: builder
  docs-inventory: wolf_bulk
  log-summarization: wolf_bulk
  queue-hygiene: wolf_bulk
  repetitive-small-diff: wolf_bulk
  research-sweep: wolf_bulk
 promotion_rules:
  - If a wolf/bulk task touches a sensitive control surface, promote it to judgment.
  - If a builder task expands beyond 5 files, architecture review, or multi-repo coordination, promote it to judgment.
  - If a terminal lane cannot produce a usable artifact, the portfolio is invalid and must be redesigned before wiring.
 agents:
  triage-coordinator:
    role_class: judgment
    critical: true
    current_playbooks:
      - playbooks/issue-triager.yaml
    portfolio:
      primary:
        provider: anthropic
        model: claude-opus-4-6
        lane: full-judgment
      fallback1:
        provider: openai-codex
        model: codex
        lane: high-judgment
      fallback2:
        provider: gemini
        model: gemini-2.5-pro
        lane: bounded-judgment
      terminal:
        provider: ollama
        model: hermes3:latest
        lane: report-and-route
        local_capable: true
        usable_output:
          - backlog classification
          - routing draft
          - risk summary
  pr-reviewer:
    role_class: judgment
    critical: true
    current_playbooks:
      - playbooks/pr-reviewer.yaml
    portfolio:
      primary:
        provider: anthropic
        model: claude-opus-4-6
        lane: full-review
      fallback1:
        provider: gemini
        model: gemini-2.5-pro
        lane: high-review
      fallback2:
        provider: grok
        model: grok-3-mini-fast
        lane: comment-only-review
      terminal:
        provider: openrouter
        model: openai/gpt-4.1-mini
        lane: low-stakes-diff-summary
        local_capable: false
        usable_output:
          - diff risk summary
          - explicit uncertainty notes
          - merge-block recommendation
  builder-main:
    role_class: builder
    critical: true
    current_playbooks:
      - playbooks/bug-fixer.yaml
      - playbooks/test-writer.yaml
      - playbooks/refactor-specialist.yaml
    portfolio:
      primary:
        provider: openai-codex
        model: codex
        lane: full-builder
      fallback1:
        provider: kimi-coding
        model: kimi-k2.5
        lane: bounded-builder
      fallback2:
        provider: groq
        model: llama-3.3-70b-versatile
        lane: small-patch-builder
      terminal:
        provider: custom_provider
        provider_name: Local llama.cpp
        model: hermes4:14b
        lane: narrow-patch
        local_capable: true
        usable_output:
          - small patch
          - reproducer test
          - docs repair
  wolf-sweeper:
    role_class: wolf_bulk
    critical: true
    current_world_state:
      - docs/automation-inventory.md
    portfolio:
      primary:
        provider: gemini
        model: gemini-2.5-flash
        lane: fast-bulk
      fallback1:
        provider: groq
        model: llama-3.3-70b-versatile
        lane: fast-bulk-backup
      fallback2:
        provider: openrouter
        model: openai/gpt-4.1-mini
        lane: bounded-bulk-summary
      terminal:
        provider: ollama
        model: hermes3:latest
        lane: gather-and-summarize
        local_capable: true
        usable_output:
          - inventory refresh
          - evidence bundle
          - summary comment
 cross_checks:
  unique_primary_fallback1_pairs:
    triage-coordinator:
      - anthropic/claude-opus-4-6
      - openai-codex/codex
    pr-reviewer:
      - anthropic/claude-opus-4-6
      - gemini/gemini-2.5-pro
    builder-main:
      - openai-codex/codex
      - kimi-coding/kimi-k2.5
    wolf-sweeper:
      - gemini/gemini-2.5-flash
      - groq/llama-3.3-70b-versatile