diff --git a/README.md b/README.md index fe2a19e4..4fdf5dec 100644 --- a/README.md +++ b/README.md @@ -13,6 +13,7 @@ timmy-config/ ├── FALSEWORK.md ← API cost management strategy ├── DEPRECATED.md ← What was removed and why ├── config.yaml ← Hermes harness configuration +├── fallback-portfolios.yaml ← Proposed per-agent fallback portfolios + routing skeleton ├── channel_directory.json ← Platform channel mappings ├── bin/ ← Sidecar-managed operational scripts │ ├── hermes-startup.sh ← Dormant startup path (audit before enabling) @@ -28,14 +29,15 @@ timmy-config/ ├── docs/ │ ├── automation-inventory.md ← Live automation + stale-state inventory │ ├── ipc-hub-and-spoke-doctrine.md ← Coordinator-first, transport-agnostic fleet IPC doctrine -│ └── coordinator-first-protocol.md ← Coordinator doctrine: intake → triage → route → track → verify → report +│ ├── coordinator-first-protocol.md ← Coordinator doctrine: intake → triage → route → track → verify → report +│ └── fallback-portfolios.md ← Routing and degraded-authority doctrine └── training/ ← Transitional training recipes, not canonical lived data ``` ## Boundary -`timmy-config` owns identity, conscience, memories, skins, playbooks, channel -maps, and harness-side orchestration glue. +`timmy-config` owns identity, conscience, memories, skins, playbooks, routing doctrine, +channel maps, fallback portfolio declarations, and harness-side orchestration glue. `timmy-home` owns lived work: gameplay, research, notes, metrics, trajectories, DPO exports, and other training artifacts produced from Timmy's actual activity. diff --git a/docs/fallback-portfolios.md b/docs/fallback-portfolios.md new file mode 100644 index 00000000..8c12a290 --- /dev/null +++ b/docs/fallback-portfolios.md @@ -0,0 +1,248 @@ +# Per-Agent Fallback Portfolios and Task-Class Routing + +Status: proposed doctrine for issue #155 +Scope: policy and sidecar structure only; no runtime wiring in `tasks.py` or live loops yet + +## Why this exists + +Timmy already has multiple model paths declared in `config.yaml`, multiple task surfaces in `playbooks/`, and multiple live automation lanes documented in `docs/automation-inventory.md`. + +What is missing is a declared resilience doctrine for how specific agents degrade when a provider, quota, or model family fails. Without that doctrine, the whole fleet tends to collapse onto the same fallback chain, which means one outage turns into synchronized fleet degradation. + +This spec makes the fallback graph explicit before runtime wiring lands. + +## Timmy ownership boundary + +`timmy-config` owns: +- routing doctrine for Timmy-side task classes +- sidecar-readable fallback portfolio declarations +- capability floors and degraded-mode authority restrictions +- the mapping between current playbooks and future resilient agent lanes + +`timmy-config` does not own: +- live queue state or issue truth outside Gitea +- launchd state, loop resurrection, or stale runtime reuse +- ad hoc worktree history or hidden queue mutation + +That split matters. This repo should declare how routing is supposed to work. Runtime surfaces should consume that declaration instead of inventing their own fallback orderings. + +## Non-goals + +This issue does not: +- fully wire portfolio selection into `tasks.py`, launch agents, or live loops +- bless human-token or operator-token fallbacks as part of an automated chain +- allow degraded agents to keep full authority just because they are still producing output + +## Role classes + +### 1. Judgment + +Use for work where the main risk is a bad decision, not a missing patch. + +Current Timmy surfaces: +- `playbooks/issue-triager.yaml` +- `playbooks/pr-reviewer.yaml` +- `playbooks/verified-logic.yaml` + +Typical task classes: +- issue triage +- queue routing +- PR review +- proof / consistency checks +- governance-sensitive review + +Judgment lanes may read broadly, but they lose authority earlier than builder lanes when degraded. + +### 2. Builder + +Use for work where the main risk is producing or verifying a change. + +Current Timmy surfaces: +- `playbooks/bug-fixer.yaml` +- `playbooks/test-writer.yaml` +- `playbooks/refactor-specialist.yaml` + +Typical task classes: +- bug fixes +- test writing +- bounded refactors +- narrow docs or code repairs with verification + +Builder lanes keep patch-producing usefulness longer than judgment lanes, but they must lose control-plane authority as they degrade. + +### 3. Wolf / bulk + +Use for repetitive, high-volume, bounded, reversible work. + +Current Timmy world-state: +- bulk and sweep behavior is still represented more by live ops reality in `docs/automation-inventory.md` than by a dedicated sidecar playbook +- this class covers the work shape currently associated with queue hygiene, inventory refresh, docs sweeps, log summarization, and repetitive small-diff passes + +Typical task classes: +- docs inventory refresh +- log summarization +- queue hygiene +- repetitive small diffs +- research or extraction sweeps + +Wolf / bulk lanes are throughput-first and deliberately lower-authority. + +## Routing policy + +1. If the task touches a sensitive control surface, route to judgment first even if the edit is small. +2. If the task is primarily about merge authority, routing authority, proof, or governance, route to judgment. +3. If the task is primarily about producing a patch with local verification, route to builder. +4. If the task is repetitive, bounded, reversible, and low-authority, route to wolf / bulk. +5. If a wolf / bulk task expands beyond its size or authority envelope, promote it upward; do not let it keep grinding forward through scope creep. +6. If a builder task becomes architecture, multi-repo coordination, or control-plane review, promote it to judgment. +7. If a lane reaches terminal fallback, it must still land in a usable degraded mode. Dead silence is not an acceptable terminal state. + +## Sensitive control surfaces + +These paths stay judgment-routed unless explicitly reviewed otherwise: +- `SOUL.md` +- `config.yaml` +- `deploy.sh` +- `tasks.py` +- `playbooks/` +- `cron/` +- `memories/` +- `skins/` +- `training/` + +This mirrors the current PR-review doctrine and keeps degraded builder or bulk lanes away from Timmy's control plane. + +## Portfolio design rules + +The sidecar portfolio declaration in `fallback-portfolios.yaml` follows these rules: + +1. Every critical agent gets four slots: + - primary + - fallback1 + - fallback2 + - terminal fallback +2. No two critical agents may share the same `primary + fallback1` pair. +3. Provider families should be anti-correlated across critical lanes whenever practical. +4. Terminal fallbacks must end in a usable degraded lane, not a null lane. +5. At least one critical lane must end on a local-capable path. +6. No human-token fallback patterns are allowed in automated chains. +7. Degraded mode reduces authority before it removes usefulness. +8. A terminal lane that cannot safely produce an artifact is not a valid terminal lane. + +## Explicit ban: synchronized fleet degradation + +Synchronized fleet degradation is forbidden. + +That means: +- do not point every critical agent at the same fallback stack +- do not let all judgment agents converge on the same first backup if avoidable +- do not let all builder agents collapse onto the same weak terminal lane +- do not treat "everyone fell back to the cheapest thing" as resilience + +A resilient fleet degrades unevenly on purpose. Some lanes should stay sharp while others become slower or narrower. + +## Capability floors and degraded authority + +### Shared slot semantics + +- `primary`: full role-class authority +- `fallback1`: full task authority for normal work, but no silent broadening of scope +- `fallback2`: bounded and reversible work only; no irreversible control-plane action +- `terminal`: usable degraded lane only; must produce a machine-usable artifact but must not impersonate full authority + +### Judgment floors + +Judgment agents lose authority earliest. + +At `fallback2` and below, judgment lanes must not: +- merge PRs +- close or rewrite governing issues or PRs +- mutate sensitive control surfaces +- bulk-reassign the fleet +- silently change routing policy + +Their degraded usefulness is still real: +- classify backlog +- produce draft routing plans +- summarize risk +- leave bounded labels or comments with explicit evidence + +### Builder floors + +Builder agents may continue doing useful narrow work deeper into degradation, but only inside a tighter box. + +At `fallback2`, builder lanes must be limited to: +- single-issue work +- reversible patches +- narrow docs or test scaffolds +- bounded file counts and small diff sizes + +At `terminal`, builder lanes must not: +- touch sensitive control surfaces +- merge or release +- do multi-repo or architecture work +- claim verification they did not run + +Their terminal usefulness may still include: +- a small patch +- a reproducer test +- a docs fix +- a draft branch or artifact for later review + +### Wolf / bulk floors + +Wolf / bulk lanes stay useful as summarizers and sweepers, not as governors. + +At `fallback2` and `terminal`, wolf / bulk lanes must not: +- fan out branch creation across repos +- mass-assign agents +- edit sensitive control surfaces +- perform irreversible queue mutation + +Their degraded usefulness may still include: +- gathering evidence +- refreshing inventories +- summarizing logs +- proposing labels or routes +- producing repetitive, low-risk artifacts inside explicit caps + +## Usable terminal lanes + +A terminal fallback is only valid if it still does at least one of these safely: +- classify and summarize a backlog +- produce a bounded patch or test artifact +- summarize a diff with explicit uncertainty +- refresh an inventory or evidence bundle + +If the terminal lane can only say "model unavailable" and stop, the portfolio is incomplete. + +## Current sidecar reference lanes + +`fallback-portfolios.yaml` defines the initial implementation-ready structure for four named lanes: +- `triage-coordinator` — judgment +- `pr-reviewer` — judgment +- `builder-main` — builder +- `wolf-sweeper` — wolf / bulk + +These are the canonical resilience lanes for the current Timmy world-state. + +Current playbooks should eventually map onto them like this: +- `playbooks/issue-triager.yaml` -> `triage-coordinator` +- `playbooks/pr-reviewer.yaml` -> `pr-reviewer` +- `playbooks/verified-logic.yaml` -> judgment lane family, pending a dedicated proof profile if needed +- `playbooks/bug-fixer.yaml`, `playbooks/test-writer.yaml`, and `playbooks/refactor-specialist.yaml` -> `builder-main` +- future sidecar bulk playbooks should inherit from `wolf-sweeper` instead of inventing independent fallback chains + +Until runtime wiring lands, unmapped playbooks should be treated as policy-incomplete rather than inheriting an implicit fallback chain. + +## Wiring contract for later implementation + +When this is wired into runtime selection, the selector should: +- classify the incoming task into a role class +- check whether the task touches a sensitive control surface +- choose the named agent lane for that class +- step through the declared portfolio slots in order +- enforce the capability floor of the active slot before taking action +- record when a fallback transition happened and what authority was still allowed + +The important part is not just choosing a different model. It is choosing a different authority envelope as the lane degrades. diff --git a/fallback-portfolios.yaml b/fallback-portfolios.yaml new file mode 100644 index 00000000..f507dc86 --- /dev/null +++ b/fallback-portfolios.yaml @@ -0,0 +1,284 @@ +schema_version: 1 +status: proposed +runtime_wiring: false +owner: timmy-config + +ownership: + owns: + - routing doctrine for task classes + - sidecar-readable per-agent fallback portfolios + - degraded-mode capability floors + does_not_own: + - live queue state outside Gitea truth + - launchd or loop process state + - ad hoc worktree history + +policy: + require_four_slots_for_critical_agents: true + terminal_fallback_must_be_usable: true + forbid_synchronized_fleet_degradation: true + forbid_human_token_fallbacks: true + anti_correlation_rule: no two critical agents may share the same primary+fallback1 pair + +sensitive_control_surfaces: + - SOUL.md + - config.yaml + - deploy.sh + - tasks.py + - playbooks/ + - cron/ + - memories/ + - skins/ + - training/ + +role_classes: + judgment: + current_surfaces: + - playbooks/issue-triager.yaml + - playbooks/pr-reviewer.yaml + - playbooks/verified-logic.yaml + task_classes: + - issue-triage + - queue-routing + - pr-review + - proof-check + - governance-review + degraded_mode: + fallback2: + allowed: + - classify backlog + - summarize risk + - produce draft routing plans + - leave bounded labels or comments with evidence + denied: + - merge pull requests + - close or rewrite governing issues or PRs + - mutate sensitive control surfaces + - bulk-reassign the fleet + - silently change routing policy + terminal: + lane: report-and-route + allowed: + - classify backlog + - summarize risk + - produce draft routing artifacts + denied: + - merge pull requests + - bulk-reassign the fleet + - mutate sensitive control surfaces + + builder: + current_surfaces: + - playbooks/bug-fixer.yaml + - playbooks/test-writer.yaml + - playbooks/refactor-specialist.yaml + task_classes: + - bug-fix + - test-writing + - refactor + - bounded-docs-change + degraded_mode: + fallback2: + allowed: + - reversible single-issue changes + - narrow docs fixes + - test scaffolds and reproducers + denied: + - cross-repo changes + - sensitive control-surface edits + - merge or release actions + terminal: + lane: narrow-patch + allowed: + - single-issue small patch + - reproducer test + - docs-only repair + denied: + - sensitive control-surface edits + - multi-file architecture work + - irreversible actions + + wolf_bulk: + current_surfaces: + - docs/automation-inventory.md + - FALSEWORK.md + task_classes: + - docs-inventory + - log-summarization + - queue-hygiene + - repetitive-small-diff + - research-sweep + degraded_mode: + fallback2: + allowed: + - gather evidence + - refresh inventories + - summarize logs + - propose labels or routes + denied: + - multi-repo branch fanout + - mass agent assignment + - sensitive control-surface edits + - irreversible queue mutation + terminal: + lane: gather-and-summarize + allowed: + - inventory refresh + - evidence bundles + - summaries + denied: + - multi-repo branch fanout + - mass agent assignment + - sensitive control-surface edits + +routing: + issue-triage: judgment + queue-routing: judgment + pr-review: judgment + proof-check: judgment + governance-review: judgment + bug-fix: builder + test-writing: builder + refactor: builder + bounded-docs-change: builder + docs-inventory: wolf_bulk + log-summarization: wolf_bulk + queue-hygiene: wolf_bulk + repetitive-small-diff: wolf_bulk + research-sweep: wolf_bulk + +promotion_rules: + - If a wolf/bulk task touches a sensitive control surface, promote it to judgment. + - If a builder task expands beyond 5 files, architecture review, or multi-repo coordination, promote it to judgment. + - If a terminal lane cannot produce a usable artifact, the portfolio is invalid and must be redesigned before wiring. + +agents: + triage-coordinator: + role_class: judgment + critical: true + current_playbooks: + - playbooks/issue-triager.yaml + portfolio: + primary: + provider: anthropic + model: claude-opus-4-6 + lane: full-judgment + fallback1: + provider: openai-codex + model: codex + lane: high-judgment + fallback2: + provider: gemini + model: gemini-2.5-pro + lane: bounded-judgment + terminal: + provider: ollama + model: hermes3:latest + lane: report-and-route + local_capable: true + usable_output: + - backlog classification + - routing draft + - risk summary + + pr-reviewer: + role_class: judgment + critical: true + current_playbooks: + - playbooks/pr-reviewer.yaml + portfolio: + primary: + provider: anthropic + model: claude-opus-4-6 + lane: full-review + fallback1: + provider: gemini + model: gemini-2.5-pro + lane: high-review + fallback2: + provider: grok + model: grok-3-mini-fast + lane: comment-only-review + terminal: + provider: openrouter + model: openai/gpt-4.1-mini + lane: low-stakes-diff-summary + local_capable: false + usable_output: + - diff risk summary + - explicit uncertainty notes + - merge-block recommendation + + builder-main: + role_class: builder + critical: true + current_playbooks: + - playbooks/bug-fixer.yaml + - playbooks/test-writer.yaml + - playbooks/refactor-specialist.yaml + portfolio: + primary: + provider: openai-codex + model: codex + lane: full-builder + fallback1: + provider: kimi-coding + model: kimi-k2.5 + lane: bounded-builder + fallback2: + provider: groq + model: llama-3.3-70b-versatile + lane: small-patch-builder + terminal: + provider: custom_provider + provider_name: Local llama.cpp + model: hermes4:14b + lane: narrow-patch + local_capable: true + usable_output: + - small patch + - reproducer test + - docs repair + + wolf-sweeper: + role_class: wolf_bulk + critical: true + current_world_state: + - docs/automation-inventory.md + portfolio: + primary: + provider: gemini + model: gemini-2.5-flash + lane: fast-bulk + fallback1: + provider: groq + model: llama-3.3-70b-versatile + lane: fast-bulk-backup + fallback2: + provider: openrouter + model: openai/gpt-4.1-mini + lane: bounded-bulk-summary + terminal: + provider: ollama + model: hermes3:latest + lane: gather-and-summarize + local_capable: true + usable_output: + - inventory refresh + - evidence bundle + - summary comment + +cross_checks: + unique_primary_fallback1_pairs: + triage-coordinator: + - anthropic/claude-opus-4-6 + - openai-codex/codex + pr-reviewer: + - anthropic/claude-opus-4-6 + - gemini/gemini-2.5-pro + builder-main: + - openai-codex/codex + - kimi-coding/kimi-k2.5 + wolf-sweeper: + - gemini/gemini-2.5-flash + - groq/llama-3.3-70b-versatile