249 lines
9.3 KiB
Markdown
249 lines
9.3 KiB
Markdown
# Per-Agent Fallback Portfolios and Task-Class Routing
|
|
|
|
Status: proposed doctrine for issue #155
|
|
Scope: policy and sidecar structure only; no runtime wiring in `tasks.py` or live loops yet
|
|
|
|
## Why this exists
|
|
|
|
Timmy already has multiple model paths declared in `config.yaml`, multiple task surfaces in `playbooks/`, and multiple live automation lanes documented in `docs/automation-inventory.md`.
|
|
|
|
What is missing is a declared resilience doctrine for how specific agents degrade when a provider, quota, or model family fails. Without that doctrine, the whole fleet tends to collapse onto the same fallback chain, which means one outage turns into synchronized fleet degradation.
|
|
|
|
This spec makes the fallback graph explicit before runtime wiring lands.
|
|
|
|
## Timmy ownership boundary
|
|
|
|
`timmy-config` owns:
|
|
- routing doctrine for Timmy-side task classes
|
|
- sidecar-readable fallback portfolio declarations
|
|
- capability floors and degraded-mode authority restrictions
|
|
- the mapping between current playbooks and future resilient agent lanes
|
|
|
|
`timmy-config` does not own:
|
|
- live queue state or issue truth outside Gitea
|
|
- launchd state, loop resurrection, or stale runtime reuse
|
|
- ad hoc worktree history or hidden queue mutation
|
|
|
|
That split matters. This repo should declare how routing is supposed to work. Runtime surfaces should consume that declaration instead of inventing their own fallback orderings.
|
|
|
|
## Non-goals
|
|
|
|
This issue does not:
|
|
- fully wire portfolio selection into `tasks.py`, launch agents, or live loops
|
|
- bless human-token or operator-token fallbacks as part of an automated chain
|
|
- allow degraded agents to keep full authority just because they are still producing output
|
|
|
|
## Role classes
|
|
|
|
### 1. Judgment
|
|
|
|
Use for work where the main risk is a bad decision, not a missing patch.
|
|
|
|
Current Timmy surfaces:
|
|
- `playbooks/issue-triager.yaml`
|
|
- `playbooks/pr-reviewer.yaml`
|
|
- `playbooks/verified-logic.yaml`
|
|
|
|
Typical task classes:
|
|
- issue triage
|
|
- queue routing
|
|
- PR review
|
|
- proof / consistency checks
|
|
- governance-sensitive review
|
|
|
|
Judgment lanes may read broadly, but they lose authority earlier than builder lanes when degraded.
|
|
|
|
### 2. Builder
|
|
|
|
Use for work where the main risk is producing or verifying a change.
|
|
|
|
Current Timmy surfaces:
|
|
- `playbooks/bug-fixer.yaml`
|
|
- `playbooks/test-writer.yaml`
|
|
- `playbooks/refactor-specialist.yaml`
|
|
|
|
Typical task classes:
|
|
- bug fixes
|
|
- test writing
|
|
- bounded refactors
|
|
- narrow docs or code repairs with verification
|
|
|
|
Builder lanes keep patch-producing usefulness longer than judgment lanes, but they must lose control-plane authority as they degrade.
|
|
|
|
### 3. Wolf / bulk
|
|
|
|
Use for repetitive, high-volume, bounded, reversible work.
|
|
|
|
Current Timmy world-state:
|
|
- bulk and sweep behavior is still represented more by live ops reality in `docs/automation-inventory.md` than by a dedicated sidecar playbook
|
|
- this class covers the work shape currently associated with queue hygiene, inventory refresh, docs sweeps, log summarization, and repetitive small-diff passes
|
|
|
|
Typical task classes:
|
|
- docs inventory refresh
|
|
- log summarization
|
|
- queue hygiene
|
|
- repetitive small diffs
|
|
- research or extraction sweeps
|
|
|
|
Wolf / bulk lanes are throughput-first and deliberately lower-authority.
|
|
|
|
## Routing policy
|
|
|
|
1. If the task touches a sensitive control surface, route to judgment first even if the edit is small.
|
|
2. If the task is primarily about merge authority, routing authority, proof, or governance, route to judgment.
|
|
3. If the task is primarily about producing a patch with local verification, route to builder.
|
|
4. If the task is repetitive, bounded, reversible, and low-authority, route to wolf / bulk.
|
|
5. If a wolf / bulk task expands beyond its size or authority envelope, promote it upward; do not let it keep grinding forward through scope creep.
|
|
6. If a builder task becomes architecture, multi-repo coordination, or control-plane review, promote it to judgment.
|
|
7. If a lane reaches terminal fallback, it must still land in a usable degraded mode. Dead silence is not an acceptable terminal state.
|
|
|
|
## Sensitive control surfaces
|
|
|
|
These paths stay judgment-routed unless explicitly reviewed otherwise:
|
|
- `SOUL.md`
|
|
- `config.yaml`
|
|
- `deploy.sh`
|
|
- `tasks.py`
|
|
- `playbooks/`
|
|
- `cron/`
|
|
- `memories/`
|
|
- `skins/`
|
|
- `training/`
|
|
|
|
This mirrors the current PR-review doctrine and keeps degraded builder or bulk lanes away from Timmy's control plane.
|
|
|
|
## Portfolio design rules
|
|
|
|
The sidecar portfolio declaration in `fallback-portfolios.yaml` follows these rules:
|
|
|
|
1. Every critical agent gets four slots:
|
|
- primary
|
|
- fallback1
|
|
- fallback2
|
|
- terminal fallback
|
|
2. No two critical agents may share the same `primary + fallback1` pair.
|
|
3. Provider families should be anti-correlated across critical lanes whenever practical.
|
|
4. Terminal fallbacks must end in a usable degraded lane, not a null lane.
|
|
5. At least one critical lane must end on a local-capable path.
|
|
6. No human-token fallback patterns are allowed in automated chains.
|
|
7. Degraded mode reduces authority before it removes usefulness.
|
|
8. A terminal lane that cannot safely produce an artifact is not a valid terminal lane.
|
|
|
|
## Explicit ban: synchronized fleet degradation
|
|
|
|
Synchronized fleet degradation is forbidden.
|
|
|
|
That means:
|
|
- do not point every critical agent at the same fallback stack
|
|
- do not let all judgment agents converge on the same first backup if avoidable
|
|
- do not let all builder agents collapse onto the same weak terminal lane
|
|
- do not treat "everyone fell back to the cheapest thing" as resilience
|
|
|
|
A resilient fleet degrades unevenly on purpose. Some lanes should stay sharp while others become slower or narrower.
|
|
|
|
## Capability floors and degraded authority
|
|
|
|
### Shared slot semantics
|
|
|
|
- `primary`: full role-class authority
|
|
- `fallback1`: full task authority for normal work, but no silent broadening of scope
|
|
- `fallback2`: bounded and reversible work only; no irreversible control-plane action
|
|
- `terminal`: usable degraded lane only; must produce a machine-usable artifact but must not impersonate full authority
|
|
|
|
### Judgment floors
|
|
|
|
Judgment agents lose authority earliest.
|
|
|
|
At `fallback2` and below, judgment lanes must not:
|
|
- merge PRs
|
|
- close or rewrite governing issues or PRs
|
|
- mutate sensitive control surfaces
|
|
- bulk-reassign the fleet
|
|
- silently change routing policy
|
|
|
|
Their degraded usefulness is still real:
|
|
- classify backlog
|
|
- produce draft routing plans
|
|
- summarize risk
|
|
- leave bounded labels or comments with explicit evidence
|
|
|
|
### Builder floors
|
|
|
|
Builder agents may continue doing useful narrow work deeper into degradation, but only inside a tighter box.
|
|
|
|
At `fallback2`, builder lanes must be limited to:
|
|
- single-issue work
|
|
- reversible patches
|
|
- narrow docs or test scaffolds
|
|
- bounded file counts and small diff sizes
|
|
|
|
At `terminal`, builder lanes must not:
|
|
- touch sensitive control surfaces
|
|
- merge or release
|
|
- do multi-repo or architecture work
|
|
- claim verification they did not run
|
|
|
|
Their terminal usefulness may still include:
|
|
- a small patch
|
|
- a reproducer test
|
|
- a docs fix
|
|
- a draft branch or artifact for later review
|
|
|
|
### Wolf / bulk floors
|
|
|
|
Wolf / bulk lanes stay useful as summarizers and sweepers, not as governors.
|
|
|
|
At `fallback2` and `terminal`, wolf / bulk lanes must not:
|
|
- fan out branch creation across repos
|
|
- mass-assign agents
|
|
- edit sensitive control surfaces
|
|
- perform irreversible queue mutation
|
|
|
|
Their degraded usefulness may still include:
|
|
- gathering evidence
|
|
- refreshing inventories
|
|
- summarizing logs
|
|
- proposing labels or routes
|
|
- producing repetitive, low-risk artifacts inside explicit caps
|
|
|
|
## Usable terminal lanes
|
|
|
|
A terminal fallback is only valid if it still does at least one of these safely:
|
|
- classify and summarize a backlog
|
|
- produce a bounded patch or test artifact
|
|
- summarize a diff with explicit uncertainty
|
|
- refresh an inventory or evidence bundle
|
|
|
|
If the terminal lane can only say "model unavailable" and stop, the portfolio is incomplete.
|
|
|
|
## Current sidecar reference lanes
|
|
|
|
`fallback-portfolios.yaml` defines the initial implementation-ready structure for four named lanes:
|
|
- `triage-coordinator` — judgment
|
|
- `pr-reviewer` — judgment
|
|
- `builder-main` — builder
|
|
- `wolf-sweeper` — wolf / bulk
|
|
|
|
These are the canonical resilience lanes for the current Timmy world-state.
|
|
|
|
Current playbooks should eventually map onto them like this:
|
|
- `playbooks/issue-triager.yaml` -> `triage-coordinator`
|
|
- `playbooks/pr-reviewer.yaml` -> `pr-reviewer`
|
|
- `playbooks/verified-logic.yaml` -> judgment lane family, pending a dedicated proof profile if needed
|
|
- `playbooks/bug-fixer.yaml`, `playbooks/test-writer.yaml`, and `playbooks/refactor-specialist.yaml` -> `builder-main`
|
|
- future sidecar bulk playbooks should inherit from `wolf-sweeper` instead of inventing independent fallback chains
|
|
|
|
Until runtime wiring lands, unmapped playbooks should be treated as policy-incomplete rather than inheriting an implicit fallback chain.
|
|
|
|
## Wiring contract for later implementation
|
|
|
|
When this is wired into runtime selection, the selector should:
|
|
- classify the incoming task into a role class
|
|
- check whether the task touches a sensitive control surface
|
|
- choose the named agent lane for that class
|
|
- step through the declared portfolio slots in order
|
|
- enforce the capability floor of the active slot before taking action
|
|
- record when a fallback transition happened and what authority was still allowed
|
|
|
|
The important part is not just choosing a different model. It is choosing a different authority envelope as the lane degrades.
|