docs: define hub-and-spoke IPC doctrine (#157)

2026-04-04 17:20:20 -04:00
parent 6a71dfb5c7
commit fc6297efa5
2 changed files with 171 additions and 1 deletions
--- a/README.md
+++ b/README.md
@@ -25,7 +25,9 @@ timmy-config/
 ├── skins/                     ← UI skins (timmy skin)
 ├── playbooks/                 ← Agent playbooks (YAML)
 ├── cron/                      ← Cron job definitions
-├── docs/automation-inventory.md ← Live automation + stale-state inventory
+├── docs/
+│   ├── automation-inventory.md ← Live automation + stale-state inventory
+│   └── ipc-hub-and-spoke-doctrine.md ← Coordinator-first, transport-agnostic fleet IPC doctrine
 └── training/                  ← Transitional training recipes, not canonical lived data
 ```

@@ -45,6 +47,8 @@ The scripts in `bin/` are sidecar-managed operational helpers for the Hermes lay
 Do NOT assume older prose about removed loops is still true at runtime.
 Audit the live machine first, then read `docs/automation-inventory.md` for the
 current reality and stale-state risks.
+For fleet routing semantics over sovereign transport, read
+`docs/ipc-hub-and-spoke-doctrine.md`.

 ## Orchestration: Huey

--- a/docs/ipc-hub-and-spoke-doctrine.md
+++ b/docs/ipc-hub-and-spoke-doctrine.md
@@ -0,0 +1,166 @@
+# IPC Doctrine: Hub-and-Spoke Semantics over Sovereign Transport
+
+Status: canonical doctrine for issue #157
+Parent: #154
+Related migration work:
+- [`../son-of-timmy.md`](../son-of-timmy.md) for Timmy's layered communications worldview
+- [`nostr_agent_research.md`](nostr_agent_research.md) for one sovereign transport candidate under evaluation
+
+## Why this exists
+
+Timmy is in an ongoing migration toward sovereign transport.
+The first question is not which bus wins. The first question is what semantics every bus must preserve.
+Those semantics matter more than any one transport.
+
+Telegram is not the target backbone for fleet IPC.
+It may exist as a temporary edge or operator convenience while migration is in flight, but the architecture we are building toward must stand on sovereign transport.
+
+This doctrine defines the routing and failure semantics that any transport adapter must honor, whether the carrier is Matrix, Nostr, NATS, or something we have not picked yet.
+
+## Roles
+
+- Coordinator: the only actor allowed to own routing authority for live agent work
+- Spoke: an executing agent that receives work, asks for clarification, and returns results
+- Durable execution truth: the visible task system of record, which remains authoritative for ownership and state transitions
+- Operator: the human principal who can direct the coordinator but is not a transport shim
+
+Timmy world-state stays the same while transport changes:
+- Gitea remains visible execution truth
+- live IPC accelerates coordination, but does not become a hidden source of authority
+- transport migration may change the wire, but not the rules
+
+## Core rules
+
+### 1. Coordinator-first routing
+
+Coordinator-first routing is the default system rule.
+
+- All new work enters through the coordinator
+- All reroutes, cancellations, escalations, and cross-agent handoffs go through the coordinator
+- A spoke receives assignments from the coordinator and reports back to the coordinator
+- A spoke does not mutate the routing graph on its own
+- If route intent is ambiguous, the system should fail closed and ask the coordinator instead of guessing a peer path
+
+The coordinator is the hub.
+Spokes are not free-roaming routers.
+
+### 2. Anti-cascade behavior
+
+The system must resist cascade failures and mesh chatter.
+
+- A spoke MUST NOT recursively fan out work to other spokes
+- A spoke MUST NOT create hidden side queues or recruit additional agents without coordinator approval
+- Broadcasts are coordinator-owned and should be rare, deliberate, and bounded
+- Retries must be bounded and idempotent
+- Transport adapters must not auto-bridge, auto-replay, or auto-forward in ways that amplify loops or duplicate storms
+
+A worker that encounters new sub-work should escalate back to the coordinator.
+It should not become a shadow dispatcher.
+
+### 3. Limited peer mesh
+
+Direct spoke-to-spoke communication is an exception, not the default.
+
+It is allowed only when the coordinator opens an explicit peer window.
+That peer window must define:
+- the allowed participants
+- the task or correlation ID
+- the narrow purpose
+- the expiry, timeout, or close condition
+- the expected artifact or summary that returns to the coordinator
+
+Peer windows are tightly scoped:
+- they are time-bounded
+- they are non-transitive
+- they do not grant standing routing authority
+- they close back to coordinator-first behavior when the declared purpose is complete
+
+Good uses for a peer window:
+- artifact handoff between two already-assigned agents
+- verifier-to-builder clarification on a bounded review loop
+- short-lived data exchange where routing everything through the coordinator would be pure latency
+
+Bad uses for a peer window:
+- ad hoc planning rings
+- recursive delegation chains
+- quorum gossip
+- hidden ownership changes
+- free-form peer mesh as the normal operating mode
+
+### 4. Transport independence
+
+The doctrine is transport-agnostic on purpose.
+
+NATS, Matrix, Nostr, or a future bus are acceptable only if they preserve the same semantics.
+If a transport cannot preserve these semantics, it is not acceptable as the fleet backbone.
+
+A valid transport layer must carry or emulate:
+- authenticated sender identity
+- intended recipient or bounded scope
+- task or work identifier
+- correlation identifier
+- message type
+- timeout or TTL semantics
+- acknowledgement or explicit timeout behavior
+- idempotency or deduplication signals
+
+Transport choice does not change authority.
+Semantics matter more than any one transport.
+
+### 5. Circuit breakers
+
+Every acceptable IPC layer must support circuit-breaker behavior.
+
+At minimum, the system must be able to:
+- isolate a noisy or unhealthy spoke
+- stop new dispatches onto a failing route
+- disable direct peer windows and collapse back to strict hub-and-spoke mode
+- stop retrying after a bounded count or deadline
+- quarantine duplicate storms, fan-out anomalies, or missing coordinator acknowledgements instead of amplifying them
+
+When a breaker trips, the fallback is slower coordinator-mediated operation over durable machine-readable channels.
+It is not a return to hidden relays.
+It is not a reason to rebuild the fleet around Telegram.
+
+No human-token fallback patterns:
+- do not route agent IPC through personal chat identities
+- do not rely on operator copy-paste as a standing transport layer
+- do not treat human-owned bot tokens as the resilience plan
+
+## Required message classes
+
+Any transport mapping should preserve these message classes, even if the carrier names differ:
+
+- dispatch
+- ack or nack
+- status or progress
+- clarify or question
+- result
+- failure or escalation
+- control messages such as cancel, pause, resume, open-peer-window, and close-peer-window
+
+## Failure semantics
+
+When things break, authority should degrade safely.
+
+- If a spoke loses contact with the coordinator, it may finish currently safe local work and persist a checkpoint, but it must not appoint itself as a router
+- If a spoke receives an unscoped peer message, it should ignore or quarantine it and report the event to the coordinator when possible
+- If delivery is duplicated or reordered, recipients should prefer correlation IDs and idempotency keys over guesswork
+- If the live transport is degraded, the system may fall back to slower durable coordination paths, but routing authority remains coordinator-first
+
+## World-state alignment
+
+This doctrine sits above transport selection.
+It does not try to settle every Matrix-vs-Nostr-vs-NATS debate inside one file.
+It constrains those choices.
+
+Current Timmy alignment:
+- sovereign transport migration is ongoing
+- Telegram is not the backbone we are building toward
+- Matrix remains relevant for human-to-fleet interaction
+- Nostr remains relevant as a sovereign option under evaluation
+- NATS remains relevant as a strong internal bus candidate
+- the semantics stay constant across all of them
+
+If we swap the wire and keep the semantics, the fleet stays coherent.
+If we keep the wire and lose the semantics, the fleet regresses into chatter, hidden routing, and cascade failure.