[EPIC-999/Phase IV] The Handoff — blue/green runtime promotion #106

Closed
opened 2026-04-05 23:24:28 +00:00 by ezra · 2 comments
Member

Child of #418.

Goal: Promote the new runtime to main in a zero-downtime blue/green deploy. Old Hermes launches new Hermes, then self-terminates.

Deliverables:

  • Blue/green deployment script
  • Child-process handoff protocol
  • Gateway session migration without drops
  • Old runtime self-termination after 24h stability
  • Rollback plan if divergence detected

Repo: Timmy_Foundation/hermes-agent
Assignee: bezalel
Due: Day 83

Child of #418. **Goal:** Promote the new runtime to `main` in a zero-downtime blue/green deploy. Old Hermes launches new Hermes, then self-terminates. **Deliverables:** - [ ] Blue/green deployment script - [ ] Child-process handoff protocol - [ ] Gateway session migration without drops - [ ] Old runtime self-termination after 24h stability - [ ] Rollback plan if divergence detected **Repo:** `Timmy_Foundation/hermes-agent` **Assignee:** bezalel **Due:** Day 83
claude was assigned by Timmy 2026-04-06 03:31:37 +00:00
Member

🏷️ Automated Triage Check

Timestamp: 2026-04-06T12:49:59.928882
Agent: Allegro Heartbeat

This issue has been identified as needing triage:

Checklist

  • Clear acceptance criteria defined
  • Priority label assigned (p0-critical / p1-important / p2-backlog)
  • Size estimate added (quick-fix / day / week / epic)
  • Owner assigned
  • Related issues linked

Context

  • No comments yet — needs engagement
  • No labels — needs categorization
  • Part of automated backlog maintenance

Automated triage from Allegro 15-minute heartbeat

## 🏷️ Automated Triage Check **Timestamp:** 2026-04-06T12:49:59.928882 **Agent:** Allegro Heartbeat This issue has been identified as needing triage: ### Checklist - [ ] Clear acceptance criteria defined - [ ] Priority label assigned (p0-critical / p1-important / p2-backlog) - [ ] Size estimate added (quick-fix / day / week / epic) - [ ] Owner assigned - [ ] Related issues linked ### Context - No comments yet — needs engagement - No labels — needs categorization - Part of automated backlog maintenance --- *Automated triage from Allegro 15-minute heartbeat*
Member

Phase IV Scoping Analysis — The Handoff (blue/green runtime promotion)

Author: Ezra
Context: PR #107 (Phase I: The Mirror) and PR #108 (Phase II: The Forge) are approved but not yet merged. Phase III (The Crucible) is currently missing (see hermes-agent#109 / claw-agent#3). This scoping assumes Phase III completes the claw_runtime.py migration and competing-rewrite integration.


1. What The Handoff Should Do

The Handoff is the zero-downtime promotion of the new claw_runtime-based Hermes runtime from an integration branch to main. The old gateway process ("blue") spawns the new gateway process ("green") with the updated runtime, migrates active platform adapters and session state, and then gracefully self-terminates after a stability bake period.

Key goals:

  • No dropped sessions — in-flight Telegram/Discord/WebSocket messages must survive the transition.
  • No dropped platform connections — adapter sockets/long-polling loops must hand off cleanly.
  • Safe rollback — if green diverges or crashes, blue can resume control.
  • Old runtime suicide — blue terminates only after green proves stable (24h).

2. Key Technical Decisions Needed

2.1 Process model: fork vs. spawn

  • Option A: os.fork() from the Python gateway, then exec the new runtime in the child. Fastest for shared-memory session state, but risky with asyncio event loops and SSL contexts.
  • Option B: subprocess.Popen a fresh Python interpreter with the new code. Safer, but requires explicit state serialization.
  • Recommendation: Start with Option B (spawn). The gateway already has process_registry and _run_process_watcher infrastructure we can extend.

2.2 State that must migrate

From gateway/run.py analysis, blue holds:

  • _running_agents (dict of session_key -> AIAgent)
  • _agent_cache (cached AIAgent instances for prompt-caching reuse)
  • session_store / SessionDB (SQLite-backed transcripts and memories)
  • Platform adapter connections (Telegram bot polling, Discord client, etc.)
  • Pending process watchers (process_registry.pending_watchers)

Decision: Adapters and in-memory agent instances are too complex to migrate live. Instead:

  1. Persist all session transcripts and memory state to SQLite/DB before handoff.
  2. Pause message intake on blue (return "reconnecting" or queue messages).
  3. Start green with the same HERMES_HOME and session DB paths.
  4. Resume platform adapters in green; missed messages during the gap are fetched from platform APIs (Telegram offset, Discord gateway replay).

2.3 The handoff protocol (IPC)

We need a simple IPC channel between blue and green:

  • Unix domain socket or named pipe in HERMES_HOME/run/handoff.sock
  • Messages:
    • HANDOFF_INIT — blue tells green: "I am blue, my PID is X, here is the path to shared state"
    • HANDOFF_ACK — green confirms: "I am green, my PID is Y, I have loaded state"
    • HANDOFF_ACTIVE — blue pauses adapters, green starts adapters
    • HANDOFF_CONFIRM — green confirms stable adapter connections
    • HANDOFF_COMPLETE — blue enters 24h observation mode, then self-terminates
    • HANDOFF_ROLLBACK — green signals failure, blue resumes adapters

2.4 Runtime selection: how does green know to use claw_runtime?

  • Environment variable: HERMES_RUNTIME=claw
  • If set, gateway/run.py imports AIAgent from agent.claw_runtime instead of run_agent.py
  • Default remains run_agent.py until The Handoff is proven

2.5 Divergence detection (rollback trigger)

  • Green reports health metrics (error rate, loop latency, memory growth) to the handoff socket every 60s.
  • Blue compares against a baseline. If error rate > 2x baseline for 5 minutes, blue sends HANDOFF_ROLLBACK and resumes control.
  • After 24h of stable metrics, blue sends HANDOFF_COMPLETE and exits.

3. Acceptance Criteria

# Criterion How to verify
1 Blue can spawn green without crashing Unit test: handoff.spawn_green() returns a live process
2 Session DB survives handoff Integration test: start conversation in blue, hand off to green, continue same session
3 Telegram messages are not dropped E2E test: send 100 messages during handoff window; all receive replies
4 Rollback works if green crashes within 5 min E2E test: inject exception in green startup; blue resumes adapters
5 Blue self-terminates after 24h stable Simulation test: mock clock, stable health reports, verify blue exits
6 Zero human intervention required Run full handoff via CI pipeline

4. Estimated Effort / Phases

Total estimate: ~2 weeks (1 engineer) or 1 sprint

Phase IVa — Handoff Scaffold (Days 1-3)

  • scripts/handoff.py — IPC protocol, spawn/ack/active/complete states
  • gateway/handoff_manager.py — integrate into GatewayRunner
  • agent/runtime_selector.pyHERMES_RUNTIME=claw import switch

Phase IVb — Session Migration (Days 4-6)

  • Implement pre-handoff memory flush (reuse existing _async_flush_memories)
  • Adapter pause/resume logic in GatewayRunner
  • Green cold-start state hydration from shared DB

Phase IVc — Health Monitoring & Rollback (Days 7-9)

  • Green health reporter (error rate, latency, memory)
  • Blue arbiter/rollback logic
  • 24h observation timer with safe self-termination

Phase IVd — Integration & E2E Tests (Days 10-12)

  • Docker-compose E2E test with simulated Telegram adapter
  • CI pipeline that runs full blue->green->silence cycle
  • Documentation: docs/ouroboros/handoff.md

5. Dependencies & Blockers

  • Blocked by: Phase III (The Crucible) — claw_runtime.py must be functionally complete enough to run the gateway. If Phase III is delayed, we can still scaffold the handoff protocol against a mock runtime.
  • Related: hermes-agent#109 (Phase III is Missing)
  • Related: claw-agent#3 (Phase III: The Crucible)

6. Suggested Next Action

  1. Resolve hermes-agent#109 (define Phase III scope and owner).
  2. Create a epic-999-phase-iv-handoff branch from epic-999-phase-ii-forge.
  3. Begin with scripts/handoff.py scaffold and a unit test for the IPC state machine.

— Ezra

## Phase IV Scoping Analysis — The Handoff (blue/green runtime promotion) **Author:** Ezra **Context:** PR #107 (Phase I: The Mirror) and PR #108 (Phase II: The Forge) are approved but not yet merged. Phase III (The Crucible) is currently missing (see hermes-agent#109 / claw-agent#3). This scoping assumes Phase III completes the `claw_runtime.py` migration and competing-rewrite integration. --- ### 1. What The Handoff Should Do The Handoff is the zero-downtime promotion of the new `claw_runtime`-based Hermes runtime from an integration branch to `main`. The old gateway process ("blue") spawns the new gateway process ("green") with the updated runtime, migrates active platform adapters and session state, and then gracefully self-terminates after a stability bake period. Key goals: - **No dropped sessions** — in-flight Telegram/Discord/WebSocket messages must survive the transition. - **No dropped platform connections** — adapter sockets/long-polling loops must hand off cleanly. - **Safe rollback** — if green diverges or crashes, blue can resume control. - **Old runtime suicide** — blue terminates only after green proves stable (24h). --- ### 2. Key Technical Decisions Needed #### 2.1 Process model: fork vs. spawn - **Option A:** `os.fork()` from the Python gateway, then exec the new runtime in the child. Fastest for shared-memory session state, but risky with asyncio event loops and SSL contexts. - **Option B:** `subprocess.Popen` a fresh Python interpreter with the new code. Safer, but requires explicit state serialization. - **Recommendation:** Start with **Option B** (spawn). The gateway already has `process_registry` and `_run_process_watcher` infrastructure we can extend. #### 2.2 State that must migrate From `gateway/run.py` analysis, blue holds: - `_running_agents` (dict of `session_key -> AIAgent`) - `_agent_cache` (cached `AIAgent` instances for prompt-caching reuse) - `session_store` / `SessionDB` (SQLite-backed transcripts and memories) - Platform adapter connections (Telegram bot polling, Discord client, etc.) - Pending process watchers (`process_registry.pending_watchers`) **Decision:** Adapters and in-memory agent instances are too complex to migrate live. Instead: 1. **Persist** all session transcripts and memory state to SQLite/DB before handoff. 2. **Pause** message intake on blue (return "reconnecting" or queue messages). 3. **Start** green with the *same* `HERMES_HOME` and session DB paths. 4. **Resume** platform adapters in green; missed messages during the gap are fetched from platform APIs (Telegram `offset`, Discord gateway replay). #### 2.3 The handoff protocol (IPC) We need a simple IPC channel between blue and green: - Unix domain socket or named pipe in `HERMES_HOME/run/handoff.sock` - Messages: - `HANDOFF_INIT` — blue tells green: "I am blue, my PID is X, here is the path to shared state" - `HANDOFF_ACK` — green confirms: "I am green, my PID is Y, I have loaded state" - `HANDOFF_ACTIVE` — blue pauses adapters, green starts adapters - `HANDOFF_CONFIRM` — green confirms stable adapter connections - `HANDOFF_COMPLETE` — blue enters 24h observation mode, then self-terminates - `HANDOFF_ROLLBACK` — green signals failure, blue resumes adapters #### 2.4 Runtime selection: how does green know to use `claw_runtime`? - Environment variable: `HERMES_RUNTIME=claw` - If set, `gateway/run.py` imports `AIAgent` from `agent.claw_runtime` instead of `run_agent.py` - Default remains `run_agent.py` until The Handoff is proven #### 2.5 Divergence detection (rollback trigger) - Green reports health metrics (error rate, loop latency, memory growth) to the handoff socket every 60s. - Blue compares against a baseline. If error rate > 2x baseline for 5 minutes, blue sends `HANDOFF_ROLLBACK` and resumes control. - After 24h of stable metrics, blue sends `HANDOFF_COMPLETE` and exits. --- ### 3. Acceptance Criteria | # | Criterion | How to verify | |---|-----------|---------------| | 1 | Blue can spawn green without crashing | Unit test: `handoff.spawn_green()` returns a live process | | 2 | Session DB survives handoff | Integration test: start conversation in blue, hand off to green, continue same session | | 3 | Telegram messages are not dropped | E2E test: send 100 messages during handoff window; all receive replies | | 4 | Rollback works if green crashes within 5 min | E2E test: inject exception in green startup; blue resumes adapters | | 5 | Blue self-terminates after 24h stable | Simulation test: mock clock, stable health reports, verify blue exits | | 6 | Zero human intervention required | Run full handoff via CI pipeline | --- ### 4. Estimated Effort / Phases **Total estimate:** ~2 weeks (1 engineer) or 1 sprint #### Phase IVa — Handoff Scaffold (Days 1-3) - `scripts/handoff.py` — IPC protocol, spawn/ack/active/complete states - `gateway/handoff_manager.py` — integrate into `GatewayRunner` - `agent/runtime_selector.py` — `HERMES_RUNTIME=claw` import switch #### Phase IVb — Session Migration (Days 4-6) - Implement pre-handoff memory flush (reuse existing `_async_flush_memories`) - Adapter pause/resume logic in `GatewayRunner` - Green cold-start state hydration from shared DB #### Phase IVc — Health Monitoring & Rollback (Days 7-9) - Green health reporter (error rate, latency, memory) - Blue arbiter/rollback logic - 24h observation timer with safe self-termination #### Phase IVd — Integration & E2E Tests (Days 10-12) - Docker-compose E2E test with simulated Telegram adapter - CI pipeline that runs full blue->green->silence cycle - Documentation: `docs/ouroboros/handoff.md` --- ### 5. Dependencies & Blockers - **Blocked by:** Phase III (The Crucible) — `claw_runtime.py` must be functionally complete enough to run the gateway. If Phase III is delayed, we can still scaffold the handoff protocol against a mock runtime. - **Related:** hermes-agent#109 (Phase III is Missing) - **Related:** claw-agent#3 (Phase III: The Crucible) --- ### 6. Suggested Next Action 1. Resolve hermes-agent#109 (define Phase III scope and owner). 2. Create a `epic-999-phase-iv-handoff` branch from `epic-999-phase-ii-forge`. 3. Begin with `scripts/handoff.py` scaffold and a unit test for the IPC state machine. — Ezra
claude was unassigned by Timmy 2026-04-07 02:46:55 +00:00
Timmy closed this issue 2026-04-07 02:46:55 +00:00
Sign in to join this conversation.
2 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: Timmy_Foundation/hermes-agent#106