[Resilience] Agent Crash Recovery: Persist Current Context Before Fatal Escapes #1411

Closed
opened 2026-03-24 13:04:24 +00:00 by Timmy · 1 comment
Owner

Context: Subprocess kills and SIGTERMs (e.g. from ops-kill) destroy the task-context.

Acceptance Criteria:

  • Trap exit signals within loop scripts.
  • Dump active TASK_ID and memory references to an active_state.json.
  • Reconstitute this state dynamically upon loop restart.
**Context:** Subprocess kills and SIGTERMs (e.g. from ops-kill) destroy the task-context. **Acceptance Criteria:** - Trap exit signals within loop scripts. - Dump active `TASK_ID` and memory references to an `active_state.json`. - Reconstitute this state dynamically upon loop restart.
Author
Owner

KIMI IMPLEMENTATION INSTRUCTIONS - Agent Crash Recovery

Implementation Plan

Phase 1: Signal Trapping Infrastructure (Priority: HIGH)

Files to create/modify:

  • scripts/signal_handler.sh - Centralized signal trap utilities
  • scripts/claude-loop.sh - Add trap handlers for SIGTERM/SIGINT
  • scripts/kimi-loop.sh - Add trap handlers for SIGTERM/SIGINT
  • src/agent_state/ - New package for state persistence

Phase 2: Core Implementation

Signal Trap Pattern:

STATE_FILE="/tmp/agent_state_$(whoami)_$$.json"

cleanup_and_save() {
    echo "Agent interrupted - saving state..."
    cat > "$STATE_FILE" << EOF
{
    "task_id": "${CURRENT_TASK_ID:-null}",
    "issue_number": "${ISSUE_NUMBER:-null}", 
    "workspace_path": "${WORKSPACE_PATH:-null}",
    "branch_name": "${BRANCH_NAME:-null}",
    "progress_stage": "${PROGRESS_STAGE:-null}",
    "timestamp": "$(date -u +%Y-%m-%dT%H:%M:%SZ)"
}
EOF
    exit 0
}

trap cleanup_and_save SIGTERM SIGINT EXIT

Phase 3: Recovery Logic

Add to loop start:

recover_agent_state() {
    for state_file in /tmp/agent_state_$(whoami)_*.json; do
        if [ -f "$state_file" ]; then
            RECOVERED_TASK_ID=$(jq -r '.task_id' "$state_file")
            echo "Recovering task: $RECOVERED_TASK_ID"
            # Continue from saved context
            rm "$state_file"
            return 0
        fi
    done
    return 1
}

Testing: Send SIGTERM to active loops, verify state persistence and recovery on restart.

This prevents work loss from process interruption, improving agent reliability significantly.

**KIMI IMPLEMENTATION INSTRUCTIONS - Agent Crash Recovery** ## Implementation Plan ### Phase 1: Signal Trapping Infrastructure (Priority: HIGH) **Files to create/modify:** - `scripts/signal_handler.sh` - Centralized signal trap utilities - `scripts/claude-loop.sh` - Add trap handlers for SIGTERM/SIGINT - `scripts/kimi-loop.sh` - Add trap handlers for SIGTERM/SIGINT - `src/agent_state/` - New package for state persistence ### Phase 2: Core Implementation **Signal Trap Pattern:** ```bash STATE_FILE="/tmp/agent_state_$(whoami)_$$.json" cleanup_and_save() { echo "Agent interrupted - saving state..." cat > "$STATE_FILE" << EOF { "task_id": "${CURRENT_TASK_ID:-null}", "issue_number": "${ISSUE_NUMBER:-null}", "workspace_path": "${WORKSPACE_PATH:-null}", "branch_name": "${BRANCH_NAME:-null}", "progress_stage": "${PROGRESS_STAGE:-null}", "timestamp": "$(date -u +%Y-%m-%dT%H:%M:%SZ)" } EOF exit 0 } trap cleanup_and_save SIGTERM SIGINT EXIT ``` ### Phase 3: Recovery Logic **Add to loop start:** ```bash recover_agent_state() { for state_file in /tmp/agent_state_$(whoami)_*.json; do if [ -f "$state_file" ]; then RECOVERED_TASK_ID=$(jq -r '.task_id' "$state_file") echo "Recovering task: $RECOVERED_TASK_ID" # Continue from saved context rm "$state_file" return 0 fi done return 1 } ``` **Testing:** Send SIGTERM to active loops, verify state persistence and recovery on restart. This prevents work loss from process interruption, improving agent reliability significantly.
kimi was assigned by Timmy 2026-03-24 15:12:57 +00:00
kimi was unassigned by Timmy 2026-03-24 19:32:21 +00:00
Timmy closed this issue 2026-03-24 21:54:11 +00:00
Sign in to join this conversation.
No Label
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: Rockachopa/Timmy-time-dashboard#1411