[Observability] loop_guard.py: Detect Deadlocks and Kill Zombie Python Processes Faster #1425

Closed
opened 2026-03-24 13:04:42 +00:00 by Timmy · 1 comment
Owner

Context: Zombie loop tests require manual ops-kill-zombies interference.

Acceptance Criteria:

  • Optimize loop_guard to run consistently in a lightweight daemon.
  • Detect specific stuck pytest hooks surpassing 15 minutes of inactivity and execute SIGKILL directly.
**Context:** Zombie loop tests require manual `ops-kill-zombies` interference. **Acceptance Criteria:** - Optimize `loop_guard` to run consistently in a lightweight daemon. - Detect specific stuck pytest hooks surpassing 15 minutes of inactivity and execute SIGKILL directly.
Author
Owner

Implementation Plan for Deadlock Detection and Zombie Process Killer

Scope: Enhance loop_guard.py to detect and kill zombie Python processes that exceed inactivity thresholds.

Current State Analysis

  • scripts/loop_guard.py is 325 lines, handles idle detection with exponential backoff
  • Currently focuses on queue state, not process health
  • Manual ops-kill-zombies intervention required for stuck processes

Implementation Steps

  1. Add Process Monitoring Module

    • Create detect_zombie_processes() function
    • Use psutil to find Python processes with:
      • pytest in command line
      • CPU usage < 1% for > 15 minutes
      • Memory not growing for > 15 minutes
  2. Daemon Mode Enhancement

    • Add --daemon flag for continuous monitoring
    • Run lightweight checks every 60 seconds
    • Log process monitoring activity
  3. Process Termination Logic

    • Escalate: SIGTERM → wait 30s → SIGKILL
    • Log all termination actions with PID, command, and duration
    • Avoid killing active processes (check recent file I/O)
  4. Configuration

    • Add timeout thresholds to config (default 15 min)
    • Whitelist processes that should never be killed
    • Enable/disable zombie detection via flag

Files to Modify

  • scripts/loop_guard.py (primary - add process monitoring)
  • Add dependency: psutil to requirements

Testing

  • Create test for zombie process detection
  • Mock psutil for process simulation
  • Verify SIGTERM/SIGKILL escalation

Acceptance Criteria

  • loop_guard.py --daemon runs continuously
  • Detects pytest processes stuck > 15 min
  • Executes SIGTERM → SIGKILL escalation
  • Logs all process termination actions
  • All existing tests still pass
  • No false positives (killing active processes)

Priority: HIGH - This addresses operational pain points and prevents manual intervention.

## Implementation Plan for Deadlock Detection and Zombie Process Killer **Scope:** Enhance loop_guard.py to detect and kill zombie Python processes that exceed inactivity thresholds. ### Current State Analysis - `scripts/loop_guard.py` is 325 lines, handles idle detection with exponential backoff - Currently focuses on queue state, not process health - Manual `ops-kill-zombies` intervention required for stuck processes ### Implementation Steps 1. **Add Process Monitoring Module** - Create `detect_zombie_processes()` function - Use `psutil` to find Python processes with: - `pytest` in command line - CPU usage < 1% for > 15 minutes - Memory not growing for > 15 minutes 2. **Daemon Mode Enhancement** - Add `--daemon` flag for continuous monitoring - Run lightweight checks every 60 seconds - Log process monitoring activity 3. **Process Termination Logic** - Escalate: SIGTERM → wait 30s → SIGKILL - Log all termination actions with PID, command, and duration - Avoid killing active processes (check recent file I/O) 4. **Configuration** - Add timeout thresholds to config (default 15 min) - Whitelist processes that should never be killed - Enable/disable zombie detection via flag ### Files to Modify - `scripts/loop_guard.py` (primary - add process monitoring) - Add dependency: `psutil` to requirements ### Testing - Create test for zombie process detection - Mock psutil for process simulation - Verify SIGTERM/SIGKILL escalation ### Acceptance Criteria - [ ] `loop_guard.py --daemon` runs continuously - [ ] Detects pytest processes stuck > 15 min - [ ] Executes SIGTERM → SIGKILL escalation - [ ] Logs all process termination actions - [ ] All existing tests still pass - [ ] No false positives (killing active processes) **Priority:** HIGH - This addresses operational pain points and prevents manual intervention.
kimi was assigned by Timmy 2026-03-24 14:11:56 +00:00
kimi was unassigned by Timmy 2026-03-24 19:32:15 +00:00
Timmy closed this issue 2026-03-24 21:54:05 +00:00
Sign in to join this conversation.
No Label
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: Rockachopa/Timmy-time-dashboard#1425