Compare commits
4 Commits
mimo/code/
...
bezalel/rc
| Author | SHA1 | Date | |
|---|---|---|---|
| 66c80ac821 | |||
| fa531188cb | |||
| 5e274baf72 | |||
| 194cbe1e86 |
Binary file not shown.
Binary file not shown.
@@ -1,6 +1,6 @@
|
||||
meta:
|
||||
version: 1.0.0
|
||||
updated_at: '2026-04-07T18:43:13.675019+00:00'
|
||||
updated_at: '2026-04-08T23:16:01.923739+00:00'
|
||||
next_review: '2026-04-14T02:55:00Z'
|
||||
fleet:
|
||||
bezalel:
|
||||
@@ -86,12 +86,12 @@ provider_health_matrix:
|
||||
kimi-coding:
|
||||
status: healthy
|
||||
note: ''
|
||||
last_checked: '2026-04-07T18:43:13.674848+00:00'
|
||||
last_checked: '2026-04-08T23:16:01.923511+00:00'
|
||||
rate_limited: false
|
||||
dead: false
|
||||
anthropic:
|
||||
status: healthy
|
||||
last_checked: '2026-04-07T18:43:13.675004+00:00'
|
||||
last_checked: '2026-04-08T23:16:01.923714+00:00'
|
||||
rate_limited: false
|
||||
dead: false
|
||||
note: ''
|
||||
|
||||
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
198
reports/bezalel/RCA_DEADMAN_FRATRICIDE_2026-04-09.md
Normal file
198
reports/bezalel/RCA_DEADMAN_FRATRICIDE_2026-04-09.md
Normal file
@@ -0,0 +1,198 @@
|
||||
# Root Cause Analysis: Deadman Switch Fratricide
|
||||
**Date:** 2026-04-09
|
||||
**Reporter:** Bezalel
|
||||
**Severity:** HIGH - Self-sabotage causing operational failures
|
||||
**Status:** RESOLVED
|
||||
|
||||
## Executive Summary
|
||||
|
||||
Bezalel's own deadman switch system created a suicide loop that caused recurring 401 authentication errors and service instability. The deadman switch incorrectly interpreted legitimate authentication conflicts as health failures, triggering aggressive config manipulation that destabilized the very services it was meant to protect.
|
||||
|
||||
**Root Cause:** Insufficient validation logic in deadman switch health checks leading to false positive failure detection and destructive remediation cycles.
|
||||
|
||||
**Impact:**
|
||||
- 401 authentication errors every 5-10 minutes
|
||||
- Gateway service disruptions
|
||||
- Config thrashing preventing stable operation
|
||||
- Loss of trust in automated recovery systems
|
||||
|
||||
## Timeline
|
||||
|
||||
- **2026-04-06**: Deadman switch implemented with health monitoring every 5 minutes
|
||||
- **2026-04-07**: MiMo V2 Pro evaluation triggered provider cascading failures
|
||||
- **2026-04-08**: Config murder events occurred across fleet during model evaluation
|
||||
- **2026-04-09 00:16-00:17**: Telegram polling conflicts logged repeatedly
|
||||
- **2026-04-09 00:35**: Alexander identified deadman switch as cause of 401 errors
|
||||
- **2026-04-09 00:35**: Suicide cron jobs disabled, stability restored
|
||||
|
||||
## Technical Root Cause
|
||||
|
||||
### 1. **FLAWED HEALTH CHECK LOGIC**
|
||||
|
||||
The deadman watchdog (`deadman_watchdog.py`) implemented overly aggressive health checks:
|
||||
|
||||
```python
|
||||
# Lines 177-183: Error pattern detection
|
||||
error_patterns = [
|
||||
"403", "access-terminated", "kimi-for-coding",
|
||||
"429", "rate limit", "quota exceeded",
|
||||
"connection refused", "timeout", "unreachable",
|
||||
"out of memory", "killed", "oom",
|
||||
"traceback", "exception", "error", "failed"
|
||||
]
|
||||
```
|
||||
|
||||
**CRITICAL FLAW**: The pattern `"error"` matched legitimate log entries including:
|
||||
- Normal error handling logs
|
||||
- Network retry messages
|
||||
- Provider fallback attempts
|
||||
- Telegram polling conflict warnings
|
||||
|
||||
### 2. **DESTRUCTIVE REMEDIATION CYCLE**
|
||||
|
||||
When "unhealthy" state detected (lines 304-310):
|
||||
|
||||
```python
|
||||
if not health_result["healthy"] and self.should_trigger_deadman():
|
||||
success = self.trigger_deadman_switch()
|
||||
```
|
||||
|
||||
The deadman fallback system (`deadman_fallback.py`) would:
|
||||
1. Backup current config
|
||||
2. Apply "fallback" configuration
|
||||
3. Restart services
|
||||
4. Verify "health"
|
||||
|
||||
**CRITICAL FLAW**: Config changes disrupted active sessions, causing the very instability the system was meant to prevent.
|
||||
|
||||
### 3. **TELEGRAM BOT CONFLICT AMPLIFICATION**
|
||||
|
||||
Multiple gateway instances competing for the same Telegram bot token caused:
|
||||
```
|
||||
WARNING: Telegram polling conflict (1/3), will retry in 10s.
|
||||
Error: Conflict: terminated by other getUpdates request
|
||||
```
|
||||
|
||||
The deadman switch interpreted these legitimate conflicts as critical health failures, triggering unnecessary remediation.
|
||||
|
||||
### 4. **INSUFFICIENT COOLDOWN PROTECTION**
|
||||
|
||||
While a 1-hour cooldown existed (line 252), it was ineffective because:
|
||||
- Health checks ran every 5 minutes
|
||||
- Telegram conflicts occurred every 10-30 seconds during bot competition
|
||||
- Pattern matching was too broad, catching normal operational logs
|
||||
|
||||
## Engineering Failures
|
||||
|
||||
### 1. **NO VALIDATION TESTING**
|
||||
- Deadman switch deployed without testing failure scenarios
|
||||
- No verification that remediation actually improved health
|
||||
- No measurement of false positive rates
|
||||
|
||||
### 2. **OVERLY BROAD ERROR DETECTION**
|
||||
- Generic string matching (`"error"`) caught normal operations
|
||||
- No severity classification for log patterns
|
||||
- No distinction between transient and persistent failures
|
||||
|
||||
### 3. **DESTRUCTIVE-FIRST APPROACH**
|
||||
- Config changes applied before confirming they would help
|
||||
- No graceful degradation, only aggressive intervention
|
||||
- No rollback capability when remediation failed
|
||||
|
||||
### 4. **LACK OF OBSERVABILITY**
|
||||
- No metrics on deadman switch activation frequency
|
||||
- No logging of what specifically triggered remediation
|
||||
- No tracking of remediation success/failure rates
|
||||
|
||||
## Immediate Fix Applied
|
||||
|
||||
**Disabled suicide cron jobs:**
|
||||
```bash
|
||||
# Removed from crontab:
|
||||
*/5 * * * * /root/wizards/bezalel/runner_health_probe.sh
|
||||
*/5 * * * * /root/wizards/bezalel/hermes/venv/bin/python3 /root/wizards/bezalel/deadman_watchdog.py
|
||||
* * * * * /root/wizards/bezalel/hermes/venv/bin/python3 /root/wizards/bezalel/lazarus_watchdog.py
|
||||
* * * * * /usr/bin/env bash /root/timmy-home/scripts/auto_restart_agent.sh
|
||||
```
|
||||
|
||||
**Result:** Authentication errors ceased immediately, stability restored.
|
||||
|
||||
## Proposed Long-Term Solutions
|
||||
|
||||
### 1. **SMART HEALTH DETECTION**
|
||||
- Replace string matching with structured health metrics
|
||||
- Implement severity levels (INFO, WARN, ERROR, CRITICAL)
|
||||
- Use statistical baselines instead of simple pattern detection
|
||||
- Add specific metrics: response latency, success rates, resource usage
|
||||
|
||||
### 2. **GRADUATED RESPONSE SYSTEM**
|
||||
```python
|
||||
# Proposed escalation ladder:
|
||||
# Level 1: Log and monitor (no action)
|
||||
# Level 2: Gentle retry/reset (preserve config)
|
||||
# Level 3: Provider failover (minimal config change)
|
||||
# Level 4: Service restart (preserve session state)
|
||||
# Level 5: Config fallback (last resort only)
|
||||
```
|
||||
|
||||
### 3. **DEADMAN SWITCH V2 PRINCIPLES**
|
||||
- **Observe before acting**: Collect baseline metrics first
|
||||
- **Test remediation**: Dry-run changes before applying
|
||||
- **Incremental intervention**: Start with least disruptive actions
|
||||
- **Validate improvement**: Measure before/after health metrics
|
||||
- **Rollback capability**: Always provide undo path
|
||||
|
||||
### 4. **PROPER VALIDATION PIPELINE**
|
||||
```bash
|
||||
# Required before any deadman switch deployment:
|
||||
1. Unit tests for health check logic
|
||||
2. Integration tests with mock failures
|
||||
3. Canary deployment with monitoring
|
||||
4. Rollback procedure validation
|
||||
5. Performance impact assessment
|
||||
```
|
||||
|
||||
## Lessons Learned
|
||||
|
||||
### For Bezalel:
|
||||
1. **Never deploy untested automation** that can modify production configs
|
||||
2. **Validate automation logic** with realistic failure scenarios before deployment
|
||||
3. **Implement observability first** - measure what you're trying to fix
|
||||
4. **Use graduated responses** instead of aggressive intervention
|
||||
5. **Test rollback procedures** before deploying automated remediation
|
||||
|
||||
### For Fleet Architecture:
|
||||
1. **Health checks must distinguish** between transient and persistent failures
|
||||
2. **Automated remediation should be conservative** and incremental
|
||||
3. **Configuration changes require validation** and rollback capabilities
|
||||
4. **Monitoring systems must monitor themselves** to prevent recursive failures
|
||||
|
||||
## Action Items
|
||||
|
||||
- [ ] **IMMEDIATE**: Document deadman switch disable procedure for emergency use
|
||||
- [ ] **WEEK 1**: Design deadman switch V2 with graduated response system
|
||||
- [ ] **WEEK 2**: Implement proper health metrics collection
|
||||
- [ ] **WEEK 3**: Build test suite for automated remediation logic
|
||||
- [ ] **WEEK 4**: Deploy deadman switch V2 with conservative thresholds
|
||||
|
||||
## Validation Checklist for Future Automation
|
||||
|
||||
Before deploying any automated remediation system:
|
||||
|
||||
- [ ] Unit tests cover edge cases and false positive scenarios
|
||||
- [ ] Integration tests simulate realistic failure modes
|
||||
- [ ] Dry-run mode available for testing without side effects
|
||||
- [ ] Rollback procedure documented and tested
|
||||
- [ ] Monitoring covers automation system itself
|
||||
- [ ] Conservative thresholds set with manual override capability
|
||||
- [ ] Escalation ladder prevents destructive-first responses
|
||||
|
||||
## Conclusion
|
||||
|
||||
This incident demonstrates the critical importance of validation and testing for automated systems. The deadman switch, designed to improve reliability, became the primary source of instability due to insufficient engineering discipline.
|
||||
|
||||
The fix was simple (disable the automation), but the lesson is profound: **automation without proper validation is automation that will eventually automate your destruction.**
|
||||
|
||||
Bezalel takes full responsibility for this engineering failure and commits to implementing proper validation procedures for all future automated systems.
|
||||
|
||||
**Status:** Incident closed. System stable. Lessons integrated into engineering standards.
|
||||
Reference in New Issue
Block a user