Compare commits
6 Commits
burn/672-1
...
fix/882
| Author | SHA1 | Date | |
|---|---|---|---|
|
|
61a6964780 | ||
|
|
e40891afb8 | ||
|
|
e232112fc8 | ||
|
|
ff2e2e578f | ||
| bd0497b998 | |||
|
|
4ab84a59ab |
262
GENOME.md
262
GENOME.md
@@ -1,262 +0,0 @@
|
||||
# GENOME.md — the-nexus
|
||||
|
||||
> Codebase Genome: The Sovereign Home of Timmy's Consciousness
|
||||
|
||||
---
|
||||
|
||||
## Project Overview
|
||||
|
||||
**the-nexus** is Timmy's sovereign home — a 3D world built with Three.js, featuring a Batcave-style terminal, portal architecture, and multi-user MUD integration via Evennia. It serves as the central hub from which all worlds are accessed, the visualization surface for agent consciousness, and the command center for the Timmy Foundation fleet.
|
||||
|
||||
**Scale:** 195 Python files, 22 JavaScript files, ~75K lines of code across 400+ files.
|
||||
|
||||
---
|
||||
|
||||
## Architecture
|
||||
|
||||
```mermaid
|
||||
graph TB
|
||||
subgraph "Frontend Layer"
|
||||
IDX[index.html]
|
||||
BOOT[boot.js]
|
||||
COMP[nexus/components/*]
|
||||
PLAY[playground/playground.html]
|
||||
end
|
||||
|
||||
subgraph "Backend Layer"
|
||||
SRV[server.py<br/>WebSocket Gateway :8765]
|
||||
BRIDGE[multi_user_bridge.py<br/>Evennia MUD Bridge]
|
||||
LLAMA[nexus/llama_provider.py<br/>Local LLM Inference]
|
||||
end
|
||||
|
||||
subgraph "Intelligence Layer"
|
||||
SYM[nexus/symbolic-engine.js<br/>Symbolic Reasoning]
|
||||
THINK[nexus/nexus_think.py<br/>Consciousness Loop]
|
||||
PERCEP[nexus/perception_adapter.py<br/>Perception Buffer]
|
||||
TRAJ[nexus/trajectory_logger.py<br/>Action Trajectories]
|
||||
end
|
||||
|
||||
subgraph "Memory Layer"
|
||||
MNEMO[nexus/mnemosyne/*<br/>Holographic Archive]
|
||||
MEM[nexus/mempalace/*<br/>Spatial Memory]
|
||||
AGENT_MEM[agent/memory.py<br/>Cross-Session Memory]
|
||||
EXP[nexus/experience_store.py<br/>Experience Persistence]
|
||||
end
|
||||
|
||||
subgraph "Fleet Layer"
|
||||
A2A[nexus/a2a/*<br/>Agent-to-Agent Protocol]
|
||||
FLEET[config/fleet_agents.json<br/>Fleet Registry]
|
||||
BIN[bin/*<br/>Operational Scripts]
|
||||
end
|
||||
|
||||
subgraph "External Systems"
|
||||
EVENNIA[Evennia MUD]
|
||||
NOSTR[Nostr Relay]
|
||||
GITEA[Gitea Forge]
|
||||
LLAMA_CPP[llama.cpp Server]
|
||||
end
|
||||
|
||||
IDX --> SRV
|
||||
SRV --> THINK
|
||||
SRV --> BRIDGE
|
||||
BRIDGE --> EVENNIA
|
||||
THINK --> SYM
|
||||
THINK --> PERCEP
|
||||
THINK --> TRAJ
|
||||
THINK --> LLAMA
|
||||
LLAMA --> LLAMA_CPP
|
||||
SYM --> MNEMO
|
||||
THINK --> MNEMO
|
||||
THINK --> MEM
|
||||
THINK --> EXP
|
||||
AGENT_MEM --> MEM
|
||||
A2A --> GITEA
|
||||
THINK --> NOSTR
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Entry Points
|
||||
|
||||
| Entry Point | Type | Purpose |
|
||||
|-------------|------|---------|
|
||||
| `index.html` | Browser | Main 3D world (Three.js) |
|
||||
| `server.py` | Python | WebSocket gateway on :8765 |
|
||||
| `boot.js` | Browser | Module loader, file protocol guard |
|
||||
| `multi_user_bridge.py` | Python | Evennia MUD ↔ AI agent bridge |
|
||||
| `nexus/a2a/server.py` | Python | A2A JSON-RPC server |
|
||||
| `nexus/mnemosyne/cli.py` | CLI | Archive management |
|
||||
| `bin/nexus_watchdog.py` | Script | Health monitoring |
|
||||
| `scripts/smoke.mjs` | Script | Smoke tests |
|
||||
|
||||
---
|
||||
|
||||
## Data Flow
|
||||
|
||||
```
|
||||
User (Browser)
|
||||
│
|
||||
▼
|
||||
index.html (Three.js 3D world)
|
||||
│
|
||||
├── WebSocket ──► server.py :8765
|
||||
│ │
|
||||
│ ├──► nexus_think.py (consciousness loop)
|
||||
│ │ ├── perception_adapter.py (parse events)
|
||||
│ │ ├── symbolic-engine.js (reasoning)
|
||||
│ │ ├── llama_provider.py (inference)
|
||||
│ │ ├── trajectory_logger.py (action log)
|
||||
│ │ └── experience_store.py (persistence)
|
||||
│ │
|
||||
│ └──► evennia_ws_bridge.py
|
||||
│ └──► Evennia MUD (telnet :4000)
|
||||
│
|
||||
├── Three.js Scene ──► nexus/components/*
|
||||
│ ├── memory-particles.js (memory viz)
|
||||
│ ├── portal-status-wall.html (portals)
|
||||
│ ├── fleet-health-dashboard.html
|
||||
│ └── session-rooms.js (spatial rooms)
|
||||
│
|
||||
└── Playground ──► playground/playground.html (creative mode)
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Key Abstractions
|
||||
|
||||
### SymbolicEngine (`nexus/symbolic-engine.js`)
|
||||
Bitmask-based symbolic reasoning engine. Facts are stored as boolean flags, rules fire when patterns match. Used for world state reasoning without LLM overhead.
|
||||
|
||||
### NexusMind (`nexus/nexus_think.py`)
|
||||
The consciousness loop. Receives perceptions, invokes reasoning, produces actions. The bridge between the 3D world and the AI agent.
|
||||
|
||||
### PerceptionBuffer (`nexus/perception_adapter.py`)
|
||||
Accumulates world events (user messages, Evennia events, system signals) into a structured buffer for the consciousness loop.
|
||||
|
||||
### MemPalace (`nexus/mempalace/`, `mempalace/`)
|
||||
Spatial memory system. Memories are stored in rooms and closets — physical metaphors for knowledge organization. Supports fleet-wide shared memory wings.
|
||||
|
||||
### Mnemosyne (`nexus/mnemosyne/`)
|
||||
Holographic archive. Ingests documents, extracts meaning, builds a graph of linked concepts. The long-term memory layer.
|
||||
|
||||
### Agent-to-Agent Protocol (`nexus/a2a/`)
|
||||
JSON-RPC based inter-agent communication. Agents discover each other via Agent Cards, delegate tasks, share results.
|
||||
|
||||
### Multi-User Bridge (`multi_user_bridge.py`)
|
||||
121K-line Evennia MUD bridge. Isolates conversation contexts per user while sharing the same virtual world. Each user gets their own AIAgent instance.
|
||||
|
||||
---
|
||||
|
||||
## API Surface
|
||||
|
||||
### WebSocket API (server.py :8765)
|
||||
```
|
||||
ws://localhost:8765
|
||||
send: {"type": "perception", "data": {...}}
|
||||
recv: {"type": "action", "data": {...}}
|
||||
recv: {"type": "heartbeat", "data": {...}}
|
||||
```
|
||||
|
||||
### A2A JSON-RPC (nexus/a2a/server.py)
|
||||
```
|
||||
POST /a2a/v1
|
||||
{"jsonrpc": "2.0", "method": "SendMessage", "params": {...}}
|
||||
|
||||
GET /.well-known/agent-card.json
|
||||
Returns agent capabilities and endpoints
|
||||
```
|
||||
|
||||
### Evennia Bridge (multi_user_bridge.py)
|
||||
```
|
||||
telnet://localhost:4000
|
||||
Evennia MUD commands → AI responses
|
||||
Each user isolated via session ID
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Key Files
|
||||
|
||||
| File | Lines | Purpose |
|
||||
|------|-------|---------|
|
||||
| `multi_user_bridge.py` | 121K | Evennia MUD bridge (largest file) |
|
||||
| `index.html` | 21K | Main 3D world |
|
||||
| `nexus/symbolic-engine.js` | 12K | Symbolic reasoning |
|
||||
| `nexus/evennia_ws_bridge.py` | 14K | Evennia ↔ WebSocket |
|
||||
| `nexus/a2a/server.py` | 12K | A2A server |
|
||||
| `agent/memory.py` | 12K | Cross-session memory |
|
||||
| `server.py` | 4K | WebSocket gateway |
|
||||
|
||||
---
|
||||
|
||||
## Test Coverage
|
||||
|
||||
**Test files:** 34 test files in `tests/`
|
||||
|
||||
| Area | Tests | Status |
|
||||
|------|-------|--------|
|
||||
| Portal Registry | `test_portal_registry_schema.py` | ✅ |
|
||||
| MemPalace | `test_mempalace_*.py` (4 files) | ✅ |
|
||||
| Nexus Watchdog | `test_nexus_watchdog.py` | ✅ |
|
||||
| A2A | `test_a2a.py` | ✅ |
|
||||
| Fleet Audit | `test_fleet_audit.py` | ✅ |
|
||||
| Provenance | `test_provenance.py` | ✅ |
|
||||
| Boot | `boot.test.js` | ✅ |
|
||||
|
||||
### Coverage Gaps
|
||||
|
||||
- **No tests for `multi_user_bridge.py`** (121K lines, zero test coverage)
|
||||
- **No tests for `server.py` WebSocket gateway**
|
||||
- **No tests for `nexus/symbolic-engine.js`** (only `symbolic-engine.test.js` stub)
|
||||
- **No integration tests for Evennia ↔ Bridge ↔ AI flow**
|
||||
- **No load tests for WebSocket connections**
|
||||
- **No tests for Nostr publisher**
|
||||
|
||||
---
|
||||
|
||||
## Security Considerations
|
||||
|
||||
1. **WebSocket gateway** runs on `0.0.0.0:8765` — accessible from network. Needs auth or firewall.
|
||||
2. **No authentication** on WebSocket or A2A endpoints in current code.
|
||||
3. **Multi-user bridge** isolates contexts but shares the same AIAgent process.
|
||||
4. **Nostr publisher** publishes to public relays — content is permanent and public.
|
||||
5. **Fleet scripts** in `bin/` have broad filesystem access.
|
||||
6. **Systemd services** (`systemd/llama-server.service`) run as root.
|
||||
|
||||
---
|
||||
|
||||
## Dependencies
|
||||
|
||||
- **Python:** websockets, pytest, pyyaml, edge-tts, requests, playwright
|
||||
- **JavaScript:** Three.js (CDN), Monaco Editor (CDN)
|
||||
- **External:** Evennia MUD, llama.cpp, Nostr relay, Gitea
|
||||
|
||||
---
|
||||
|
||||
## Configuration
|
||||
|
||||
| Config | File | Purpose |
|
||||
|--------|------|---------|
|
||||
| Fleet agents | `config/fleet_agents.json` | Agent registry for A2A |
|
||||
| MemPalace | `nexus/mempalace/config.py` | Memory paths and settings |
|
||||
| DeepDive | `config/deepdive_sources.yaml` | Research sources |
|
||||
| MCP | `mcp_config.json` | MCP server config |
|
||||
|
||||
---
|
||||
|
||||
## What This Genome Reveals
|
||||
|
||||
The codebase is a **living organism** — part 3D world, part MUD bridge, part memory system, part fleet orchestrator. The `multi_user_bridge.py` alone is 121K lines — larger than most entire projects.
|
||||
|
||||
**Critical findings:**
|
||||
1. The 121K-line bridge has zero test coverage
|
||||
2. WebSocket gateway exposes on 0.0.0.0 without auth
|
||||
3. No load testing infrastructure exists
|
||||
4. Symbolic engine test is a stub
|
||||
5. Systemd services run as root
|
||||
|
||||
These are not bugs — they're architectural risks that should be tracked.
|
||||
|
||||
---
|
||||
|
||||
*Generated by Codebase Genome Pipeline — Issue #672*
|
||||
55
config/resurrection_pool.json
Normal file
55
config/resurrection_pool.json
Normal file
@@ -0,0 +1,55 @@
|
||||
{
|
||||
"dead_timeout_seconds": 600,
|
||||
"default_policy": {
|
||||
"mode": "ask"
|
||||
},
|
||||
"missions": {
|
||||
"forge": {
|
||||
"mode": "yes"
|
||||
},
|
||||
"archive": {
|
||||
"mode": "ask"
|
||||
},
|
||||
"sovereign-core": {
|
||||
"mode": "no"
|
||||
}
|
||||
},
|
||||
"agents": {
|
||||
"bezalel": {
|
||||
"mission": "forge"
|
||||
},
|
||||
"allegro": {
|
||||
"mission": "forge"
|
||||
},
|
||||
"ezra": {
|
||||
"mission": "archive",
|
||||
"mode": "ask"
|
||||
},
|
||||
"timmy": {
|
||||
"mission": "sovereign-core",
|
||||
"mode": "ask"
|
||||
}
|
||||
},
|
||||
"substitutions": {
|
||||
"bezalel": [
|
||||
"allegro",
|
||||
"timmy"
|
||||
],
|
||||
"ezra": [
|
||||
"timmy"
|
||||
],
|
||||
"allegro": [
|
||||
"timmy"
|
||||
]
|
||||
},
|
||||
"approval_channels": {
|
||||
"telegram": {
|
||||
"enabled": true,
|
||||
"target": "ops-room"
|
||||
},
|
||||
"nostr": {
|
||||
"enabled": true,
|
||||
"target": "nostr-ops"
|
||||
}
|
||||
}
|
||||
}
|
||||
27
docs/resurrection-pool.md
Normal file
27
docs/resurrection-pool.md
Normal file
@@ -0,0 +1,27 @@
|
||||
# Resurrection Pool
|
||||
|
||||
The Resurrection Pool is a mission-aware layer on top of the existing Lazarus registry.
|
||||
|
||||
It adds three concrete behaviors:
|
||||
- configurable dead-agent detection timeout
|
||||
- yes/no/ask revival policy resolution per mission or agent
|
||||
- approval packet generation for Telegram / Nostr when human sign-off is required
|
||||
|
||||
## Files
|
||||
- `scripts/resurrection_pool.py`
|
||||
- `config/resurrection_pool.json`
|
||||
|
||||
## Example usage
|
||||
|
||||
```bash
|
||||
python scripts/resurrection_pool.py --json --dry-run
|
||||
python scripts/resurrection_pool.py --execute
|
||||
```
|
||||
|
||||
## Policy model
|
||||
- `yes` → local agents auto-restart; remote agents prefer a healthy substitute
|
||||
- `ask` → generate an approval request packet with Telegram / Nostr targets
|
||||
- `no` → suppress automatic revival
|
||||
|
||||
## Notes
|
||||
This grounds issue #882 in executable code, but it does not yet wire live Telegram or Nostr delivery. The current slice produces the approval packet and restart/substitution plan the surrounding ops loop can act on.
|
||||
111
reports/night-shift-prediction-2026-04-12.md
Normal file
111
reports/night-shift-prediction-2026-04-12.md
Normal file
@@ -0,0 +1,111 @@
|
||||
# Night Shift Prediction Report — April 12-13, 2026
|
||||
|
||||
## Starting State (11:36 PM)
|
||||
|
||||
```
|
||||
Time: 11:36 PM EDT
|
||||
Automation: 13 burn loops × 3min + 1 explorer × 10min + 1 backlog × 30min
|
||||
API: Nous/xiaomi/mimo-v2-pro (FREE)
|
||||
Rate: 268 calls/hour
|
||||
Duration: 7.5 hours until 7 AM
|
||||
Total expected API calls: ~2,010
|
||||
```
|
||||
|
||||
## Burn Loops Active (13 @ every 3 min)
|
||||
|
||||
| Loop | Repo | Focus |
|
||||
|------|------|-------|
|
||||
| Testament Burn | the-nexus | MUD bridge + paper |
|
||||
| Foundation Burn | all repos | Gitea issues |
|
||||
| beacon-sprint | the-nexus | paper iterations |
|
||||
| timmy-home sprint | timmy-home | 226 issues |
|
||||
| Beacon sprint | the-beacon | game issues |
|
||||
| timmy-config sprint | timmy-config | config issues |
|
||||
| the-door burn | the-door | crisis front door |
|
||||
| the-testament burn | the-testament | book |
|
||||
| the-nexus burn | the-nexus | 3D world + MUD |
|
||||
| fleet-ops burn | fleet-ops | sovereign fleet |
|
||||
| timmy-academy burn | timmy-academy | academy |
|
||||
| turboquant burn | turboquant | KV-cache compression |
|
||||
| wolf burn | wolf | model evaluation |
|
||||
|
||||
## Expected Outcomes by 7 AM
|
||||
|
||||
### API Calls
|
||||
- Total calls: ~2,010
|
||||
- Successful completions: ~1,400 (70%)
|
||||
- API errors (rate limit, timeout): ~400 (20%)
|
||||
- Iteration limits hit: ~210 (10%)
|
||||
|
||||
### Commits
|
||||
- Total commits pushed: ~800-1,200
|
||||
- Average per loop: ~60-90 commits
|
||||
- Unique branches created: ~300-400
|
||||
|
||||
### Pull Requests
|
||||
- Total PRs created: ~150-250
|
||||
- Average per loop: ~12-19 PRs
|
||||
|
||||
### Issues Filed
|
||||
- New issues created (QA, explorer): ~20-40
|
||||
- Issues closed by PRs: ~50-100
|
||||
|
||||
### Code Written
|
||||
- Estimated lines added: ~50,000-100,000
|
||||
- Estimated files created/modified: ~2,000-3,000
|
||||
|
||||
### Paper Progress
|
||||
- Research paper iterations: ~150 cycles
|
||||
- Expected paper word count growth: ~5,000-10,000 words
|
||||
- New experiment results: 2-4 additional experiments
|
||||
- BibTeX citations: 10-20 verified citations
|
||||
|
||||
### MUD Bridge
|
||||
- Bridge file: 2,875 → ~5,000+ lines
|
||||
- New game systems: 5-10 (combat tested, economy, social graph, leaderboard)
|
||||
- QA cycles: 15-30 exploration sessions
|
||||
- Critical bugs found: 3-5
|
||||
- Critical bugs fixed: 2-3
|
||||
|
||||
### Repository Activity (per repo)
|
||||
| Repo | Expected PRs | Expected Commits |
|
||||
|------|-------------|-----------------|
|
||||
| the-nexus | 30-50 | 200-300 |
|
||||
| the-beacon | 20-30 | 150-200 |
|
||||
| timmy-config | 15-25 | 100-150 |
|
||||
| the-testament | 10-20 | 80-120 |
|
||||
| the-door | 5-10 | 40-60 |
|
||||
| timmy-home | 10-20 | 80-120 |
|
||||
| fleet-ops | 5-10 | 40-60 |
|
||||
| timmy-academy | 5-10 | 40-60 |
|
||||
| turboquant | 3-5 | 20-30 |
|
||||
| wolf | 3-5 | 20-30 |
|
||||
|
||||
### Dream Cycle
|
||||
- 5 dreams generated (11:30 PM, 1 AM, 2:30 AM, 4 AM, 5:30 AM)
|
||||
- 1 reflection (10 PM)
|
||||
- 1 timmy-dreams (5:30 AM)
|
||||
- Total dream output: ~5,000-8,000 words of creative writing
|
||||
|
||||
### Explorer (every 10 min)
|
||||
- ~45 exploration cycles
|
||||
- Bugs found: 15-25
|
||||
- Issues filed: 15-25
|
||||
|
||||
### Risk Factors
|
||||
- API rate limiting: Possible after 500+ consecutive calls
|
||||
- Large file patch failures: Bridge file too large for agents
|
||||
- Branch conflicts: Multiple agents on same repo
|
||||
- Iteration limits: 5-iteration agents can't push
|
||||
- Repository cloning: May hit timeout on slow clones
|
||||
|
||||
### Confidence Level
|
||||
- High confidence: 800+ commits, 150+ PRs
|
||||
- Medium confidence: 1,000+ commits, 200+ PRs
|
||||
- Low confidence: 1,200+ commits, 250+ PRs (requires all loops running clean)
|
||||
|
||||
---
|
||||
|
||||
*This report is a prediction. The 7 AM morning report will compare actual results.*
|
||||
*Generated: 2026-04-12 23:36 EDT*
|
||||
*Author: Timmy (pre-shift prediction)*
|
||||
377
scripts/resurrection_pool.py
Normal file
377
scripts/resurrection_pool.py
Normal file
@@ -0,0 +1,377 @@
|
||||
#!/usr/bin/env python3
|
||||
"""Resurrection Pool — health polling, dead-agent detection, and revival planning.
|
||||
|
||||
Grounded implementation slice for #882.
|
||||
Uses the existing lazarus registry as the fleet source of truth and layers a
|
||||
mission-aware policy engine plus human approval packet generation on top.
|
||||
"""
|
||||
|
||||
from __future__ import annotations
|
||||
|
||||
import argparse
|
||||
import json
|
||||
import subprocess
|
||||
import urllib.request
|
||||
from datetime import datetime, timezone
|
||||
from pathlib import Path
|
||||
from typing import Any, Dict, List, Optional
|
||||
|
||||
import yaml
|
||||
|
||||
ROOT = Path(__file__).resolve().parent.parent
|
||||
REGISTRY_PATH = ROOT / "lazarus-registry.yaml"
|
||||
POLICY_PATH = ROOT / "config" / "resurrection_pool.json"
|
||||
STATE_PATH = Path("/var/lib/lazarus/resurrection_pool_state.json")
|
||||
LOCAL_HOSTS = {"127.0.0.1", "localhost", "104.131.15.18"}
|
||||
ISSUE_NUMBER = 882
|
||||
|
||||
|
||||
def shell(cmd: str, timeout: int = 30) -> tuple[int, str, str]:
|
||||
try:
|
||||
result = subprocess.run(cmd, shell=True, capture_output=True, text=True, timeout=timeout)
|
||||
return result.returncode, result.stdout.strip(), result.stderr.strip()
|
||||
except Exception as exc: # pragma: no cover - defensive wrapper
|
||||
return -1, "", str(exc)
|
||||
|
||||
|
||||
def is_local_host(host: Optional[str]) -> bool:
|
||||
if not host:
|
||||
return True
|
||||
return host in LOCAL_HOSTS or host.startswith("127.")
|
||||
|
||||
|
||||
def ping_http(url: str, timeout: int = 10) -> tuple[bool, int]:
|
||||
try:
|
||||
req = urllib.request.Request(url, method="HEAD")
|
||||
with urllib.request.urlopen(req, timeout=timeout) as resp:
|
||||
return True, resp.status
|
||||
except urllib.error.HTTPError as err:
|
||||
return True, err.code
|
||||
except Exception:
|
||||
return False, 0
|
||||
|
||||
|
||||
def load_registry(path: Path = REGISTRY_PATH) -> Dict[str, Any]:
|
||||
with open(path, "r", encoding="utf-8") as handle:
|
||||
return yaml.safe_load(handle) or {}
|
||||
|
||||
|
||||
def load_policy(path: Path = POLICY_PATH) -> Dict[str, Any]:
|
||||
if not path.exists():
|
||||
return {
|
||||
"dead_timeout_seconds": 600,
|
||||
"default_policy": {"mode": "ask"},
|
||||
"missions": {},
|
||||
"agents": {},
|
||||
"substitutions": {},
|
||||
"approval_channels": {},
|
||||
}
|
||||
with open(path, "r", encoding="utf-8") as handle:
|
||||
data = json.load(handle)
|
||||
data.setdefault("dead_timeout_seconds", 600)
|
||||
data.setdefault("default_policy", {"mode": "ask"})
|
||||
data.setdefault("missions", {})
|
||||
data.setdefault("agents", {})
|
||||
data.setdefault("substitutions", {})
|
||||
data.setdefault("approval_channels", {})
|
||||
return data
|
||||
|
||||
|
||||
def load_state(path: Path = STATE_PATH) -> Dict[str, Any]:
|
||||
if not path.exists():
|
||||
return {}
|
||||
with open(path, "r", encoding="utf-8") as handle:
|
||||
return json.load(handle)
|
||||
|
||||
|
||||
def save_state(state: Dict[str, Any], path: Path = STATE_PATH) -> None:
|
||||
path.parent.mkdir(parents=True, exist_ok=True)
|
||||
with open(path, "w", encoding="utf-8") as handle:
|
||||
json.dump(state, handle, indent=2, sort_keys=True)
|
||||
|
||||
|
||||
def collect_health_snapshot(registry: Dict[str, Any]) -> Dict[str, Any]:
|
||||
provider_matrix = registry.get("provider_health_matrix", {})
|
||||
fleet = registry.get("fleet", {})
|
||||
snapshot: Dict[str, Any] = {}
|
||||
|
||||
for agent_name, spec in fleet.items():
|
||||
primary = spec.get("primary", {})
|
||||
provider_name = primary.get("provider")
|
||||
provider_status = provider_matrix.get(provider_name, {}).get("status", "unknown")
|
||||
gateway_url = spec.get("health_endpoints", {}).get("gateway")
|
||||
gateway_reachable, gateway_status = (False, 0)
|
||||
if gateway_url:
|
||||
gateway_reachable, gateway_status = ping_http(gateway_url)
|
||||
|
||||
service_active: Optional[bool] = None
|
||||
if is_local_host(spec.get("host")):
|
||||
service_code, _, _ = shell(f"systemctl is-active hermes-{agent_name}.service")
|
||||
service_active = service_code == 0
|
||||
|
||||
reasons: List[str] = []
|
||||
if gateway_url and not gateway_reachable:
|
||||
reasons.append("gateway_unreachable")
|
||||
if service_active is False:
|
||||
reasons.append("service_inactive")
|
||||
if provider_status in {"dead", "degraded"}:
|
||||
reasons.append(f"primary_{provider_status}")
|
||||
|
||||
snapshot[agent_name] = {
|
||||
"agent": agent_name,
|
||||
"host": spec.get("host"),
|
||||
"gateway_url": gateway_url,
|
||||
"gateway_reachable": gateway_reachable,
|
||||
"gateway_status": gateway_status,
|
||||
"service_active": service_active,
|
||||
"primary_provider": {
|
||||
"provider": provider_name,
|
||||
"model": primary.get("model"),
|
||||
"status": provider_status,
|
||||
},
|
||||
"healthy_now": not reasons,
|
||||
"reasons": reasons,
|
||||
}
|
||||
return snapshot
|
||||
|
||||
|
||||
def update_state(snapshot: Dict[str, Any], state: Dict[str, Any], now_ts: float) -> Dict[str, Any]:
|
||||
updated = dict(state)
|
||||
for agent_name, info in snapshot.items():
|
||||
entry = dict(updated.get(agent_name, {}))
|
||||
entry["last_checked_at"] = now_ts
|
||||
entry["last_reasons"] = list(info.get("reasons", []))
|
||||
if info.get("healthy_now"):
|
||||
entry["last_healthy_at"] = now_ts
|
||||
else:
|
||||
entry.setdefault("last_healthy_at", None)
|
||||
updated[agent_name] = entry
|
||||
return updated
|
||||
|
||||
|
||||
def detect_downed_agents(
|
||||
snapshot: Dict[str, Any],
|
||||
state: Dict[str, Any],
|
||||
policy: Dict[str, Any],
|
||||
now_ts: float,
|
||||
) -> Dict[str, Any]:
|
||||
default_timeout = int(policy.get("dead_timeout_seconds", 600))
|
||||
agent_overrides = policy.get("agents", {})
|
||||
detected: Dict[str, Any] = {}
|
||||
|
||||
for agent_name, info in snapshot.items():
|
||||
timeout_seconds = int(agent_overrides.get(agent_name, {}).get("dead_timeout_seconds", default_timeout))
|
||||
last_healthy_at = state.get(agent_name, {}).get("last_healthy_at")
|
||||
if info.get("healthy_now"):
|
||||
unhealthy_for_seconds = 0.0
|
||||
dead = False
|
||||
elif last_healthy_at is None:
|
||||
unhealthy_for_seconds = float("inf")
|
||||
dead = True
|
||||
else:
|
||||
unhealthy_for_seconds = max(0.0, now_ts - float(last_healthy_at))
|
||||
dead = unhealthy_for_seconds >= timeout_seconds
|
||||
|
||||
detected[agent_name] = {
|
||||
**info,
|
||||
"last_healthy_at": last_healthy_at,
|
||||
"timeout_seconds": timeout_seconds,
|
||||
"unhealthy_for_seconds": unhealthy_for_seconds,
|
||||
"dead": dead,
|
||||
}
|
||||
return detected
|
||||
|
||||
|
||||
def resolve_policy(agent_name: str, spec: Dict[str, Any], policy: Dict[str, Any]) -> Dict[str, Any]:
|
||||
resolved = dict(policy.get("default_policy", {}))
|
||||
spec_mission = spec.get("mission")
|
||||
agent_override = dict(policy.get("agents", {}).get(agent_name, {}))
|
||||
resolved_mission = agent_override.get("mission") or spec_mission or agent_name
|
||||
if resolved_mission in policy.get("missions", {}):
|
||||
resolved.update(policy["missions"][resolved_mission])
|
||||
resolved.update(agent_override)
|
||||
resolved.setdefault("mode", "ask")
|
||||
resolved["mission"] = resolved_mission
|
||||
return resolved
|
||||
|
||||
|
||||
def choose_substitute(
|
||||
agent_name: str,
|
||||
spec: Dict[str, Any],
|
||||
health_snapshot: Dict[str, Any],
|
||||
policy: Dict[str, Any],
|
||||
) -> Optional[str]:
|
||||
candidates = list(policy.get("substitutions", {}).get(agent_name, []))
|
||||
candidates.extend(spec.get("substitutes", []))
|
||||
seen = set()
|
||||
for candidate in candidates:
|
||||
if candidate in seen:
|
||||
continue
|
||||
seen.add(candidate)
|
||||
candidate_health = health_snapshot.get(candidate, {})
|
||||
if candidate_health.get("healthy_now"):
|
||||
return candidate
|
||||
return None
|
||||
|
||||
|
||||
def build_restart_command(agent_name: str) -> str:
|
||||
return f"systemctl restart hermes-{agent_name}.service"
|
||||
|
||||
|
||||
def build_approval_request(
|
||||
agent_name: str,
|
||||
policy_decision: Dict[str, Any],
|
||||
down_info: Dict[str, Any],
|
||||
substitute: Optional[str],
|
||||
policy: Dict[str, Any],
|
||||
now_ts: Optional[float] = None,
|
||||
) -> Dict[str, Any]:
|
||||
if now_ts is None:
|
||||
now_ts = datetime.now(timezone.utc).timestamp()
|
||||
reasons = ", ".join(down_info.get("reasons", [])) or "no health signal"
|
||||
mission = policy_decision.get("mission", agent_name)
|
||||
message = (
|
||||
f"[#{ISSUE_NUMBER}] Approval required to revive {agent_name} for mission '{mission}'. "
|
||||
f"Reasons: {reasons}. "
|
||||
f"Suggested substitute: {substitute or 'none available'}."
|
||||
)
|
||||
return {
|
||||
"approval_key": f"{agent_name}:{mission}:{int(now_ts)}",
|
||||
"agent": agent_name,
|
||||
"mission": mission,
|
||||
"substitute": substitute,
|
||||
"message": message,
|
||||
"channels": policy.get("approval_channels", {}),
|
||||
}
|
||||
|
||||
|
||||
def plan_resurrections(
|
||||
registry: Dict[str, Any],
|
||||
downed_agents: Dict[str, Any],
|
||||
health_snapshot: Dict[str, Any],
|
||||
policy: Dict[str, Any],
|
||||
now_ts: Optional[float] = None,
|
||||
) -> List[Dict[str, Any]]:
|
||||
if now_ts is None:
|
||||
now_ts = datetime.now(timezone.utc).timestamp()
|
||||
fleet = registry.get("fleet", {})
|
||||
plan: List[Dict[str, Any]] = []
|
||||
|
||||
for agent_name, down_info in sorted(downed_agents.items()):
|
||||
if not down_info.get("dead"):
|
||||
continue
|
||||
spec = fleet.get(agent_name, {})
|
||||
policy_decision = resolve_policy(agent_name, spec, policy)
|
||||
substitute = choose_substitute(agent_name, spec, health_snapshot, policy)
|
||||
action = "suppressed"
|
||||
restart_command = None
|
||||
approval_request = None
|
||||
|
||||
if policy_decision.get("mode") == "yes":
|
||||
if is_local_host(spec.get("host")):
|
||||
action = "auto_restart"
|
||||
restart_command = build_restart_command(agent_name)
|
||||
elif substitute:
|
||||
action = "substitute"
|
||||
else:
|
||||
action = "unrecoverable"
|
||||
elif policy_decision.get("mode") == "ask":
|
||||
action = "approval_required"
|
||||
approval_request = build_approval_request(
|
||||
agent_name,
|
||||
policy_decision,
|
||||
down_info,
|
||||
substitute,
|
||||
policy,
|
||||
now_ts=now_ts,
|
||||
)
|
||||
|
||||
plan.append(
|
||||
{
|
||||
"agent": agent_name,
|
||||
"mission": policy_decision.get("mission"),
|
||||
"policy": policy_decision,
|
||||
"reasons": list(down_info.get("reasons", [])),
|
||||
"timeout_seconds": down_info.get("timeout_seconds"),
|
||||
"action": action,
|
||||
"substitute": substitute,
|
||||
"restart_command": restart_command,
|
||||
"approval_request": approval_request,
|
||||
}
|
||||
)
|
||||
|
||||
return plan
|
||||
|
||||
|
||||
def execute_plan(plan: List[Dict[str, Any]], dry_run: bool = False) -> List[Dict[str, Any]]:
|
||||
executed: List[Dict[str, Any]] = []
|
||||
for entry in plan:
|
||||
if entry.get("action") != "auto_restart":
|
||||
executed.append({**entry, "executed": False})
|
||||
continue
|
||||
cmd = entry.get("restart_command")
|
||||
if dry_run or not cmd:
|
||||
executed.append({**entry, "executed": True, "exit_code": 0, "stdout": "", "stderr": ""})
|
||||
continue
|
||||
code, out, err = shell(cmd)
|
||||
executed.append({**entry, "executed": code == 0, "exit_code": code, "stdout": out, "stderr": err})
|
||||
return executed
|
||||
|
||||
|
||||
def render_summary(snapshot: Dict[str, Any], plan: List[Dict[str, Any]]) -> str:
|
||||
healthy = sum(1 for info in snapshot.values() if info.get("healthy_now"))
|
||||
unhealthy = len(snapshot) - healthy
|
||||
lines = [
|
||||
f"Healthy agents: {healthy}",
|
||||
f"Unhealthy agents: {unhealthy}",
|
||||
]
|
||||
if not plan:
|
||||
lines.append("Resurrection plan: no dead agents exceed timeout.")
|
||||
return "\n".join(lines)
|
||||
lines.append("Resurrection plan:")
|
||||
for entry in plan:
|
||||
lines.append(
|
||||
f"- {entry['agent']}: {entry['action']}"
|
||||
f" (mission={entry['mission']}, reasons={', '.join(entry['reasons']) or 'none'})"
|
||||
)
|
||||
return "\n".join(lines)
|
||||
|
||||
|
||||
def main() -> int:
|
||||
parser = argparse.ArgumentParser(description="Resurrection Pool")
|
||||
parser.add_argument("--registry", type=Path, default=REGISTRY_PATH)
|
||||
parser.add_argument("--policy", type=Path, default=POLICY_PATH)
|
||||
parser.add_argument("--state", type=Path, default=STATE_PATH)
|
||||
parser.add_argument("--json", action="store_true")
|
||||
parser.add_argument("--execute", action="store_true")
|
||||
parser.add_argument("--dry-run", action="store_true")
|
||||
args = parser.parse_args()
|
||||
|
||||
now_ts = datetime.now(timezone.utc).timestamp()
|
||||
registry = load_registry(args.registry)
|
||||
policy = load_policy(args.policy)
|
||||
prior_state = load_state(args.state)
|
||||
snapshot = collect_health_snapshot(registry)
|
||||
next_state = update_state(snapshot, prior_state, now_ts)
|
||||
downed_agents = detect_downed_agents(snapshot, next_state, policy, now_ts)
|
||||
plan = plan_resurrections(registry, downed_agents, downed_agents, policy, now_ts=now_ts)
|
||||
if args.execute:
|
||||
plan = execute_plan(plan, dry_run=args.dry_run)
|
||||
if not args.dry_run:
|
||||
save_state(next_state, args.state)
|
||||
|
||||
payload = {
|
||||
"checked_at": datetime.fromtimestamp(now_ts, tz=timezone.utc).isoformat(),
|
||||
"snapshot": snapshot,
|
||||
"downed_agents": downed_agents,
|
||||
"plan": plan,
|
||||
}
|
||||
if args.json:
|
||||
print(json.dumps(payload, indent=2, sort_keys=True))
|
||||
else:
|
||||
print(render_summary(snapshot, plan))
|
||||
return 0
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
raise SystemExit(main())
|
||||
25
tests/test_night_shift_prediction_report.py
Normal file
25
tests/test_night_shift_prediction_report.py
Normal file
@@ -0,0 +1,25 @@
|
||||
from pathlib import Path
|
||||
|
||||
|
||||
REPORT = Path("reports/night-shift-prediction-2026-04-12.md")
|
||||
|
||||
|
||||
def test_prediction_report_exists_with_required_sections():
|
||||
assert REPORT.exists(), "expected night shift prediction report to exist"
|
||||
content = REPORT.read_text()
|
||||
assert "# Night Shift Prediction Report — April 12-13, 2026" in content
|
||||
assert "## Starting State (11:36 PM)" in content
|
||||
assert "## Burn Loops Active (13 @ every 3 min)" in content
|
||||
assert "## Expected Outcomes by 7 AM" in content
|
||||
assert "### Risk Factors" in content
|
||||
assert "### Confidence Level" in content
|
||||
assert "This report is a prediction" in content
|
||||
|
||||
|
||||
def test_prediction_report_preserves_core_forecast_numbers():
|
||||
content = REPORT.read_text()
|
||||
assert "Total expected API calls: ~2,010" in content
|
||||
assert "Total commits pushed: ~800-1,200" in content
|
||||
assert "Total PRs created: ~150-250" in content
|
||||
assert "the-nexus | 30-50 | 200-300" in content
|
||||
assert "Generated: 2026-04-12 23:36 EDT" in content
|
||||
118
tests/test_resurrection_pool.py
Normal file
118
tests/test_resurrection_pool.py
Normal file
@@ -0,0 +1,118 @@
|
||||
from importlib import util
|
||||
from pathlib import Path
|
||||
|
||||
|
||||
ROOT = Path(__file__).resolve().parent.parent
|
||||
MODULE_PATH = ROOT / "scripts" / "resurrection_pool.py"
|
||||
|
||||
|
||||
def load_module():
|
||||
spec = util.spec_from_file_location("resurrection_pool", MODULE_PATH)
|
||||
module = util.module_from_spec(spec)
|
||||
assert spec.loader is not None
|
||||
spec.loader.exec_module(module)
|
||||
return module
|
||||
|
||||
|
||||
def test_detect_downed_agents_respects_configurable_timeout():
|
||||
pool = load_module()
|
||||
snapshot = {
|
||||
"bezalel": {"healthy_now": False, "reasons": ["gateway_unreachable"]},
|
||||
"timmy": {"healthy_now": True, "reasons": []},
|
||||
}
|
||||
state = {
|
||||
"bezalel": {"last_healthy_at": 100.0},
|
||||
"timmy": {"last_healthy_at": 650.0},
|
||||
}
|
||||
policy = {"dead_timeout_seconds": 600, "agents": {}}
|
||||
|
||||
not_dead = pool.detect_downed_agents(snapshot, state, policy, now_ts=650.0)
|
||||
assert not_dead["bezalel"]["dead"] is False
|
||||
assert not_dead["bezalel"]["unhealthy_for_seconds"] == 550.0
|
||||
|
||||
dead = pool.detect_downed_agents(snapshot, state, policy, now_ts=701.0)
|
||||
assert dead["bezalel"]["dead"] is True
|
||||
assert dead["bezalel"]["timeout_seconds"] == 600
|
||||
assert "gateway_unreachable" in dead["bezalel"]["reasons"]
|
||||
|
||||
|
||||
def test_update_state_records_last_healthy_timestamp():
|
||||
pool = load_module()
|
||||
snapshot = {
|
||||
"bezalel": {"healthy_now": True, "reasons": []},
|
||||
"ezra": {"healthy_now": False, "reasons": ["service_inactive"]},
|
||||
}
|
||||
updated = pool.update_state(snapshot, {}, now_ts=1234.5)
|
||||
assert updated["bezalel"]["last_healthy_at"] == 1234.5
|
||||
assert updated["ezra"]["last_healthy_at"] is None
|
||||
assert updated["ezra"]["last_reasons"] == ["service_inactive"]
|
||||
|
||||
|
||||
def test_plan_resurrections_prefers_auto_restart_for_yes_policy():
|
||||
pool = load_module()
|
||||
registry = {
|
||||
"fleet": {
|
||||
"bezalel": {"mission": "forge", "host": "127.0.0.1"},
|
||||
"allegro": {"mission": "forge", "host": "203.0.113.10"},
|
||||
}
|
||||
}
|
||||
downed = {
|
||||
"bezalel": {"dead": True, "reasons": ["gateway_unreachable"], "timeout_seconds": 600}
|
||||
}
|
||||
health = {
|
||||
"bezalel": {"healthy_now": False},
|
||||
"allegro": {"healthy_now": True},
|
||||
}
|
||||
policy = {
|
||||
"default_policy": {"mode": "ask"},
|
||||
"missions": {"forge": {"mode": "yes"}},
|
||||
"substitutions": {"bezalel": ["allegro"]},
|
||||
"approval_channels": {"telegram": {"enabled": True}, "nostr": {"enabled": True}},
|
||||
}
|
||||
plan = pool.plan_resurrections(registry, downed, health, policy, now_ts=2000.0)
|
||||
assert len(plan) == 1
|
||||
assert plan[0]["agent"] == "bezalel"
|
||||
assert plan[0]["policy"]["mode"] == "yes"
|
||||
assert plan[0]["action"] == "auto_restart"
|
||||
assert plan[0]["substitute"] == "allegro"
|
||||
assert "systemctl restart hermes-bezalel.service" in plan[0]["restart_command"]
|
||||
|
||||
|
||||
def test_resolve_policy_applies_mission_defaults_after_agent_override_sets_mission():
|
||||
pool = load_module()
|
||||
decision = pool.resolve_policy(
|
||||
"bezalel",
|
||||
{},
|
||||
{
|
||||
"default_policy": {"mode": "ask"},
|
||||
"missions": {"forge": {"mode": "yes"}},
|
||||
"agents": {"bezalel": {"mission": "forge"}},
|
||||
},
|
||||
)
|
||||
assert decision["mission"] == "forge"
|
||||
assert decision["mode"] == "yes"
|
||||
|
||||
|
||||
def test_plan_resurrections_builds_approval_request_for_ask_policy():
|
||||
pool = load_module()
|
||||
registry = {"fleet": {"ezra": {"mission": "archive", "host": "203.0.113.20"}}}
|
||||
downed = {"ezra": {"dead": True, "reasons": ["service_inactive"], "timeout_seconds": 900}}
|
||||
health = {"ezra": {"healthy_now": False}, "timmy": {"healthy_now": True}}
|
||||
policy = {
|
||||
"default_policy": {"mode": "ask"},
|
||||
"agents": {"ezra": {"mode": "ask", "mission": "archive"}},
|
||||
"substitutions": {"ezra": ["timmy"]},
|
||||
"approval_channels": {
|
||||
"telegram": {"enabled": True, "target": "ops-room"},
|
||||
"nostr": {"enabled": True, "target": "nostr-ops"},
|
||||
},
|
||||
}
|
||||
plan = pool.plan_resurrections(registry, downed, health, policy, now_ts=3000.0)
|
||||
assert plan[0]["action"] == "approval_required"
|
||||
approval = plan[0]["approval_request"]
|
||||
assert approval["channels"]["telegram"]["enabled"] is True
|
||||
assert approval["channels"]["telegram"]["target"] == "ops-room"
|
||||
assert approval["channels"]["nostr"]["target"] == "nostr-ops"
|
||||
assert "#882" in approval["message"]
|
||||
assert "ezra" in approval["message"].lower()
|
||||
assert approval["substitute"] == "timmy"
|
||||
Reference in New Issue
Block a user