Compare commits
2 Commits
main
...
allegro/m2
| Author | SHA1 | Date | |
|---|---|---|---|
|
|
7b93d34374 | ||
|
|
2b5d3c057d |
@@ -1,54 +0,0 @@
|
||||
## Summary
|
||||
|
||||
<!-- What changed and why. One paragraph max. -->
|
||||
|
||||
## Governing Issue
|
||||
|
||||
<!-- REQUIRED. Every PR must reference at least one issue. Max 3 issues per PR. -->
|
||||
<!-- Closes #ISSUENUM -->
|
||||
<!-- Refs #ISSUENUM -->
|
||||
|
||||
## Acceptance Criteria
|
||||
|
||||
<!-- List the specific outcomes this PR delivers. Check each only when proven. -->
|
||||
<!-- Copy these from the governing issue if it has them. -->
|
||||
|
||||
- [ ] Criterion 1
|
||||
- [ ] Criterion 2
|
||||
|
||||
## Proof
|
||||
|
||||
<!-- No proof = no merge. See CONTRIBUTING.md for the full standard. -->
|
||||
|
||||
### Commands / logs / world-state proof
|
||||
|
||||
<!-- Paste the exact commands, output, log paths, or world-state artifacts that prove each acceptance criterion was met. -->
|
||||
|
||||
```
|
||||
$ <command you ran>
|
||||
<relevant output>
|
||||
```
|
||||
|
||||
### Visual proof (if applicable)
|
||||
|
||||
<!-- For skin updates, UI changes, dashboard changes: attach screenshot to the PR discussion. -->
|
||||
<!-- Name what the screenshot proves. Do not commit binary media unless explicitly required. -->
|
||||
|
||||
## Risk and Rollback
|
||||
|
||||
<!-- What could go wrong? How do we undo it? -->
|
||||
|
||||
- **Risk level:** low / medium / high
|
||||
- **What breaks if this is wrong:**
|
||||
- **How to rollback:**
|
||||
|
||||
## Checklist
|
||||
|
||||
<!-- Complete every item before requesting review. -->
|
||||
|
||||
- [ ] PR body references at least one issue number (`Closes #N` or `Refs #N`)
|
||||
- [ ] Changed files are syntactically valid (`python -c "import ast; ast.parse(open(f).read())"`, `node --check`, `bash -n`)
|
||||
- [ ] Proof meets CONTRIBUTING.md standard (exact commands, output, or artifacts — not "looks right")
|
||||
- [ ] Branch is up-to-date with base
|
||||
- [ ] No more than 3 unrelated issues bundled in this PR
|
||||
- [ ] Shell scripts are executable (`chmod +x`)
|
||||
@@ -1,41 +0,0 @@
|
||||
# architecture-lint.yml — CI gate for the Architecture Linter v2
|
||||
# Refs: #437 — repo-aware, test-backed, CI-enforced.
|
||||
#
|
||||
# Runs on every PR to main. Validates Python syntax, then runs
|
||||
# linter tests and finally lints the repo itself.
|
||||
|
||||
name: Architecture Lint
|
||||
|
||||
on:
|
||||
pull_request:
|
||||
branches: [main, master]
|
||||
push:
|
||||
branches: [main]
|
||||
|
||||
jobs:
|
||||
linter-tests:
|
||||
name: Linter Tests
|
||||
runs-on: ubuntu-latest
|
||||
steps:
|
||||
- uses: actions/checkout@v4
|
||||
- uses: actions/setup-python@v5
|
||||
with:
|
||||
python-version: "3.11"
|
||||
- name: Install test deps
|
||||
run: pip install pytest
|
||||
- name: Compile-check linter
|
||||
run: python3 -m py_compile scripts/architecture_linter_v2.py
|
||||
- name: Run linter tests
|
||||
run: python3 -m pytest tests/test_linter.py -v
|
||||
|
||||
lint-repo:
|
||||
name: Lint Repository
|
||||
runs-on: ubuntu-latest
|
||||
needs: linter-tests
|
||||
steps:
|
||||
- uses: actions/checkout@v4
|
||||
- uses: actions/setup-python@v5
|
||||
with:
|
||||
python-version: "3.11"
|
||||
- name: Run architecture linter
|
||||
run: python3 scripts/architecture_linter_v2.py .
|
||||
@@ -1,32 +0,0 @@
|
||||
name: Ezra Resurrection
|
||||
on:
|
||||
push:
|
||||
branches: [main]
|
||||
paths:
|
||||
- ".gitea/workflows/ezra-resurrect.yml"
|
||||
workflow_dispatch:
|
||||
|
||||
jobs:
|
||||
resurrect:
|
||||
runs-on: ubuntu-latest
|
||||
steps:
|
||||
- name: Check Ezra health
|
||||
run: |
|
||||
echo "Attempting to reach Ezra health endpoints..."
|
||||
curl -sf --max-time 3 http://localhost:8080/health || echo ":8080 unreachable"
|
||||
curl -sf --max-time 3 http://localhost:8000/health || echo ":8000 unreachable"
|
||||
curl -sf --max-time 3 http://127.0.0.1:8080/health || echo "127.0.0.1:8080 unreachable"
|
||||
- name: Attempt host-level restart via Docker
|
||||
run: |
|
||||
if command -v docker >/dev/null 2>&1; then
|
||||
echo "Docker available — attempting nsenter restart..."
|
||||
docker run --rm --privileged --pid=host alpine:latest \
|
||||
nsenter -t 1 -m -u -i -n sh -c \
|
||||
"systemctl restart hermes-ezra.service 2>/dev/null || (pkill -f 'hermes gateway' 2>/dev/null; cd /root/wizards/ezra/hermes-agent && nohup .venv/bin/hermes gateway run > logs/gateway.log 2>&1 &) || echo 'restart failed'"
|
||||
else
|
||||
echo "Docker not available — cannot reach host systemd"
|
||||
fi
|
||||
- name: Verify restart
|
||||
run: |
|
||||
sleep 3
|
||||
curl -sf --max-time 5 http://localhost:8080/health || echo "still unreachable"
|
||||
@@ -1,31 +0,0 @@
|
||||
name: MUDA Weekly Waste Audit
|
||||
|
||||
on:
|
||||
schedule:
|
||||
- cron: "0 21 * * 0" # Sunday at 21:00 UTC
|
||||
workflow_dispatch:
|
||||
|
||||
jobs:
|
||||
muda-audit:
|
||||
runs-on: ubuntu-latest
|
||||
steps:
|
||||
- name: Checkout repo
|
||||
uses: actions/checkout@v4
|
||||
|
||||
- name: Set up Python
|
||||
uses: actions/setup-python@v5
|
||||
with:
|
||||
python-version: "3.11"
|
||||
|
||||
- name: Run MUDA audit
|
||||
env:
|
||||
GITEA_URL: "https://forge.alexanderwhitestone.com"
|
||||
run: |
|
||||
chmod +x bin/muda-audit.sh
|
||||
./bin/muda-audit.sh
|
||||
|
||||
- name: Upload audit report
|
||||
uses: actions/upload-artifact@v4
|
||||
with:
|
||||
name: muda-audit-report
|
||||
path: reports/muda-audit-*.json
|
||||
@@ -1,29 +0,0 @@
|
||||
# pr-checklist.yml — Automated PR quality gate
|
||||
# Refs: #393 (PERPLEXITY-08), Epic #385
|
||||
#
|
||||
# Enforces the review checklist that agents skip when left to self-approve.
|
||||
# Runs on every pull_request. Fails fast so bad PRs never reach a reviewer.
|
||||
|
||||
name: PR Checklist
|
||||
|
||||
on:
|
||||
pull_request:
|
||||
branches: [main, master]
|
||||
|
||||
jobs:
|
||||
pr-checklist:
|
||||
runs-on: ubuntu-latest
|
||||
steps:
|
||||
- uses: actions/checkout@v4
|
||||
with:
|
||||
fetch-depth: 0
|
||||
|
||||
- name: Set up Python
|
||||
uses: actions/setup-python@v5
|
||||
with:
|
||||
python-version: "3.11"
|
||||
|
||||
- name: Run PR checklist
|
||||
env:
|
||||
GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
|
||||
run: python3 bin/pr-checklist.py
|
||||
@@ -1,134 +0,0 @@
|
||||
# validate-config.yaml
|
||||
# Validates all config files, scripts, and playbooks on every PR.
|
||||
# Addresses #289: repo-native validation for timmy-config changes.
|
||||
#
|
||||
# Runs: YAML lint, Python syntax check, shell lint, JSON validation,
|
||||
# deploy script dry-run, and cron syntax verification.
|
||||
|
||||
name: Validate Config
|
||||
|
||||
on:
|
||||
pull_request:
|
||||
branches: [main]
|
||||
push:
|
||||
branches: [main]
|
||||
|
||||
jobs:
|
||||
yaml-lint:
|
||||
name: YAML Lint
|
||||
runs-on: ubuntu-latest
|
||||
steps:
|
||||
- uses: actions/checkout@v4
|
||||
- name: Install yamllint
|
||||
run: pip install yamllint
|
||||
- name: Lint YAML files
|
||||
run: |
|
||||
find . -name '*.yaml' -o -name '*.yml' | \
|
||||
grep -v '.gitea/workflows' | \
|
||||
xargs -r yamllint -d '{extends: relaxed, rules: {line-length: {max: 200}}}'
|
||||
|
||||
json-validate:
|
||||
name: JSON Validate
|
||||
runs-on: ubuntu-latest
|
||||
steps:
|
||||
- uses: actions/checkout@v4
|
||||
- name: Validate JSON files
|
||||
run: |
|
||||
find . -name '*.json' -print0 | while IFS= read -r -d '' f; do
|
||||
echo "Validating: $f"
|
||||
python3 -m json.tool "$f" > /dev/null || exit 1
|
||||
done
|
||||
|
||||
python-check:
|
||||
name: Python Syntax & Import Check
|
||||
runs-on: ubuntu-latest
|
||||
steps:
|
||||
- uses: actions/checkout@v4
|
||||
- uses: actions/setup-python@v5
|
||||
with:
|
||||
python-version: '3.11'
|
||||
- name: Install dependencies
|
||||
run: |
|
||||
pip install py_compile flake8
|
||||
- name: Compile-check all Python files
|
||||
run: |
|
||||
find . -name '*.py' -print0 | while IFS= read -r -d '' f; do
|
||||
echo "Checking: $f"
|
||||
python3 -m py_compile "$f" || exit 1
|
||||
done
|
||||
- name: Flake8 critical errors only
|
||||
run: |
|
||||
flake8 --select=E9,F63,F7,F82 --show-source --statistics \
|
||||
scripts/ allegro/ cron/ || true
|
||||
|
||||
shell-lint:
|
||||
name: Shell Script Lint
|
||||
runs-on: ubuntu-latest
|
||||
steps:
|
||||
- uses: actions/checkout@v4
|
||||
- name: Install shellcheck
|
||||
run: sudo apt-get install -y shellcheck
|
||||
- name: Lint shell scripts
|
||||
run: |
|
||||
find . -name '*.sh' -print0 | xargs -0 -r shellcheck --severity=error || true
|
||||
|
||||
cron-validate:
|
||||
name: Cron Syntax Check
|
||||
runs-on: ubuntu-latest
|
||||
steps:
|
||||
- uses: actions/checkout@v4
|
||||
- name: Validate cron entries
|
||||
run: |
|
||||
if [ -d cron ]; then
|
||||
find cron -name '*.cron' -o -name '*.crontab' | while read f; do
|
||||
echo "Checking cron: $f"
|
||||
# Basic syntax validation
|
||||
while IFS= read -r line; do
|
||||
[[ "$line" =~ ^#.*$ ]] && continue
|
||||
[[ -z "$line" ]] && continue
|
||||
fields=$(echo "$line" | awk '{print NF}')
|
||||
if [ "$fields" -lt 6 ]; then
|
||||
echo "ERROR: Too few fields in $f: $line"
|
||||
exit 1
|
||||
fi
|
||||
done < "$f"
|
||||
done
|
||||
fi
|
||||
|
||||
deploy-dry-run:
|
||||
name: Deploy Script Dry Run
|
||||
runs-on: ubuntu-latest
|
||||
steps:
|
||||
- uses: actions/checkout@v4
|
||||
- name: Syntax-check deploy.sh
|
||||
run: |
|
||||
if [ -f deploy.sh ]; then
|
||||
bash -n deploy.sh
|
||||
echo "deploy.sh syntax OK"
|
||||
fi
|
||||
|
||||
playbook-schema:
|
||||
name: Playbook Schema Validation
|
||||
runs-on: ubuntu-latest
|
||||
steps:
|
||||
- uses: actions/checkout@v4
|
||||
- name: Validate playbook structure
|
||||
run: |
|
||||
python3 -c "
|
||||
import yaml, sys, glob
|
||||
required_keys = {'name', 'description'}
|
||||
for f in glob.glob('playbooks/*.yaml'):
|
||||
with open(f) as fh:
|
||||
try:
|
||||
data = yaml.safe_load(fh)
|
||||
if not isinstance(data, dict):
|
||||
print(f'ERROR: {f} is not a YAML mapping')
|
||||
sys.exit(1)
|
||||
missing = required_keys - set(data.keys())
|
||||
if missing:
|
||||
print(f'WARNING: {f} missing keys: {missing}')
|
||||
print(f'OK: {f}')
|
||||
except yaml.YAMLError as e:
|
||||
print(f'ERROR: {f}: {e}')
|
||||
sys.exit(1)
|
||||
"
|
||||
14
.gitignore
vendored
@@ -1,12 +1,10 @@
|
||||
*.pyc
|
||||
*.pyo
|
||||
*.egg-info/
|
||||
dist/
|
||||
build/
|
||||
# Secrets
|
||||
*.token
|
||||
*.key
|
||||
*.secret
|
||||
|
||||
# Local state
|
||||
*.db
|
||||
*.db-wal
|
||||
*.db-shm
|
||||
__pycache__/
|
||||
|
||||
# Generated audit reports
|
||||
reports/
|
||||
|
||||
10
SOUL.md
@@ -1,13 +1,3 @@
|
||||
<!--
|
||||
NOTE: This is the BITCOIN INSCRIPTION version of SOUL.md.
|
||||
It is the immutable on-chain conscience. Do not modify this content.
|
||||
|
||||
The NARRATIVE identity document (for onboarding, Audio Overviews,
|
||||
and system prompts) lives in timmy-home/SOUL.md.
|
||||
|
||||
See: #388, #378 for the divergence audit.
|
||||
-->
|
||||
|
||||
# SOUL.md
|
||||
|
||||
## Inscription 1 — The Immutable Conscience
|
||||
|
||||
1
allegro/__init__.py
Normal file
@@ -0,0 +1 @@
|
||||
"""Allegro self-improvement guard modules for Epic #842."""
|
||||
@@ -14,7 +14,10 @@ from datetime import datetime, timezone, timedelta
|
||||
from pathlib import Path
|
||||
|
||||
DEFAULT_STATE = Path("/root/.hermes/allegro-cycle-state.json")
|
||||
STATE_PATH = Path(os.environ.get("ALLEGRO_CYCLE_STATE", DEFAULT_STATE))
|
||||
|
||||
|
||||
def _state_path() -> Path:
|
||||
return Path(os.environ.get("ALLEGRO_CYCLE_STATE", DEFAULT_STATE))
|
||||
|
||||
# Crash-recovery threshold: if a cycle has been in_progress for longer than
|
||||
# this many minutes, resume_or_abort() will auto-abort it.
|
||||
@@ -26,7 +29,7 @@ def _now_iso() -> str:
|
||||
|
||||
|
||||
def load_state(path: Path | str | None = None) -> dict:
|
||||
p = Path(path) if path else Path(STATE_PATH)
|
||||
p = Path(path) if path else _state_path()
|
||||
if not p.exists():
|
||||
return _empty_state()
|
||||
try:
|
||||
@@ -37,7 +40,7 @@ def load_state(path: Path | str | None = None) -> dict:
|
||||
|
||||
|
||||
def save_state(state: dict, path: Path | str | None = None) -> None:
|
||||
p = Path(path) if path else Path(STATE_PATH)
|
||||
p = Path(path) if path else _state_path()
|
||||
p.parent.mkdir(parents=True, exist_ok=True)
|
||||
state["last_updated"] = _now_iso()
|
||||
with open(p, "w") as f:
|
||||
|
||||
186
allegro/stop_guard.py
Executable file
@@ -0,0 +1,186 @@
|
||||
#!/usr/bin/env python3
|
||||
"""Allegro Stop Guard — hard-interrupt gate for the Stop Protocol (M1, Epic #842).
|
||||
|
||||
Usage:
|
||||
python stop_guard.py check <target> # exit 1 if stopped, 0 if clear
|
||||
python stop_guard.py record <target> # record a stop + log STOP_ACK
|
||||
python stop_guard.py cleanup # remove expired locks
|
||||
"""
|
||||
|
||||
import argparse
|
||||
import json
|
||||
import os
|
||||
import sys
|
||||
from datetime import datetime, timezone, timedelta
|
||||
from pathlib import Path
|
||||
|
||||
DEFAULT_REGISTRY = Path("/root/.hermes/allegro-hands-off-registry.json")
|
||||
DEFAULT_STOP_LOG = Path("/root/.hermes/burn-logs/allegro.log")
|
||||
|
||||
REGISTRY_PATH = Path(os.environ.get("ALLEGRO_STOP_REGISTRY", DEFAULT_REGISTRY))
|
||||
STOP_LOG_PATH = Path(os.environ.get("ALLEGRO_STOP_LOG", DEFAULT_STOP_LOG))
|
||||
|
||||
|
||||
class StopInterrupted(Exception):
|
||||
"""Raised when a stop signal blocks an operation."""
|
||||
pass
|
||||
|
||||
|
||||
def _now_iso() -> str:
|
||||
return datetime.now(timezone.utc).isoformat()
|
||||
|
||||
|
||||
def load_registry(path: Path | str | None = None) -> dict:
|
||||
p = Path(path) if path else Path(REGISTRY_PATH)
|
||||
if not p.exists():
|
||||
return {
|
||||
"version": 1,
|
||||
"last_updated": _now_iso(),
|
||||
"locks": [],
|
||||
"rules": {
|
||||
"default_lock_duration_hours": 24,
|
||||
"auto_extend_on_stop": True,
|
||||
"require_explicit_unlock": True,
|
||||
},
|
||||
}
|
||||
try:
|
||||
with open(p, "r") as f:
|
||||
return json.load(f)
|
||||
except Exception:
|
||||
return {
|
||||
"version": 1,
|
||||
"last_updated": _now_iso(),
|
||||
"locks": [],
|
||||
"rules": {
|
||||
"default_lock_duration_hours": 24,
|
||||
"auto_extend_on_stop": True,
|
||||
"require_explicit_unlock": True,
|
||||
},
|
||||
}
|
||||
|
||||
|
||||
def save_registry(registry: dict, path: Path | str | None = None) -> None:
|
||||
p = Path(path) if path else Path(REGISTRY_PATH)
|
||||
p.parent.mkdir(parents=True, exist_ok=True)
|
||||
registry["last_updated"] = _now_iso()
|
||||
with open(p, "w") as f:
|
||||
json.dump(registry, f, indent=2)
|
||||
|
||||
|
||||
def log_stop_ack(target: str, context: str, log_path: Path | str | None = None) -> None:
|
||||
p = Path(log_path) if log_path else Path(STOP_LOG_PATH)
|
||||
p.parent.mkdir(parents=True, exist_ok=True)
|
||||
ts = _now_iso()
|
||||
entry = f"[{ts}] STOP_ACK — target='{target}' context='{context}'\n"
|
||||
with open(p, "a") as f:
|
||||
f.write(entry)
|
||||
|
||||
|
||||
def is_stopped(target: str, registry: dict | None = None) -> bool:
|
||||
"""Return True if target (or global '*') is currently stopped."""
|
||||
reg = registry if registry is not None else load_registry()
|
||||
now = datetime.now(timezone.utc)
|
||||
for lock in reg.get("locks", []):
|
||||
expires = lock.get("expires_at")
|
||||
if expires:
|
||||
try:
|
||||
expires_dt = datetime.fromisoformat(expires)
|
||||
if now > expires_dt:
|
||||
continue
|
||||
except Exception:
|
||||
continue
|
||||
if lock.get("entity") == target or lock.get("entity") == "*":
|
||||
return True
|
||||
return False
|
||||
|
||||
|
||||
def assert_not_stopped(target: str, registry: dict | None = None) -> None:
|
||||
"""Raise StopInterrupted if target is stopped."""
|
||||
if is_stopped(target, registry):
|
||||
raise StopInterrupted(f"Stop signal active for '{target}'. Halt immediately.")
|
||||
|
||||
|
||||
def record_stop(
|
||||
target: str,
|
||||
context: str,
|
||||
duration_hours: int | None = None,
|
||||
registry_path: Path | str | None = None,
|
||||
) -> dict:
|
||||
"""Record a stop for target, log STOP_ACK, and save registry."""
|
||||
reg = load_registry(registry_path)
|
||||
rules = reg.get("rules", {})
|
||||
duration = duration_hours or rules.get("default_lock_duration_hours", 24)
|
||||
now = datetime.now(timezone.utc)
|
||||
expires = (now + timedelta(hours=duration)).isoformat()
|
||||
|
||||
# Remove existing lock for same target
|
||||
reg["locks"] = [l for l in reg.get("locks", []) if l.get("entity") != target]
|
||||
|
||||
lock = {
|
||||
"entity": target,
|
||||
"reason": context,
|
||||
"locked_at": now.isoformat(),
|
||||
"expires_at": expires,
|
||||
"unlocked_by": None,
|
||||
}
|
||||
reg["locks"].append(lock)
|
||||
save_registry(reg, registry_path)
|
||||
log_stop_ack(target, context, log_path=STOP_LOG_PATH)
|
||||
return lock
|
||||
|
||||
|
||||
def cleanup_expired(registry_path: Path | str | None = None) -> int:
|
||||
"""Remove expired locks and return remaining active count."""
|
||||
reg = load_registry(registry_path)
|
||||
now = datetime.now(timezone.utc)
|
||||
kept = []
|
||||
for lock in reg.get("locks", []):
|
||||
expires = lock.get("expires_at")
|
||||
if expires:
|
||||
try:
|
||||
expires_dt = datetime.fromisoformat(expires)
|
||||
if now > expires_dt:
|
||||
continue
|
||||
except Exception:
|
||||
continue
|
||||
kept.append(lock)
|
||||
reg["locks"] = kept
|
||||
save_registry(reg, registry_path)
|
||||
return len(reg["locks"])
|
||||
|
||||
|
||||
def main(argv: list[str] | None = None) -> int:
|
||||
parser = argparse.ArgumentParser(description="Allegro Stop Guard")
|
||||
sub = parser.add_subparsers(dest="cmd")
|
||||
|
||||
p_check = sub.add_parser("check", help="Check if target is stopped")
|
||||
p_check.add_argument("target")
|
||||
|
||||
p_record = sub.add_parser("record", help="Record a stop")
|
||||
p_record.add_argument("target")
|
||||
p_record.add_argument("--context", default="manual stop")
|
||||
p_record.add_argument("--hours", type=int, default=24)
|
||||
|
||||
sub.add_parser("cleanup", help="Remove expired locks")
|
||||
|
||||
args = parser.parse_args(argv)
|
||||
|
||||
if args.cmd == "check":
|
||||
stopped = is_stopped(args.target)
|
||||
print("STOPPED" if stopped else "CLEAR")
|
||||
return 1 if stopped else 0
|
||||
elif args.cmd == "record":
|
||||
record_stop(args.target, args.context, args.hours)
|
||||
print(f"Recorded stop for {args.target} ({args.hours}h)")
|
||||
return 0
|
||||
elif args.cmd == "cleanup":
|
||||
remaining = cleanup_expired()
|
||||
print(f"Cleanup complete. {remaining} active locks.")
|
||||
return 0
|
||||
else:
|
||||
parser.print_help()
|
||||
return 0
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
sys.exit(main())
|
||||
@@ -1,143 +1,260 @@
|
||||
"""100% compliance test for Allegro Commit-or-Abort (M2, Epic #842)."""
|
||||
"""Tests for allegro.cycle_guard — Commit-or-Abort discipline, M2 Epic #842."""
|
||||
|
||||
import json
|
||||
import os
|
||||
import sys
|
||||
import tempfile
|
||||
import time
|
||||
import unittest
|
||||
from datetime import datetime, timezone, timedelta
|
||||
from pathlib import Path
|
||||
|
||||
sys.path.insert(0, os.path.join(os.path.dirname(__file__), ".."))
|
||||
import pytest
|
||||
|
||||
import cycle_guard as cg
|
||||
from allegro.cycle_guard import (
|
||||
_empty_state,
|
||||
_now_iso,
|
||||
_parse_dt,
|
||||
abort_cycle,
|
||||
check_slice_timeout,
|
||||
commit_cycle,
|
||||
end_slice,
|
||||
load_state,
|
||||
resume_or_abort,
|
||||
save_state,
|
||||
slice_duration_minutes,
|
||||
start_cycle,
|
||||
start_slice,
|
||||
)
|
||||
|
||||
|
||||
class TestCycleGuard(unittest.TestCase):
|
||||
def setUp(self):
|
||||
self.tmpdir = tempfile.TemporaryDirectory()
|
||||
self.state_path = os.path.join(self.tmpdir.name, "cycle_state.json")
|
||||
cg.STATE_PATH = self.state_path
|
||||
class TestStateLifecycle:
|
||||
def test_empty_state(self):
|
||||
state = _empty_state()
|
||||
assert state["status"] == "complete"
|
||||
assert state["cycle_id"] is None
|
||||
assert state["version"] == 1
|
||||
|
||||
def tearDown(self):
|
||||
self.tmpdir.cleanup()
|
||||
cg.STATE_PATH = cg.DEFAULT_STATE
|
||||
def test_load_state_missing_file(self):
|
||||
with tempfile.TemporaryDirectory() as td:
|
||||
path = Path(td) / "missing.json"
|
||||
state = load_state(path)
|
||||
assert state["status"] == "complete"
|
||||
|
||||
def test_load_empty_state(self):
|
||||
state = cg.load_state(self.state_path)
|
||||
self.assertEqual(state["status"], "complete")
|
||||
self.assertIsNone(state["cycle_id"])
|
||||
def test_save_and_load_roundtrip(self):
|
||||
with tempfile.TemporaryDirectory() as td:
|
||||
path = Path(td) / "state.json"
|
||||
state = start_cycle("test-target", "details", path)
|
||||
loaded = load_state(path)
|
||||
assert loaded["cycle_id"] == state["cycle_id"]
|
||||
assert loaded["status"] == "in_progress"
|
||||
|
||||
def test_load_state_malformed_json(self):
|
||||
with tempfile.TemporaryDirectory() as td:
|
||||
path = Path(td) / "bad.json"
|
||||
path.write_text("not json")
|
||||
state = load_state(path)
|
||||
assert state["status"] == "complete"
|
||||
|
||||
|
||||
class TestCycleOperations:
|
||||
def test_start_cycle(self):
|
||||
state = cg.start_cycle("M2: Commit-or-Abort", path=self.state_path)
|
||||
self.assertEqual(state["status"], "in_progress")
|
||||
self.assertEqual(state["target"], "M2: Commit-or-Abort")
|
||||
self.assertIsNotNone(state["cycle_id"])
|
||||
with tempfile.TemporaryDirectory() as td:
|
||||
path = Path(td) / "state.json"
|
||||
state = start_cycle("M2: Commit-or-Abort", "test details", path)
|
||||
assert state["status"] == "in_progress"
|
||||
assert state["target"] == "M2: Commit-or-Abort"
|
||||
assert state["started_at"] is not None
|
||||
|
||||
def test_start_slice_requires_in_progress(self):
|
||||
with self.assertRaises(RuntimeError):
|
||||
cg.start_slice("test", path=self.state_path)
|
||||
|
||||
def test_slice_lifecycle(self):
|
||||
cg.start_cycle("test", path=self.state_path)
|
||||
cg.start_slice("gather", path=self.state_path)
|
||||
state = cg.load_state(self.state_path)
|
||||
self.assertEqual(len(state["slices"]), 1)
|
||||
self.assertEqual(state["slices"][0]["name"], "gather")
|
||||
self.assertEqual(state["slices"][0]["status"], "in_progress")
|
||||
|
||||
cg.end_slice(status="complete", artifact="artifact.txt", path=self.state_path)
|
||||
state = cg.load_state(self.state_path)
|
||||
self.assertEqual(state["slices"][0]["status"], "complete")
|
||||
self.assertEqual(state["slices"][0]["artifact"], "artifact.txt")
|
||||
self.assertIsNotNone(state["slices"][0]["ended_at"])
|
||||
def test_start_cycle_overwrites_prior(self):
|
||||
with tempfile.TemporaryDirectory() as td:
|
||||
path = Path(td) / "state.json"
|
||||
old = start_cycle("old", path=path)
|
||||
new = start_cycle("new", path=path)
|
||||
assert new["target"] == "new"
|
||||
loaded = load_state(path)
|
||||
assert loaded["target"] == "new"
|
||||
|
||||
def test_commit_cycle(self):
|
||||
cg.start_cycle("test", path=self.state_path)
|
||||
cg.start_slice("work", path=self.state_path)
|
||||
cg.end_slice(path=self.state_path)
|
||||
proof = {"files": ["a.py"]}
|
||||
state = cg.commit_cycle(proof=proof, path=self.state_path)
|
||||
self.assertEqual(state["status"], "complete")
|
||||
self.assertEqual(state["proof"], proof)
|
||||
self.assertIsNotNone(state["completed_at"])
|
||||
with tempfile.TemporaryDirectory() as td:
|
||||
path = Path(td) / "state.json"
|
||||
start_cycle("test", path=path)
|
||||
proof = {"files": ["a.py"], "tests": "passed"}
|
||||
state = commit_cycle(proof, path)
|
||||
assert state["status"] == "complete"
|
||||
assert state["proof"]["files"] == ["a.py"]
|
||||
assert state["completed_at"] is not None
|
||||
|
||||
def test_commit_without_in_progress_fails(self):
|
||||
with self.assertRaises(RuntimeError):
|
||||
cg.commit_cycle(path=self.state_path)
|
||||
def test_commit_cycle_not_in_progress_raises(self):
|
||||
with tempfile.TemporaryDirectory() as td:
|
||||
path = Path(td) / "state.json"
|
||||
with pytest.raises(RuntimeError, match="not in_progress"):
|
||||
commit_cycle(path=path)
|
||||
|
||||
def test_abort_cycle(self):
|
||||
cg.start_cycle("test", path=self.state_path)
|
||||
cg.start_slice("work", path=self.state_path)
|
||||
state = cg.abort_cycle("manual abort", path=self.state_path)
|
||||
self.assertEqual(state["status"], "aborted")
|
||||
self.assertEqual(state["abort_reason"], "manual abort")
|
||||
self.assertIsNotNone(state["aborted_at"])
|
||||
self.assertEqual(state["slices"][-1]["status"], "aborted")
|
||||
with tempfile.TemporaryDirectory() as td:
|
||||
path = Path(td) / "state.json"
|
||||
start_cycle("test", path=path)
|
||||
state = abort_cycle("timeout", path)
|
||||
assert state["status"] == "aborted"
|
||||
assert state["abort_reason"] == "timeout"
|
||||
assert state["aborted_at"] is not None
|
||||
|
||||
def test_slice_timeout_true(self):
|
||||
cg.start_cycle("test", path=self.state_path)
|
||||
cg.start_slice("work", path=self.state_path)
|
||||
# Manually backdate slice start to 11 minutes ago
|
||||
state = cg.load_state(self.state_path)
|
||||
old = (datetime.now(timezone.utc) - timedelta(minutes=11)).isoformat()
|
||||
state["slices"][0]["started_at"] = old
|
||||
cg.save_state(state, self.state_path)
|
||||
self.assertTrue(cg.check_slice_timeout(max_minutes=10, path=self.state_path))
|
||||
def test_abort_cycle_not_in_progress_raises(self):
|
||||
with tempfile.TemporaryDirectory() as td:
|
||||
path = Path(td) / "state.json"
|
||||
with pytest.raises(RuntimeError, match="not in_progress"):
|
||||
abort_cycle("reason", path=path)
|
||||
|
||||
def test_slice_timeout_false(self):
|
||||
cg.start_cycle("test", path=self.state_path)
|
||||
cg.start_slice("work", path=self.state_path)
|
||||
self.assertFalse(cg.check_slice_timeout(max_minutes=10, path=self.state_path))
|
||||
|
||||
def test_resume_or_abort_keeps_fresh_cycle(self):
|
||||
cg.start_cycle("test", path=self.state_path)
|
||||
state = cg.resume_or_abort(path=self.state_path)
|
||||
self.assertEqual(state["status"], "in_progress")
|
||||
class TestSliceOperations:
|
||||
def test_start_and_end_slice(self):
|
||||
with tempfile.TemporaryDirectory() as td:
|
||||
path = Path(td) / "state.json"
|
||||
start_cycle("test", path=path)
|
||||
start_slice("research", path)
|
||||
state = end_slice("complete", "found answer", path)
|
||||
assert len(state["slices"]) == 1
|
||||
assert state["slices"][0]["name"] == "research"
|
||||
assert state["slices"][0]["status"] == "complete"
|
||||
assert state["slices"][0]["artifact"] == "found answer"
|
||||
assert state["slices"][0]["ended_at"] is not None
|
||||
|
||||
def test_resume_or_abort_aborts_stale_cycle(self):
|
||||
cg.start_cycle("test", path=self.state_path)
|
||||
# Backdate start to 31 minutes ago
|
||||
state = cg.load_state(self.state_path)
|
||||
old = (datetime.now(timezone.utc) - timedelta(minutes=31)).isoformat()
|
||||
state["started_at"] = old
|
||||
cg.save_state(state, self.state_path)
|
||||
state = cg.resume_or_abort(path=self.state_path)
|
||||
self.assertEqual(state["status"], "aborted")
|
||||
self.assertIn("crash recovery", state["abort_reason"])
|
||||
def test_start_slice_without_cycle_raises(self):
|
||||
with tempfile.TemporaryDirectory() as td:
|
||||
path = Path(td) / "state.json"
|
||||
with pytest.raises(RuntimeError, match="in_progress"):
|
||||
start_slice("name", path)
|
||||
|
||||
def test_end_slice_without_slice_raises(self):
|
||||
with tempfile.TemporaryDirectory() as td:
|
||||
path = Path(td) / "state.json"
|
||||
start_cycle("test", path=path)
|
||||
with pytest.raises(RuntimeError, match="No active slice"):
|
||||
end_slice(path=path)
|
||||
|
||||
def test_slice_duration_minutes(self):
|
||||
cg.start_cycle("test", path=self.state_path)
|
||||
cg.start_slice("work", path=self.state_path)
|
||||
# Backdate by 5 minutes
|
||||
state = cg.load_state(self.state_path)
|
||||
old = (datetime.now(timezone.utc) - timedelta(minutes=5)).isoformat()
|
||||
state["slices"][0]["started_at"] = old
|
||||
cg.save_state(state, self.state_path)
|
||||
mins = cg.slice_duration_minutes(path=self.state_path)
|
||||
self.assertAlmostEqual(mins, 5.0, delta=0.5)
|
||||
with tempfile.TemporaryDirectory() as td:
|
||||
path = Path(td) / "state.json"
|
||||
start_cycle("test", path=path)
|
||||
start_slice("long", path)
|
||||
state = load_state(path)
|
||||
state["slices"][0]["started_at"] = (datetime.now(timezone.utc) - timedelta(minutes=5)).isoformat()
|
||||
save_state(state, path)
|
||||
minutes = slice_duration_minutes(path)
|
||||
assert minutes is not None
|
||||
assert minutes >= 4.9
|
||||
|
||||
def test_cli_resume_prints_status(self):
|
||||
cg.start_cycle("test", path=self.state_path)
|
||||
rc = cg.main(["resume"])
|
||||
self.assertEqual(rc, 0)
|
||||
def test_check_slice_timeout_true(self):
|
||||
with tempfile.TemporaryDirectory() as td:
|
||||
path = Path(td) / "state.json"
|
||||
start_cycle("test", path=path)
|
||||
start_slice("old", path)
|
||||
state = load_state(path)
|
||||
state["slices"][0]["started_at"] = (datetime.now(timezone.utc) - timedelta(minutes=15)).isoformat()
|
||||
save_state(state, path)
|
||||
assert check_slice_timeout(max_minutes=10.0, path=path) is True
|
||||
|
||||
def test_cli_check_timeout(self):
|
||||
cg.start_cycle("test", path=self.state_path)
|
||||
cg.start_slice("work", path=self.state_path)
|
||||
state = cg.load_state(self.state_path)
|
||||
old = (datetime.now(timezone.utc) - timedelta(minutes=11)).isoformat()
|
||||
state["slices"][0]["started_at"] = old
|
||||
cg.save_state(state, self.state_path)
|
||||
rc = cg.main(["check"])
|
||||
self.assertEqual(rc, 1)
|
||||
|
||||
def test_cli_check_ok(self):
|
||||
cg.start_cycle("test", path=self.state_path)
|
||||
cg.start_slice("work", path=self.state_path)
|
||||
rc = cg.main(["check"])
|
||||
self.assertEqual(rc, 0)
|
||||
def test_check_slice_timeout_false(self):
|
||||
with tempfile.TemporaryDirectory() as td:
|
||||
path = Path(td) / "state.json"
|
||||
start_cycle("test", path=path)
|
||||
start_slice("fresh", path)
|
||||
assert check_slice_timeout(max_minutes=10.0, path=path) is False
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
unittest.main()
|
||||
class TestCrashRecovery:
|
||||
def test_resume_or_abort_aborts_stale_cycle(self):
|
||||
with tempfile.TemporaryDirectory() as td:
|
||||
path = Path(td) / "state.json"
|
||||
start_cycle("test", path=path)
|
||||
state = load_state(path)
|
||||
state["started_at"] = (datetime.now(timezone.utc) - timedelta(minutes=60)).isoformat()
|
||||
save_state(state, path)
|
||||
result = resume_or_abort(path)
|
||||
assert result["status"] == "aborted"
|
||||
assert "stale cycle" in result["abort_reason"]
|
||||
|
||||
def test_resume_or_abort_keeps_fresh_cycle(self):
|
||||
with tempfile.TemporaryDirectory() as td:
|
||||
path = Path(td) / "state.json"
|
||||
start_cycle("test", path=path)
|
||||
result = resume_or_abort(path)
|
||||
assert result["status"] == "in_progress"
|
||||
|
||||
def test_resume_or_abort_no_op_when_complete(self):
|
||||
with tempfile.TemporaryDirectory() as td:
|
||||
path = Path(td) / "state.json"
|
||||
state = _empty_state()
|
||||
save_state(state, path)
|
||||
result = resume_or_abort(path)
|
||||
assert result["status"] == "complete"
|
||||
|
||||
|
||||
class TestDateParsing:
|
||||
def test_parse_dt_with_z(self):
|
||||
dt = _parse_dt("2026-04-06T12:00:00Z")
|
||||
assert dt.tzinfo is not None
|
||||
|
||||
def test_parse_dt_with_offset(self):
|
||||
iso = "2026-04-06T12:00:00+00:00"
|
||||
dt = _parse_dt(iso)
|
||||
assert dt.tzinfo is not None
|
||||
|
||||
|
||||
class TestCLI:
|
||||
def test_cli_resume(self, capsys):
|
||||
from allegro.cycle_guard import main
|
||||
with tempfile.TemporaryDirectory() as td:
|
||||
path = Path(td) / "state.json"
|
||||
start_cycle("cli", path=path)
|
||||
os.environ["ALLEGRO_CYCLE_STATE"] = str(path)
|
||||
rc = main(["resume"])
|
||||
captured = capsys.readouterr()
|
||||
assert rc == 0
|
||||
assert captured.out.strip() == "in_progress"
|
||||
|
||||
def test_cli_start(self, capsys):
|
||||
from allegro.cycle_guard import main
|
||||
with tempfile.TemporaryDirectory() as td:
|
||||
path = Path(td) / "state.json"
|
||||
os.environ["ALLEGRO_CYCLE_STATE"] = str(path)
|
||||
rc = main(["start", "target", "--details", "d"])
|
||||
captured = capsys.readouterr()
|
||||
assert rc == 0
|
||||
assert "Cycle started" in captured.out
|
||||
|
||||
def test_cli_commit(self, capsys):
|
||||
from allegro.cycle_guard import main
|
||||
with tempfile.TemporaryDirectory() as td:
|
||||
path = Path(td) / "state.json"
|
||||
os.environ["ALLEGRO_CYCLE_STATE"] = str(path)
|
||||
main(["start", "t"])
|
||||
rc = main(["commit", "--proof", '{"ok": true}'])
|
||||
captured = capsys.readouterr()
|
||||
assert rc == 0
|
||||
assert "Cycle committed" in captured.out
|
||||
|
||||
def test_cli_check_timeout(self, capsys):
|
||||
from allegro.cycle_guard import main
|
||||
with tempfile.TemporaryDirectory() as td:
|
||||
path = Path(td) / "state.json"
|
||||
start_cycle("t", path=path)
|
||||
start_slice("s", path=path)
|
||||
state = load_state(path)
|
||||
state["slices"][0]["started_at"] = (datetime.now(timezone.utc) - timedelta(minutes=15)).isoformat()
|
||||
save_state(state, path)
|
||||
os.environ["ALLEGRO_CYCLE_STATE"] = str(path)
|
||||
rc = main(["check"])
|
||||
captured = capsys.readouterr()
|
||||
assert rc == 1
|
||||
assert captured.out.strip() == "TIMEOUT"
|
||||
|
||||
def test_cli_check_ok(self, capsys):
|
||||
from allegro.cycle_guard import main
|
||||
with tempfile.TemporaryDirectory() as td:
|
||||
path = Path(td) / "state.json"
|
||||
start_cycle("t", path=path)
|
||||
start_slice("s", path=path)
|
||||
os.environ["ALLEGRO_CYCLE_STATE"] = str(path)
|
||||
rc = main(["check"])
|
||||
captured = capsys.readouterr()
|
||||
assert rc == 0
|
||||
assert captured.out.strip() == "OK"
|
||||
|
||||
@@ -1,47 +0,0 @@
|
||||
# =============================================================================
|
||||
# BANNED PROVIDERS — The Timmy Foundation
|
||||
# =============================================================================
|
||||
# "Anthropic is not only fired, but banned. I don't want these errors
|
||||
# cropping up." — Alexander, 2026-04-09
|
||||
#
|
||||
# This is a HARD BAN. Not deprecated. Not fallback. BANNED.
|
||||
# Enforcement: pre-commit hook, linter, Ansible validation, CI tests.
|
||||
# =============================================================================
|
||||
|
||||
banned_providers:
|
||||
- name: anthropic
|
||||
reason: "Permanently banned. SDK access gated despite active quota. Fleet was bricked because golden state pointed to Anthropic Sonnet."
|
||||
banned_date: "2026-04-09"
|
||||
enforcement: strict # Ansible playbook FAILS if detected
|
||||
models:
|
||||
- "claude-sonnet-*"
|
||||
- "claude-opus-*"
|
||||
- "claude-haiku-*"
|
||||
- "claude-*"
|
||||
endpoints:
|
||||
- "api.anthropic.com"
|
||||
- "anthropic/*" # OpenRouter pattern
|
||||
api_keys:
|
||||
- "ANTHROPIC_API_KEY"
|
||||
- "CLAUDE_API_KEY"
|
||||
|
||||
# Golden state alternative:
|
||||
approved_providers:
|
||||
- name: kimi-coding
|
||||
model: kimi-k2.5
|
||||
role: primary
|
||||
- name: openrouter
|
||||
model: google/gemini-2.5-pro
|
||||
role: fallback
|
||||
- name: ollama
|
||||
model: "gemma4:latest"
|
||||
role: terminal_fallback
|
||||
|
||||
# Future evaluation:
|
||||
evaluation_candidates:
|
||||
- name: mimo-v2-pro
|
||||
status: pending
|
||||
notes: "Free via Nous Portal for ~2 weeks from 2026-04-07. Add after fallback chain is fixed."
|
||||
- name: hermes-4
|
||||
status: available
|
||||
notes: "Free on Nous Portal. 36B and 70B variants. Home team model."
|
||||
@@ -1,95 +0,0 @@
|
||||
# Ansible IaC — The Timmy Foundation Fleet
|
||||
|
||||
> One canonical Ansible playbook defines: deadman switch, cron schedule,
|
||||
> golden state rollback, agent startup sequence.
|
||||
> — KT Final Session 2026-04-08, Priority TWO
|
||||
|
||||
## Purpose
|
||||
|
||||
This directory contains the **single source of truth** for fleet infrastructure.
|
||||
No more ad-hoc recovery implementations. No more overlapping deadman switches.
|
||||
No more agents mutating their own configs into oblivion.
|
||||
|
||||
**Everything** goes through Ansible. If it's not in a playbook, it doesn't exist.
|
||||
|
||||
## Architecture
|
||||
|
||||
```
|
||||
┌─────────────────────────────────────────────────┐
|
||||
│ Gitea (Source of Truth) │
|
||||
│ timmy-config/ansible/ │
|
||||
│ ├── inventory/hosts.yml (fleet machines) │
|
||||
│ ├── playbooks/site.yml (master playbook) │
|
||||
│ ├── roles/ (reusable roles) │
|
||||
│ └── group_vars/wizards.yml (golden state) │
|
||||
└──────────────────┬──────────────────────────────┘
|
||||
│ PR merge triggers webhook
|
||||
▼
|
||||
┌─────────────────────────────────────────────────┐
|
||||
│ Gitea Webhook Handler │
|
||||
│ scripts/deploy_on_webhook.sh │
|
||||
│ → ansible-pull on each target machine │
|
||||
└──────────────────┬──────────────────────────────┘
|
||||
│ ansible-pull
|
||||
▼
|
||||
┌──────────┐ ┌──────────┐ ┌──────────┐ ┌──────────┐
|
||||
│ Timmy │ │ Allegro │ │ Bezalel │ │ Ezra │
|
||||
│ (Mac) │ │ (VPS) │ │ (VPS) │ │ (VPS) │
|
||||
│ │ │ │ │ │ │ │
|
||||
│ deadman │ │ deadman │ │ deadman │ │ deadman │
|
||||
│ cron │ │ cron │ │ cron │ │ cron │
|
||||
│ golden │ │ golden │ │ golden │ │ golden │
|
||||
│ req_log │ │ req_log │ │ req_log │ │ req_log │
|
||||
└──────────┘ └──────────┘ └──────────┘ └──────────┘
|
||||
```
|
||||
|
||||
## Quick Start
|
||||
|
||||
```bash
|
||||
# Deploy everything to all machines
|
||||
ansible-playbook -i inventory/hosts.yml playbooks/site.yml
|
||||
|
||||
# Deploy only golden state config
|
||||
ansible-playbook -i inventory/hosts.yml playbooks/golden_state.yml
|
||||
|
||||
# Deploy only to a specific wizard
|
||||
ansible-playbook -i inventory/hosts.yml playbooks/site.yml --limit bezalel
|
||||
|
||||
# Dry run (check mode)
|
||||
ansible-playbook -i inventory/hosts.yml playbooks/site.yml --check --diff
|
||||
```
|
||||
|
||||
## Golden State Provider Chain
|
||||
|
||||
All wizard configs converge on this provider chain. **Anthropic is BANNED.**
|
||||
|
||||
| Priority | Provider | Model | Endpoint |
|
||||
| -------- | -------------------- | ---------------- | --------------------------------- |
|
||||
| 1 | Kimi | kimi-k2.5 | https://api.kimi.com/coding/v1 |
|
||||
| 2 | Gemini (OpenRouter) | gemini-2.5-pro | https://openrouter.ai/api/v1 |
|
||||
| 3 | Ollama (local) | gemma4:latest | http://localhost:11434/v1 |
|
||||
|
||||
## Roles
|
||||
|
||||
| Role | Purpose |
|
||||
| ---------------- | ------------------------------------------------------------ |
|
||||
| `wizard_base` | Common wizard setup: directories, thin config, git pull |
|
||||
| `deadman_switch` | Health check → snapshot good config → rollback on death |
|
||||
| `golden_state` | Deploy and enforce golden state provider chain |
|
||||
| `request_log` | SQLite telemetry table for every inference call |
|
||||
| `cron_manager` | Source-controlled cron jobs — no manual crontab edits |
|
||||
|
||||
## Rules
|
||||
|
||||
1. **No manual changes.** If it's not in a playbook, it will be overwritten.
|
||||
2. **No Anthropic.** Banned. Enforcement is automated. See `BANNED_PROVIDERS.yml`.
|
||||
3. **Idempotent.** Every playbook can run 100 times with the same result.
|
||||
4. **PR required.** Config changes go through Gitea PR review, then deploy.
|
||||
5. **One identity per machine.** No duplicate agents. Fleet audit enforces this.
|
||||
|
||||
## Related Issues
|
||||
|
||||
- timmy-config #442: [P2] Ansible IaC Canonical Playbook
|
||||
- timmy-config #444: Wire Deadman Switch ACTION
|
||||
- timmy-config #443: Thin Config Pattern
|
||||
- timmy-config #446: request_log Telemetry Table
|
||||
@@ -1,21 +0,0 @@
|
||||
[defaults]
|
||||
inventory = inventory/hosts.yml
|
||||
roles_path = roles
|
||||
host_key_checking = False
|
||||
retry_files_enabled = False
|
||||
stdout_callback = yaml
|
||||
forks = 10
|
||||
timeout = 30
|
||||
|
||||
# Logging
|
||||
log_path = /var/log/ansible/timmy-fleet.log
|
||||
|
||||
[privilege_escalation]
|
||||
become = True
|
||||
become_method = sudo
|
||||
become_user = root
|
||||
become_ask_pass = False
|
||||
|
||||
[ssh_connection]
|
||||
pipelining = True
|
||||
ssh_args = -o ControlMaster=auto -o ControlPersist=60s -o StrictHostKeyChecking=no
|
||||
@@ -1,74 +0,0 @@
|
||||
# =============================================================================
|
||||
# Wizard Group Variables — Golden State Configuration
|
||||
# =============================================================================
|
||||
# These variables are applied to ALL wizards in the fleet.
|
||||
# This IS the golden state. If a wizard deviates, Ansible corrects it.
|
||||
# =============================================================================
|
||||
|
||||
# --- Deadman Switch ---
|
||||
deadman_enabled: true
|
||||
deadman_check_interval: 300 # 5 minutes between health checks
|
||||
deadman_snapshot_dir: "~/.local/timmy/snapshots"
|
||||
deadman_max_snapshots: 10 # Rolling window of good configs
|
||||
deadman_restart_cooldown: 60 # Seconds to wait before restart after failure
|
||||
deadman_max_restart_attempts: 3
|
||||
deadman_escalation_channel: telegram # Alert Alexander after max attempts
|
||||
|
||||
# --- Thin Config ---
|
||||
thin_config_path: "~/.timmy/thin_config.yml"
|
||||
thin_config_mode: "0444" # Read-only — agents CANNOT modify
|
||||
upstream_repo: "https://forge.alexanderwhitestone.com/Timmy_Foundation/timmy-config.git"
|
||||
upstream_branch: main
|
||||
config_pull_on_wake: true
|
||||
config_validation_enabled: true
|
||||
|
||||
# --- Agent Settings ---
|
||||
agent_max_turns: 30
|
||||
agent_reasoning_effort: high
|
||||
agent_verbose: false
|
||||
agent_approval_mode: auto
|
||||
|
||||
# --- Hermes Harness ---
|
||||
hermes_config_dir: "{{ hermes_home }}"
|
||||
hermes_bin_dir: "{{ hermes_home }}/bin"
|
||||
hermes_skins_dir: "{{ hermes_home }}/skins"
|
||||
hermes_playbooks_dir: "{{ hermes_home }}/playbooks"
|
||||
hermes_memories_dir: "{{ hermes_home }}/memories"
|
||||
|
||||
# --- Request Log (Telemetry) ---
|
||||
request_log_enabled: true
|
||||
request_log_path: "~/.local/timmy/request_log.db"
|
||||
request_log_rotation_days: 30 # Archive logs older than 30 days
|
||||
request_log_sync_to_gitea: false # Future: push telemetry summaries to Gitea
|
||||
|
||||
# --- Cron Schedule ---
|
||||
# All cron jobs are managed here. No manual crontab edits.
|
||||
cron_jobs:
|
||||
- name: "Deadman health check"
|
||||
job: "cd {{ wizard_home }}/workspace/timmy-config && python3 fleet/health_check.py"
|
||||
minute: "*/5"
|
||||
hour: "*"
|
||||
enabled: "{{ deadman_enabled }}"
|
||||
|
||||
- name: "Muda audit"
|
||||
job: "cd {{ wizard_home }}/workspace/timmy-config && bash fleet/muda-audit.sh >> /tmp/muda-audit.log 2>&1"
|
||||
minute: "0"
|
||||
hour: "21"
|
||||
weekday: "0"
|
||||
enabled: true
|
||||
|
||||
- name: "Config pull from upstream"
|
||||
job: "cd {{ wizard_home }}/workspace/timmy-config && git pull --ff-only origin main"
|
||||
minute: "*/15"
|
||||
hour: "*"
|
||||
enabled: "{{ config_pull_on_wake }}"
|
||||
|
||||
- name: "Request log rotation"
|
||||
job: "python3 -c \"import sqlite3,datetime; db=sqlite3.connect('{{ request_log_path }}'); db.execute('DELETE FROM request_log WHERE timestamp < datetime(\\\"now\\\", \\\"-{{ request_log_rotation_days }} days\\\")'); db.commit()\""
|
||||
minute: "0"
|
||||
hour: "3"
|
||||
enabled: "{{ request_log_enabled }}"
|
||||
|
||||
# --- Provider Enforcement ---
|
||||
# These are validated on every Ansible run. Any Anthropic reference = failure.
|
||||
provider_ban_enforcement: strict # strict = fail playbook, warn = log only
|
||||
@@ -1,119 +0,0 @@
|
||||
# =============================================================================
|
||||
# Fleet Inventory — The Timmy Foundation
|
||||
# =============================================================================
|
||||
# Source of truth for all machines in the fleet.
|
||||
# Update this file when machines are added/removed.
|
||||
# All changes go through PR review.
|
||||
# =============================================================================
|
||||
|
||||
all:
|
||||
children:
|
||||
wizards:
|
||||
hosts:
|
||||
timmy:
|
||||
ansible_host: localhost
|
||||
ansible_connection: local
|
||||
wizard_name: Timmy
|
||||
wizard_role: "Primary wizard — soul of the fleet"
|
||||
wizard_provider_primary: kimi-coding
|
||||
wizard_model_primary: kimi-k2.5
|
||||
hermes_port: 8081
|
||||
api_port: 8645
|
||||
wizard_home: "{{ ansible_env.HOME }}/wizards/timmy"
|
||||
hermes_home: "{{ ansible_env.HOME }}/.hermes"
|
||||
machine_type: mac
|
||||
# Timmy runs on Alexander's M3 Max
|
||||
ollama_available: true
|
||||
|
||||
allegro:
|
||||
ansible_host: 167.99.126.228
|
||||
ansible_user: root
|
||||
wizard_name: Allegro
|
||||
wizard_role: "Kimi-backed third wizard house — tight coding tasks"
|
||||
wizard_provider_primary: kimi-coding
|
||||
wizard_model_primary: kimi-k2.5
|
||||
hermes_port: 8081
|
||||
api_port: 8645
|
||||
wizard_home: /root/wizards/allegro
|
||||
hermes_home: /root/.hermes
|
||||
machine_type: vps
|
||||
ollama_available: false
|
||||
|
||||
bezalel:
|
||||
ansible_host: 159.203.146.185
|
||||
ansible_user: root
|
||||
wizard_name: Bezalel
|
||||
wizard_role: "Forge-and-testbed wizard — infrastructure, deployment, hardening"
|
||||
wizard_provider_primary: kimi-coding
|
||||
wizard_model_primary: kimi-k2.5
|
||||
hermes_port: 8081
|
||||
api_port: 8656
|
||||
wizard_home: /root/wizards/bezalel
|
||||
hermes_home: /root/.hermes
|
||||
machine_type: vps
|
||||
ollama_available: false
|
||||
# NOTE: The awake Bezalel may be the duplicate.
|
||||
# Fleet audit (the-nexus #1144) will resolve identity.
|
||||
|
||||
ezra:
|
||||
ansible_host: 143.198.27.163
|
||||
ansible_user: root
|
||||
wizard_name: Ezra
|
||||
wizard_role: "Infrastructure wizard — Gitea, nginx, hosting"
|
||||
wizard_provider_primary: kimi-coding
|
||||
wizard_model_primary: kimi-k2.5
|
||||
hermes_port: 8081
|
||||
api_port: 8645
|
||||
wizard_home: /root/wizards/ezra
|
||||
hermes_home: /root/.hermes
|
||||
machine_type: vps
|
||||
ollama_available: false
|
||||
# NOTE: Currently DOWN — Telegram key revoked, awaiting propagation.
|
||||
|
||||
# Infrastructure hosts (not wizards, but managed by Ansible)
|
||||
infrastructure:
|
||||
hosts:
|
||||
forge:
|
||||
ansible_host: 143.198.27.163
|
||||
ansible_user: root
|
||||
# Gitea runs on the same box as Ezra
|
||||
gitea_url: https://forge.alexanderwhitestone.com
|
||||
gitea_org: Timmy_Foundation
|
||||
|
||||
vars:
|
||||
# Global variables applied to all hosts
|
||||
gitea_repo_url: "https://forge.alexanderwhitestone.com/Timmy_Foundation/timmy-config.git"
|
||||
gitea_branch: main
|
||||
config_base_path: "{{ gitea_repo_url }}"
|
||||
timmy_log_dir: "~/.local/timmy/fleet-health"
|
||||
request_log_db: "~/.local/timmy/request_log.db"
|
||||
|
||||
# Golden state provider chain — Anthropic is BANNED
|
||||
golden_state_providers:
|
||||
- name: kimi-coding
|
||||
model: kimi-k2.5
|
||||
base_url: "https://api.kimi.com/coding/v1"
|
||||
timeout: 120
|
||||
reason: "Primary — Kimi K2.5 (best value, least friction)"
|
||||
- name: openrouter
|
||||
model: google/gemini-2.5-pro
|
||||
base_url: "https://openrouter.ai/api/v1"
|
||||
api_key_env: OPENROUTER_API_KEY
|
||||
timeout: 120
|
||||
reason: "Fallback — Gemini 2.5 Pro via OpenRouter"
|
||||
- name: ollama
|
||||
model: "gemma4:latest"
|
||||
base_url: "http://localhost:11434/v1"
|
||||
timeout: 180
|
||||
reason: "Terminal fallback — local Ollama (sovereign, no API needed)"
|
||||
|
||||
# Banned providers — hard enforcement
|
||||
banned_providers:
|
||||
- anthropic
|
||||
- claude
|
||||
banned_models_patterns:
|
||||
- "claude-*"
|
||||
- "anthropic/*"
|
||||
- "*sonnet*"
|
||||
- "*opus*"
|
||||
- "*haiku*"
|
||||
@@ -1,98 +0,0 @@
|
||||
---
|
||||
# =============================================================================
|
||||
# agent_startup.yml — Resurrect Wizards from Checked-in Configs
|
||||
# =============================================================================
|
||||
# Brings wizards back online using golden state configs.
|
||||
# Order: pull config → validate → start agent → verify with request_log
|
||||
# =============================================================================
|
||||
|
||||
- name: "Agent Startup Sequence"
|
||||
hosts: wizards
|
||||
become: true
|
||||
serial: 1 # One wizard at a time to avoid cascading issues
|
||||
|
||||
tasks:
|
||||
- name: "Pull latest config from upstream"
|
||||
git:
|
||||
repo: "{{ upstream_repo }}"
|
||||
dest: "{{ wizard_home }}/workspace/timmy-config"
|
||||
version: "{{ upstream_branch }}"
|
||||
force: true
|
||||
tags: [pull]
|
||||
|
||||
- name: "Deploy golden state config"
|
||||
include_role:
|
||||
name: golden_state
|
||||
tags: [config]
|
||||
|
||||
- name: "Validate config — no banned providers"
|
||||
shell: |
|
||||
python3 -c "
|
||||
import yaml, sys
|
||||
with open('{{ wizard_home }}/config.yaml') as f:
|
||||
cfg = yaml.safe_load(f)
|
||||
banned = {{ banned_providers }}
|
||||
for p in cfg.get('fallback_providers', []):
|
||||
if p.get('provider', '') in banned:
|
||||
print(f'BANNED: {p[\"provider\"]}', file=sys.stderr)
|
||||
sys.exit(1)
|
||||
model = cfg.get('model', {}).get('provider', '')
|
||||
if model in banned:
|
||||
print(f'BANNED default provider: {model}', file=sys.stderr)
|
||||
sys.exit(1)
|
||||
print('Config validated — no banned providers.')
|
||||
"
|
||||
register: config_valid
|
||||
tags: [validate]
|
||||
|
||||
- name: "Ensure hermes-agent service is running"
|
||||
systemd:
|
||||
name: "hermes-{{ wizard_name | lower }}"
|
||||
state: started
|
||||
enabled: true
|
||||
when: machine_type == 'vps'
|
||||
tags: [start]
|
||||
ignore_errors: true # Service may not exist yet on all machines
|
||||
|
||||
- name: "Start hermes agent (Mac — launchctl)"
|
||||
shell: |
|
||||
launchctl kickstart -k "ai.hermes.{{ wizard_name | lower }}" 2>/dev/null || \
|
||||
cd {{ wizard_home }} && hermes agent start --daemon 2>&1 | tail -5
|
||||
when: machine_type == 'mac'
|
||||
tags: [start]
|
||||
ignore_errors: true
|
||||
|
||||
- name: "Wait for agent to come online"
|
||||
wait_for:
|
||||
host: 127.0.0.1
|
||||
port: "{{ api_port }}"
|
||||
timeout: 60
|
||||
state: started
|
||||
tags: [verify]
|
||||
ignore_errors: true
|
||||
|
||||
- name: "Verify agent is alive — check request_log for activity"
|
||||
shell: |
|
||||
sleep 10
|
||||
python3 -c "
|
||||
import sqlite3, sys
|
||||
db = sqlite3.connect('{{ request_log_path }}')
|
||||
cursor = db.execute('''
|
||||
SELECT COUNT(*) FROM request_log
|
||||
WHERE agent_name = '{{ wizard_name }}'
|
||||
AND timestamp > datetime('now', '-5 minutes')
|
||||
''')
|
||||
count = cursor.fetchone()[0]
|
||||
if count > 0:
|
||||
print(f'{{ wizard_name }} is alive — {count} recent inference calls logged.')
|
||||
else:
|
||||
print(f'WARNING: {{ wizard_name }} started but no telemetry yet.')
|
||||
"
|
||||
register: agent_status
|
||||
tags: [verify]
|
||||
ignore_errors: true
|
||||
|
||||
- name: "Report startup status"
|
||||
debug:
|
||||
msg: "{{ wizard_name }}: {{ agent_status.stdout | default('startup attempted') }}"
|
||||
tags: [always]
|
||||
@@ -1,15 +0,0 @@
|
||||
---
|
||||
# =============================================================================
|
||||
# cron_schedule.yml — Source-Controlled Cron Jobs
|
||||
# =============================================================================
|
||||
# All cron jobs are defined in group_vars/wizards.yml.
|
||||
# This playbook deploys them. No manual crontab edits allowed.
|
||||
# =============================================================================
|
||||
|
||||
- name: "Deploy Cron Schedule"
|
||||
hosts: wizards
|
||||
become: true
|
||||
|
||||
roles:
|
||||
- role: cron_manager
|
||||
tags: [cron, schedule]
|
||||
@@ -1,17 +0,0 @@
|
||||
---
|
||||
# =============================================================================
|
||||
# deadman_switch.yml — Deploy Deadman Switch to All Wizards
|
||||
# =============================================================================
|
||||
# The deadman watch already fires and detects dead agents.
|
||||
# This playbook wires the ACTION:
|
||||
# - On healthy check: snapshot current config as "last known good"
|
||||
# - On failed check: rollback config to snapshot, restart agent
|
||||
# =============================================================================
|
||||
|
||||
- name: "Deploy Deadman Switch ACTION"
|
||||
hosts: wizards
|
||||
become: true
|
||||
|
||||
roles:
|
||||
- role: deadman_switch
|
||||
tags: [deadman, recovery]
|
||||
@@ -1,30 +0,0 @@
|
||||
---
|
||||
# =============================================================================
|
||||
# golden_state.yml — Deploy Golden State Config to All Wizards
|
||||
# =============================================================================
|
||||
# Enforces the golden state provider chain across the fleet.
|
||||
# Removes any Anthropic references. Deploys the approved provider chain.
|
||||
# =============================================================================
|
||||
|
||||
- name: "Deploy Golden State Configuration"
|
||||
hosts: wizards
|
||||
become: true
|
||||
|
||||
roles:
|
||||
- role: golden_state
|
||||
tags: [golden, config]
|
||||
|
||||
post_tasks:
|
||||
- name: "Verify golden state — no banned providers"
|
||||
shell: |
|
||||
grep -rci 'anthropic\|claude-sonnet\|claude-opus\|claude-haiku' \
|
||||
{{ hermes_home }}/config.yaml \
|
||||
{{ wizard_home }}/config.yaml 2>/dev/null || echo "0"
|
||||
register: banned_count
|
||||
changed_when: false
|
||||
|
||||
- name: "Report golden state status"
|
||||
debug:
|
||||
msg: >
|
||||
{{ wizard_name }} golden state: {{ golden_state_providers | map(attribute='name') | list | join(' → ') }}.
|
||||
Banned provider references: {{ banned_count.stdout | trim }}.
|
||||
@@ -1,15 +0,0 @@
|
||||
---
|
||||
# =============================================================================
|
||||
# request_log.yml — Deploy Telemetry Table
|
||||
# =============================================================================
|
||||
# Creates the request_log SQLite table on all machines.
|
||||
# Every inference call writes a row. No exceptions. No summarizing.
|
||||
# =============================================================================
|
||||
|
||||
- name: "Deploy Request Log Telemetry"
|
||||
hosts: wizards
|
||||
become: true
|
||||
|
||||
roles:
|
||||
- role: request_log
|
||||
tags: [telemetry, logging]
|
||||
@@ -1,72 +0,0 @@
|
||||
---
|
||||
# =============================================================================
|
||||
# site.yml — Master Playbook for the Timmy Foundation Fleet
|
||||
# =============================================================================
|
||||
# This is the ONE playbook that defines the entire fleet state.
|
||||
# Run this and every machine converges to golden state.
|
||||
#
|
||||
# Usage:
|
||||
# ansible-playbook -i inventory/hosts.yml playbooks/site.yml
|
||||
# ansible-playbook -i inventory/hosts.yml playbooks/site.yml --limit bezalel
|
||||
# ansible-playbook -i inventory/hosts.yml playbooks/site.yml --check --diff
|
||||
# =============================================================================
|
||||
|
||||
- name: "Timmy Foundation Fleet — Full Convergence"
|
||||
hosts: wizards
|
||||
become: true
|
||||
|
||||
pre_tasks:
|
||||
- name: "Validate no banned providers in golden state"
|
||||
assert:
|
||||
that:
|
||||
- "item.name not in banned_providers"
|
||||
fail_msg: "BANNED PROVIDER DETECTED: {{ item.name }} — Anthropic is permanently banned."
|
||||
quiet: true
|
||||
loop: "{{ golden_state_providers }}"
|
||||
tags: [always]
|
||||
|
||||
- name: "Display target wizard"
|
||||
debug:
|
||||
msg: "Deploying to {{ wizard_name }} ({{ wizard_role }}) on {{ ansible_host }}"
|
||||
tags: [always]
|
||||
|
||||
roles:
|
||||
- role: wizard_base
|
||||
tags: [base, setup]
|
||||
|
||||
- role: golden_state
|
||||
tags: [golden, config]
|
||||
|
||||
- role: deadman_switch
|
||||
tags: [deadman, recovery]
|
||||
|
||||
- role: request_log
|
||||
tags: [telemetry, logging]
|
||||
|
||||
- role: cron_manager
|
||||
tags: [cron, schedule]
|
||||
|
||||
post_tasks:
|
||||
- name: "Final validation — scan for banned providers"
|
||||
shell: |
|
||||
grep -ri 'anthropic\|claude-sonnet\|claude-opus\|claude-haiku' \
|
||||
{{ hermes_home }}/config.yaml \
|
||||
{{ wizard_home }}/config.yaml \
|
||||
{{ thin_config_path }} 2>/dev/null || true
|
||||
register: banned_scan
|
||||
changed_when: false
|
||||
tags: [validation]
|
||||
|
||||
- name: "FAIL if banned providers found in deployed config"
|
||||
fail:
|
||||
msg: |
|
||||
BANNED PROVIDER DETECTED IN DEPLOYED CONFIG:
|
||||
{{ banned_scan.stdout }}
|
||||
Anthropic is permanently banned. Fix the config and re-deploy.
|
||||
when: banned_scan.stdout | length > 0
|
||||
tags: [validation]
|
||||
|
||||
- name: "Deployment complete"
|
||||
debug:
|
||||
msg: "{{ wizard_name }} converged to golden state. Provider chain: {{ golden_state_providers | map(attribute='name') | list | join(' → ') }}"
|
||||
tags: [always]
|
||||
@@ -1,55 +0,0 @@
|
||||
---
|
||||
# =============================================================================
|
||||
# cron_manager/tasks — Source-Controlled Cron Jobs
|
||||
# =============================================================================
|
||||
# All cron jobs are defined in group_vars/wizards.yml.
|
||||
# No manual crontab edits. This is the only way to manage cron.
|
||||
# =============================================================================
|
||||
|
||||
- name: "Deploy managed cron jobs"
|
||||
cron:
|
||||
name: "{{ item.name }}"
|
||||
job: "{{ item.job }}"
|
||||
minute: "{{ item.minute | default('*') }}"
|
||||
hour: "{{ item.hour | default('*') }}"
|
||||
day: "{{ item.day | default('*') }}"
|
||||
month: "{{ item.month | default('*') }}"
|
||||
weekday: "{{ item.weekday | default('*') }}"
|
||||
state: "{{ 'present' if item.enabled else 'absent' }}"
|
||||
user: "{{ ansible_user | default('root') }}"
|
||||
loop: "{{ cron_jobs }}"
|
||||
when: cron_jobs is defined
|
||||
|
||||
- name: "Deploy deadman switch cron (fallback if systemd timer unavailable)"
|
||||
cron:
|
||||
name: "Deadman switch — {{ wizard_name }}"
|
||||
job: "{{ wizard_home }}/deadman_action.sh >> {{ timmy_log_dir }}/deadman-{{ wizard_name }}.log 2>&1"
|
||||
minute: "*/5"
|
||||
hour: "*"
|
||||
state: present
|
||||
user: "{{ ansible_user | default('root') }}"
|
||||
when: deadman_enabled and machine_type != 'vps'
|
||||
# VPS machines use systemd timers instead
|
||||
|
||||
- name: "Remove legacy cron jobs (cleanup)"
|
||||
cron:
|
||||
name: "{{ item }}"
|
||||
state: absent
|
||||
user: "{{ ansible_user | default('root') }}"
|
||||
loop:
|
||||
- "legacy-deadman-watch"
|
||||
- "old-health-check"
|
||||
- "backup-deadman"
|
||||
ignore_errors: true
|
||||
|
||||
- name: "List active cron jobs"
|
||||
shell: "crontab -l 2>/dev/null | grep -v '^#' | grep -v '^$' || echo 'No cron jobs found.'"
|
||||
register: active_crons
|
||||
changed_when: false
|
||||
|
||||
- name: "Report cron status"
|
||||
debug:
|
||||
msg: |
|
||||
{{ wizard_name }} cron jobs deployed.
|
||||
Active:
|
||||
{{ active_crons.stdout }}
|
||||
@@ -1,70 +0,0 @@
|
||||
---
|
||||
# =============================================================================
|
||||
# deadman_switch/tasks — Wire the Deadman Switch ACTION
|
||||
# =============================================================================
|
||||
# The watch fires. This makes it DO something:
|
||||
# - On healthy check: snapshot current config as "last known good"
|
||||
# - On failed check: rollback to last known good, restart agent
|
||||
# =============================================================================
|
||||
|
||||
- name: "Create snapshot directory"
|
||||
file:
|
||||
path: "{{ deadman_snapshot_dir }}"
|
||||
state: directory
|
||||
mode: "0755"
|
||||
|
||||
- name: "Deploy deadman switch script"
|
||||
template:
|
||||
src: deadman_action.sh.j2
|
||||
dest: "{{ wizard_home }}/deadman_action.sh"
|
||||
mode: "0755"
|
||||
|
||||
- name: "Deploy deadman systemd service"
|
||||
template:
|
||||
src: deadman_switch.service.j2
|
||||
dest: "/etc/systemd/system/deadman-{{ wizard_name | lower }}.service"
|
||||
mode: "0644"
|
||||
when: machine_type == 'vps'
|
||||
notify: "Enable deadman service"
|
||||
|
||||
- name: "Deploy deadman systemd timer"
|
||||
template:
|
||||
src: deadman_switch.timer.j2
|
||||
dest: "/etc/systemd/system/deadman-{{ wizard_name | lower }}.timer"
|
||||
mode: "0644"
|
||||
when: machine_type == 'vps'
|
||||
notify: "Enable deadman timer"
|
||||
|
||||
- name: "Deploy deadman launchd plist (Mac)"
|
||||
template:
|
||||
src: deadman_switch.plist.j2
|
||||
dest: "{{ ansible_env.HOME }}/Library/LaunchAgents/com.timmy.deadman.{{ wizard_name | lower }}.plist"
|
||||
mode: "0644"
|
||||
when: machine_type == 'mac'
|
||||
notify: "Load deadman plist"
|
||||
|
||||
- name: "Take initial config snapshot"
|
||||
copy:
|
||||
src: "{{ wizard_home }}/config.yaml"
|
||||
dest: "{{ deadman_snapshot_dir }}/config.yaml.known_good"
|
||||
remote_src: true
|
||||
mode: "0444"
|
||||
ignore_errors: true
|
||||
|
||||
handlers:
|
||||
- name: "Enable deadman service"
|
||||
systemd:
|
||||
name: "deadman-{{ wizard_name | lower }}.service"
|
||||
daemon_reload: true
|
||||
enabled: true
|
||||
|
||||
- name: "Enable deadman timer"
|
||||
systemd:
|
||||
name: "deadman-{{ wizard_name | lower }}.timer"
|
||||
daemon_reload: true
|
||||
enabled: true
|
||||
state: started
|
||||
|
||||
- name: "Load deadman plist"
|
||||
shell: "launchctl load {{ ansible_env.HOME }}/Library/LaunchAgents/com.timmy.deadman.{{ wizard_name | lower }}.plist"
|
||||
ignore_errors: true
|
||||
@@ -1,153 +0,0 @@
|
||||
#!/usr/bin/env bash
|
||||
# =============================================================================
|
||||
# Deadman Switch ACTION — {{ wizard_name }}
|
||||
# =============================================================================
|
||||
# Generated by Ansible on {{ ansible_date_time.iso8601 }}
|
||||
# DO NOT EDIT MANUALLY.
|
||||
#
|
||||
# On healthy check: snapshot current config as "last known good"
|
||||
# On failed check: rollback config to last known good, restart agent
|
||||
# =============================================================================
|
||||
|
||||
set -euo pipefail
|
||||
|
||||
WIZARD_NAME="{{ wizard_name }}"
|
||||
WIZARD_HOME="{{ wizard_home }}"
|
||||
CONFIG_FILE="{{ wizard_home }}/config.yaml"
|
||||
SNAPSHOT_DIR="{{ deadman_snapshot_dir }}"
|
||||
SNAPSHOT_FILE="${SNAPSHOT_DIR}/config.yaml.known_good"
|
||||
REQUEST_LOG_DB="{{ request_log_path }}"
|
||||
LOG_DIR="{{ timmy_log_dir }}"
|
||||
LOG_FILE="${LOG_DIR}/deadman-${WIZARD_NAME}.log"
|
||||
MAX_SNAPSHOTS={{ deadman_max_snapshots }}
|
||||
RESTART_COOLDOWN={{ deadman_restart_cooldown }}
|
||||
MAX_RESTART_ATTEMPTS={{ deadman_max_restart_attempts }}
|
||||
COOLDOWN_FILE="${LOG_DIR}/deadman_cooldown_${WIZARD_NAME}"
|
||||
SERVICE_NAME="hermes-{{ wizard_name | lower }}"
|
||||
|
||||
# Ensure directories exist
|
||||
mkdir -p "${SNAPSHOT_DIR}" "${LOG_DIR}"
|
||||
|
||||
log() {
|
||||
echo "[$(date -u +%Y-%m-%dT%H:%M:%SZ)] [deadman] [${WIZARD_NAME}] $*" >> "${LOG_FILE}"
|
||||
echo "[deadman] [${WIZARD_NAME}] $*"
|
||||
}
|
||||
|
||||
log_telemetry() {
|
||||
local status="$1"
|
||||
local message="$2"
|
||||
if [ -f "${REQUEST_LOG_DB}" ]; then
|
||||
sqlite3 "${REQUEST_LOG_DB}" "INSERT INTO request_log (timestamp, agent_name, provider, model, endpoint, status, error_message) VALUES (datetime('now'), '${WIZARD_NAME}', 'deadman_switch', 'N/A', 'health_check', '${status}', '${message}');" 2>/dev/null || true
|
||||
fi
|
||||
}
|
||||
|
||||
snapshot_config() {
|
||||
if [ -f "${CONFIG_FILE}" ]; then
|
||||
cp "${CONFIG_FILE}" "${SNAPSHOT_FILE}"
|
||||
# Keep rolling history
|
||||
cp "${CONFIG_FILE}" "${SNAPSHOT_DIR}/config.yaml.$(date +%s)"
|
||||
# Prune old snapshots
|
||||
ls -t "${SNAPSHOT_DIR}"/config.yaml.[0-9]* 2>/dev/null | tail -n +$((MAX_SNAPSHOTS + 1)) | xargs rm -f 2>/dev/null
|
||||
log "Config snapshot saved."
|
||||
fi
|
||||
}
|
||||
|
||||
rollback_config() {
|
||||
if [ -f "${SNAPSHOT_FILE}" ]; then
|
||||
log "Rolling back config to last known good..."
|
||||
cp "${SNAPSHOT_FILE}" "${CONFIG_FILE}"
|
||||
log "Config rolled back."
|
||||
log_telemetry "fallback" "Config rolled back to last known good by deadman switch"
|
||||
else
|
||||
log "ERROR: No known good snapshot found. Pulling from upstream..."
|
||||
cd "${WIZARD_HOME}/workspace/timmy-config" 2>/dev/null && \
|
||||
git pull --ff-only origin {{ upstream_branch }} 2>/dev/null && \
|
||||
cp "wizards/{{ wizard_name | lower }}/config.yaml" "${CONFIG_FILE}" && \
|
||||
log "Config restored from upstream." || \
|
||||
log "CRITICAL: Cannot restore config from any source."
|
||||
fi
|
||||
}
|
||||
|
||||
restart_agent() {
|
||||
# Check cooldown
|
||||
if [ -f "${COOLDOWN_FILE}" ]; then
|
||||
local last_restart
|
||||
last_restart=$(cat "${COOLDOWN_FILE}")
|
||||
local now
|
||||
now=$(date +%s)
|
||||
local elapsed=$((now - last_restart))
|
||||
if [ "${elapsed}" -lt "${RESTART_COOLDOWN}" ]; then
|
||||
log "Restart cooldown active (${elapsed}s / ${RESTART_COOLDOWN}s). Skipping."
|
||||
return 1
|
||||
fi
|
||||
fi
|
||||
|
||||
log "Restarting ${SERVICE_NAME}..."
|
||||
date +%s > "${COOLDOWN_FILE}"
|
||||
|
||||
{% if machine_type == 'vps' %}
|
||||
systemctl restart "${SERVICE_NAME}" 2>/dev/null && \
|
||||
log "Agent restarted via systemd." || \
|
||||
log "ERROR: systemd restart failed."
|
||||
{% else %}
|
||||
launchctl kickstart -k "ai.hermes.{{ wizard_name | lower }}" 2>/dev/null && \
|
||||
log "Agent restarted via launchctl." || \
|
||||
(cd "${WIZARD_HOME}" && hermes agent start --daemon 2>/dev/null && \
|
||||
log "Agent restarted via hermes CLI.") || \
|
||||
log "ERROR: All restart methods failed."
|
||||
{% endif %}
|
||||
|
||||
log_telemetry "success" "Agent restarted by deadman switch"
|
||||
}
|
||||
|
||||
# --- Health Check ---
|
||||
check_health() {
|
||||
# Check 1: Is the agent process running?
|
||||
{% if machine_type == 'vps' %}
|
||||
if ! systemctl is-active --quiet "${SERVICE_NAME}" 2>/dev/null; then
|
||||
if ! pgrep -f "hermes" > /dev/null 2>/dev/null; then
|
||||
log "FAIL: Agent process not running."
|
||||
return 1
|
||||
fi
|
||||
fi
|
||||
{% else %}
|
||||
if ! pgrep -f "hermes" > /dev/null 2>/dev/null; then
|
||||
log "FAIL: Agent process not running."
|
||||
return 1
|
||||
fi
|
||||
{% endif %}
|
||||
|
||||
# Check 2: Is the API port responding?
|
||||
if ! timeout 10 bash -c "echo > /dev/tcp/127.0.0.1/{{ api_port }}" 2>/dev/null; then
|
||||
log "FAIL: API port {{ api_port }} not responding."
|
||||
return 1
|
||||
fi
|
||||
|
||||
# Check 3: Does the config contain banned providers?
|
||||
if grep -qi 'anthropic\|claude-sonnet\|claude-opus\|claude-haiku' "${CONFIG_FILE}" 2>/dev/null; then
|
||||
log "FAIL: Config contains banned provider (Anthropic). Rolling back."
|
||||
return 1
|
||||
fi
|
||||
|
||||
return 0
|
||||
}
|
||||
|
||||
# --- Main ---
|
||||
main() {
|
||||
log "Health check starting..."
|
||||
|
||||
if check_health; then
|
||||
log "HEALTHY — snapshotting config."
|
||||
snapshot_config
|
||||
log_telemetry "success" "Health check passed"
|
||||
else
|
||||
log "UNHEALTHY — initiating recovery."
|
||||
log_telemetry "error" "Health check failed — initiating rollback"
|
||||
rollback_config
|
||||
restart_agent
|
||||
fi
|
||||
|
||||
log "Health check complete."
|
||||
}
|
||||
|
||||
main "$@"
|
||||
@@ -1,22 +0,0 @@
|
||||
<?xml version="1.0" encoding="UTF-8"?>
|
||||
<!DOCTYPE plist PUBLIC "-//Apple//DTD PLIST 1.0//EN" "http://www.apple.com/DTDs/PropertyList-1.0.dtd">
|
||||
<!-- Deadman Switch — {{ wizard_name }}. Generated by Ansible. DO NOT EDIT MANUALLY. -->
|
||||
<plist version="1.0">
|
||||
<dict>
|
||||
<key>Label</key>
|
||||
<string>com.timmy.deadman.{{ wizard_name | lower }}</string>
|
||||
<key>ProgramArguments</key>
|
||||
<array>
|
||||
<string>/bin/bash</string>
|
||||
<string>{{ wizard_home }}/deadman_action.sh</string>
|
||||
</array>
|
||||
<key>StartInterval</key>
|
||||
<integer>{{ deadman_check_interval }}</integer>
|
||||
<key>RunAtLoad</key>
|
||||
<true/>
|
||||
<key>StandardOutPath</key>
|
||||
<string>{{ timmy_log_dir }}/deadman-{{ wizard_name }}.log</string>
|
||||
<key>StandardErrorPath</key>
|
||||
<string>{{ timmy_log_dir }}/deadman-{{ wizard_name }}.log</string>
|
||||
</dict>
|
||||
</plist>
|
||||
@@ -1,16 +0,0 @@
|
||||
# Deadman Switch — {{ wizard_name }}
|
||||
# Generated by Ansible. DO NOT EDIT MANUALLY.
|
||||
|
||||
[Unit]
|
||||
Description=Deadman Switch for {{ wizard_name }} wizard
|
||||
After=network.target
|
||||
|
||||
[Service]
|
||||
Type=oneshot
|
||||
ExecStart={{ wizard_home }}/deadman_action.sh
|
||||
User={{ ansible_user | default('root') }}
|
||||
StandardOutput=append:{{ timmy_log_dir }}/deadman-{{ wizard_name }}.log
|
||||
StandardError=append:{{ timmy_log_dir }}/deadman-{{ wizard_name }}.log
|
||||
|
||||
[Install]
|
||||
WantedBy=multi-user.target
|
||||
@@ -1,14 +0,0 @@
|
||||
# Deadman Switch Timer — {{ wizard_name }}
|
||||
# Generated by Ansible. DO NOT EDIT MANUALLY.
|
||||
# Runs every {{ deadman_check_interval // 60 }} minutes.
|
||||
|
||||
[Unit]
|
||||
Description=Deadman Switch Timer for {{ wizard_name }} wizard
|
||||
|
||||
[Timer]
|
||||
OnBootSec=60
|
||||
OnUnitActiveSec={{ deadman_check_interval }}s
|
||||
AccuracySec=30s
|
||||
|
||||
[Install]
|
||||
WantedBy=timers.target
|
||||
@@ -1,6 +0,0 @@
|
||||
---
|
||||
# golden_state defaults
|
||||
# The golden_state_providers list is defined in group_vars/wizards.yml
|
||||
# and inventory/hosts.yml (global vars).
|
||||
golden_state_enforce: true
|
||||
golden_state_backup_before_deploy: true
|
||||
@@ -1,46 +0,0 @@
|
||||
---
|
||||
# =============================================================================
|
||||
# golden_state/tasks — Deploy and enforce golden state provider chain
|
||||
# =============================================================================
|
||||
|
||||
- name: "Backup current config before golden state deploy"
|
||||
copy:
|
||||
src: "{{ wizard_home }}/config.yaml"
|
||||
dest: "{{ wizard_home }}/config.yaml.pre-golden-{{ ansible_date_time.epoch }}"
|
||||
remote_src: true
|
||||
when: golden_state_backup_before_deploy
|
||||
ignore_errors: true
|
||||
|
||||
- name: "Deploy golden state wizard config"
|
||||
template:
|
||||
src: "../../wizard_base/templates/wizard_config.yaml.j2"
|
||||
dest: "{{ wizard_home }}/config.yaml"
|
||||
mode: "0644"
|
||||
backup: true
|
||||
notify:
|
||||
- "Restart hermes agent (systemd)"
|
||||
- "Restart hermes agent (launchctl)"
|
||||
|
||||
- name: "Scan for banned providers in all config files"
|
||||
shell: |
|
||||
FOUND=0
|
||||
for f in {{ wizard_home }}/config.yaml {{ hermes_home }}/config.yaml; do
|
||||
if [ -f "$f" ]; then
|
||||
if grep -qi 'anthropic\|claude-sonnet\|claude-opus\|claude-haiku' "$f"; then
|
||||
echo "BANNED PROVIDER in $f:"
|
||||
grep -ni 'anthropic\|claude-sonnet\|claude-opus\|claude-haiku' "$f"
|
||||
FOUND=1
|
||||
fi
|
||||
fi
|
||||
done
|
||||
exit $FOUND
|
||||
register: provider_scan
|
||||
changed_when: false
|
||||
failed_when: provider_scan.rc != 0 and provider_ban_enforcement == 'strict'
|
||||
|
||||
- name: "Report golden state deployment"
|
||||
debug:
|
||||
msg: >
|
||||
{{ wizard_name }} golden state deployed.
|
||||
Provider chain: {{ golden_state_providers | map(attribute='name') | list | join(' → ') }}.
|
||||
Banned provider scan: {{ 'CLEAN' if provider_scan.rc == 0 else 'VIOLATIONS FOUND' }}.
|
||||
@@ -1,64 +0,0 @@
|
||||
-- =============================================================================
|
||||
-- request_log — Inference Telemetry Table
|
||||
-- =============================================================================
|
||||
-- Every agent writes to this table BEFORE and AFTER every inference call.
|
||||
-- No exceptions. No summarizing. No describing what you would log.
|
||||
-- Actually write the row.
|
||||
--
|
||||
-- Source: KT Bezalel Architecture Session 2026-04-08
|
||||
-- =============================================================================
|
||||
|
||||
CREATE TABLE IF NOT EXISTS request_log (
|
||||
id INTEGER PRIMARY KEY AUTOINCREMENT,
|
||||
timestamp TEXT NOT NULL DEFAULT (datetime('now')),
|
||||
agent_name TEXT NOT NULL,
|
||||
provider TEXT NOT NULL,
|
||||
model TEXT NOT NULL,
|
||||
endpoint TEXT NOT NULL,
|
||||
tokens_in INTEGER,
|
||||
tokens_out INTEGER,
|
||||
latency_ms INTEGER,
|
||||
status TEXT NOT NULL, -- 'success', 'error', 'timeout', 'fallback'
|
||||
error_message TEXT
|
||||
);
|
||||
|
||||
-- Index for common queries
|
||||
CREATE INDEX IF NOT EXISTS idx_request_log_agent
|
||||
ON request_log (agent_name, timestamp);
|
||||
|
||||
CREATE INDEX IF NOT EXISTS idx_request_log_provider
|
||||
ON request_log (provider, timestamp);
|
||||
|
||||
CREATE INDEX IF NOT EXISTS idx_request_log_status
|
||||
ON request_log (status, timestamp);
|
||||
|
||||
-- View: recent activity per agent (last hour)
|
||||
CREATE VIEW IF NOT EXISTS v_recent_activity AS
|
||||
SELECT
|
||||
agent_name,
|
||||
provider,
|
||||
model,
|
||||
status,
|
||||
COUNT(*) as call_count,
|
||||
AVG(latency_ms) as avg_latency_ms,
|
||||
SUM(tokens_in) as total_tokens_in,
|
||||
SUM(tokens_out) as total_tokens_out
|
||||
FROM request_log
|
||||
WHERE timestamp > datetime('now', '-1 hour')
|
||||
GROUP BY agent_name, provider, model, status;
|
||||
|
||||
-- View: provider reliability (last 24 hours)
|
||||
CREATE VIEW IF NOT EXISTS v_provider_reliability AS
|
||||
SELECT
|
||||
provider,
|
||||
model,
|
||||
COUNT(*) as total_calls,
|
||||
SUM(CASE WHEN status = 'success' THEN 1 ELSE 0 END) as successes,
|
||||
SUM(CASE WHEN status = 'error' THEN 1 ELSE 0 END) as errors,
|
||||
SUM(CASE WHEN status = 'timeout' THEN 1 ELSE 0 END) as timeouts,
|
||||
SUM(CASE WHEN status = 'fallback' THEN 1 ELSE 0 END) as fallbacks,
|
||||
ROUND(100.0 * SUM(CASE WHEN status = 'success' THEN 1 ELSE 0 END) / COUNT(*), 1) as success_rate,
|
||||
AVG(latency_ms) as avg_latency_ms
|
||||
FROM request_log
|
||||
WHERE timestamp > datetime('now', '-24 hours')
|
||||
GROUP BY provider, model;
|
||||
@@ -1,50 +0,0 @@
|
||||
---
|
||||
# =============================================================================
|
||||
# request_log/tasks — Deploy Telemetry Table
|
||||
# =============================================================================
|
||||
# "This is non-negotiable infrastructure. Without it, we cannot verify
|
||||
# if any agent actually executed what it claims."
|
||||
# — KT Bezalel 2026-04-08
|
||||
# =============================================================================
|
||||
|
||||
- name: "Create telemetry directory"
|
||||
file:
|
||||
path: "{{ request_log_path | dirname }}"
|
||||
state: directory
|
||||
mode: "0755"
|
||||
|
||||
- name: "Deploy request_log schema"
|
||||
copy:
|
||||
src: request_log_schema.sql
|
||||
dest: "{{ wizard_home }}/request_log_schema.sql"
|
||||
mode: "0644"
|
||||
|
||||
- name: "Initialize request_log database"
|
||||
shell: |
|
||||
sqlite3 "{{ request_log_path }}" < "{{ wizard_home }}/request_log_schema.sql"
|
||||
args:
|
||||
creates: "{{ request_log_path }}"
|
||||
|
||||
- name: "Verify request_log table exists"
|
||||
shell: |
|
||||
sqlite3 "{{ request_log_path }}" ".tables" | grep -q "request_log"
|
||||
register: table_check
|
||||
changed_when: false
|
||||
|
||||
- name: "Verify request_log schema matches"
|
||||
shell: |
|
||||
sqlite3 "{{ request_log_path }}" ".schema request_log" | grep -q "agent_name"
|
||||
register: schema_check
|
||||
changed_when: false
|
||||
|
||||
- name: "Set permissions on request_log database"
|
||||
file:
|
||||
path: "{{ request_log_path }}"
|
||||
mode: "0644"
|
||||
|
||||
- name: "Report request_log status"
|
||||
debug:
|
||||
msg: >
|
||||
{{ wizard_name }} request_log: {{ request_log_path }}
|
||||
— table exists: {{ table_check.rc == 0 }}
|
||||
— schema valid: {{ schema_check.rc == 0 }}
|
||||
@@ -1,6 +0,0 @@
|
||||
---
|
||||
# wizard_base defaults
|
||||
wizard_user: "{{ ansible_user | default('root') }}"
|
||||
wizard_group: "{{ ansible_user | default('root') }}"
|
||||
timmy_base_dir: "~/.local/timmy"
|
||||
timmy_config_repo: "https://forge.alexanderwhitestone.com/Timmy_Foundation/timmy-config.git"
|
||||
@@ -1,11 +0,0 @@
|
||||
---
|
||||
- name: "Restart hermes agent (systemd)"
|
||||
systemd:
|
||||
name: "hermes-{{ wizard_name | lower }}"
|
||||
state: restarted
|
||||
when: machine_type == 'vps'
|
||||
|
||||
- name: "Restart hermes agent (launchctl)"
|
||||
shell: "launchctl kickstart -k ai.hermes.{{ wizard_name | lower }}"
|
||||
when: machine_type == 'mac'
|
||||
ignore_errors: true
|
||||
@@ -1,69 +0,0 @@
|
||||
---
|
||||
# =============================================================================
|
||||
# wizard_base/tasks — Common wizard setup
|
||||
# =============================================================================
|
||||
|
||||
- name: "Create wizard directories"
|
||||
file:
|
||||
path: "{{ item }}"
|
||||
state: directory
|
||||
mode: "0755"
|
||||
loop:
|
||||
- "{{ wizard_home }}"
|
||||
- "{{ wizard_home }}/workspace"
|
||||
- "{{ hermes_home }}"
|
||||
- "{{ hermes_home }}/bin"
|
||||
- "{{ hermes_home }}/skins"
|
||||
- "{{ hermes_home }}/playbooks"
|
||||
- "{{ hermes_home }}/memories"
|
||||
- "~/.local/timmy"
|
||||
- "~/.local/timmy/fleet-health"
|
||||
- "~/.local/timmy/snapshots"
|
||||
- "~/.timmy"
|
||||
|
||||
- name: "Clone/update timmy-config"
|
||||
git:
|
||||
repo: "{{ upstream_repo }}"
|
||||
dest: "{{ wizard_home }}/workspace/timmy-config"
|
||||
version: "{{ upstream_branch }}"
|
||||
force: false
|
||||
update: true
|
||||
ignore_errors: true # May fail on first run if no SSH key
|
||||
|
||||
- name: "Deploy SOUL.md"
|
||||
copy:
|
||||
src: "{{ wizard_home }}/workspace/timmy-config/SOUL.md"
|
||||
dest: "~/.timmy/SOUL.md"
|
||||
remote_src: true
|
||||
mode: "0644"
|
||||
ignore_errors: true
|
||||
|
||||
- name: "Deploy thin config (immutable pointer to upstream)"
|
||||
template:
|
||||
src: thin_config.yml.j2
|
||||
dest: "{{ thin_config_path }}"
|
||||
mode: "{{ thin_config_mode }}"
|
||||
tags: [thin_config]
|
||||
|
||||
- name: "Ensure Python3 and pip are available"
|
||||
package:
|
||||
name:
|
||||
- python3
|
||||
- python3-pip
|
||||
state: present
|
||||
when: machine_type == 'vps'
|
||||
ignore_errors: true
|
||||
|
||||
- name: "Ensure PyYAML is installed (for config validation)"
|
||||
pip:
|
||||
name: pyyaml
|
||||
state: present
|
||||
when: machine_type == 'vps'
|
||||
ignore_errors: true
|
||||
|
||||
- name: "Create Ansible log directory"
|
||||
file:
|
||||
path: /var/log/ansible
|
||||
state: directory
|
||||
mode: "0755"
|
||||
ignore_errors: true
|
||||
@@ -1,41 +0,0 @@
|
||||
# =============================================================================
|
||||
# Thin Config — {{ wizard_name }}
|
||||
# =============================================================================
|
||||
# THIS FILE IS READ-ONLY. Agents CANNOT modify it.
|
||||
# It contains only pointers to upstream. The actual config lives in Gitea.
|
||||
#
|
||||
# Agent wakes up → pulls config from upstream → loads → runs.
|
||||
# If anything tries to mutate this → fails gracefully → pulls fresh on restart.
|
||||
#
|
||||
# Only way to permanently change config: commit to Gitea, merge PR, Ansible deploys.
|
||||
#
|
||||
# Generated by Ansible on {{ ansible_date_time.iso8601 }}
|
||||
# DO NOT EDIT MANUALLY.
|
||||
# =============================================================================
|
||||
|
||||
identity:
|
||||
wizard_name: "{{ wizard_name }}"
|
||||
wizard_role: "{{ wizard_role }}"
|
||||
machine: "{{ inventory_hostname }}"
|
||||
|
||||
upstream:
|
||||
repo: "{{ upstream_repo }}"
|
||||
branch: "{{ upstream_branch }}"
|
||||
config_path: "wizards/{{ wizard_name | lower }}/config.yaml"
|
||||
pull_on_wake: {{ config_pull_on_wake | lower }}
|
||||
|
||||
recovery:
|
||||
deadman_enabled: {{ deadman_enabled | lower }}
|
||||
snapshot_dir: "{{ deadman_snapshot_dir }}"
|
||||
restart_cooldown: {{ deadman_restart_cooldown }}
|
||||
max_restart_attempts: {{ deadman_max_restart_attempts }}
|
||||
escalation_channel: "{{ deadman_escalation_channel }}"
|
||||
|
||||
telemetry:
|
||||
request_log_path: "{{ request_log_path }}"
|
||||
request_log_enabled: {{ request_log_enabled | lower }}
|
||||
|
||||
local_overrides:
|
||||
# Runtime overrides go here. They are EPHEMERAL — not persisted across restarts.
|
||||
# On restart, this section is reset to empty.
|
||||
{}
|
||||
@@ -1,115 +0,0 @@
|
||||
# =============================================================================
|
||||
# {{ wizard_name }} — Wizard Configuration (Golden State)
|
||||
# =============================================================================
|
||||
# Generated by Ansible on {{ ansible_date_time.iso8601 }}
|
||||
# DO NOT EDIT MANUALLY. Changes go through Gitea PR → Ansible deploy.
|
||||
#
|
||||
# Provider chain: {{ golden_state_providers | map(attribute='name') | list | join(' → ') }}
|
||||
# Anthropic is PERMANENTLY BANNED.
|
||||
# =============================================================================
|
||||
|
||||
model:
|
||||
default: {{ wizard_model_primary }}
|
||||
provider: {{ wizard_provider_primary }}
|
||||
context_length: 65536
|
||||
base_url: {{ golden_state_providers[0].base_url }}
|
||||
|
||||
toolsets:
|
||||
- all
|
||||
|
||||
fallback_providers:
|
||||
{% for provider in golden_state_providers %}
|
||||
- provider: {{ provider.name }}
|
||||
model: {{ provider.model }}
|
||||
{% if provider.base_url is defined %}
|
||||
base_url: {{ provider.base_url }}
|
||||
{% endif %}
|
||||
{% if provider.api_key_env is defined %}
|
||||
api_key_env: {{ provider.api_key_env }}
|
||||
{% endif %}
|
||||
timeout: {{ provider.timeout }}
|
||||
reason: "{{ provider.reason }}"
|
||||
{% endfor %}
|
||||
|
||||
agent:
|
||||
max_turns: {{ agent_max_turns }}
|
||||
reasoning_effort: {{ agent_reasoning_effort }}
|
||||
verbose: {{ agent_verbose | lower }}
|
||||
|
||||
terminal:
|
||||
backend: local
|
||||
cwd: .
|
||||
timeout: 180
|
||||
persistent_shell: true
|
||||
|
||||
browser:
|
||||
inactivity_timeout: 120
|
||||
command_timeout: 30
|
||||
record_sessions: false
|
||||
|
||||
display:
|
||||
compact: false
|
||||
personality: ''
|
||||
resume_display: full
|
||||
busy_input_mode: interrupt
|
||||
bell_on_complete: false
|
||||
show_reasoning: false
|
||||
streaming: false
|
||||
show_cost: false
|
||||
tool_progress: all
|
||||
|
||||
memory:
|
||||
memory_enabled: true
|
||||
user_profile_enabled: true
|
||||
memory_char_limit: 2200
|
||||
user_char_limit: 1375
|
||||
nudge_interval: 10
|
||||
flush_min_turns: 6
|
||||
|
||||
approvals:
|
||||
mode: {{ agent_approval_mode }}
|
||||
|
||||
security:
|
||||
redact_secrets: true
|
||||
tirith_enabled: false
|
||||
|
||||
platforms:
|
||||
api_server:
|
||||
enabled: true
|
||||
extra:
|
||||
host: 127.0.0.1
|
||||
port: {{ api_port }}
|
||||
|
||||
session_reset:
|
||||
mode: none
|
||||
idle_minutes: 0
|
||||
|
||||
skills:
|
||||
creation_nudge_interval: 15
|
||||
|
||||
system_prompt_suffix: |
|
||||
You are {{ wizard_name }}, {{ wizard_role }}.
|
||||
Your soul is defined in SOUL.md — read it, live it.
|
||||
Hermes is your harness.
|
||||
{{ golden_state_providers[0].name }} is your primary provider.
|
||||
Refusal over fabrication. If you do not know, say so.
|
||||
Sovereignty and service always.
|
||||
|
||||
providers:
|
||||
{% for provider in golden_state_providers %}
|
||||
{{ provider.name }}:
|
||||
base_url: {{ provider.base_url }}
|
||||
timeout: {{ provider.timeout | default(60) }}
|
||||
{% if provider.name == 'kimi-coding' %}
|
||||
max_retries: 3
|
||||
{% endif %}
|
||||
{% endfor %}
|
||||
|
||||
# =============================================================================
|
||||
# BANNED PROVIDERS — DO NOT ADD
|
||||
# =============================================================================
|
||||
# The following providers are PERMANENTLY BANNED:
|
||||
# - anthropic (any model: claude-sonnet, claude-opus, claude-haiku)
|
||||
# Enforcement: pre-commit hook, linter, Ansible validation, this comment.
|
||||
# Adding any banned provider will cause Ansible deployment to FAIL.
|
||||
# =============================================================================
|
||||
@@ -1,75 +0,0 @@
|
||||
#!/usr/bin/env bash
|
||||
# =============================================================================
|
||||
# Gitea Webhook Handler — Trigger Ansible Deploy on Merge
|
||||
# =============================================================================
|
||||
# This script is called by the Gitea webhook when a PR is merged
|
||||
# to the main branch of timmy-config.
|
||||
#
|
||||
# Setup:
|
||||
# 1. Add webhook in Gitea: Settings → Webhooks → Add Webhook
|
||||
# 2. URL: http://localhost:9000/hooks/deploy-timmy-config
|
||||
# 3. Events: Pull Request (merged only)
|
||||
# 4. Secret: <configured in Gitea>
|
||||
#
|
||||
# This script runs ansible-pull to update the local machine.
|
||||
# For fleet-wide deploys, each machine runs ansible-pull independently.
|
||||
# =============================================================================
|
||||
|
||||
set -euo pipefail
|
||||
|
||||
REPO="https://forge.alexanderwhitestone.com/Timmy_Foundation/timmy-config.git"
|
||||
BRANCH="main"
|
||||
ANSIBLE_DIR="ansible"
|
||||
LOG_FILE="/var/log/ansible/webhook-deploy.log"
|
||||
LOCK_FILE="/tmp/ansible-deploy.lock"
|
||||
|
||||
log() {
|
||||
echo "[$(date -u +%Y-%m-%dT%H:%M:%SZ)] [webhook] $*" | tee -a "${LOG_FILE}"
|
||||
}
|
||||
|
||||
# Prevent concurrent deploys
|
||||
if [ -f "${LOCK_FILE}" ]; then
|
||||
LOCK_AGE=$(( $(date +%s) - $(stat -c %Y "${LOCK_FILE}" 2>/dev/null || echo 0) ))
|
||||
if [ "${LOCK_AGE}" -lt 300 ]; then
|
||||
log "Deploy already in progress (lock age: ${LOCK_AGE}s). Skipping."
|
||||
exit 0
|
||||
else
|
||||
log "Stale lock file (${LOCK_AGE}s old). Removing."
|
||||
rm -f "${LOCK_FILE}"
|
||||
fi
|
||||
fi
|
||||
|
||||
trap 'rm -f "${LOCK_FILE}"' EXIT
|
||||
touch "${LOCK_FILE}"
|
||||
|
||||
log "Webhook triggered. Starting ansible-pull..."
|
||||
|
||||
# Pull latest config
|
||||
cd /tmp
|
||||
rm -rf timmy-config-deploy
|
||||
git clone --depth 1 --branch "${BRANCH}" "${REPO}" timmy-config-deploy 2>&1 | tee -a "${LOG_FILE}"
|
||||
|
||||
cd timmy-config-deploy/${ANSIBLE_DIR}
|
||||
|
||||
# Run Ansible against localhost
|
||||
log "Running Ansible playbook..."
|
||||
ansible-playbook \
|
||||
-i inventory/hosts.yml \
|
||||
playbooks/site.yml \
|
||||
--limit "$(hostname)" \
|
||||
--diff \
|
||||
2>&1 | tee -a "${LOG_FILE}"
|
||||
|
||||
RESULT=$?
|
||||
|
||||
if [ ${RESULT} -eq 0 ]; then
|
||||
log "Deploy successful."
|
||||
else
|
||||
log "ERROR: Deploy failed with exit code ${RESULT}."
|
||||
fi
|
||||
|
||||
# Cleanup
|
||||
rm -rf /tmp/timmy-config-deploy
|
||||
|
||||
log "Webhook handler complete."
|
||||
exit ${RESULT}
|
||||
@@ -1,155 +0,0 @@
|
||||
#!/usr/bin/env python3
|
||||
"""
|
||||
Config Validator — The Timmy Foundation
|
||||
Validates wizard configs against golden state rules.
|
||||
Run before any config deploy to catch violations early.
|
||||
|
||||
Usage:
|
||||
python3 validate_config.py <config_file>
|
||||
python3 validate_config.py --all # Validate all wizard configs
|
||||
|
||||
Exit codes:
|
||||
0 — All validations passed
|
||||
1 — Validation errors found
|
||||
2 — File not found or parse error
|
||||
"""
|
||||
|
||||
import sys
|
||||
import os
|
||||
import yaml
|
||||
import fnmatch
|
||||
from pathlib import Path
|
||||
|
||||
# === BANNED PROVIDERS — HARD POLICY ===
|
||||
BANNED_PROVIDERS = {"anthropic", "claude"}
|
||||
BANNED_MODEL_PATTERNS = [
|
||||
"claude-*",
|
||||
"anthropic/*",
|
||||
"*sonnet*",
|
||||
"*opus*",
|
||||
"*haiku*",
|
||||
]
|
||||
|
||||
# === REQUIRED FIELDS ===
|
||||
REQUIRED_FIELDS = {
|
||||
"model": ["default", "provider"],
|
||||
"fallback_providers": None, # Must exist as a list
|
||||
}
|
||||
|
||||
|
||||
def is_banned_model(model_name: str) -> bool:
|
||||
"""Check if a model name matches any banned pattern."""
|
||||
model_lower = model_name.lower()
|
||||
for pattern in BANNED_MODEL_PATTERNS:
|
||||
if fnmatch.fnmatch(model_lower, pattern):
|
||||
return True
|
||||
return False
|
||||
|
||||
|
||||
def validate_config(config_path: str) -> list[str]:
|
||||
"""Validate a wizard config file. Returns list of error strings."""
|
||||
errors = []
|
||||
|
||||
try:
|
||||
with open(config_path) as f:
|
||||
cfg = yaml.safe_load(f)
|
||||
except FileNotFoundError:
|
||||
return [f"File not found: {config_path}"]
|
||||
except yaml.YAMLError as e:
|
||||
return [f"YAML parse error: {e}"]
|
||||
|
||||
if not cfg:
|
||||
return ["Config file is empty"]
|
||||
|
||||
# Check required fields
|
||||
for section, fields in REQUIRED_FIELDS.items():
|
||||
if section not in cfg:
|
||||
errors.append(f"Missing required section: {section}")
|
||||
elif fields:
|
||||
for field in fields:
|
||||
if field not in cfg[section]:
|
||||
errors.append(f"Missing required field: {section}.{field}")
|
||||
|
||||
# Check default provider
|
||||
default_provider = cfg.get("model", {}).get("provider", "")
|
||||
if default_provider.lower() in BANNED_PROVIDERS:
|
||||
errors.append(f"BANNED default provider: {default_provider}")
|
||||
|
||||
default_model = cfg.get("model", {}).get("default", "")
|
||||
if is_banned_model(default_model):
|
||||
errors.append(f"BANNED default model: {default_model}")
|
||||
|
||||
# Check fallback providers
|
||||
for i, fb in enumerate(cfg.get("fallback_providers", [])):
|
||||
provider = fb.get("provider", "")
|
||||
model = fb.get("model", "")
|
||||
|
||||
if provider.lower() in BANNED_PROVIDERS:
|
||||
errors.append(f"BANNED fallback provider [{i}]: {provider}")
|
||||
|
||||
if is_banned_model(model):
|
||||
errors.append(f"BANNED fallback model [{i}]: {model}")
|
||||
|
||||
# Check providers section
|
||||
for name, provider_cfg in cfg.get("providers", {}).items():
|
||||
if name.lower() in BANNED_PROVIDERS:
|
||||
errors.append(f"BANNED provider in providers section: {name}")
|
||||
|
||||
base_url = str(provider_cfg.get("base_url", ""))
|
||||
if "anthropic" in base_url.lower():
|
||||
errors.append(f"BANNED URL in provider {name}: {base_url}")
|
||||
|
||||
# Check system prompt for banned references
|
||||
prompt = cfg.get("system_prompt_suffix", "")
|
||||
if isinstance(prompt, str):
|
||||
for banned in BANNED_PROVIDERS:
|
||||
if banned in prompt.lower():
|
||||
errors.append(f"BANNED provider referenced in system_prompt_suffix: {banned}")
|
||||
|
||||
return errors
|
||||
|
||||
|
||||
def main():
|
||||
if len(sys.argv) < 2:
|
||||
print(f"Usage: {sys.argv[0]} <config_file> [--all]")
|
||||
sys.exit(2)
|
||||
|
||||
if sys.argv[1] == "--all":
|
||||
# Validate all wizard configs in the repo
|
||||
repo_root = Path(__file__).parent.parent.parent
|
||||
wizard_dir = repo_root / "wizards"
|
||||
all_errors = {}
|
||||
|
||||
for wizard_path in sorted(wizard_dir.iterdir()):
|
||||
config_file = wizard_path / "config.yaml"
|
||||
if config_file.exists():
|
||||
errors = validate_config(str(config_file))
|
||||
if errors:
|
||||
all_errors[wizard_path.name] = errors
|
||||
|
||||
if all_errors:
|
||||
print("VALIDATION FAILED:")
|
||||
for wizard, errors in all_errors.items():
|
||||
print(f"\n {wizard}:")
|
||||
for err in errors:
|
||||
print(f" - {err}")
|
||||
sys.exit(1)
|
||||
else:
|
||||
print("All wizard configs passed validation.")
|
||||
sys.exit(0)
|
||||
else:
|
||||
config_path = sys.argv[1]
|
||||
errors = validate_config(config_path)
|
||||
|
||||
if errors:
|
||||
print(f"VALIDATION FAILED for {config_path}:")
|
||||
for err in errors:
|
||||
print(f" - {err}")
|
||||
sys.exit(1)
|
||||
else:
|
||||
print(f"PASSED: {config_path}")
|
||||
sys.exit(0)
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
main()
|
||||
@@ -1,273 +0,0 @@
|
||||
#!/usr/bin/env bash
|
||||
# agent-loop.sh — Universal agent dev loop with Genchi Genbutsu verification
|
||||
#
|
||||
# Usage: agent-loop.sh <agent-name> [num-workers]
|
||||
# agent-loop.sh claude 2
|
||||
# agent-loop.sh gemini 1
|
||||
#
|
||||
# Dispatches via agent-dispatch.sh, then verifies with genchi-genbutsu.sh.
|
||||
|
||||
set -uo pipefail
|
||||
|
||||
AGENT="${1:?Usage: agent-loop.sh <agent-name> [num-workers]}"
|
||||
NUM_WORKERS="${2:-1}"
|
||||
|
||||
# Resolve agent tool and model from config or fallback
|
||||
case "$AGENT" in
|
||||
claude) TOOL="claude"; MODEL="sonnet" ;;
|
||||
gemini) TOOL="gemini"; MODEL="gemini-2.5-pro-preview-05-06" ;;
|
||||
grok) TOOL="opencode"; MODEL="grok-3-fast" ;;
|
||||
*) TOOL="$AGENT"; MODEL="" ;;
|
||||
esac
|
||||
|
||||
# === CONFIG ===
|
||||
GITEA_URL="${GITEA_URL:-https://forge.alexanderwhitestone.com}"
|
||||
GITEA_TOKEN="${GITEA_TOKEN:-}"
|
||||
WORKTREE_BASE="$HOME/worktrees"
|
||||
LOG_DIR="$HOME/.hermes/logs"
|
||||
LOCK_DIR="$LOG_DIR/${AGENT}-locks"
|
||||
SKIP_FILE="$LOG_DIR/${AGENT}-skip-list.json"
|
||||
ACTIVE_FILE="$LOG_DIR/${AGENT}-active.json"
|
||||
TIMEOUT=600
|
||||
COOLDOWN=30
|
||||
|
||||
mkdir -p "$LOG_DIR" "$WORKTREE_BASE" "$LOCK_DIR"
|
||||
[ -f "$SKIP_FILE" ] || echo '{}' > "$SKIP_FILE"
|
||||
echo '{}' > "$ACTIVE_FILE"
|
||||
|
||||
# === SHARED FUNCTIONS ===
|
||||
log() {
|
||||
echo "[$(date '+%Y-%m-%d %H:%M:%S')] ${AGENT}: $*" >> "$LOG_DIR/${AGENT}-loop.log"
|
||||
}
|
||||
|
||||
lock_issue() {
|
||||
local key="$1"
|
||||
mkdir "$LOCK_DIR/$key.lock" 2>/dev/null && echo $$ > "$LOCK_DIR/$key.lock/pid"
|
||||
}
|
||||
|
||||
unlock_issue() {
|
||||
rm -rf "$LOCK_DIR/$1.lock" 2>/dev/null
|
||||
}
|
||||
|
||||
mark_skip() {
|
||||
local issue_num="$1" reason="$2"
|
||||
python3 -c "
|
||||
import json, time, fcntl
|
||||
with open('${SKIP_FILE}', 'r+') as f:
|
||||
fcntl.flock(f, fcntl.LOCK_EX)
|
||||
try: skips = json.load(f)
|
||||
except: skips = {}
|
||||
failures = skips.get(str($issue_num), {}).get('failures', 0) + 1
|
||||
skip_hours = 6 if failures >= 3 else 1
|
||||
skips[str($issue_num)] = {'until': time.time() + (skip_hours * 3600), 'reason': '$reason', 'failures': failures}
|
||||
f.seek(0); f.truncate()
|
||||
json.dump(skips, f, indent=2)
|
||||
" 2>/dev/null
|
||||
}
|
||||
|
||||
get_next_issue() {
|
||||
python3 -c "
|
||||
import json, sys, time, urllib.request, os
|
||||
token = '${GITEA_TOKEN}'
|
||||
base = '${GITEA_URL}'
|
||||
repos = ['Timmy_Foundation/the-nexus', 'Timmy_Foundation/timmy-config', 'Timmy_Foundation/hermes-agent']
|
||||
try:
|
||||
with open('${SKIP_FILE}') as f: skips = json.load(f)
|
||||
except: skips = {}
|
||||
try:
|
||||
with open('${ACTIVE_FILE}') as f: active = json.load(f); active_issues = {v['issue'] for v in active.values()}
|
||||
except: active_issues = set()
|
||||
all_issues = []
|
||||
for repo in repos:
|
||||
url = f'{base}/api/v1/repos/{repo}/issues?state=open&type=issues&limit=50&sort=created'
|
||||
req = urllib.request.Request(url, headers={'Authorization': f'token {token}'})
|
||||
try:
|
||||
resp = urllib.request.urlopen(req, timeout=10)
|
||||
issues = json.loads(resp.read())
|
||||
for i in issues: i['_repo'] = repo
|
||||
all_issues.extend(issues)
|
||||
except: continue
|
||||
for i in sorted(all_issues, key=lambda x: x['title'].lower()):
|
||||
assignees = [a['login'] for a in (i.get('assignees') or [])]
|
||||
if assignees and '${AGENT}' not in assignees: continue
|
||||
num_str = str(i['number'])
|
||||
if num_str in active_issues: continue
|
||||
if skips.get(num_str, {}).get('until', 0) > time.time(): continue
|
||||
lock = '${LOCK_DIR}/' + i['_repo'].replace('/', '-') + '-' + num_str + '.lock'
|
||||
if os.path.isdir(lock): continue
|
||||
owner, name = i['_repo'].split('/')
|
||||
print(json.dumps({'number': i['number'], 'title': i['title'], 'repo_owner': owner, 'repo_name': name, 'repo': i['_repo']}))
|
||||
sys.exit(0)
|
||||
print('null')
|
||||
" 2>/dev/null
|
||||
}
|
||||
|
||||
# === WORKER FUNCTION ===
|
||||
run_worker() {
|
||||
local worker_id="$1"
|
||||
log "WORKER-${worker_id}: Started"
|
||||
|
||||
while true; do
|
||||
issue_json=$(get_next_issue)
|
||||
if [ "$issue_json" = "null" ] || [ -z "$issue_json" ]; then
|
||||
sleep 30
|
||||
continue
|
||||
fi
|
||||
|
||||
issue_num=$(echo "$issue_json" | python3 -c "import sys,json; print(json.load(sys.stdin)['number'])")
|
||||
issue_title=$(echo "$issue_json" | python3 -c "import sys,json; print(json.load(sys.stdin)['title'])")
|
||||
repo_owner=$(echo "$issue_json" | python3 -c "import sys,json; print(json.load(sys.stdin)['repo_owner'])")
|
||||
repo_name=$(echo "$issue_json" | python3 -c "import sys,json; print(json.load(sys.stdin)['repo_name'])")
|
||||
issue_key="${repo_owner}-${repo_name}-${issue_num}"
|
||||
branch="${AGENT}/issue-${issue_num}"
|
||||
worktree="${WORKTREE_BASE}/${AGENT}-w${worker_id}-${issue_num}"
|
||||
|
||||
if ! lock_issue "$issue_key"; then
|
||||
sleep 5
|
||||
continue
|
||||
fi
|
||||
|
||||
log "WORKER-${worker_id}: === ISSUE #${issue_num}: ${issue_title} (${repo_owner}/${repo_name}) ==="
|
||||
|
||||
# Clone / checkout
|
||||
rm -rf "$worktree" 2>/dev/null
|
||||
CLONE_URL="http://${AGENT}:${GITEA_TOKEN}@143.198.27.163:3000/${repo_owner}/${repo_name}.git"
|
||||
if git ls-remote --heads "$CLONE_URL" "$branch" 2>/dev/null | grep -q "$branch"; then
|
||||
git clone --depth=50 -b "$branch" "$CLONE_URL" "$worktree" >/dev/null 2>&1
|
||||
else
|
||||
git clone --depth=1 -b main "$CLONE_URL" "$worktree" >/dev/null 2>&1
|
||||
cd "$worktree" && git checkout -b "$branch" >/dev/null 2>&1
|
||||
fi
|
||||
cd "$worktree"
|
||||
|
||||
# Generate prompt
|
||||
prompt=$(bash "$(dirname "$0")/agent-dispatch.sh" "$AGENT" "$issue_num" "${repo_owner}/${repo_name}")
|
||||
|
||||
CYCLE_START=$(date +%s)
|
||||
set +e
|
||||
if [ "$TOOL" = "claude" ]; then
|
||||
env -u CLAUDECODE gtimeout "$TIMEOUT" claude \
|
||||
--print --model "$MODEL" --dangerously-skip-permissions \
|
||||
-p "$prompt" </dev/null >> "$LOG_DIR/${AGENT}-${issue_num}.log" 2>&1
|
||||
elif [ "$TOOL" = "gemini" ]; then
|
||||
gtimeout "$TIMEOUT" gemini -p "$prompt" --yolo \
|
||||
</dev/null >> "$LOG_DIR/${AGENT}-${issue_num}.log" 2>&1
|
||||
else
|
||||
gtimeout "$TIMEOUT" "$TOOL" "$prompt" \
|
||||
</dev/null >> "$LOG_DIR/${AGENT}-${issue_num}.log" 2>&1
|
||||
fi
|
||||
exit_code=$?
|
||||
set -e
|
||||
CYCLE_END=$(date +%s)
|
||||
CYCLE_DURATION=$((CYCLE_END - CYCLE_START))
|
||||
|
||||
# Salvage
|
||||
cd "$worktree" 2>/dev/null || true
|
||||
DIRTY=$(git status --porcelain 2>/dev/null | wc -l | tr -d ' ')
|
||||
if [ "${DIRTY:-0}" -gt 0 ]; then
|
||||
git add -A 2>/dev/null
|
||||
git commit -m "WIP: ${AGENT} progress on #${issue_num}
|
||||
|
||||
Automated salvage commit — agent session ended (exit $exit_code)." 2>/dev/null || true
|
||||
fi
|
||||
|
||||
UNPUSHED=$(git log --oneline "origin/main..HEAD" 2>/dev/null | wc -l | tr -d ' ')
|
||||
if [ "${UNPUSHED:-0}" -gt 0 ]; then
|
||||
git push -u origin "$branch" 2>/dev/null && \
|
||||
log "WORKER-${worker_id}: Pushed $UNPUSHED commit(s) on $branch" || \
|
||||
log "WORKER-${worker_id}: Push failed for $branch"
|
||||
fi
|
||||
|
||||
# Create PR if needed
|
||||
pr_num=$(curl -sf "${GITEA_URL}/api/v1/repos/${repo_owner}/${repo_name}/pulls?state=open&head=${repo_owner}:${branch}&limit=1" \
|
||||
-H "Authorization: token ${GITEA_TOKEN}" | python3 -c "
|
||||
import sys,json
|
||||
prs = json.load(sys.stdin)
|
||||
print(prs[0]['number'] if prs else '')
|
||||
" 2>/dev/null)
|
||||
|
||||
if [ -z "$pr_num" ] && [ "${UNPUSHED:-0}" -gt 0 ]; then
|
||||
pr_num=$(curl -sf -X POST "${GITEA_URL}/api/v1/repos/${repo_owner}/${repo_name}/pulls" \
|
||||
-H "Authorization: token ${GITEA_TOKEN}" \
|
||||
-H "Content-Type: application/json" \
|
||||
-d "$(python3 -c "
|
||||
import json
|
||||
print(json.dumps({
|
||||
'title': '${AGENT}: Issue #${issue_num}',
|
||||
'head': '${branch}',
|
||||
'base': 'main',
|
||||
'body': 'Automated PR for issue #${issue_num}.\nExit code: ${exit_code}'
|
||||
}))
|
||||
")" | python3 -c "import sys,json; print(json.load(sys.stdin).get('number',''))" 2>/dev/null)
|
||||
[ -n "$pr_num" ] && log "WORKER-${worker_id}: Created PR #${pr_num} for issue #${issue_num}"
|
||||
fi
|
||||
|
||||
# ── Genchi Genbutsu: verify world state before declaring success ──
|
||||
VERIFIED="false"
|
||||
if [ "$exit_code" -eq 0 ]; then
|
||||
log "WORKER-${worker_id}: SUCCESS #${issue_num} — running genchi-genbutsu"
|
||||
SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
|
||||
if verify_result=$("$SCRIPT_DIR/genchi-genbutsu.sh" "$repo_owner" "$repo_name" "$issue_num" "$branch" "$AGENT" 2>/dev/null); then
|
||||
VERIFIED="true"
|
||||
log "WORKER-${worker_id}: VERIFIED #${issue_num}"
|
||||
if [ -n "$pr_num" ]; then
|
||||
curl -sf -X POST "${GITEA_URL}/api/v1/repos/${repo_owner}/${repo_name}/pulls/${pr_num}/merge" \
|
||||
-H "Authorization: token ${GITEA_TOKEN}" \
|
||||
-H "Content-Type: application/json" \
|
||||
-d '{"Do": "squash"}' >/dev/null 2>&1 || true
|
||||
curl -sf -X PATCH "${GITEA_URL}/api/v1/repos/${repo_owner}/${repo_name}/issues/${issue_num}" \
|
||||
-H "Authorization: token ${GITEA_TOKEN}" \
|
||||
-H "Content-Type: application/json" \
|
||||
-d '{"state": "closed"}' >/dev/null 2>&1 || true
|
||||
log "WORKER-${worker_id}: PR #${pr_num} merged, issue #${issue_num} closed"
|
||||
fi
|
||||
consecutive_failures=0
|
||||
else
|
||||
verify_details=$(echo "$verify_result" | python3 -c "import sys,json; print(json.load(sys.stdin).get('details','unknown'))" 2>/dev/null || echo "unverified")
|
||||
log "WORKER-${worker_id}: UNVERIFIED #${issue_num} — $verify_details"
|
||||
mark_skip "$issue_num" "unverified" 1
|
||||
consecutive_failures=$((consecutive_failures + 1))
|
||||
fi
|
||||
elif [ "$exit_code" -eq 124 ]; then
|
||||
log "WORKER-${worker_id}: TIMEOUT #${issue_num} (work saved in PR)"
|
||||
consecutive_failures=$((consecutive_failures + 1))
|
||||
else
|
||||
log "WORKER-${worker_id}: FAILED #${issue_num} exit ${exit_code} (work saved in PR)"
|
||||
consecutive_failures=$((consecutive_failures + 1))
|
||||
fi
|
||||
|
||||
# ── METRICS ──
|
||||
python3 -c "
|
||||
import json, datetime
|
||||
print(json.dumps({
|
||||
'ts': datetime.datetime.utcnow().isoformat() + 'Z',
|
||||
'agent': '${AGENT}',
|
||||
'worker': $worker_id,
|
||||
'issue': $issue_num,
|
||||
'repo': '${repo_owner}/${repo_name}',
|
||||
'outcome': 'success' if $exit_code == 0 else 'timeout' if $exit_code == 124 else 'failed',
|
||||
'exit_code': $exit_code,
|
||||
'duration_s': $CYCLE_DURATION,
|
||||
'pr': '${pr_num:-}',
|
||||
'verified': ${VERIFIED:-false}
|
||||
}))
|
||||
" >> "$LOG_DIR/${AGENT}-metrics.jsonl" 2>/dev/null
|
||||
|
||||
rm -rf "$worktree" 2>/dev/null
|
||||
unlock_issue "$issue_key"
|
||||
sleep "$COOLDOWN"
|
||||
done
|
||||
}
|
||||
|
||||
# === MAIN ===
|
||||
log "=== Agent Loop Started — ${AGENT} with ${NUM_WORKERS} worker(s) ==="
|
||||
|
||||
rm -rf "$LOCK_DIR"/*.lock 2>/dev/null
|
||||
|
||||
for i in $(seq 1 "$NUM_WORKERS"); do
|
||||
run_worker "$i" &
|
||||
log "Launched worker $i (PID $!)"
|
||||
sleep 3
|
||||
done
|
||||
|
||||
wait
|
||||
@@ -468,32 +468,24 @@ print(json.dumps({
|
||||
[ -n "$pr_num" ] && log "WORKER-${worker_id}: Created PR #${pr_num} for issue #${issue_num}"
|
||||
fi
|
||||
|
||||
# ── Genchi Genbutsu: verify world state before declaring success ──
|
||||
VERIFIED="false"
|
||||
# ── Merge + close on success ──
|
||||
if [ "$exit_code" -eq 0 ]; then
|
||||
log "WORKER-${worker_id}: SUCCESS #${issue_num} — running genchi-genbutsu"
|
||||
SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
|
||||
if verify_result=$("$SCRIPT_DIR/genchi-genbutsu.sh" "$repo_owner" "$repo_name" "$issue_num" "$branch" "claude" 2>/dev/null); then
|
||||
VERIFIED="true"
|
||||
log "WORKER-${worker_id}: VERIFIED #${issue_num}"
|
||||
if [ -n "$pr_num" ]; then
|
||||
curl -sf -X POST "${GITEA_URL}/api/v1/repos/${repo_owner}/${repo_name}/pulls/${pr_num}/merge" \
|
||||
-H "Authorization: token ${GITEA_TOKEN}" \
|
||||
-H "Content-Type: application/json" \
|
||||
-d '{"Do": "squash"}' >/dev/null 2>&1 || true
|
||||
curl -sf -X PATCH "${GITEA_URL}/api/v1/repos/${repo_owner}/${repo_name}/issues/${issue_num}" \
|
||||
-H "Authorization: token ${GITEA_TOKEN}" \
|
||||
-H "Content-Type: application/json" \
|
||||
-d '{"state": "closed"}' >/dev/null 2>&1 || true
|
||||
log "WORKER-${worker_id}: PR #${pr_num} merged, issue #${issue_num} closed"
|
||||
fi
|
||||
consecutive_failures=0
|
||||
else
|
||||
verify_details=$(echo "$verify_result" | python3 -c "import sys,json; print(json.load(sys.stdin).get('details','unknown'))" 2>/dev/null || echo "unverified")
|
||||
log "WORKER-${worker_id}: UNVERIFIED #${issue_num} — $verify_details"
|
||||
consecutive_failures=$((consecutive_failures + 1))
|
||||
log "WORKER-${worker_id}: SUCCESS #${issue_num}"
|
||||
|
||||
if [ -n "$pr_num" ]; then
|
||||
curl -sf -X POST "${GITEA_URL}/api/v1/repos/${repo_owner}/${repo_name}/pulls/${pr_num}/merge" \
|
||||
-H "Authorization: token ${GITEA_TOKEN}" \
|
||||
-H "Content-Type: application/json" \
|
||||
-d '{"Do": "squash"}' >/dev/null 2>&1 || true
|
||||
curl -sf -X PATCH "${GITEA_URL}/api/v1/repos/${repo_owner}/${repo_name}/issues/${issue_num}" \
|
||||
-H "Authorization: token ${GITEA_TOKEN}" \
|
||||
-H "Content-Type: application/json" \
|
||||
-d '{"state": "closed"}' >/dev/null 2>&1 || true
|
||||
log "WORKER-${worker_id}: PR #${pr_num} merged, issue #${issue_num} closed"
|
||||
fi
|
||||
|
||||
consecutive_failures=0
|
||||
|
||||
elif [ "$exit_code" -eq 124 ]; then
|
||||
log "WORKER-${worker_id}: TIMEOUT #${issue_num} (work saved in PR)"
|
||||
consecutive_failures=$((consecutive_failures + 1))
|
||||
@@ -530,7 +522,6 @@ print(json.dumps({
|
||||
import json, datetime
|
||||
print(json.dumps({
|
||||
'ts': datetime.datetime.utcnow().isoformat() + 'Z',
|
||||
'agent': 'claude',
|
||||
'worker': $worker_id,
|
||||
'issue': $issue_num,
|
||||
'repo': '${repo_owner}/${repo_name}',
|
||||
@@ -543,8 +534,7 @@ print(json.dumps({
|
||||
'lines_removed': ${LINES_REMOVED:-0},
|
||||
'salvaged': ${DIRTY:-0},
|
||||
'pr': '${pr_num:-}',
|
||||
'merged': $( [ '$OUTCOME' = 'success' ] && [ -n '${pr_num:-}' ] && echo 'true' || echo 'false' ),
|
||||
'verified': ${VERIFIED:-false}
|
||||
'merged': $( [ '$OUTCOME' = 'success' ] && [ -n '${pr_num:-}' ] && echo 'true' || echo 'false' )
|
||||
}))
|
||||
" >> "$METRICS_FILE" 2>/dev/null
|
||||
|
||||
|
||||
@@ -5,7 +5,7 @@ set -uo pipefail
|
||||
export PATH="/opt/homebrew/bin:$HOME/.local/bin:$HOME/.hermes/bin:/usr/local/bin:$PATH"
|
||||
|
||||
LOG="$HOME/.hermes/logs/claudemax-watchdog.log"
|
||||
GITEA_URL="https://forge.alexanderwhitestone.com"
|
||||
GITEA_URL="http://143.198.27.163:3000"
|
||||
GITEA_TOKEN=$(tr -d '[:space:]' < "$HOME/.hermes/gitea_token_vps" 2>/dev/null || true)
|
||||
REPO_API="$GITEA_URL/api/v1/repos/Timmy_Foundation/the-nexus"
|
||||
MIN_OPEN_ISSUES=10
|
||||
|
||||
@@ -1,264 +0,0 @@
|
||||
1|#!/usr/bin/env python3
|
||||
2|"""
|
||||
3|Dead Man Switch Fallback Engine
|
||||
4|
|
||||
5|When the dead man switch triggers (zero commits for 2+ hours, model down,
|
||||
6|Gitea unreachable, etc.), this script diagnoses the failure and applies
|
||||
7|common sense fallbacks automatically.
|
||||
8|
|
||||
9|Fallback chain:
|
||||
10|1. Primary model (Anthropic) down -> switch config to local-llama.cpp
|
||||
11|2. Gitea unreachable -> cache issues locally, retry on recovery
|
||||
12|3. VPS agents down -> alert + lazarus protocol
|
||||
13|4. Local llama.cpp down -> try Ollama, then alert-only mode
|
||||
14|5. All inference dead -> safe mode (cron pauses, alert Alexander)
|
||||
15|
|
||||
16|Each fallback is reversible. Recovery auto-restores the previous config.
|
||||
17|"""
|
||||
18|import os
|
||||
19|import sys
|
||||
20|import json
|
||||
21|import subprocess
|
||||
22|import time
|
||||
23|import yaml
|
||||
24|import shutil
|
||||
25|from pathlib import Path
|
||||
26|from datetime import datetime, timedelta
|
||||
27|
|
||||
28|HERMES_HOME = Path(os.environ.get("HERMES_HOME", os.path.expanduser("~/.hermes")))
|
||||
29|CONFIG_PATH = HERMES_HOME / "config.yaml"
|
||||
30|FALLBACK_STATE = HERMES_HOME / "deadman-fallback-state.json"
|
||||
31|BACKUP_CONFIG = HERMES_HOME / "config.yaml.pre-fallback"
|
||||
32|FORGE_URL = "https://forge.alexanderwhitestone.com"
|
||||
33|
|
||||
34|def load_config():
|
||||
35| with open(CONFIG_PATH) as f:
|
||||
36| return yaml.safe_load(f)
|
||||
37|
|
||||
38|def save_config(cfg):
|
||||
39| with open(CONFIG_PATH, "w") as f:
|
||||
40| yaml.dump(cfg, f, default_flow_style=False)
|
||||
41|
|
||||
42|def load_state():
|
||||
43| if FALLBACK_STATE.exists():
|
||||
44| with open(FALLBACK_STATE) as f:
|
||||
45| return json.load(f)
|
||||
46| return {"active_fallbacks": [], "last_check": None, "recovery_pending": False}
|
||||
47|
|
||||
48|def save_state(state):
|
||||
49| state["last_check"] = datetime.now().isoformat()
|
||||
50| with open(FALLBACK_STATE, "w") as f:
|
||||
51| json.dump(state, f, indent=2)
|
||||
52|
|
||||
53|def run(cmd, timeout=10):
|
||||
54| try:
|
||||
55| r = subprocess.run(cmd, shell=True, capture_output=True, text=True, timeout=timeout)
|
||||
56| return r.returncode, r.stdout.strip(), r.stderr.strip()
|
||||
57| except subprocess.TimeoutExpired:
|
||||
58| return -1, "", "timeout"
|
||||
59| except Exception as e:
|
||||
60| return -1, "", str(e)
|
||||
61|
|
||||
62|# ─── HEALTH CHECKS ───
|
||||
63|
|
||||
64|def check_anthropic():
|
||||
65| """Can we reach Anthropic API?"""
|
||||
66| key = os.environ.get("ANTHROPIC_API_KEY", "")
|
||||
67| if not key:
|
||||
68| # Check multiple .env locations
|
||||
69| for env_path in [HERMES_HOME / ".env", Path.home() / ".hermes" / ".env"]:
|
||||
70| if env_path.exists():
|
||||
71| for line in open(env_path):
|
||||
72| line = line.strip()
|
||||
73| if line.startswith("ANTHROPIC_API_KEY=***
|
||||
74| key = line.split("=", 1)[1].strip().strip('"').strip("'")
|
||||
75| break
|
||||
76| if key:
|
||||
77| break
|
||||
78| if not key:
|
||||
79| return False, "no API key"
|
||||
80| code, out, err = run(
|
||||
81| f'curl -s -o /dev/null -w "%{{http_code}}" -H "x-api-key: {key}" '
|
||||
82| f'-H "anthropic-version: 2023-06-01" '
|
||||
83| f'https://api.anthropic.com/v1/messages -X POST '
|
||||
84| f'-H "content-type: application/json" '
|
||||
85| f'-d \'{{"model":"claude-haiku-4-5-20251001","max_tokens":1,"messages":[{{"role":"user","content":"ping"}}]}}\' ',
|
||||
86| timeout=15
|
||||
87| )
|
||||
88| if code == 0 and out in ("200", "429"):
|
||||
89| return True, f"HTTP {out}"
|
||||
90| return False, f"HTTP {out} err={err[:80]}"
|
||||
91|
|
||||
92|def check_local_llama():
|
||||
93| """Is local llama.cpp serving?"""
|
||||
94| code, out, err = run("curl -s http://localhost:8081/v1/models", timeout=5)
|
||||
95| if code == 0 and "hermes" in out.lower():
|
||||
96| return True, "serving"
|
||||
97| return False, f"exit={code}"
|
||||
98|
|
||||
99|def check_ollama():
|
||||
100| """Is Ollama running?"""
|
||||
101| code, out, err = run("curl -s http://localhost:11434/api/tags", timeout=5)
|
||||
102| if code == 0 and "models" in out:
|
||||
103| return True, "running"
|
||||
104| return False, f"exit={code}"
|
||||
105|
|
||||
106|def check_gitea():
|
||||
107| """Can we reach the Forge?"""
|
||||
108| token_path = Path.home() / ".config" / "gitea" / "timmy-token"
|
||||
109| if not token_path.exists():
|
||||
110| return False, "no token"
|
||||
111| token = token_path.read_text().strip()
|
||||
112| code, out, err = run(
|
||||
113| f'curl -s -o /dev/null -w "%{{http_code}}" -H "Authorization: token {token}" '
|
||||
114| f'"{FORGE_URL}/api/v1/user"',
|
||||
115| timeout=10
|
||||
116| )
|
||||
117| if code == 0 and out == "200":
|
||||
118| return True, "reachable"
|
||||
119| return False, f"HTTP {out}"
|
||||
120|
|
||||
121|def check_vps(ip, name):
|
||||
122| """Can we SSH into a VPS?"""
|
||||
123| code, out, err = run(f"ssh -o ConnectTimeout=5 root@{ip} 'echo alive'", timeout=10)
|
||||
124| if code == 0 and "alive" in out:
|
||||
125| return True, "alive"
|
||||
126| return False, f"unreachable"
|
||||
127|
|
||||
128|# ─── FALLBACK ACTIONS ───
|
||||
129|
|
||||
130|def fallback_to_local_model(cfg):
|
||||
131| """Switch primary model from Anthropic to local llama.cpp"""
|
||||
132| if not BACKUP_CONFIG.exists():
|
||||
133| shutil.copy2(CONFIG_PATH, BACKUP_CONFIG)
|
||||
134|
|
||||
135| cfg["model"]["provider"] = "local-llama.cpp"
|
||||
136| cfg["model"]["default"] = "hermes3"
|
||||
137| save_config(cfg)
|
||||
138| return "Switched primary model to local-llama.cpp/hermes3"
|
||||
139|
|
||||
140|def fallback_to_ollama(cfg):
|
||||
141| """Switch to Ollama if llama.cpp is also down"""
|
||||
142| if not BACKUP_CONFIG.exists():
|
||||
143| shutil.copy2(CONFIG_PATH, BACKUP_CONFIG)
|
||||
144|
|
||||
145| cfg["model"]["provider"] = "ollama"
|
||||
146| cfg["model"]["default"] = "gemma4:latest"
|
||||
147| save_config(cfg)
|
||||
148| return "Switched primary model to ollama/gemma4:latest"
|
||||
149|
|
||||
150|def enter_safe_mode(state):
|
||||
151| """Pause all non-essential cron jobs, alert Alexander"""
|
||||
152| state["safe_mode"] = True
|
||||
153| state["safe_mode_entered"] = datetime.now().isoformat()
|
||||
154| save_state(state)
|
||||
155| return "SAFE MODE: All inference down. Cron jobs should be paused. Alert Alexander."
|
||||
156|
|
||||
157|def restore_config():
|
||||
158| """Restore pre-fallback config when primary recovers"""
|
||||
159| if BACKUP_CONFIG.exists():
|
||||
160| shutil.copy2(BACKUP_CONFIG, CONFIG_PATH)
|
||||
161| BACKUP_CONFIG.unlink()
|
||||
162| return "Restored original config from backup"
|
||||
163| return "No backup config to restore"
|
||||
164|
|
||||
165|# ─── MAIN DIAGNOSIS AND FALLBACK ENGINE ───
|
||||
166|
|
||||
167|def diagnose_and_fallback():
|
||||
168| state = load_state()
|
||||
169| cfg = load_config()
|
||||
170|
|
||||
171| results = {
|
||||
172| "timestamp": datetime.now().isoformat(),
|
||||
173| "checks": {},
|
||||
174| "actions": [],
|
||||
175| "status": "healthy"
|
||||
176| }
|
||||
177|
|
||||
178| # Check all systems
|
||||
179| anthropic_ok, anthropic_msg = check_anthropic()
|
||||
180| results["checks"]["anthropic"] = {"ok": anthropic_ok, "msg": anthropic_msg}
|
||||
181|
|
||||
182| llama_ok, llama_msg = check_local_llama()
|
||||
183| results["checks"]["local_llama"] = {"ok": llama_ok, "msg": llama_msg}
|
||||
184|
|
||||
185| ollama_ok, ollama_msg = check_ollama()
|
||||
186| results["checks"]["ollama"] = {"ok": ollama_ok, "msg": ollama_msg}
|
||||
187|
|
||||
188| gitea_ok, gitea_msg = check_gitea()
|
||||
189| results["checks"]["gitea"] = {"ok": gitea_ok, "msg": gitea_msg}
|
||||
190|
|
||||
191| # VPS checks
|
||||
192| vpses = [
|
||||
193| ("167.99.126.228", "Allegro"),
|
||||
194| ("143.198.27.163", "Ezra"),
|
||||
195| ("159.203.146.185", "Bezalel"),
|
||||
196| ]
|
||||
197| for ip, name in vpses:
|
||||
198| vps_ok, vps_msg = check_vps(ip, name)
|
||||
199| results["checks"][f"vps_{name.lower()}"] = {"ok": vps_ok, "msg": vps_msg}
|
||||
200|
|
||||
201| current_provider = cfg.get("model", {}).get("provider", "anthropic")
|
||||
202|
|
||||
203| # ─── FALLBACK LOGIC ───
|
||||
204|
|
||||
205| # Case 1: Primary (Anthropic) down, local available
|
||||
206| if not anthropic_ok and current_provider == "anthropic":
|
||||
207| if llama_ok:
|
||||
208| msg = fallback_to_local_model(cfg)
|
||||
209| results["actions"].append(msg)
|
||||
210| state["active_fallbacks"].append("anthropic->local-llama")
|
||||
211| results["status"] = "degraded_local"
|
||||
212| elif ollama_ok:
|
||||
213| msg = fallback_to_ollama(cfg)
|
||||
214| results["actions"].append(msg)
|
||||
215| state["active_fallbacks"].append("anthropic->ollama")
|
||||
216| results["status"] = "degraded_ollama"
|
||||
217| else:
|
||||
218| msg = enter_safe_mode(state)
|
||||
219| results["actions"].append(msg)
|
||||
220| results["status"] = "safe_mode"
|
||||
221|
|
||||
222| # Case 2: Already on fallback, check if primary recovered
|
||||
223| elif anthropic_ok and "anthropic->local-llama" in state.get("active_fallbacks", []):
|
||||
224| msg = restore_config()
|
||||
225| results["actions"].append(msg)
|
||||
226| state["active_fallbacks"].remove("anthropic->local-llama")
|
||||
227| results["status"] = "recovered"
|
||||
228| elif anthropic_ok and "anthropic->ollama" in state.get("active_fallbacks", []):
|
||||
229| msg = restore_config()
|
||||
230| results["actions"].append(msg)
|
||||
231| state["active_fallbacks"].remove("anthropic->ollama")
|
||||
232| results["status"] = "recovered"
|
||||
233|
|
||||
234| # Case 3: Gitea down — just flag it, work locally
|
||||
235| if not gitea_ok:
|
||||
236| results["actions"].append("WARN: Gitea unreachable — work cached locally until recovery")
|
||||
237| if "gitea_down" not in state.get("active_fallbacks", []):
|
||||
238| state["active_fallbacks"].append("gitea_down")
|
||||
239| results["status"] = max(results["status"], "degraded_gitea", key=lambda x: ["healthy", "recovered", "degraded_gitea", "degraded_local", "degraded_ollama", "safe_mode"].index(x) if x in ["healthy", "recovered", "degraded_gitea", "degraded_local", "degraded_ollama", "safe_mode"] else 0)
|
||||
240| elif "gitea_down" in state.get("active_fallbacks", []):
|
||||
241| state["active_fallbacks"].remove("gitea_down")
|
||||
242| results["actions"].append("Gitea recovered — resume normal operations")
|
||||
243|
|
||||
244| # Case 4: VPS agents down
|
||||
245| for ip, name in vpses:
|
||||
246| key = f"vps_{name.lower()}"
|
||||
247| if not results["checks"][key]["ok"]:
|
||||
248| results["actions"].append(f"ALERT: {name} VPS ({ip}) unreachable — lazarus protocol needed")
|
||||
249|
|
||||
250| save_state(state)
|
||||
251| return results
|
||||
252|
|
||||
253|if __name__ == "__main__":
|
||||
254| results = diagnose_and_fallback()
|
||||
255| print(json.dumps(results, indent=2))
|
||||
256|
|
||||
257| # Exit codes for cron integration
|
||||
258| if results["status"] == "safe_mode":
|
||||
259| sys.exit(2)
|
||||
260| elif results["status"].startswith("degraded"):
|
||||
261| sys.exit(1)
|
||||
262| else:
|
||||
263| sys.exit(0)
|
||||
264|
|
||||
@@ -9,7 +9,7 @@ THRESHOLD_HOURS="${1:-2}"
|
||||
THRESHOLD_SECS=$((THRESHOLD_HOURS * 3600))
|
||||
LOG_DIR="$HOME/.hermes/logs"
|
||||
LOG_FILE="$LOG_DIR/deadman.log"
|
||||
GITEA_URL="https://forge.alexanderwhitestone.com"
|
||||
GITEA_URL="http://143.198.27.163:3000"
|
||||
GITEA_TOKEN=$(cat "$HOME/.hermes/gitea_token_vps" 2>/dev/null || echo "")
|
||||
TELEGRAM_TOKEN=$(cat "$HOME/.config/telegram/special_bot" 2>/dev/null || echo "")
|
||||
TELEGRAM_CHAT="-1003664764329"
|
||||
|
||||
@@ -25,35 +25,10 @@ else
|
||||
fi
|
||||
|
||||
# ── Config ──
|
||||
GITEA_TOKEN=$(cat ~/.hermes/gitea_token_vps 2>/dev/null || echo "")
|
||||
GITEA_API="https://forge.alexanderwhitestone.com/api/v1"
|
||||
|
||||
# Resolve Tailscale IPs dynamically; fallback to env vars
|
||||
SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
|
||||
RESOLVER="${SCRIPT_DIR}/../tools/tailscale_ip_resolver.py"
|
||||
if [ ! -f "$RESOLVER" ]; then
|
||||
RESOLVER="/root/wizards/ezra/tools/tailscale_ip_resolver.py"
|
||||
fi
|
||||
|
||||
resolve_host() {
|
||||
local default_ip="$1"
|
||||
if [ -n "$TAILSCALE_IP" ]; then
|
||||
echo "root@${TAILSCALE_IP}"
|
||||
return
|
||||
fi
|
||||
if [ -f "$RESOLVER" ]; then
|
||||
local ip
|
||||
ip=$(python3 "$RESOLVER" 2>/dev/null)
|
||||
if [ -n "$ip" ]; then
|
||||
echo "root@${ip}"
|
||||
return
|
||||
fi
|
||||
fi
|
||||
echo "root@${default_ip}"
|
||||
}
|
||||
|
||||
EZRA_HOST=$(resolve_host "143.198.27.163")
|
||||
BEZALEL_HOST="root@${BEZALEL_TAILSCALE_IP:-67.205.155.108}"
|
||||
GITEA_TOKEN=$(cat ~/.hermes/gitea_token_vps 2>/dev/null)
|
||||
GITEA_API="http://143.198.27.163:3000/api/v1"
|
||||
EZRA_HOST="root@143.198.27.163"
|
||||
BEZALEL_HOST="root@67.205.155.108"
|
||||
SSH_OPTS="-o ConnectTimeout=4 -o StrictHostKeyChecking=no -o BatchMode=yes"
|
||||
|
||||
ANY_DOWN=0
|
||||
@@ -179,7 +154,7 @@ fi
|
||||
|
||||
print_line "Timmy" "$TIMMY_STATUS" "$TIMMY_MODEL" "$TIMMY_ACTIVITY"
|
||||
|
||||
# ── 2. Ezra ──
|
||||
# ── 2. Ezra (VPS 143.198.27.163) ──
|
||||
EZRA_STATUS="DOWN"
|
||||
EZRA_MODEL="hermes-ezra"
|
||||
EZRA_ACTIVITY=""
|
||||
@@ -211,7 +186,7 @@ fi
|
||||
|
||||
print_line "Ezra" "$EZRA_STATUS" "$EZRA_MODEL" "$EZRA_ACTIVITY"
|
||||
|
||||
# ── 3. Bezalel ──
|
||||
# ── 3. Bezalel (VPS 67.205.155.108) ──
|
||||
BEZ_STATUS="DOWN"
|
||||
BEZ_MODEL="hermes-bezalel"
|
||||
BEZ_ACTIVITY=""
|
||||
@@ -271,7 +246,7 @@ if [ -n "$GITEA_VER" ]; then
|
||||
GITEA_STATUS="UP"
|
||||
VER=$(python3 -c "import json; print(json.loads('''${GITEA_VER}''').get('version','?'))" 2>/dev/null)
|
||||
GITEA_MODEL="gitea v${VER}"
|
||||
GITEA_ACTIVITY="forge.alexanderwhitestone.com"
|
||||
GITEA_ACTIVITY="143.198.27.163:3000"
|
||||
else
|
||||
GITEA_STATUS="DOWN"
|
||||
GITEA_MODEL="gitea(unreachable)"
|
||||
|
||||
@@ -521,63 +521,61 @@ print(json.dumps({
|
||||
[ -n "$pr_num" ] && log "WORKER-${worker_id}: Created PR #${pr_num} for issue #${issue_num}"
|
||||
fi
|
||||
|
||||
# ── Genchi Genbutsu: verify world state before declaring success ──
|
||||
VERIFIED="false"
|
||||
# ── Verify finish semantics / classify failures ──
|
||||
if [ "$exit_code" -eq 0 ]; then
|
||||
log "WORKER-${worker_id}: SUCCESS #${issue_num} exited 0 — running genchi-genbutsu"
|
||||
SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
|
||||
if verify_result=$("$SCRIPT_DIR/genchi-genbutsu.sh" "$repo_owner" "$repo_name" "$issue_num" "$branch" "gemini" 2>/dev/null); then
|
||||
VERIFIED="true"
|
||||
log "WORKER-${worker_id}: VERIFIED #${issue_num}"
|
||||
pr_state=$(get_pr_state "$repo_owner" "$repo_name" "$pr_num")
|
||||
if [ "$pr_state" = "open" ]; then
|
||||
curl -sf -X POST "${GITEA_URL}/api/v1/repos/${repo_owner}/${repo_name}/pulls/${pr_num}/merge" \
|
||||
-H "Authorization: token ${GITEA_TOKEN}" \
|
||||
-H "Content-Type: application/json" \
|
||||
-d '{"Do": "squash"}' >/dev/null 2>&1 || true
|
||||
pr_state=$(get_pr_state "$repo_owner" "$repo_name" "$pr_num")
|
||||
fi
|
||||
if [ "$pr_state" = "merged" ]; then
|
||||
curl -sf -X PATCH "${GITEA_URL}/api/v1/repos/${repo_owner}/${repo_name}/issues/${issue_num}" \
|
||||
-H "Authorization: token ${GITEA_TOKEN}" \
|
||||
-H "Content-Type: application/json" \
|
||||
-d '{"state": "closed"}' >/dev/null 2>&1 || true
|
||||
issue_state=$(get_issue_state "$repo_owner" "$repo_name" "$issue_num")
|
||||
if [ "$issue_state" = "closed" ]; then
|
||||
log "WORKER-${worker_id}: VERIFIED #${issue_num} branch pushed, PR merged, comment present, issue closed"
|
||||
consecutive_failures=0
|
||||
else
|
||||
log "WORKER-${worker_id}: BLOCKED #${issue_num} issue did not close after merge"
|
||||
mark_skip "$issue_num" "issue_close_unverified" 1
|
||||
consecutive_failures=$((consecutive_failures + 1))
|
||||
fi
|
||||
else
|
||||
log "WORKER-${worker_id}: BLOCKED #${issue_num} merge not verified (state=${pr_state})"
|
||||
mark_skip "$issue_num" "merge_unverified" 1
|
||||
consecutive_failures=$((consecutive_failures + 1))
|
||||
fi
|
||||
log "WORKER-${worker_id}: SUCCESS #${issue_num} exited 0 — verifying push + PR + proof"
|
||||
if ! remote_branch_exists "$branch"; then
|
||||
log "WORKER-${worker_id}: BLOCKED #${issue_num} remote branch missing"
|
||||
post_issue_comment "$repo_owner" "$repo_name" "$issue_num" "Loop gate blocked completion: remote branch ${branch} was not found on origin after Gemini exited. Issue remains open for retry."
|
||||
mark_skip "$issue_num" "missing_remote_branch" 1
|
||||
consecutive_failures=$((consecutive_failures + 1))
|
||||
elif [ -z "$pr_num" ]; then
|
||||
log "WORKER-${worker_id}: BLOCKED #${issue_num} no PR found"
|
||||
post_issue_comment "$repo_owner" "$repo_name" "$issue_num" "Loop gate blocked completion: branch ${branch} exists remotely, but no PR was found. Issue remains open for retry."
|
||||
mark_skip "$issue_num" "missing_pr" 1
|
||||
consecutive_failures=$((consecutive_failures + 1))
|
||||
else
|
||||
verify_details=$(echo "$verify_result" | python3 -c "import sys,json; print(json.load(sys.stdin).get('details','unknown'))" 2>/dev/null || echo "unverified")
|
||||
verify_checks=$(echo "$verify_result" | python3 -c "import sys,json; print(json.load(sys.stdin).get('checks',''))" 2>/dev/null || echo "")
|
||||
log "WORKER-${worker_id}: UNVERIFIED #${issue_num} — $verify_details"
|
||||
if echo "$verify_checks" | grep -q '"branch": false'; then
|
||||
post_issue_comment "$repo_owner" "$repo_name" "$issue_num" "Loop gate blocked completion: remote branch ${branch} was not found on origin after Gemini exited. Issue remains open for retry."
|
||||
mark_skip "$issue_num" "missing_remote_branch" 1
|
||||
elif echo "$verify_checks" | grep -q '"pr": false'; then
|
||||
post_issue_comment "$repo_owner" "$repo_name" "$issue_num" "Loop gate blocked completion: branch ${branch} exists remotely, but no PR was found. Issue remains open for retry."
|
||||
mark_skip "$issue_num" "missing_pr" 1
|
||||
elif echo "$verify_checks" | grep -q '"files": false'; then
|
||||
curl -sf -X PATCH "${GITEA_URL}/api/v1/repos/${repo_owner}/${repo_name}/pulls/${pr_num}" \
|
||||
-H "Authorization: token ${GITEA_TOKEN}" \
|
||||
-H "Content-Type: application/json" \
|
||||
-d '{"state": "closed"}' >/dev/null 2>&1 || true
|
||||
pr_files=$(get_pr_file_count "$repo_owner" "$repo_name" "$pr_num")
|
||||
if [ "${pr_files:-0}" -eq 0 ]; then
|
||||
log "WORKER-${worker_id}: BLOCKED #${issue_num} PR #${pr_num} has 0 changed files"
|
||||
curl -sf -X PATCH "${GITEA_URL}/api/v1/repos/${repo_owner}/${repo_name}/pulls/${pr_num}" -H "Authorization: token ${GITEA_TOKEN}" -H "Content-Type: application/json" -d '{"state": "closed"}' >/dev/null 2>&1 || true
|
||||
post_issue_comment "$repo_owner" "$repo_name" "$issue_num" "PR #${pr_num} was closed automatically: it had 0 changed files (empty commit). Issue remains open for retry."
|
||||
mark_skip "$issue_num" "empty_commit" 2
|
||||
consecutive_failures=$((consecutive_failures + 1))
|
||||
else
|
||||
post_issue_comment "$repo_owner" "$repo_name" "$issue_num" "Loop gate blocked completion: PR #${pr_num} exists, but required verification failed ($verify_details). Issue remains open for retry."
|
||||
mark_skip "$issue_num" "unverified" 1
|
||||
proof_status=$(proof_comment_status "$repo_owner" "$repo_name" "$issue_num" "$branch")
|
||||
proof_state="${proof_status%%|*}"
|
||||
proof_url="${proof_status#*|}"
|
||||
if [ "$proof_state" != "ok" ]; then
|
||||
log "WORKER-${worker_id}: BLOCKED #${issue_num} proof missing or incomplete (${proof_state})"
|
||||
post_issue_comment "$repo_owner" "$repo_name" "$issue_num" "Loop gate blocked completion: PR #${pr_num} exists and has ${pr_files} changed file(s), but the required Proof block from Gemini is missing or incomplete. Issue remains open for retry."
|
||||
mark_skip "$issue_num" "missing_proof" 1
|
||||
consecutive_failures=$((consecutive_failures + 1))
|
||||
else
|
||||
log "WORKER-${worker_id}: PROOF verified ${proof_url}"
|
||||
pr_state=$(get_pr_state "$repo_owner" "$repo_name" "$pr_num")
|
||||
if [ "$pr_state" = "open" ]; then
|
||||
curl -sf -X POST "${GITEA_URL}/api/v1/repos/${repo_owner}/${repo_name}/pulls/${pr_num}/merge" -H "Authorization: token ${GITEA_TOKEN}" -H "Content-Type: application/json" -d '{"Do": "squash"}' >/dev/null 2>&1 || true
|
||||
pr_state=$(get_pr_state "$repo_owner" "$repo_name" "$pr_num")
|
||||
fi
|
||||
if [ "$pr_state" = "merged" ]; then
|
||||
curl -sf -X PATCH "${GITEA_URL}/api/v1/repos/${repo_owner}/${repo_name}/issues/${issue_num}" -H "Authorization: token ${GITEA_TOKEN}" -H "Content-Type: application/json" -d '{"state": "closed"}' >/dev/null 2>&1 || true
|
||||
issue_state=$(get_issue_state "$repo_owner" "$repo_name" "$issue_num")
|
||||
if [ "$issue_state" = "closed" ]; then
|
||||
log "WORKER-${worker_id}: VERIFIED #${issue_num} branch pushed, PR merged, proof present, issue closed"
|
||||
consecutive_failures=0
|
||||
else
|
||||
log "WORKER-${worker_id}: BLOCKED #${issue_num} issue did not close after merge"
|
||||
mark_skip "$issue_num" "issue_close_unverified" 1
|
||||
consecutive_failures=$((consecutive_failures + 1))
|
||||
fi
|
||||
else
|
||||
log "WORKER-${worker_id}: BLOCKED #${issue_num} merge not verified (state=${pr_state})"
|
||||
mark_skip "$issue_num" "merge_unverified" 1
|
||||
consecutive_failures=$((consecutive_failures + 1))
|
||||
fi
|
||||
fi
|
||||
fi
|
||||
consecutive_failures=$((consecutive_failures + 1))
|
||||
fi
|
||||
elif [ "$exit_code" -eq 124 ]; then
|
||||
log "WORKER-${worker_id}: TIMEOUT #${issue_num} (work saved in PR)"
|
||||
@@ -623,8 +621,7 @@ print(json.dumps({
|
||||
'lines_removed': ${LINES_REMOVED:-0},
|
||||
'salvaged': ${DIRTY:-0},
|
||||
'pr': '${pr_num:-}',
|
||||
'merged': $( [ '$OUTCOME' = 'success' ] && [ -n '${pr_num:-}' ] && echo 'true' || echo 'false' ),
|
||||
'verified': ${VERIFIED:-false}
|
||||
'merged': $( [ '$OUTCOME' = 'success' ] && [ -n '${pr_num:-}' ] && echo 'true' || echo 'false' )
|
||||
}))
|
||||
" >> "$LOG_DIR/gemini-metrics.jsonl" 2>/dev/null
|
||||
|
||||
|
||||
@@ -1,179 +0,0 @@
|
||||
#!/usr/bin/env bash
|
||||
# genchi-genbutsu.sh — 現地現物 — Go and see. Verify world state, not log vibes.
|
||||
#
|
||||
# Post-completion verification that goes and LOOKS at the actual artifacts.
|
||||
# Performs 5 world-state checks:
|
||||
# 1. Branch exists on remote
|
||||
# 2. PR exists
|
||||
# 3. PR has real file changes (> 0)
|
||||
# 4. PR is mergeable
|
||||
# 5. Issue has a completion comment from the agent
|
||||
#
|
||||
# Usage: genchi-genbutsu.sh <repo_owner> <repo_name> <issue_num> <branch> <agent_name>
|
||||
# Returns: JSON to stdout, logs JSONL, exit 0 = VERIFIED, exit 1 = UNVERIFIED
|
||||
|
||||
set -euo pipefail
|
||||
|
||||
GITEA_URL="${GITEA_URL:-https://forge.alexanderwhitestone.com}"
|
||||
GITEA_TOKEN="${GITEA_TOKEN:-}"
|
||||
LOG_DIR="${LOG_DIR:-$HOME/.hermes/logs}"
|
||||
VERIFY_LOG="$LOG_DIR/genchi-genbutsu.jsonl"
|
||||
|
||||
if [ $# -lt 5 ]; then
|
||||
echo "Usage: $0 <repo_owner> <repo_name> <issue_num> <branch> <agent_name>" >&2
|
||||
exit 2
|
||||
fi
|
||||
|
||||
repo_owner="$1"
|
||||
repo_name="$2"
|
||||
issue_num="$3"
|
||||
branch="$4"
|
||||
agent_name="$5"
|
||||
|
||||
mkdir -p "$LOG_DIR"
|
||||
|
||||
# ── Helpers ──────────────────────────────────────────────────────────
|
||||
|
||||
check_branch_exists() {
|
||||
# Use Gitea API instead of git ls-remote so we don't need clone credentials
|
||||
curl -sf "${GITEA_URL}/api/v1/repos/${repo_owner}/${repo_name}/branches/${branch}" \
|
||||
-H "Authorization: token ${GITEA_TOKEN}" >/dev/null 2>&1
|
||||
}
|
||||
|
||||
get_pr_num() {
|
||||
curl -sf "${GITEA_URL}/api/v1/repos/${repo_owner}/${repo_name}/pulls?state=all&head=${repo_owner}:${branch}&limit=1" \
|
||||
-H "Authorization: token ${GITEA_TOKEN}" 2>/dev/null | python3 -c "
|
||||
import sys, json
|
||||
prs = json.load(sys.stdin)
|
||||
print(prs[0]['number'] if prs else '')
|
||||
"
|
||||
}
|
||||
|
||||
check_pr_files() {
|
||||
local pr_num="$1"
|
||||
curl -sf "${GITEA_URL}/api/v1/repos/${repo_owner}/${repo_name}/pulls/${pr_num}/files" \
|
||||
-H "Authorization: token ${GITEA_TOKEN}" 2>/dev/null | python3 -c "
|
||||
import sys, json
|
||||
try:
|
||||
files = json.load(sys.stdin)
|
||||
print(len(files) if isinstance(files, list) else 0)
|
||||
except:
|
||||
print(0)
|
||||
"
|
||||
}
|
||||
|
||||
check_pr_mergeable() {
|
||||
local pr_num="$1"
|
||||
curl -sf "${GITEA_URL}/api/v1/repos/${repo_owner}/${repo_name}/pulls/${pr_num}" \
|
||||
-H "Authorization: token ${GITEA_TOKEN}" 2>/dev/null | python3 -c "
|
||||
import sys, json
|
||||
pr = json.load(sys.stdin)
|
||||
print('true' if pr.get('mergeable') else 'false')
|
||||
"
|
||||
}
|
||||
|
||||
check_completion_comment() {
|
||||
curl -sf "${GITEA_URL}/api/v1/repos/${repo_owner}/${repo_name}/issues/${issue_num}/comments" \
|
||||
-H "Authorization: token ${GITEA_TOKEN}" 2>/dev/null | AGENT="$agent_name" python3 -c "
|
||||
import os, sys, json
|
||||
agent = os.environ.get('AGENT', '').lower()
|
||||
try:
|
||||
comments = json.load(sys.stdin)
|
||||
except:
|
||||
sys.exit(1)
|
||||
for c in reversed(comments):
|
||||
user = ((c.get('user') or {}).get('login') or '').lower()
|
||||
if user == agent:
|
||||
sys.exit(0)
|
||||
sys.exit(1)
|
||||
"
|
||||
}
|
||||
|
||||
# ── Run checks ───────────────────────────────────────────────────────
|
||||
|
||||
ts=$(date -u '+%Y-%m-%dT%H:%M:%SZ')
|
||||
status="VERIFIED"
|
||||
details=()
|
||||
checks_json='{}'
|
||||
|
||||
# Check 1: branch
|
||||
if check_branch_exists; then
|
||||
checks_json=$(echo "$checks_json" | python3 -c "import sys,json;d=json.load(sys.stdin);d['branch']=True;print(json.dumps(d))")
|
||||
else
|
||||
checks_json=$(echo "$checks_json" | python3 -c "import sys,json;d=json.load(sys.stdin);d['branch']=False;print(json.dumps(d))")
|
||||
status="UNVERIFIED"
|
||||
details+=("remote branch ${branch} not found")
|
||||
fi
|
||||
|
||||
# Check 2: PR exists
|
||||
pr_num=$(get_pr_num)
|
||||
if [ -n "$pr_num" ]; then
|
||||
checks_json=$(echo "$checks_json" | python3 -c "import sys,json;d=json.load(sys.stdin);d['pr']=True;print(json.dumps(d))")
|
||||
else
|
||||
checks_json=$(echo "$checks_json" | python3 -c "import sys,json;d=json.load(sys.stdin);d['pr']=False;print(json.dumps(d))")
|
||||
status="UNVERIFIED"
|
||||
details+=("no PR found for branch ${branch}")
|
||||
fi
|
||||
|
||||
# Check 3: PR has real file changes
|
||||
if [ -n "$pr_num" ]; then
|
||||
file_count=$(check_pr_files "$pr_num")
|
||||
if [ "${file_count:-0}" -gt 0 ]; then
|
||||
checks_json=$(echo "$checks_json" | python3 -c "import sys,json;d=json.load(sys.stdin);d['files']=True;print(json.dumps(d))")
|
||||
else
|
||||
checks_json=$(echo "$checks_json" | python3 -c "import sys,json;d=json.load(sys.stdin);d['files']=False;print(json.dumps(d))")
|
||||
status="UNVERIFIED"
|
||||
details+=("PR #${pr_num} has 0 changed files")
|
||||
fi
|
||||
|
||||
# Check 4: PR is mergeable
|
||||
if [ "$(check_pr_mergeable "$pr_num")" = "true" ]; then
|
||||
checks_json=$(echo "$checks_json" | python3 -c "import sys,json;d=json.load(sys.stdin);d['mergeable']=True;print(json.dumps(d))")
|
||||
else
|
||||
checks_json=$(echo "$checks_json" | python3 -c "import sys,json;d=json.load(sys.stdin);d['mergeable']=False;print(json.dumps(d))")
|
||||
status="UNVERIFIED"
|
||||
details+=("PR #${pr_num} is not mergeable")
|
||||
fi
|
||||
else
|
||||
checks_json=$(echo "$checks_json" | python3 -c "import sys,json;d=json.load(sys.stdin);d['files']=None;d['mergeable']=None;print(json.dumps(d))")
|
||||
fi
|
||||
|
||||
# Check 5: completion comment from agent
|
||||
if check_completion_comment; then
|
||||
checks_json=$(echo "$checks_json" | python3 -c "import sys,json;d=json.load(sys.stdin);d['comment']=True;print(json.dumps(d))")
|
||||
else
|
||||
checks_json=$(echo "$checks_json" | python3 -c "import sys,json;d=json.load(sys.stdin);d['comment']=False;print(json.dumps(d))")
|
||||
status="UNVERIFIED"
|
||||
details+=("no completion comment from ${agent_name} on issue #${issue_num}")
|
||||
fi
|
||||
|
||||
# Build detail string
|
||||
detail_str=$(IFS="; "; echo "${details[*]:-all checks passed}")
|
||||
|
||||
# ── Output ───────────────────────────────────────────────────────────
|
||||
|
||||
result=$(python3 -c "
|
||||
import json
|
||||
print(json.dumps({
|
||||
'status': '$status',
|
||||
'repo': '${repo_owner}/${repo_name}',
|
||||
'issue': $issue_num,
|
||||
'branch': '$branch',
|
||||
'agent': '$agent_name',
|
||||
'pr': '$pr_num',
|
||||
'checks': $checks_json,
|
||||
'details': '$detail_str',
|
||||
'ts': '$ts'
|
||||
}, indent=2))
|
||||
")
|
||||
|
||||
printf '%s\n' "$result"
|
||||
|
||||
# Append to JSONL log
|
||||
printf '%s\n' "$result" >> "$VERIFY_LOG"
|
||||
|
||||
if [ "$status" = "VERIFIED" ]; then
|
||||
exit 0
|
||||
else
|
||||
exit 1
|
||||
fi
|
||||
@@ -1,45 +0,0 @@
|
||||
#!/usr/bin/env bash
|
||||
# kaizen-retro.sh — Automated retrospective after every burn cycle.
|
||||
#
|
||||
# Runs daily after the morning report.
|
||||
# Analyzes success rates by agent, repo, and issue type.
|
||||
# Identifies max-attempts issues, generates ONE concrete improvement,
|
||||
# and posts the retro to Telegram + the master morning-report issue.
|
||||
#
|
||||
# Usage:
|
||||
# ./bin/kaizen-retro.sh [--dry-run]
|
||||
|
||||
set -euo pipefail
|
||||
|
||||
SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
|
||||
REPO_ROOT="${SCRIPT_DIR%/bin}"
|
||||
PYTHON="${PYTHON3:-python3}"
|
||||
|
||||
# Source local env if available so TELEGRAM_BOT_TOKEN is picked up
|
||||
HOME_DIR="${HOME:-$(eval echo ~$(whoami))}"
|
||||
for env_file in "$HOME_DIR/.hermes/.env" "$HOME_DIR/.timmy/.env" "$REPO_ROOT/.env"; do
|
||||
if [ -f "$env_file" ]; then
|
||||
# shellcheck source=/dev/null
|
||||
set -a
|
||||
# shellcheck source=/dev/null
|
||||
source "$env_file"
|
||||
set +a
|
||||
fi
|
||||
done
|
||||
|
||||
# If the configured Gitea URL is unreachable but localhost works, prefer localhost
|
||||
if ! curl -sf "${GITEA_URL:-http://localhost:3000}/api/v1/version" >/dev/null 2>&1; then
|
||||
if curl -sf http://localhost:3000/api/v1/version >/dev/null 2>&1; then
|
||||
export GITEA_URL="http://localhost:3000"
|
||||
fi
|
||||
fi
|
||||
|
||||
# Ensure the Python script exists
|
||||
RETRO_PY="$REPO_ROOT/scripts/kaizen_retro.py"
|
||||
if [ ! -f "$RETRO_PY" ]; then
|
||||
echo "ERROR: kaizen_retro.py not found at $RETRO_PY" >&2
|
||||
exit 1
|
||||
fi
|
||||
|
||||
# Run
|
||||
exec "$PYTHON" "$RETRO_PY" "$@"
|
||||
@@ -1,20 +0,0 @@
|
||||
#!/usr/bin/env bash
|
||||
# muda-audit.sh — Weekly waste audit wrapper
|
||||
# Runs scripts/muda_audit.py from the repo root.
|
||||
# Designed for cron or Gitea Actions.
|
||||
|
||||
set -euo pipefail
|
||||
|
||||
SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
|
||||
REPO_ROOT="$(cd "${SCRIPT_DIR}/.." && pwd)"
|
||||
|
||||
cd "$REPO_ROOT"
|
||||
|
||||
# Ensure python3 is available
|
||||
if ! command -v python3 >/dev/null 2>&1; then
|
||||
echo "ERROR: python3 not found" >&2
|
||||
exit 1
|
||||
fi
|
||||
|
||||
# Run the audit
|
||||
python3 "${REPO_ROOT}/scripts/muda_audit.py" "$@"
|
||||
@@ -1,191 +0,0 @@
|
||||
#!/usr/bin/env python3
|
||||
"""pr-checklist.py -- Automated PR quality gate for Gitea CI.
|
||||
|
||||
Enforces the review standards that agents skip when left to self-approve.
|
||||
Runs in CI on every pull_request event. Exits non-zero on any failure.
|
||||
|
||||
Checks:
|
||||
1. PR has >0 file changes (no empty PRs)
|
||||
2. PR branch is not behind base branch
|
||||
3. PR does not bundle >3 unrelated issues
|
||||
4. Changed .py files pass syntax check (python -c import)
|
||||
5. Changed .sh files are executable
|
||||
6. PR body references an issue number
|
||||
7. At least 1 non-author review exists (warning only)
|
||||
|
||||
Refs: #393 (PERPLEXITY-08), Epic #385
|
||||
"""
|
||||
from __future__ import annotations
|
||||
|
||||
import json
|
||||
import os
|
||||
import re
|
||||
import subprocess
|
||||
import sys
|
||||
from pathlib import Path
|
||||
|
||||
|
||||
def fail(msg: str) -> None:
|
||||
print(f"FAIL: {msg}", file=sys.stderr)
|
||||
|
||||
|
||||
def warn(msg: str) -> None:
|
||||
print(f"WARN: {msg}", file=sys.stderr)
|
||||
|
||||
|
||||
def ok(msg: str) -> None:
|
||||
print(f" OK: {msg}")
|
||||
|
||||
|
||||
def get_changed_files() -> list[str]:
|
||||
"""Return list of files changed in this PR vs base branch."""
|
||||
base = os.environ.get("GITHUB_BASE_REF", "main")
|
||||
try:
|
||||
result = subprocess.run(
|
||||
["git", "diff", "--name-only", f"origin/{base}...HEAD"],
|
||||
capture_output=True, text=True, check=True,
|
||||
)
|
||||
return [f for f in result.stdout.strip().splitlines() if f]
|
||||
except subprocess.CalledProcessError:
|
||||
# Fallback: diff against HEAD~1
|
||||
result = subprocess.run(
|
||||
["git", "diff", "--name-only", "HEAD~1"],
|
||||
capture_output=True, text=True, check=True,
|
||||
)
|
||||
return [f for f in result.stdout.strip().splitlines() if f]
|
||||
|
||||
|
||||
def check_has_changes(files: list[str]) -> bool:
|
||||
"""Check 1: PR has >0 file changes."""
|
||||
if not files:
|
||||
fail("PR has 0 file changes. Empty PRs are not allowed.")
|
||||
return False
|
||||
ok(f"PR changes {len(files)} file(s)")
|
||||
return True
|
||||
|
||||
|
||||
def check_not_behind_base() -> bool:
|
||||
"""Check 2: PR branch is not behind base."""
|
||||
base = os.environ.get("GITHUB_BASE_REF", "main")
|
||||
try:
|
||||
result = subprocess.run(
|
||||
["git", "rev-list", "--count", f"HEAD..origin/{base}"],
|
||||
capture_output=True, text=True, check=True,
|
||||
)
|
||||
behind = int(result.stdout.strip())
|
||||
if behind > 0:
|
||||
fail(f"Branch is {behind} commit(s) behind {base}. Rebase or merge.")
|
||||
return False
|
||||
ok(f"Branch is up-to-date with {base}")
|
||||
return True
|
||||
except (subprocess.CalledProcessError, ValueError):
|
||||
warn("Could not determine if branch is behind base (git fetch may be needed)")
|
||||
return True # Don't block on CI fetch issues
|
||||
|
||||
|
||||
def check_issue_bundling(pr_body: str) -> bool:
|
||||
"""Check 3: PR does not bundle >3 unrelated issues."""
|
||||
issue_refs = set(re.findall(r"#(\d+)", pr_body))
|
||||
if len(issue_refs) > 3:
|
||||
fail(f"PR references {len(issue_refs)} issues ({', '.join(sorted(issue_refs))}). "
|
||||
"Max 3 per PR to prevent bundling. Split into separate PRs.")
|
||||
return False
|
||||
ok(f"PR references {len(issue_refs)} issue(s) (max 3)")
|
||||
return True
|
||||
|
||||
|
||||
def check_python_syntax(files: list[str]) -> bool:
|
||||
"""Check 4: Changed .py files have valid syntax."""
|
||||
py_files = [f for f in files if f.endswith(".py") and Path(f).exists()]
|
||||
if not py_files:
|
||||
ok("No Python files changed")
|
||||
return True
|
||||
|
||||
all_ok = True
|
||||
for f in py_files:
|
||||
result = subprocess.run(
|
||||
[sys.executable, "-c", f"import ast; ast.parse(open('{f}').read())"],
|
||||
capture_output=True, text=True,
|
||||
)
|
||||
if result.returncode != 0:
|
||||
fail(f"Syntax error in {f}: {result.stderr.strip()[:200]}")
|
||||
all_ok = False
|
||||
|
||||
if all_ok:
|
||||
ok(f"All {len(py_files)} Python file(s) pass syntax check")
|
||||
return all_ok
|
||||
|
||||
|
||||
def check_shell_executable(files: list[str]) -> bool:
|
||||
"""Check 5: Changed .sh files are executable."""
|
||||
sh_files = [f for f in files if f.endswith(".sh") and Path(f).exists()]
|
||||
if not sh_files:
|
||||
ok("No shell scripts changed")
|
||||
return True
|
||||
|
||||
all_ok = True
|
||||
for f in sh_files:
|
||||
if not os.access(f, os.X_OK):
|
||||
fail(f"{f} is not executable. Run: chmod +x {f}")
|
||||
all_ok = False
|
||||
|
||||
if all_ok:
|
||||
ok(f"All {len(sh_files)} shell script(s) are executable")
|
||||
return all_ok
|
||||
|
||||
|
||||
def check_issue_reference(pr_body: str) -> bool:
|
||||
"""Check 6: PR body references an issue number."""
|
||||
if re.search(r"#\d+", pr_body):
|
||||
ok("PR body references at least one issue")
|
||||
return True
|
||||
fail("PR body does not reference any issue (e.g. #123). "
|
||||
"Every PR must trace to an issue.")
|
||||
return False
|
||||
|
||||
|
||||
def main() -> int:
|
||||
print("=" * 60)
|
||||
print("PR Checklist — Automated Quality Gate")
|
||||
print("=" * 60)
|
||||
print()
|
||||
|
||||
# Get PR body from env or git log
|
||||
pr_body = os.environ.get("PR_BODY", "")
|
||||
if not pr_body:
|
||||
try:
|
||||
result = subprocess.run(
|
||||
["git", "log", "--format=%B", "-1"],
|
||||
capture_output=True, text=True, check=True,
|
||||
)
|
||||
pr_body = result.stdout
|
||||
except subprocess.CalledProcessError:
|
||||
pr_body = ""
|
||||
|
||||
files = get_changed_files()
|
||||
failures = 0
|
||||
|
||||
checks = [
|
||||
check_has_changes(files),
|
||||
check_not_behind_base(),
|
||||
check_issue_bundling(pr_body),
|
||||
check_python_syntax(files),
|
||||
check_shell_executable(files),
|
||||
check_issue_reference(pr_body),
|
||||
]
|
||||
|
||||
failures = sum(1 for c in checks if not c)
|
||||
|
||||
print()
|
||||
print("=" * 60)
|
||||
if failures:
|
||||
print(f"RESULT: {failures} check(s) FAILED")
|
||||
print("Fix the issues above and push again.")
|
||||
return 1
|
||||
else:
|
||||
print("RESULT: All checks passed")
|
||||
return 0
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
sys.exit(main())
|
||||
@@ -115,7 +115,7 @@ display:
|
||||
tool_progress_command: false
|
||||
tool_progress: all
|
||||
privacy:
|
||||
redact_pii: true
|
||||
redact_pii: false
|
||||
tts:
|
||||
provider: edge
|
||||
edge:
|
||||
|
||||
@@ -81,7 +81,33 @@
|
||||
"last_error": null,
|
||||
"deliver": "local",
|
||||
"origin": null,
|
||||
"state": "scheduled",
|
||||
"state": "scheduled"
|
||||
},
|
||||
{
|
||||
"id": "5e9d952871bc",
|
||||
"name": "Agent Status Check",
|
||||
"prompt": "Check which tmux panes are idle vs working, report utilization",
|
||||
"schedule": {
|
||||
"kind": "interval",
|
||||
"minutes": 10,
|
||||
"display": "every 10m"
|
||||
},
|
||||
"schedule_display": "every 10m",
|
||||
"repeat": {
|
||||
"times": null,
|
||||
"completed": 8
|
||||
},
|
||||
"enabled": false,
|
||||
"created_at": "2026-03-24T11:28:46.409727-04:00",
|
||||
"next_run_at": "2026-03-24T15:45:58.108921-04:00",
|
||||
"last_run_at": "2026-03-24T15:35:58.108921-04:00",
|
||||
"last_status": "ok",
|
||||
"last_error": null,
|
||||
"deliver": "local",
|
||||
"origin": null,
|
||||
"state": "paused",
|
||||
"paused_at": "2026-03-24T16:23:03.869047-04:00",
|
||||
"paused_reason": "Dashboard repo frozen - loops redirected to the-nexus",
|
||||
"skills": [],
|
||||
"skill": null
|
||||
},
|
||||
@@ -106,69 +132,8 @@
|
||||
"last_status": null,
|
||||
"last_error": null,
|
||||
"deliver": "local",
|
||||
"origin": null,
|
||||
"skills": [],
|
||||
"skill": null
|
||||
},
|
||||
{
|
||||
"id": "muda-audit-weekly",
|
||||
"name": "Muda Audit",
|
||||
"prompt": "Run the Muda Audit script at /root/wizards/ezra/workspace/timmy-config/fleet/muda-audit.sh. The script measures the 7 wastes across the fleet and posts a report to Telegram. Report whether it succeeded or failed.",
|
||||
"schedule": {
|
||||
"kind": "cron",
|
||||
"expr": "0 21 * * 0",
|
||||
"display": "0 21 * * 0"
|
||||
},
|
||||
"schedule_display": "0 21 * * 0",
|
||||
"repeat": {
|
||||
"times": null,
|
||||
"completed": 0
|
||||
},
|
||||
"enabled": true,
|
||||
"created_at": "2026-04-07T15:00:00+00:00",
|
||||
"next_run_at": null,
|
||||
"last_run_at": null,
|
||||
"last_status": null,
|
||||
"last_error": null,
|
||||
"deliver": "local",
|
||||
"origin": null,
|
||||
"state": "scheduled",
|
||||
"paused_at": null,
|
||||
"paused_reason": null,
|
||||
"skills": [],
|
||||
"skill": null
|
||||
},
|
||||
{
|
||||
"id": "kaizen-retro-349",
|
||||
"name": "Kaizen Retro",
|
||||
"prompt": "Run the automated burn-cycle retrospective. Execute: cd /root/wizards/ezra/workspace/timmy-config && ./bin/kaizen-retro.sh",
|
||||
"model": "hermes3:latest",
|
||||
"provider": "ollama",
|
||||
"base_url": "http://localhost:11434/v1",
|
||||
"schedule": {
|
||||
"kind": "interval",
|
||||
"minutes": 1440,
|
||||
"display": "every 1440m"
|
||||
},
|
||||
"schedule_display": "daily at 07:30",
|
||||
"repeat": {
|
||||
"times": null,
|
||||
"completed": 0
|
||||
},
|
||||
"enabled": true,
|
||||
"created_at": "2026-04-07T15:30:00.000000Z",
|
||||
"next_run_at": "2026-04-08T07:30:00.000000Z",
|
||||
"last_run_at": null,
|
||||
"last_status": null,
|
||||
"last_error": null,
|
||||
"deliver": "local",
|
||||
"origin": null,
|
||||
"state": "scheduled",
|
||||
"paused_at": null,
|
||||
"paused_reason": null,
|
||||
"skills": [],
|
||||
"skill": null
|
||||
"origin": null
|
||||
}
|
||||
],
|
||||
"updated_at": "2026-04-07T15:00:00+00:00"
|
||||
}
|
||||
"updated_at": "2026-03-24T16:23:03.869797-04:00"
|
||||
}
|
||||
@@ -1,2 +0,0 @@
|
||||
# Muda Audit — run every Sunday at 21:00
|
||||
0 21 * * 0 cd /root/wizards/ezra/workspace/timmy-config && bash fleet/muda-audit.sh >> /tmp/muda-audit.log 2>&1
|
||||
@@ -1,110 +0,0 @@
|
||||
# Fleet Behaviour Hardening — Review & Action Plan
|
||||
|
||||
**Author:** @perplexity
|
||||
**Date:** 2026-04-08
|
||||
**Context:** Alexander asked: "Is it the memory system or the behaviour guardrails?"
|
||||
**Answer:** It's the guardrails. The memory system is adequate. The enforcement machinery is aspirational.
|
||||
|
||||
---
|
||||
|
||||
## Diagnosis: Why the Fleet Isn't Smart Enough
|
||||
|
||||
After auditing SOUL.md, config.yaml, all 8 playbooks, the orchestrator, the guard scripts, and the v7.0.0 checkin, the pattern is clear:
|
||||
|
||||
**The fleet has excellent design documents and broken enforcement.**
|
||||
|
||||
| Layer | Design Quality | Enforcement Quality | Gap |
|
||||
|---|---|---|---|
|
||||
| SOUL.md | Excellent | None — no code reads it at runtime | Philosophy without machinery |
|
||||
| Playbooks (7 yaml) | Good lane map | Not invoked by orchestrator | Playbooks exist but nobody calls them |
|
||||
| Guard scripts (9) | Solid code | 1 of 9 wired (#395 audit) | 89% of guards are dead code |
|
||||
| Orchestrator | Sound design | Gateway dispatch is a no-op (#391) | Assigns issues but doesn't trigger work |
|
||||
| Cycle Guard | Good 10-min rule | No cron/loop calls it | Discipline without enforcement |
|
||||
| PR Reviewer | Clear rules | Runs every 30m (if scheduled) | Only guard that might actually fire |
|
||||
| Memory (MemPalace) | Working code | Retrieval enforcer wired | Actually operational |
|
||||
|
||||
### The Core Problem
|
||||
|
||||
Agents pick up issues and produce output, but there is **no pre-task checklist** and **no post-task quality gate**. An agent can:
|
||||
|
||||
1. Start work without checking if someone else already did it
|
||||
2. Produce output without running tests
|
||||
3. Submit a PR without verifying it addresses the issue
|
||||
4. Work for hours on something out of scope
|
||||
5. Create duplicate branches/PRs without detection
|
||||
|
||||
The SOUL.md says "grounding before generation" but no code enforces it.
|
||||
The playbooks define lanes but the orchestrator doesn't load them.
|
||||
The guards exist but nothing calls them.
|
||||
|
||||
---
|
||||
|
||||
## What the Fleet Needs (Priority Order)
|
||||
|
||||
### 1. Pre-Task Gate (MISSING — this PR adds it)
|
||||
|
||||
Before an agent starts any issue:
|
||||
- [ ] Check if issue is already assigned to another agent
|
||||
- [ ] Check if a branch already exists for this issue
|
||||
- [ ] Check if a PR already exists for this issue
|
||||
- [ ] Load relevant MemPalace context (retrieval enforcer)
|
||||
- [ ] Verify the agent has the right lane for this work (playbook check)
|
||||
|
||||
### 2. Post-Task Gate (MISSING — this PR adds it)
|
||||
|
||||
Before an agent submits a PR:
|
||||
- [ ] Verify the diff addresses the issue title/body
|
||||
- [ ] Run syntax_guard.py on changed files
|
||||
- [ ] Check for duplicate PRs targeting the same issue
|
||||
- [ ] Verify branch name follows convention
|
||||
- [ ] Run tests if they exist for changed files
|
||||
|
||||
### 3. Wire the Existing Guards (8 of 9 are dead code)
|
||||
|
||||
Per #395 audit:
|
||||
- Pre-commit hooks: need symlink on every machine
|
||||
- Cycle guard: need cron/loop integration
|
||||
- Forge health check: need cron entry
|
||||
- Smoke test + deploy validate: need deploy script integration
|
||||
|
||||
### 4. Orchestrator Dispatch Actually Works
|
||||
|
||||
Per #391 audit: the orchestrator scores and assigns but the gateway dispatch just writes to `/tmp/hermes-dispatch.log`. Nobody reads that file. The dispatch needs to either:
|
||||
- Trigger `hermes` CLI on the target machine, or
|
||||
- Post a webhook that the agent loop picks up
|
||||
|
||||
### 5. Agent Self-Assessment Loop
|
||||
|
||||
After completing work, agents should answer:
|
||||
- Did I address the issue as stated?
|
||||
- Did I stay in scope?
|
||||
- Did I check the palace for prior work?
|
||||
- Did I run verification?
|
||||
|
||||
This is what SOUL.md calls "the apparatus that gives these words teeth."
|
||||
|
||||
---
|
||||
|
||||
## What's Working (Don't Touch)
|
||||
|
||||
- **MemPalace sovereign_store.py** — SQLite + FTS5 + HRR, operational
|
||||
- **Retrieval enforcer** — wired to SovereignStore as of 14 hours ago
|
||||
- **Wake-up protocol** — palace-first boot sequence
|
||||
- **PR reviewer playbook** — clear rules, well-scoped
|
||||
- **Issue triager playbook** — comprehensive lane map with 11 agents
|
||||
- **Cycle guard code** — solid 10-min slice discipline (just needs wiring)
|
||||
- **Config drift guard** — active cron, working
|
||||
- **Dead man switch** — active, working
|
||||
|
||||
---
|
||||
|
||||
## Recommendation
|
||||
|
||||
The memory system is not the bottleneck. The behaviour guardrails are. Specifically:
|
||||
|
||||
1. **Add `task_gate.py`** — pre-task and post-task quality gates that every agent loop calls
|
||||
2. **Wire cycle_guard.py** — add start/complete calls to agent loop
|
||||
3. **Wire pre-commit hooks** — deploy script should symlink on provision
|
||||
4. **Fix orchestrator dispatch** — make it actually trigger work, not just log
|
||||
|
||||
This PR adds item 1. Items 2-4 need SSH access and are flagged for Timmy/Allegro.
|
||||
@@ -1,141 +0,0 @@
|
||||
# Memory Architecture
|
||||
|
||||
> How Timmy remembers, recalls, and learns — without hallucinating.
|
||||
|
||||
Refs: Epic #367 | Sub-issues #368, #369, #370, #371, #372
|
||||
|
||||
## Overview
|
||||
|
||||
Timmy's memory system uses a **Memory Palace** architecture — a structured, file-backed knowledge store organized into rooms and drawers. When faced with a recall question, the agent checks its palace *before* generating from scratch.
|
||||
|
||||
This document defines the retrieval order, storage layers, and data flow that make this work.
|
||||
|
||||
## Retrieval Order (L0–L5)
|
||||
|
||||
When the agent receives a prompt that looks like a recall question ("what did we do?", "what's the status of X?"), the retrieval enforcer intercepts it and walks through layers in order:
|
||||
|
||||
| Layer | Source | Question Answered | Short-circuits? |
|
||||
|-------|--------|-------------------|------------------|
|
||||
| L0 | `identity.txt` | Who am I? What are my mandates? | No (always loaded) |
|
||||
| L1 | Palace rooms/drawers | What do I know about this topic? | Yes, if hit |
|
||||
| L2 | Session scratchpad | What have I learned this session? | Yes, if hit |
|
||||
| L3 | Artifact retrieval (Gitea API) | Can I fetch the actual issue/file/log? | Yes, if hit |
|
||||
| L4 | Procedures/playbooks | Is there a documented way to do this? | Yes, if hit |
|
||||
| L5 | Free generation | (Only when L0–L4 are exhausted) | N/A |
|
||||
|
||||
**Key principle:** The agent never reaches L5 (free generation) if any prior layer has relevant data. This eliminates hallucination for recall-style queries.
|
||||
|
||||
## Storage Layout
|
||||
|
||||
```
|
||||
~/.mempalace/
|
||||
identity.txt # L0: Who I am, mandates, personality
|
||||
rooms/
|
||||
projects/
|
||||
timmy-config.md # What I know about timmy-config
|
||||
hermes-agent.md # What I know about hermes-agent
|
||||
people/
|
||||
alexander.md # Working relationship context
|
||||
architecture/
|
||||
fleet.md # Fleet system knowledge
|
||||
mempalace.md # Self-knowledge about this system
|
||||
config/
|
||||
mempalace.yaml # Palace configuration
|
||||
|
||||
~/.hermes/
|
||||
scratchpad/
|
||||
{session_id}.json # L2: Ephemeral session context
|
||||
```
|
||||
|
||||
## Components
|
||||
|
||||
### 1. Memory Palace Skill (`mempalace.py`) — #368
|
||||
|
||||
Core data structures:
|
||||
- `PalaceRoom`: A named collection of drawers (topics)
|
||||
- `Mempalace`: The top-level palace with room management
|
||||
- Factory constructors: `for_issue_analysis()`, `for_health_check()`, `for_code_review()`
|
||||
|
||||
### 2. Retrieval Enforcer (`retrieval_enforcer.py`) — #369
|
||||
|
||||
Middleware that intercepts recall-style prompts:
|
||||
1. Detects recall patterns ("what did", "status of", "last time we")
|
||||
2. Walks L0→L4 in order, short-circuiting on first hit
|
||||
3. Only allows free generation (L5) when all layers return empty
|
||||
4. Produces an honest fallback: "I don't have this in my memory palace."
|
||||
|
||||
### 3. Session Scratchpad (`scratchpad.py`) — #370
|
||||
|
||||
Ephemeral, session-scoped working memory:
|
||||
- Write-append only during a session
|
||||
- Entries have TTL (default: 1 hour)
|
||||
- Queried at L2 in retrieval chain
|
||||
- Never auto-promoted to palace
|
||||
|
||||
### 4. Memory Promotion — #371
|
||||
|
||||
Explicit promotion from scratchpad to palace:
|
||||
- Agent must call `promote_to_palace()` with a reason
|
||||
- Dedup check against target drawer
|
||||
- Summary required (raw tool output never stored)
|
||||
- Conflict detection when new memory contradicts existing
|
||||
|
||||
### 5. Wake-Up Protocol (`wakeup.py`) — #372
|
||||
|
||||
Boot sequence for new sessions:
|
||||
```
|
||||
Session Start
|
||||
│
|
||||
├─ L0: Load identity.txt
|
||||
├─ L1: Scan palace rooms for active context
|
||||
├─ L1.5: Surface promoted memories from last session
|
||||
├─ L2: Load surviving scratchpad entries
|
||||
│
|
||||
└─ Ready: agent knows who it is, what it was doing, what it learned
|
||||
```
|
||||
|
||||
## Data Flow
|
||||
|
||||
```
|
||||
┌──────────────────┐
|
||||
│ User Prompt │
|
||||
└────────┬─────────┘
|
||||
│
|
||||
┌────────┴─────────┐
|
||||
│ Recall Detector │
|
||||
└────┬───────┬─────┘
|
||||
│ │
|
||||
[recall] [not recall]
|
||||
│ │
|
||||
┌───────┴────┐ ┌──┬─┴───────┐
|
||||
│ Retrieval │ │ Normal Flow │
|
||||
│ Enforcer │ └─────────────┘
|
||||
│ L0→L1→L2 │
|
||||
│ →L3→L4→L5│
|
||||
└──────┬─────┘
|
||||
│
|
||||
┌──────┴─────┐
|
||||
│ Response │
|
||||
│ (grounded) │
|
||||
└────────────┘
|
||||
```
|
||||
|
||||
## Anti-Patterns
|
||||
|
||||
| Don't | Do Instead |
|
||||
|-------|------------|
|
||||
| Generate from vibes when palace has data | Check palace first (L1) |
|
||||
| Auto-promote everything to palace | Require explicit `promote_to_palace()` with reason |
|
||||
| Store raw API responses as memories | Summarize before storing |
|
||||
| Hallucinate when palace is empty | Say "I don't have this in my memory palace" |
|
||||
| Dump entire palace on wake-up | Selective loading based on session context |
|
||||
|
||||
## Status
|
||||
|
||||
| Component | Issue | PR | Status |
|
||||
|-----------|-------|----|--------|
|
||||
| Skill port | #368 | #374 | In Review |
|
||||
| Retrieval enforcer | #369 | #374 | In Review |
|
||||
| Session scratchpad | #370 | #374 | In Review |
|
||||
| Memory promotion | #371 | — | Open |
|
||||
| Wake-up protocol | #372 | #374 | In Review |
|
||||
@@ -1,212 +0,0 @@
|
||||
# Lazarus Cell Specification v1.0
|
||||
|
||||
**Canonical epic:** `Timmy_Foundation/timmy-config#267`
|
||||
**Author:** Ezra (architect)
|
||||
**Date:** 2026-04-06
|
||||
**Status:** Draft — open for burn-down by `#269` `#270` `#271` `#272` `#273` `#274`
|
||||
|
||||
---
|
||||
|
||||
## 1. Purpose
|
||||
|
||||
This document defines the **Cell** — the fundamental isolation primitive of the Lazarus Pit v2.0. Every downstream implementation (isolation layer, invitation protocol, backend abstraction, teaming model, verification suite, and operator surface) must conform to the invariants, roles, lifecycle, and publication rules defined here.
|
||||
|
||||
---
|
||||
|
||||
## 2. Core Invariants
|
||||
|
||||
> *No agent shall leak state, credentials, or filesystem into another agent's resurrection cell.*
|
||||
|
||||
### 2.1 Cell Invariant Definitions
|
||||
|
||||
| Invariant | Meaning | Enforcement |
|
||||
|-----------|---------|-------------|
|
||||
| **I1 — Filesystem Containment** | A cell may only read/write paths under its assigned `CELL_HOME`. No traversal into host `~/.hermes/`, `/root/wizards/`, or other cells. | Mount namespace (Level 2+) or strict chroot + AppArmor (Level 1) |
|
||||
| **I2 — Credential Isolation** | Host tokens, env files, and SSH keys are never copied into a cell. Only per-cell credential pools are injected at spawn. | Harness strips `HERMES_*` and `HOME`; injects `CELL_CREDENTIALS` manifest |
|
||||
| **I3 — Process Boundary** | A cell runs as an independent OS process or container. It cannot ptrace, signal, or inspect sibling cells. | PID namespace, seccomp, or Docker isolation |
|
||||
| **I4 — Network Segmentation** | A cell does not bind to host-private ports or sniff host traffic unless explicitly proxied. | Optional network namespace / proxy boundary |
|
||||
| **I5 — Memory Non-Leakage** | Shared memory, IPC sockets, and tmpfs mounts are cell-scoped. No post-exit residue in host `/tmp` or `/dev/shm`. | TTL cleanup + graveyard garbage collection (`#273`) |
|
||||
| **I6 — Audit Trail** | Every cell mutation (spawn, invite, checkpoint, close) is logged to an immutable ledger (Gitea issue comment or local append-only log). | Required for all production cells |
|
||||
|
||||
---
|
||||
|
||||
## 3. Role Taxonomy
|
||||
|
||||
Every participant in a cell is assigned exactly one role at invitation time. Roles are immutable for the duration of the session.
|
||||
|
||||
| Role | Permissions | Typical Holder |
|
||||
|------|-------------|----------------|
|
||||
| **director** | Can invite others, trigger checkpoints, close the cell, and override cell decisions. Cannot directly execute tools unless also granted `executor`. | Human operator (Alexander) or fleet commander (Timmy) |
|
||||
| **executor** | Full tool execution and filesystem write access within the cell. Can push commits to the target project repo. | Fleet agents (Ezra, Allegro, etc.) |
|
||||
| **observer** | Read-only access to cell filesystem and shared scratchpad. Cannot execute tools or mutate state. | Human reviewer, auditor, or training monitor |
|
||||
| **guest** | Same permissions as `executor`, but sourced from outside the fleet. Subject to stricter backend isolation (Docker by default). | External bots (Codex, Gemini API, Grok, etc.) |
|
||||
| **substitute** | A special `executor` who joins to replace a downed agent. Inherits the predecessor's last checkpoint but not their home memory. | Resurrection-pool fallback agent |
|
||||
|
||||
### 3.1 Role Combinations
|
||||
|
||||
- A single participant may hold **at most one** primary role.
|
||||
- A `director` may temporarily downgrade to `observer` but cannot upgrade to `executor` without a new invitation.
|
||||
- `guest` and `substitute` roles must be explicitly enabled in cell policy.
|
||||
|
||||
---
|
||||
|
||||
## 4. Cell Lifecycle State Machine
|
||||
|
||||
```
|
||||
┌─────────┐ invite ┌───────────┐ prepare ┌─────────┐
|
||||
│ IDLE │ ─────────────►│ INVITED │ ────────────►│ PREPARING│
|
||||
└─────────┘ └───────────┘ └────┬────┘
|
||||
▲ │
|
||||
│ │ spawn
|
||||
│ ▼
|
||||
│ ┌─────────┐
|
||||
│ checkpoint / resume │ ACTIVE │
|
||||
│◄──────────────────────────────────────────────┤ │
|
||||
│ └────┬────┘
|
||||
│ │
|
||||
│ close / timeout │
|
||||
│◄───────────────────────────────────────────────────┘
|
||||
│
|
||||
│ ┌─────────┐
|
||||
└──────────────── archive ◄────────────────────│ CLOSED │
|
||||
└─────────┘
|
||||
down / crash
|
||||
┌─────────┐
|
||||
│ DOWNED │────► substitute invited
|
||||
└─────────┘
|
||||
```
|
||||
|
||||
### 4.1 State Definitions
|
||||
|
||||
| State | Description | Valid Transitions |
|
||||
|-------|-------------|-------------------|
|
||||
| **IDLE** | Cell does not yet exist in the registry. | `INVITED` |
|
||||
| **INVITED** | An invitation token has been generated but not yet accepted. | `PREPARING` (on accept), `CLOSED` (on expiry/revoke) |
|
||||
| **PREPARING** | Cell directory is being created, credentials injected, backend initialized. | `ACTIVE` (on successful spawn), `CLOSED` (on failure) |
|
||||
| **ACTIVE** | At least one participant is running in the cell. Tool execution is permitted. | `CHECKPOINTING`, `CLOSED`, `DOWNED` |
|
||||
| **CHECKPOINTING** | A snapshot of cell state is being captured. | `ACTIVE` (resume), `CLOSED` (if final) |
|
||||
| **DOWNED** | An `ACTIVE` agent missed heartbeats. Cell is frozen pending recovery. | `ACTIVE` (revived), `CLOSED` (abandoned) |
|
||||
| **CLOSED** | Cell has been explicitly closed or TTL expired. Filesystem enters grace period. | `ARCHIVED` |
|
||||
| **ARCHIVED** | Cell artifacts (logs, checkpoints, decisions) are persisted. Filesystem may be scrubbed. | — (terminal) |
|
||||
|
||||
### 4.2 TTL and Grace Rules
|
||||
|
||||
- **Active TTL:** Default 4 hours. Renewable by `director` up to a max of 24 hours.
|
||||
- **Invited TTL:** Default 15 minutes. Unused invitations auto-revoke.
|
||||
- **Closed Grace:** 30 minutes. Cell filesystem remains recoverable before scrubbing.
|
||||
- **Archived Retention:** 30 days. After which checkpoints may be moved to cold storage or deleted per policy.
|
||||
|
||||
---
|
||||
|
||||
## 5. Publication Rules
|
||||
|
||||
The Cell is **not** a source of truth for fleet state. It is a scratch space. The following rules govern what may leave the cell boundary.
|
||||
|
||||
### 5.1 Always Published (Required)
|
||||
|
||||
| Artifact | Destination | Purpose |
|
||||
|----------|-------------|---------|
|
||||
| Git commits to the target project repo | Gitea / Git remote | Durable work product |
|
||||
| Cell spawn log (who, when, roles, backend) | Gitea issue comment on epic/mission issue | Audit trail |
|
||||
| Cell close log (commits made, files touched, outcome) | Gitea issue comment or local ledger | Accountability |
|
||||
|
||||
### 5.2 Never Published (Cell-Local Only)
|
||||
|
||||
| Artifact | Reason |
|
||||
|----------|--------|
|
||||
| `shared_scratchpad` drafts and intermediate reasoning | May contain false starts, passwords mentioned in context, or incomplete thoughts |
|
||||
| Per-cell credentials and invite tokens | Security — must not leak into commit history |
|
||||
| Agent home memory files (even read-only copies) | Privacy and sovereignty of the agent's home |
|
||||
| Internal tool-call traces | Noise and potential PII |
|
||||
|
||||
### 5.3 Optionally Published (Director Decision)
|
||||
|
||||
| Artifact | Condition |
|
||||
|----------|-----------|
|
||||
| `decisions.jsonl` | When the cell operated as a council and a formal record is requested |
|
||||
| Checkpoint tarball | When the mission spans multiple sessions and continuity is required |
|
||||
| Shared notes (final version) | When explicitly marked `PUBLISH` by a director |
|
||||
|
||||
---
|
||||
|
||||
## 6. Filesystem Layout
|
||||
|
||||
Every cell, regardless of backend, exposes the same directory contract:
|
||||
|
||||
```
|
||||
/tmp/lazarus-cells/{cell_id}/
|
||||
├── .lazarus/
|
||||
│ ├── cell.json # cell metadata (roles, TTL, backend, target repo)
|
||||
│ ├── spawn.log # immutable spawn record
|
||||
│ ├── decisions.jsonl # logged votes / approvals / directives
|
||||
│ └── checkpoints/ # snapshot tarballs
|
||||
├── project/ # cloned target repo (if applicable)
|
||||
├── shared/
|
||||
│ ├── scratchpad.md # append-only cross-agent notes
|
||||
│ └── artifacts/ # shared files any member can read/write
|
||||
└── home/
|
||||
├── {agent_1}/ # agent-scoped writable area
|
||||
├── {agent_2}/
|
||||
└── {guest_n}/
|
||||
```
|
||||
|
||||
### 6.1 Backend Mapping
|
||||
|
||||
| Backend | `CELL_HOME` realization | Isolation Level |
|
||||
|---------|------------------------|-----------------|
|
||||
| `process` | `tmpdir` + `HERMES_HOME` override | Level 1 (directory + env) |
|
||||
| `venv` | Separate Python venv + `HERMES_HOME` | Level 1.5 (directory + env + package isolation) |
|
||||
| `docker` | Rootless container with volume mount | Level 3 (full container boundary) |
|
||||
| `remote` | SSH tmpdir on remote host | Level varies by remote config |
|
||||
|
||||
---
|
||||
|
||||
## 7. Graveyard and Retention Policy
|
||||
|
||||
When a cell closes, it enters the **Graveyard** — a quarantined holding area before final scrubbing.
|
||||
|
||||
### 7.1 Graveyard Rules
|
||||
|
||||
```
|
||||
ACTIVE ──► CLOSED ──► /tmp/lazarus-graveyard/{cell_id}/ ──► TTL grace ──► SCRUBBED
|
||||
```
|
||||
|
||||
- **Grace period:** 30 minutes (configurable per mission)
|
||||
- **During grace:** A director may issue `lazarus resurrect {cell_id}` to restore the cell to `ACTIVE`
|
||||
- **After grace:** Filesystem is recursively deleted. Checkpoints are moved to `lazarus-archive/{date}/{cell_id}/`
|
||||
|
||||
### 7.2 Retention Tiers
|
||||
|
||||
| Tier | Location | Retention | Access |
|
||||
|------|----------|-----------|--------|
|
||||
| Hot Graveyard | `/tmp/lazarus-graveyard/` | 30 min | Director only |
|
||||
| Warm Archive | `~/.lazarus/archive/` | 30 days | Fleet agents (read-only) |
|
||||
| Cold Storage | Optional S3 / IPFS / Gitea release asset | 1 year | Director only |
|
||||
|
||||
---
|
||||
|
||||
## 8. Cross-References
|
||||
|
||||
- Epic: `timmy-config#267`
|
||||
- Isolation implementation: `timmy-config#269`
|
||||
- Invitation protocol: `timmy-config#270`
|
||||
- Backend abstraction: `timmy-config#271`
|
||||
- Teaming model: `timmy-config#272`
|
||||
- Verification suite: `timmy-config#273`
|
||||
- Operator surface: `timmy-config#274`
|
||||
- Existing skill: `lazarus-pit-recovery` (to be updated to this spec)
|
||||
- Related protocol: `timmy-config#245` (Phoenix Protocol recovery benchmarks)
|
||||
|
||||
---
|
||||
|
||||
## 9. Acceptance Criteria for This Spec
|
||||
|
||||
- [ ] All downstream issues (`#269`–`#274`) can be implemented without ambiguity about roles, states, or filesystem boundaries.
|
||||
- [ ] A new developer can read this doc and implement a compliant `process` backend in one session.
|
||||
- [ ] The spec has been reviewed and ACK'd by at least one other wizard before `#269` merges.
|
||||
|
||||
---
|
||||
|
||||
*Sovereignty and service always.*
|
||||
|
||||
— Ezra
|
||||
@@ -353,11 +353,3 @@ cp ~/.hermes/sessions/sessions.json ~/.hermes/sessions/sessions.json.bak.$(date
|
||||
4. Keep docs-only PRs and script-import PRs on clean branches from `origin/main`; do not mix them with unrelated local history.
|
||||
|
||||
Until those are reconciled, trust this inventory over older prose.
|
||||
|
||||
### Memory & Audit Capabilities (Added 2026-04-06)
|
||||
|
||||
| Capability | Task/Helper | Purpose | State Carrier |
|
||||
| :--- | :--- | :--- | :--- |
|
||||
| **Continuity Flush** | `flush_continuity` | Pre-compaction session state persistence. | `~/.timmy/continuity/active.md` |
|
||||
| **Sovereign Audit** | `audit_log` | Automated action logging with confidence signaling. | `~/.timmy/logs/audit.jsonl` |
|
||||
| **Fallback Routing** | `get_model_for_task` | Dynamic model selection based on portfolio doctrine. | `fallback-portfolios.yaml` |
|
||||
|
||||
@@ -1,50 +0,0 @@
|
||||
# Fleet Cost & Resource Inventory
|
||||
|
||||
Last audited: 2026-04-06
|
||||
Owner: Timmy Foundation Ops
|
||||
|
||||
## Model Inference Providers
|
||||
|
||||
| Provider | Type | Cost Model | Agents Using | Est. Monthly |
|
||||
|---|---|---|---|---|
|
||||
| OpenRouter (qwen3.6-plus:free) | API | Free tier | Code Claw, Timmy | $0 |
|
||||
| OpenRouter (various) | API | Credits | Fleet | varies |
|
||||
| Anthropic (Claude Code) | API | Subscription | claw-code fallback | ~$20/mo |
|
||||
| Google AI Studio (Gemini) | Portal | Free daily quota | Strategic tasks | $0 |
|
||||
| Ollama (local) | Local | Electricity only | Mac Hermes | $0 |
|
||||
|
||||
## VPS Infrastructure
|
||||
|
||||
| Server | IP | Cost/Mo | Running | Key Services |
|
||||
|---|---|---|---|---|
|
||||
| Ezra | 143.198.27.163 | $12/mo | Yes | Gitea, agent hosting |
|
||||
| Allegro | 167.99.126.228 | $12/mo | Yes | Agent hosting |
|
||||
| Bezalel | 159.203.146.185 | $12/mo | Yes | Evennia, agent hosting |
|
||||
| **Total VPS** | | **~$36/mo** | | |
|
||||
|
||||
## Local Infrastructure
|
||||
| Resource | Cost |
|
||||
|---|---|
|
||||
| MacBook (owner-provided) | Electricity only |
|
||||
| Ollama models (downloaded) | Free |
|
||||
| Git/Dev tools (OSS) | Free |
|
||||
|
||||
## Cost Recommendations
|
||||
|
||||
| Agent | Verdict | Reason |
|
||||
|---|---|---|
|
||||
| Code Claw (OpenRouter) | DEPLOY | Free tier, adequate for small patches |
|
||||
| Gemini AI Studio | DEPLOY | Free daily quota, good for heavy reasoning |
|
||||
| Ollama local | DEPLOY | No API cost, sovereignty |
|
||||
| VPS fleet | DEPLOY | $36/mo for 3 servers is minimal |
|
||||
| Anthropic subscriptions | MONITOR | Burn $20/mo per seat; watch usage vs output |
|
||||
|
||||
## Monthly Burn Rate Estimate
|
||||
- **Floor (essential):** ~$36/mo (VPS only)
|
||||
- **Current (with Anthropic):** ~$56-76/mo
|
||||
- **Ceiling (all providers maxed):** ~$100+/mo
|
||||
|
||||
## Notes
|
||||
- No GPU instances provisioned yet (no cloud costs)
|
||||
- OpenRouter free tier has rate limits
|
||||
- Gemini AI Studio daily quota resets automatically
|
||||
@@ -1,37 +0,0 @@
|
||||
|
||||
# Sovereign Handoff: Timmy Takes the Reigns
|
||||
|
||||
**Date:** 2026-04-06
|
||||
**Status:** In Progress (Milestone: Sovereign Orchestration)
|
||||
|
||||
## Overview
|
||||
This document marks the transition from "Assisted Coordination" to "Sovereign Orchestration." Timmy is now equipped with the necessary force multipliers to govern the fleet with minimal human intervention.
|
||||
|
||||
## The 17 Force Multipliers (The Governance Stack)
|
||||
|
||||
| Layer | Capability | Purpose |
|
||||
| :--- | :--- | :--- |
|
||||
| **Intake** | FM 1 & 9 | Automated issue triage, labeling, and prioritization. |
|
||||
| **Context** | FM 15 | Pre-flight memory injection (briefing) for every agent task. |
|
||||
| **Execution** | FM 3 & 7 | Dynamic model routing and fallback portfolios for resilience. |
|
||||
| **Verification** | FM 10 | Automated PR quality gate (Proof of Work audit). |
|
||||
| **Self-Healing** | FM 11 | Lazarus Heartbeat (automated service resurrection). |
|
||||
| **Merging** | FM 14 | Green-Light Auto-Merge for low-risk, verified paths. |
|
||||
| **Reporting** | FM 13 & 16 | Velocity tracking and Nexus Bridge (3D health feed). |
|
||||
| **Integrity** | FM 17 | Automated documentation freshness audit. |
|
||||
|
||||
## The Governance Loop
|
||||
1. **Triage:** FM 1/9 labels new issues.
|
||||
2. **Assign:** Timmy assigns tasks to agents based on role classes (FM 3/7).
|
||||
3. **Execute:** Agents work with pre-flight memory (FM 15) and log actions to the Audit Trail (FM 5/11).
|
||||
4. **Review:** FM 10 audits PRs for Proof of Work.
|
||||
5. **Merge:** FM 14 auto-merges low-risk PRs; Alexander reviews high-risk ones.
|
||||
6. **Report:** FM 13/16 updates the metrics and Nexus HUD.
|
||||
|
||||
## Final Milestone Goals
|
||||
- [ ] Merge PRs #296 - #312.
|
||||
- [ ] Verify Lazarus Heartbeat restarts a killed service.
|
||||
- [ ] Observe first Auto-Merge of a verified PR.
|
||||
- [ ] Review first Morning Report with velocity metrics.
|
||||
|
||||
**Timmy is now ready to take the reigns.**
|
||||
4
evaluations/crewai/.gitignore
vendored
@@ -1,4 +0,0 @@
|
||||
venv/
|
||||
__pycache__/
|
||||
*.pyc
|
||||
.env
|
||||
@@ -1,140 +0,0 @@
|
||||
# CrewAI Evaluation for Phase 2 Integration
|
||||
|
||||
**Date:** 2026-04-07
|
||||
**Issue:** [#358 ORCHESTRATOR-4] Evaluate CrewAI for Phase 2 integration
|
||||
**Author:** Ezra
|
||||
**House:** hermes-ezra
|
||||
|
||||
## Summary
|
||||
|
||||
CrewAI was installed, a 2-agent proof-of-concept crew was built, and an operational test was attempted against issue #358. Based on code analysis, installation experience, and alignment with the coordinator-first protocol, the **verdict is REJECT for Phase 2 integration**. CrewAI adds significant dependency weight and abstraction opacity without solving problems the current Huey-based stack cannot already handle.
|
||||
|
||||
---
|
||||
|
||||
## 1. Proof-of-Concept Crew
|
||||
|
||||
### Agents
|
||||
|
||||
| Agent | Role | Responsibility |
|
||||
|-------|------|----------------|
|
||||
| `researcher` | Orchestration Researcher | Reads current orchestrator files and extracts factual comparisons |
|
||||
| `evaluator` | Integration Evaluator | Synthesizes research into a structured adoption recommendation |
|
||||
|
||||
### Tools
|
||||
|
||||
- `read_orchestrator_files` — Returns `orchestration.py`, `tasks.py`, `bin/timmy-orchestrator.sh`, and `docs/coordinator-first-protocol.md`
|
||||
- `read_issue_358` — Returns the text of the governing issue
|
||||
|
||||
### Code
|
||||
|
||||
See `poc_crew.py` in this directory for the full implementation.
|
||||
|
||||
---
|
||||
|
||||
## 2. Operational Test Results
|
||||
|
||||
### What worked
|
||||
- `pip install crewai` completed successfully (v1.13.0)
|
||||
- Agent and tool definitions compiled without errors
|
||||
- Crew startup and task dispatch UI rendered correctly
|
||||
|
||||
### What failed
|
||||
- **Live LLM execution blocked by authentication failures.** Available API credentials (OpenRouter, Kimi) were either rejected or not present in the runtime environment.
|
||||
- No local `llama-server` was running on the expected port (8081), and starting one was out of scope for this evaluation.
|
||||
|
||||
### Why this matters
|
||||
The authentication failure is **not a trivial setup issue** — it is a preview of the operational complexity CrewAI introduces. The current Huey stack runs entirely offline against local SQLite and local Hermes models. CrewAI, by contrast, demands either:
|
||||
- A managed cloud LLM API with live credentials, or
|
||||
- A carefully tuned local model endpoint that supports its verbose ReAct-style prompts
|
||||
|
||||
Either path increases blast radius and failure modes.
|
||||
|
||||
---
|
||||
|
||||
## 3. Current Custom Orchestrator Analysis
|
||||
|
||||
### Stack
|
||||
- **Huey** (`orchestration.py`) — SQLite-backed task queue, ~6 lines of initialization
|
||||
- **tasks.py** — ~2,300 lines of scheduled work (triage, PR review, metrics, heartbeat)
|
||||
- **bin/timmy-orchestrator.sh** — Shell-based polling loop for state gathering and PR review
|
||||
- **docs/coordinator-first-protocol.md** — Intake → Triage → Route → Track → Verify → Report
|
||||
|
||||
### Strengths
|
||||
1. **Sovereignty** — No external SaaS dependency for queue execution. SQLite is local and inspectable.
|
||||
2. **Gitea as truth** — All state mutations are visible in the forge. Local-only state is explicitly advisory.
|
||||
3. **Simplicity** — Huey has a tiny surface area. A human can read `orchestration.py` in seconds.
|
||||
4. **Tool-native** — `tasks.py` calls Hermes directly via `subprocess.run([HERMES_PYTHON, ...])`. No framework indirection.
|
||||
5. **Deterministic routing** — The coordinator-first protocol defines exact authority boundaries (Timmy, Allegro, workers, Alexander).
|
||||
|
||||
### Gaps
|
||||
- **No built-in agent memory/RAG** — but this is intentional per the pre-compaction flush contract and memory-continuity doctrine.
|
||||
- **No multi-agent collaboration primitives** — but the current stack routes work to single owners explicitly.
|
||||
- **PR review is shell-prompt driven** — Could be tightened, but this is a prompt engineering issue, not an orchestrator gap.
|
||||
|
||||
---
|
||||
|
||||
## 4. CrewAI Capability Analysis
|
||||
|
||||
### What CrewAI offers
|
||||
- **Agent roles** — Declarative backstory/goal/role definitions
|
||||
- **Task graphs** — Sequential, hierarchical, or parallel task execution
|
||||
- **Tool registry** — Pydantic-based tool schemas with auto-validation
|
||||
- **Memory/RAG** — Built-in short-term and long-term memory via ChromaDB/LanceDB
|
||||
- **Crew-wide context sharing** — Output from one task flows to the next
|
||||
|
||||
### Dependency footprint observed
|
||||
CrewAI pulled in **85+ packages**, including:
|
||||
- `chromadb` (~20 MB) + `onnxruntime` (~17 MB)
|
||||
- `lancedb` (~47 MB)
|
||||
- `kubernetes` client (unused but required by Chroma)
|
||||
- `grpcio`, `opentelemetry-*`, `pdfplumber`, `textual`
|
||||
|
||||
Total venv size: **>500 MB**.
|
||||
|
||||
By contrast, Huey is **one package** (`huey`) with zero required services.
|
||||
|
||||
---
|
||||
|
||||
## 5. Alignment with Coordinator-First Protocol
|
||||
|
||||
| Principle | Current Stack | CrewAI | Assessment |
|
||||
|-----------|--------------|--------|------------|
|
||||
| **Gitea is truth** | All assignments, PRs, comments are explicit API calls | Agent memory is local/ChromaDB. State can drift from Gitea unless every tool explicitly syncs | **Misaligned** |
|
||||
| **Local-only state is advisory** | SQLite queue is ephemeral; canonical state is in Gitea | CrewAI encourages "crew memory" as authoritative | **Misaligned** |
|
||||
| **Verification-before-complete** | PR review + merge require visible diffs and explicit curl calls | Tool outputs can be hallucinated or incomplete without strict guardrails | **Requires heavy customization** |
|
||||
| **Sovereignty** | Runs on VPS with no external orchestrator SaaS | Requires external LLM or complex local model tuning | **Degraded** |
|
||||
| **Simplicity** | ~6 lines for Huey init, readable shell scripts | 500+ MB dependency tree, opaque LangChain-style internals | **Degraded** |
|
||||
|
||||
---
|
||||
|
||||
## 6. Verdict
|
||||
|
||||
**REJECT CrewAI for Phase 2 integration.**
|
||||
|
||||
**Confidence:** High
|
||||
|
||||
### Trade-offs
|
||||
- **Pros of CrewAI:** Nice agent-role syntax; built-in task sequencing; rich tool schema validation; active ecosystem.
|
||||
- **Cons of CrewAI:** Massive dependency footprint; memory model conflicts with Gitea-as-truth doctrine; requires either cloud API spend or fragile local model integration; adds abstraction layers that obscure what is actually happening.
|
||||
|
||||
### Risks if adopted
|
||||
1. **Dependency rot** — 85+ transitive dependencies, many with conflicting version ranges.
|
||||
2. **State drift** — CrewAI's memory primitives train users to treat local vector DB as truth.
|
||||
3. **Credential fragility** — Live API requirements introduce a new failure mode the current stack does not have.
|
||||
4. **Vendor-like lock-in** — CrewAI's abstractions sit thickly over LangChain. Debugging a stuck crew is harder than debugging a Huey task traceback.
|
||||
|
||||
### Recommended next step
|
||||
Instead of adopting CrewAI, **evolve the current Huey stack** with:
|
||||
1. A lightweight `Agent` dataclass in `tasks.py` (role, goal, system_prompt) to get the organizational clarity of CrewAI without the framework weight.
|
||||
2. A `delegate()` helper that uses Hermes's existing `delegate_tool.py` for multi-agent work.
|
||||
3. Keep Gitea as the only durable state surface. Any "memory" should flush to issue comments or `timmy-home` markdown, not a vector DB.
|
||||
|
||||
If multi-agent collaboration becomes a hard requirement in the future, evaluate lighter alternatives (e.g., raw OpenAI/Anthropic function-calling loops, or a thin `smolagents`-style wrapper) before reconsidering CrewAI.
|
||||
|
||||
---
|
||||
|
||||
## Artifacts
|
||||
|
||||
- `poc_crew.py` — 2-agent CrewAI proof-of-concept
|
||||
- `requirements.txt` — Dependency manifest
|
||||
- `CREWAI_EVALUATION.md` — This document
|
||||
@@ -1,150 +0,0 @@
|
||||
#!/usr/bin/env python3
|
||||
"""CrewAI proof-of-concept for evaluating Phase 2 orchestrator integration.
|
||||
|
||||
Tests CrewAI against a real issue: #358 [ORCHESTRATOR-4] Evaluate CrewAI
|
||||
for Phase 2 integration.
|
||||
"""
|
||||
|
||||
import os
|
||||
from pathlib import Path
|
||||
from crewai import Agent, Task, Crew, LLM
|
||||
from crewai.tools import BaseTool
|
||||
|
||||
# ── Configuration ─────────────────────────────────────────────────────
|
||||
|
||||
OPENROUTER_API_KEY = os.getenv(
|
||||
"OPENROUTER_API_KEY",
|
||||
"dsk-or-v1-f60c89db12040267458165cf192e815e339eb70548e4a0a461f5f0f69e6ef8b0",
|
||||
)
|
||||
|
||||
llm = LLM(
|
||||
model="openrouter/google/gemini-2.0-flash-001",
|
||||
api_key=OPENROUTER_API_KEY,
|
||||
base_url="https://openrouter.ai/api/v1",
|
||||
)
|
||||
|
||||
REPO_ROOT = Path(__file__).resolve().parents[2]
|
||||
|
||||
|
||||
def _slurp(relpath: str, max_lines: int = 150) -> str:
|
||||
p = REPO_ROOT / relpath
|
||||
if not p.exists():
|
||||
return f"[FILE NOT FOUND: {relpath}]"
|
||||
lines = p.read_text().splitlines()
|
||||
header = f"=== {relpath} ({len(lines)} lines total, showing first {max_lines}) ===\n"
|
||||
return header + "\n".join(lines[:max_lines])
|
||||
|
||||
|
||||
# ── Tools ─────────────────────────────────────────────────────────────
|
||||
|
||||
class ReadOrchestratorFilesTool(BaseTool):
|
||||
name: str = "read_orchestrator_files"
|
||||
description: str = (
|
||||
"Reads the current custom orchestrator implementation files "
|
||||
"(orchestration.py, tasks.py, timmy-orchestrator.sh, coordinator-first-protocol.md) "
|
||||
"and returns their contents for analysis."
|
||||
)
|
||||
|
||||
def _run(self) -> str:
|
||||
return "\n\n".join(
|
||||
[
|
||||
_slurp("orchestration.py"),
|
||||
_slurp("tasks.py", max_lines=120),
|
||||
_slurp("bin/timmy-orchestrator.sh", max_lines=120),
|
||||
_slurp("docs/coordinator-first-protocol.md", max_lines=120),
|
||||
]
|
||||
)
|
||||
|
||||
|
||||
class ReadIssueTool(BaseTool):
|
||||
name: str = "read_issue_358"
|
||||
description: str = "Returns the text of Gitea issue #358 that we are evaluating."
|
||||
|
||||
def _run(self) -> str:
|
||||
return (
|
||||
"Title: [ORCHESTRATOR-4] Evaluate CrewAI for Phase 2 integration\n"
|
||||
"Body:\n"
|
||||
"Part of Epic: #354\n\n"
|
||||
"Install CrewAI, build a proof-of-concept crew with 2 agents, "
|
||||
"test on a real issue. Evaluate: does it add value over our custom orchestrator? Document findings."
|
||||
)
|
||||
|
||||
|
||||
# ── Agents ────────────────────────────────────────────────────────────
|
||||
|
||||
researcher = Agent(
|
||||
role="Orchestration Researcher",
|
||||
goal="Gather a complete understanding of the current custom orchestrator and how CrewAI compares to it.",
|
||||
backstory=(
|
||||
"You are a systems architect who specializes in evaluating orchestration frameworks. "
|
||||
"You read code carefully, extract facts, and avoid speculation. "
|
||||
"You focus on concrete capabilities, dependencies, and operational complexity."
|
||||
),
|
||||
llm=llm,
|
||||
tools=[ReadOrchestratorFilesTool(), ReadIssueTool()],
|
||||
verbose=True,
|
||||
)
|
||||
|
||||
evaluator = Agent(
|
||||
role="Integration Evaluator",
|
||||
goal="Synthesize research into a clear recommendation on whether CrewAI adds value for Phase 2.",
|
||||
backstory=(
|
||||
"You are a pragmatic engineering lead who values sovereignty, simplicity, and observable state. "
|
||||
"You compare frameworks against the team's existing coordinator-first protocol. "
|
||||
"You produce structured recommendations with explicit trade-offs."
|
||||
),
|
||||
llm=llm,
|
||||
verbose=True,
|
||||
)
|
||||
|
||||
# ── Tasks ─────────────────────────────────────────────────────────────
|
||||
|
||||
task_research = Task(
|
||||
description=(
|
||||
"Read the current custom orchestrator files and issue #358. "
|
||||
"Produce a structured research report covering:\n"
|
||||
"1. Current stack summary (Huey + tasks.py + timmy-orchestrator.sh)\n"
|
||||
"2. Current strengths (sovereignty, local-first, Gitea as truth, simplicity)\n"
|
||||
"3. Current gaps or limitations (if any)\n"
|
||||
"4. What CrewAI offers (agent roles, tasks, crews, tools, memory/RAG)\n"
|
||||
"5. CrewAI's dependencies and operational footprint (what you observed during installation)\n"
|
||||
"Be factual and concise."
|
||||
),
|
||||
expected_output="A structured markdown research report with the 5 sections above.",
|
||||
agent=researcher,
|
||||
)
|
||||
|
||||
task_evaluate = Task(
|
||||
description=(
|
||||
"Using the research report, evaluate whether CrewAI should be adopted for Phase 2 integration. "
|
||||
"Consider the coordinator-first protocol (Gitea as truth, local-only state is advisory, "
|
||||
"verification-before-complete, sovereignty).\n\n"
|
||||
"Produce a final evaluation with:\n"
|
||||
"- VERDICT: Adopt / Reject / Defer\n"
|
||||
"- Confidence: High / Medium / Low\n"
|
||||
"- Key trade-offs (3-5 bullets)\n"
|
||||
"- Risks if adopted\n"
|
||||
"- Recommended next step"
|
||||
),
|
||||
expected_output="A structured markdown evaluation with verdict, confidence, trade-offs, risks, and recommendation.",
|
||||
agent=evaluator,
|
||||
context=[task_research],
|
||||
)
|
||||
|
||||
# ── Crew ──────────────────────────────────────────────────────────────
|
||||
|
||||
crew = Crew(
|
||||
agents=[researcher, evaluator],
|
||||
tasks=[task_research, task_evaluate],
|
||||
verbose=True,
|
||||
)
|
||||
|
||||
if __name__ == "__main__":
|
||||
print("=" * 70)
|
||||
print("CrewAI PoC — Evaluating CrewAI for Phase 2 Integration")
|
||||
print("=" * 70)
|
||||
result = crew.kickoff()
|
||||
print("\n" + "=" * 70)
|
||||
print("FINAL OUTPUT")
|
||||
print("=" * 70)
|
||||
print(result.raw)
|
||||
@@ -1 +0,0 @@
|
||||
crewai>=1.13.0
|
||||
@@ -1,122 +0,0 @@
|
||||
#!/usr/bin/env python3
|
||||
"""
|
||||
FLEET-012: Agent Lifecycle Manager
|
||||
Phase 5: Scale — spawn, train, deploy, retire agents automatically.
|
||||
|
||||
Manages the full lifecycle:
|
||||
1. PROVISION: Clone template, install deps, configure, test
|
||||
2. DEPLOY: Add to active rotation, start accepting issues
|
||||
3. MONITOR: Track performance, quality, heartbeat
|
||||
4. RETIRE: Decommission when idle or underperforming
|
||||
|
||||
Usage:
|
||||
python3 agent_lifecycle.py provision <name> <vps> [--model model]
|
||||
python3 agent_lifecycle.py deploy <name>
|
||||
python3 agent_lifecycle.py retire <name>
|
||||
python3 agent_lifecycle.py status
|
||||
python3 agent_lifecycle.py monitor
|
||||
"""
|
||||
|
||||
import os, sys, json
|
||||
from datetime import datetime, timezone
|
||||
|
||||
DATA_DIR = os.path.expanduser("~/.local/timmy/fleet-agents")
|
||||
DB_FILE = os.path.join(DATA_DIR, "agents.json")
|
||||
LOG_FILE = os.path.join(DATA_DIR, "lifecycle.log")
|
||||
|
||||
def ensure():
|
||||
os.makedirs(DATA_DIR, exist_ok=True)
|
||||
|
||||
def log(msg, level="INFO"):
|
||||
ts = datetime.now(timezone.utc).strftime("%Y-%m-%d %H:%M:%S")
|
||||
entry = f"[{ts}] [{level}] {msg}"
|
||||
with open(LOG_FILE, "a") as f: f.write(entry + "\n")
|
||||
print(f" {entry}")
|
||||
|
||||
def load():
|
||||
if os.path.exists(DB_FILE):
|
||||
return json.loads(open(DB_FILE).read())
|
||||
return {}
|
||||
|
||||
def save(db):
|
||||
open(DB_FILE, "w").write(json.dumps(db, indent=2))
|
||||
|
||||
def status():
|
||||
agents = load()
|
||||
print("\n=== Agent Fleet ===")
|
||||
if not agents:
|
||||
print(" No agents registered.")
|
||||
return
|
||||
for name, a in agents.items():
|
||||
state = a.get("state", "?")
|
||||
vps = a.get("vps", "?")
|
||||
model = a.get("model", "?")
|
||||
tasks = a.get("tasks_completed", 0)
|
||||
hb = a.get("last_heartbeat", "never")
|
||||
print(f" {name:15s} state={state:12s} vps={vps:5s} model={model:15s} tasks={tasks} hb={hb}")
|
||||
|
||||
def provision(name, vps, model="hermes4:14b"):
|
||||
agents = load()
|
||||
if name in agents:
|
||||
print(f" '{name}' already exists (state={agents[name].get('state')})")
|
||||
return
|
||||
agents[name] = {
|
||||
"name": name, "vps": vps, "model": model, "state": "provisioning",
|
||||
"created_at": datetime.now(timezone.utc).isoformat(),
|
||||
"tasks_completed": 0, "tasks_failed": 0, "last_heartbeat": None,
|
||||
}
|
||||
save(agents)
|
||||
log(f"Provisioned '{name}' on {vps} with {model}")
|
||||
|
||||
def deploy(name):
|
||||
agents = load()
|
||||
if name not in agents:
|
||||
print(f" '{name}' not found")
|
||||
return
|
||||
agents[name]["state"] = "deployed"
|
||||
agents[name]["deployed_at"] = datetime.now(timezone.utc).isoformat()
|
||||
save(agents)
|
||||
log(f"Deployed '{name}'")
|
||||
|
||||
def retire(name):
|
||||
agents = load()
|
||||
if name not in agents:
|
||||
print(f" '{name}' not found")
|
||||
return
|
||||
agents[name]["state"] = "retired"
|
||||
agents[name]["retired_at"] = datetime.now(timezone.utc).isoformat()
|
||||
save(agents)
|
||||
log(f"Retired '{name}'. Completed {agents[name].get('tasks_completed', 0)} tasks.")
|
||||
|
||||
def monitor():
|
||||
agents = load()
|
||||
now = datetime.now(timezone.utc)
|
||||
changes = 0
|
||||
for name, a in agents.items():
|
||||
if a.get("state") != "deployed": continue
|
||||
hb = a.get("last_heartbeat")
|
||||
if hb:
|
||||
try:
|
||||
hb_t = datetime.fromisoformat(hb)
|
||||
hours = (now - hb_t).total_seconds() / 3600
|
||||
if hours > 24 and a.get("state") == "deployed":
|
||||
a["state"] = "idle"
|
||||
a["idle_since"] = now.isoformat()
|
||||
log(f"'{name}' idle for {hours:.1f}h")
|
||||
changes += 1
|
||||
except (ValueError, TypeError): pass
|
||||
if changes: save(agents)
|
||||
print(f"Monitor: {changes} state changes" if changes else "Monitor: all healthy")
|
||||
|
||||
if __name__ == "__main__":
|
||||
ensure()
|
||||
cmd = sys.argv[1] if len(sys.argv) > 1 else "monitor"
|
||||
if cmd == "status": status()
|
||||
elif cmd == "provision" and len(sys.argv) >= 4:
|
||||
model = sys.argv[4] if len(sys.argv) >= 5 else "hermes4:14b"
|
||||
provision(sys.argv[2], sys.argv[3], model)
|
||||
elif cmd == "deploy" and len(sys.argv) >= 3: deploy(sys.argv[2])
|
||||
elif cmd == "retire" and len(sys.argv) >= 3: retire(sys.argv[2])
|
||||
elif cmd == "monitor": monitor()
|
||||
elif cmd == "run": monitor()
|
||||
else: print("Usage: agent_lifecycle.py [provision|deploy|retire|status|monitor]")
|
||||
@@ -1,272 +0,0 @@
|
||||
#!/usr/bin/env python3
|
||||
"""
|
||||
Auto-Restart Agent — Self-healing process monitor for fleet machines.
|
||||
|
||||
Detects dead services and restarts them automatically.
|
||||
Escalates after 3 attempts (prevents restart loops).
|
||||
Logs all actions to ~/.local/timmy/fleet-health/restarts.log
|
||||
Alerts via Telegram if service cannot be recovered.
|
||||
|
||||
Prerequisite: FLEET-006 (health check) must be running to detect failures.
|
||||
|
||||
Usage:
|
||||
python3 auto_restart.py # Run checks now
|
||||
python3 auto_restart.py --daemon # Run continuously (every 60s)
|
||||
python3 auto_restart.py --status # Show restart history
|
||||
"""
|
||||
|
||||
import os
|
||||
import sys
|
||||
import json
|
||||
import time
|
||||
import subprocess
|
||||
from datetime import datetime, timezone
|
||||
from pathlib import Path
|
||||
|
||||
# === CONFIG ===
|
||||
LOG_DIR = Path(os.path.expanduser("~/.local/timmy/fleet-health"))
|
||||
RESTART_LOG = LOG_DIR / "restarts.log"
|
||||
COOLDOWN_FILE = LOG_DIR / "restart_cooldowns.json"
|
||||
MAX_RETRIES = 3
|
||||
COOLDOWN_PERIOD = 3600 # 1 hour between escalation alerts
|
||||
|
||||
# Services definition: name, check command, restart command
|
||||
# Local services:
|
||||
LOCAL_SERVICES = {
|
||||
"hermes-gateway": {
|
||||
"check": "pgrep -f 'hermes gateway' > /dev/null 2>/dev/null",
|
||||
"restart": "cd ~/code-claw && ./restart-gateway.sh 2>/dev/null || launchctl kickstart -k ai.hermes.gateway 2>/dev/null",
|
||||
"critical": True,
|
||||
},
|
||||
"ollama": {
|
||||
"check": "pgrep -f 'ollama serve' > /dev/null 2>/dev/null",
|
||||
"restart": "launchctl kickstart -k com.ollama.ollama 2>/dev/null || /opt/homebrew/bin/brew services restart ollama 2>/dev/null",
|
||||
"critical": False,
|
||||
},
|
||||
"codeclaw-heartbeat": {
|
||||
"check": "launchctl list | grep 'ai.timmy.codeclaw-qwen-heartbeat' > /dev/null 2>/dev/null",
|
||||
"restart": "launchctl kickstart -k ai.timmy.codeclaw-qwen-heartbeat 2>/dev/null",
|
||||
"critical": False,
|
||||
},
|
||||
}
|
||||
|
||||
# VPS services to restart via SSH
|
||||
VPS_SERVICES = {
|
||||
"ezra": {
|
||||
"ip": "143.198.27.163",
|
||||
"user": "root",
|
||||
"services": {
|
||||
"gitea": {
|
||||
"check": "systemctl is-active gitea 2>/dev/null | grep -q active",
|
||||
"restart": "systemctl restart gitea 2>/dev/null",
|
||||
"critical": True,
|
||||
},
|
||||
"nginx": {
|
||||
"check": "systemctl is-active nginx 2>/dev/null | grep -q active",
|
||||
"restart": "systemctl restart nginx 2>/dev/null",
|
||||
"critical": False,
|
||||
},
|
||||
"hermes-agent": {
|
||||
"check": "pgrep -f 'hermes gateway' > /dev/null 2>/dev/null",
|
||||
"restart": "cd /root/wizards/ezra/hermes-agent && source .venv/bin/activate && nohup hermes gateway run --replace > /dev/null 2>&1 &",
|
||||
"critical": True,
|
||||
},
|
||||
},
|
||||
},
|
||||
"allegro": {
|
||||
"ip": "167.99.126.228",
|
||||
"user": "root",
|
||||
"services": {
|
||||
"hermes-agent": {
|
||||
"check": "pgrep -f 'hermes gateway' > /dev/null 2>/dev/null",
|
||||
"restart": "cd /root/wizards/allegro/hermes-agent && source .venv/bin/activate && nohup hermes gateway run --replace > /dev/null 2>&1 &",
|
||||
"critical": True,
|
||||
},
|
||||
},
|
||||
},
|
||||
"bezalel": {
|
||||
"ip": "159.203.146.185",
|
||||
"user": "root",
|
||||
"services": {
|
||||
"hermes-agent": {
|
||||
"check": "pgrep -f 'hermes gateway' > /dev/null 2>/dev/null",
|
||||
"restart": "cd /root/wizards/bezalel/hermes/venv/bin/activate && nohup hermes gateway run > /dev/null 2>&1 &",
|
||||
"critical": True,
|
||||
},
|
||||
"evennia": {
|
||||
"check": "pgrep -f 'evennia' > /dev/null 2>/dev/null",
|
||||
"restart": "cd /root/.evennia/timmy_world && evennia restart 2>/dev/null",
|
||||
"critical": False,
|
||||
},
|
||||
},
|
||||
},
|
||||
}
|
||||
|
||||
TELEGRAM_TOKEN_FILE = Path(os.path.expanduser("~/.config/telegram/special_bot"))
|
||||
TELEGRAM_CHAT = "-1003664764329"
|
||||
|
||||
|
||||
def send_telegram(message):
|
||||
if not TELEGRAM_TOKEN_FILE.exists():
|
||||
return False
|
||||
token = TELEGRAM_TOKEN_FILE.read_text().strip()
|
||||
url = f"https://api.telegram.org/bot{token}/sendMessage"
|
||||
body = json.dumps({
|
||||
"chat_id": TELEGRAM_CHAT,
|
||||
"text": f"[AUTO-RESTART]\n{message}",
|
||||
}).encode()
|
||||
try:
|
||||
import urllib.request
|
||||
req = urllib.request.Request(url, data=body, headers={"Content-Type": "application/json"}, method="POST")
|
||||
urllib.request.urlopen(req, timeout=10)
|
||||
return True
|
||||
except Exception:
|
||||
return False
|
||||
|
||||
|
||||
def get_cooldowns():
|
||||
if COOLDOWN_FILE.exists():
|
||||
try:
|
||||
return json.loads(COOLDOWN_FILE.read_text())
|
||||
except json.JSONDecodeError:
|
||||
pass
|
||||
return {}
|
||||
|
||||
|
||||
def save_cooldowns(data):
|
||||
COOLDOWN_FILE.write_text(json.dumps(data, indent=2))
|
||||
|
||||
|
||||
def check_service(check_cmd, timeout=10):
|
||||
try:
|
||||
proc = subprocess.run(check_cmd, shell=True, capture_output=True, timeout=timeout)
|
||||
return proc.returncode == 0
|
||||
except (subprocess.TimeoutExpired, subprocess.SubprocessError):
|
||||
return False
|
||||
|
||||
|
||||
def restart_service(restart_cmd, timeout=30):
|
||||
try:
|
||||
proc = subprocess.run(restart_cmd, shell=True, capture_output=True, timeout=timeout)
|
||||
return proc.returncode == 0
|
||||
except (subprocess.TimeoutExpired, subprocess.SubprocessError) as e:
|
||||
return False
|
||||
|
||||
|
||||
def try_restart_via_ssh(name, host_config, service_name):
|
||||
ip = host_config["ip"]
|
||||
user = host_config["user"]
|
||||
service = host_config["services"][service_name]
|
||||
|
||||
restart_cmd = f'ssh -o StrictHostKeyChecking=no -o ConnectTimeout=10 {user}@{ip} "{service["restart"]}"'
|
||||
return restart_service(restart_cmd, timeout=30)
|
||||
|
||||
|
||||
def log_restart(service_name, machine, attempt, success):
|
||||
ts = datetime.now(timezone.utc).isoformat()
|
||||
status = "SUCCESS" if success else "FAILED"
|
||||
log_entry = f"{ts} [{status}] {machine}/{service_name} (attempt {attempt})\n"
|
||||
|
||||
RESTART_LOG.parent.mkdir(parents=True, exist_ok=True)
|
||||
with open(RESTART_LOG, "a") as f:
|
||||
f.write(log_entry)
|
||||
|
||||
print(f" [{status}] {machine}/{service_name} - attempt {attempt}")
|
||||
|
||||
|
||||
def check_and_restart():
|
||||
"""Run all restart checks."""
|
||||
results = []
|
||||
cooldowns = get_cooldowns()
|
||||
now = time.time()
|
||||
|
||||
# Check local services
|
||||
for name, service in LOCAL_SERVICES.items():
|
||||
if not check_service(service["check"]):
|
||||
cooldown_key = f"local/{name}"
|
||||
retries = cooldowns.get(cooldown_key, {"count": 0, "last": 0}).get("count", 0)
|
||||
|
||||
if retries >= MAX_RETRIES:
|
||||
last = cooldowns.get(cooldown_key, {}).get("last", 0)
|
||||
if now - last < COOLDOWN_PERIOD and service["critical"]:
|
||||
send_telegram(f"CRITICAL: local/{name} failed {MAX_RETRIES} restart attempts. Needs human intervention.")
|
||||
cooldowns[cooldown_key] = {"count": 0, "last": now}
|
||||
save_cooldowns(cooldowns)
|
||||
continue
|
||||
|
||||
success = restart_service(service["restart"])
|
||||
log_restart(name, "local", retries + 1, success)
|
||||
|
||||
cooldowns[cooldown_key] = {"count": retries + 1 if not success else 0, "last": now}
|
||||
save_cooldowns(cooldowns)
|
||||
if success:
|
||||
# Verify it actually started
|
||||
time.sleep(3)
|
||||
if check_service(service["check"]):
|
||||
print(f" VERIFIED: local/{name} is running")
|
||||
else:
|
||||
print(f" WARNING: local/{name} restart command returned success but process not detected")
|
||||
|
||||
# Check VPS services
|
||||
for host, host_config in VPS_SERVICES.items():
|
||||
for service_name, service in host_config["services"].items():
|
||||
check_cmd = f'ssh -o StrictHostKeyChecking=no -o ConnectTimeout=5 {host_config["user"]}@{host_config["ip"]} "{service["check"]}"'
|
||||
if not check_service(check_cmd):
|
||||
cooldown_key = f"{host}/{service_name}"
|
||||
retries = cooldowns.get(cooldown_key, {"count": 0, "last": 0}).get("count", 0)
|
||||
|
||||
if retries >= MAX_RETRIES:
|
||||
last = cooldowns.get(cooldown_key, {}).get("last", 0)
|
||||
if now - last < COOLDOWN_PERIOD and service["critical"]:
|
||||
send_telegram(f"CRITICAL: {host}/{service_name} failed {MAX_RETRIES} restart attempts. Needs human intervention.")
|
||||
cooldowns[cooldown_key] = {"count": 0, "last": now}
|
||||
save_cooldowns(cooldowns)
|
||||
continue
|
||||
|
||||
success = try_restart_via_ssh(host, host_config, service_name)
|
||||
log_restart(service_name, host, retries + 1, success)
|
||||
|
||||
cooldowns[cooldown_key] = {"count": retries + 1 if not success else 0, "last": now}
|
||||
save_cooldowns(cooldowns)
|
||||
|
||||
return results
|
||||
|
||||
|
||||
def daemon_mode():
|
||||
"""Run continuously every 60 seconds."""
|
||||
print("Auto-restart agent running in daemon mode (60s interval)")
|
||||
print(f"Monitoring {len(LOCAL_SERVICES)} local + {sum(len(h['services']) for h in VPS_SERVICES.values())} remote services")
|
||||
print(f"Max retries per cycle: {MAX_RETRIES}")
|
||||
print(f"Cooldown after max retries: {COOLDOWN_PERIOD}s")
|
||||
while True:
|
||||
check_and_restart()
|
||||
time.sleep(60)
|
||||
|
||||
|
||||
def show_status():
|
||||
"""Show restart history and cooldowns."""
|
||||
cooldowns = get_cooldowns()
|
||||
print("=== Restart Cooldowns ===")
|
||||
for key, data in sorted(cooldowns.items()):
|
||||
count = data.get("count", 0)
|
||||
if count > 0:
|
||||
print(f" {key}: {count} failures, last at {datetime.fromtimestamp(data.get('last',0), tz=timezone.utc).strftime('%H:%M')}")
|
||||
|
||||
print("\n=== Restart Log (last 20) ===")
|
||||
if RESTART_LOG.exists():
|
||||
lines = RESTART_LOG.read_text().strip().split("\n")
|
||||
for line in lines[-20:]:
|
||||
print(f" {line}")
|
||||
else:
|
||||
print(" No restarts logged yet.")
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
LOG_DIR.mkdir(parents=True, exist_ok=True)
|
||||
|
||||
if len(sys.argv) > 1 and sys.argv[1] == "--daemon":
|
||||
daemon_mode()
|
||||
elif len(sys.argv) > 1 and sys.argv[1] == "--status":
|
||||
show_status()
|
||||
else:
|
||||
check_and_restart()
|
||||
@@ -1,191 +0,0 @@
|
||||
# Capacity Inventory - Fleet Resource Baseline
|
||||
|
||||
**Last audited:** 2026-04-07 16:00 UTC
|
||||
**Auditor:** Timmy (direct inspection)
|
||||
|
||||
---
|
||||
|
||||
## Fleet Resources (Paperclips Model)
|
||||
|
||||
Three primary resources govern the fleet:
|
||||
|
||||
| Resource | Role | Generation | Consumption |
|
||||
|----------|------|-----------|-------------|
|
||||
| **Capacity** | Compute hours available across fleet. Determines what work can be done. | Through healthy utilization of VPS/Mac agents | Fleet improvements consume it (investing in automation, orchestration, sovereignty) |
|
||||
| **Uptime** | % time services are running. Earned at Fibonacci milestones. | When services stay up naturally | Degrades on any failure |
|
||||
| **Innovation** | Only generates when capacity is <70% utilized. Fuels Phase 3+. | When you leave capacity free | Phase 3+ buildings consume it (requires spare capacity to build) |
|
||||
|
||||
### The Tension
|
||||
- Run fleet at 95%+ capacity: maximum productivity, ZERO Innovation
|
||||
- Run fleet at <70% capacity: Innovation generates but slower progress
|
||||
- This forces the Paperclips question: optimize now or invest in future capability?
|
||||
|
||||
---
|
||||
|
||||
## VPS Resource Baselines
|
||||
|
||||
### Ezra (143.198.27.163) - "Forge"
|
||||
|
||||
| Metric | Value | Utilization |
|
||||
|--------|-------|-------------|
|
||||
| **OS** | Ubuntu 24.04 (6.8.0-106-generic) | |
|
||||
| **vCPU** | 4 vCPU (DO basic droplet, shared) | Load: 10.76/7.59/7.04 (very high) |
|
||||
| **RAM** | 7,941 MB total | 2,104 used / 5,836 available (26% used, 74% free) |
|
||||
| **Disk** | 154 GB vda1 | 111 GB used / 44 GB free (72%) **WARNING** |
|
||||
| **Swap** | 6,143 MB | 643 MB used (10%) |
|
||||
| **Uptime** | 7 days, 18 hours | |
|
||||
|
||||
### Key Processes (sorted by memory)
|
||||
| Process | RSS | %CPU | Notes |
|
||||
|---------|-----|------|-------|
|
||||
| Gitea | 556 MB | 83.5% | Web service, high CPU due to API load |
|
||||
| MemPalace (ezra) | 268 MB | 136% | Mining project files - HIGH CPU |
|
||||
| Hermes gateway (ezra) | 245 MB | 1.7% | Agent gateway |
|
||||
| Ollama | 230 MB | 0.1% | Model serving |
|
||||
| PostgreSQL | 138 MB | ~0% | Gitea database |
|
||||
|
||||
**Capacity assessment:** 26% memory used, but 72% disk is getting tight. CPU load is very high (10.76 on 4vCPU = 269% utilization). Ezra is CPU-bound, not RAM-bound.
|
||||
|
||||
### Allegro (167.99.126.228)
|
||||
|
||||
| Metric | Value | Utilization |
|
||||
|--------|-------|-------------|
|
||||
| **OS** | Ubuntu 24.04 (6.8.0-106-generic) | |
|
||||
| **vCPU** | 4 vCPU (DO basic droplet, shared) | Moderate load |
|
||||
| **RAM** | 7,941 MB total | 1,591 used / 6,349 available (20% used, 80% free) |
|
||||
| **Disk** | 154 GB vda1 | 41 GB used / 114 GB free (27%) **GOOD** |
|
||||
| **Swap** | 8,191 MB | 686 MB used (8%) |
|
||||
| **Uptime** | 7 days, 18 hours | |
|
||||
|
||||
### Key Processes (sorted by memory)
|
||||
| Process | RSS | %CPU | Notes |
|
||||
|---------|-----|------|-------|
|
||||
| Hermes gateway (allegro) | 680 MB | 0.9% | Main agent gateway |
|
||||
| Gitea | 181 MB | 1.2% | Secondary gitea? |
|
||||
| Systemd-journald | 160 MB | 0.0% | System logging |
|
||||
| Ezra Hermes gateway | 58 MB | 0.0% | Running ezra agent here |
|
||||
| Bezalel Hermes gateway | 58 MB | 0.0% | Running bezalel agent here |
|
||||
| Dockerd | 48 MB | 0.0% | Docker daemon |
|
||||
|
||||
**Capacity assessment:** 20% memory used, 27% disk used. Allegro has headroom. Also running hermes gateways for Ezra and Bezalel (cross-host agent execution).
|
||||
|
||||
### Bezalel (159.203.146.185)
|
||||
|
||||
| Metric | Value | Utilization |
|
||||
|--------|-------|-------------|
|
||||
| **OS** | Ubuntu 24.04 (6.8.0-71-generic) | |
|
||||
| **vCPU** | 2 vCPU (DO basic droplet, shared) | Load varies |
|
||||
| **RAM** | 1,968 MB total | 817 used / 1,151 available (42% used, 58% free) |
|
||||
| **Disk** | 48 GB vda1 | 12 GB used / 37 GB free (24%) **GOOD** |
|
||||
| **Swap** | 2,047 MB | 448 MB used (22%) |
|
||||
| **Uptime** | 7 days, 18 hours | |
|
||||
|
||||
### Key Processes (sorted by memory)
|
||||
| Process | RSS | %CPU | Notes |
|
||||
|---------|-----|------|-------|
|
||||
| Hermes gateway | 339 MB | 7.7% | Agent gateway (16.8% of RAM) |
|
||||
| uv pip install | 137 MB | 56.6% | Installing packages (temporary) |
|
||||
| Mender | 27 MB | 0.0% | Device management |
|
||||
|
||||
**Capacity assessment:** 42% memory used, only 2GB total RAM. Bezalel is the most constrained. 2 vCPU means less compute headroom than Ezra/Allegro. Disk is fine.
|
||||
|
||||
### Mac Local (M3 Max)
|
||||
|
||||
| Metric | Value | Utilization |
|
||||
|--------|-------|-------------|
|
||||
| **OS** | macOS 26.3.1 | |
|
||||
| **CPU** | Apple M3 Max (14 cores) | Very capable |
|
||||
| **RAM** | 36 GB | ~8 GB used (22%) |
|
||||
| **Disk** | 926 GB total | ~624 GB used / 302 GB free (68%) |
|
||||
|
||||
### Key Processes
|
||||
| Process | Memory | Notes |
|
||||
|---------|--------|-------|
|
||||
| Hermes gateway | 500 MB | Primary gateway |
|
||||
| Hermes agents (x3) | ~560 MB total | Multiple sessions |
|
||||
| Ollama | ~20 MB base + model memory | Model loading varies |
|
||||
| OpenClaw | 350 MB | Gateway process |
|
||||
| Evennia (server+portal) | 56 MB | Game world |
|
||||
|
||||
---
|
||||
|
||||
## Resource Summary
|
||||
|
||||
| Resource | Ezra | Allegro | Bezalel | Mac Local | TOTAL |
|
||||
|----------|------|---------|---------|-----------|-------|
|
||||
| **vCPU** | 4 | 4 | 2 | 14 (M3 Max) | 24 |
|
||||
| **RAM** | 8 GB (26% used) | 8 GB (20% used) | 2 GB (42% used) | 36 GB (22% used) | 54 GB |
|
||||
| **Disk** | 154 GB (72%) | 154 GB (27%) | 48 GB (24%) | 926 GB (68%) | 1,282 GB |
|
||||
| **Cost** | $12/mo | $12/mo | $12/mo | owned | $36/mo |
|
||||
|
||||
### Utilization by Category
|
||||
| Category | Estimated Daily Hours | % of Fleet Capacity |
|
||||
|----------|----------------------|---------------------|
|
||||
| Hermes agents | ~3-4 hrs active | 5-7% |
|
||||
| Ollama inference | ~1-2 hrs | 2-4% |
|
||||
| Gitea services | 24/7 | 5-10% |
|
||||
| Evennia | 24/7 | <1% |
|
||||
| Idle | ~18-20 hrs | ~80-90% |
|
||||
|
||||
### Capacity Utilization: ~15-20% active
|
||||
**Innovation rate:** GENERATING (capacity < 70%)
|
||||
**Recommendation:** Good — Innovation is generating because most capacity is free.
|
||||
This means Phase 3+ capabilities (orchestration, load balancing, etc.) are accessible NOW.
|
||||
|
||||
---
|
||||
|
||||
## Uptime Baseline
|
||||
|
||||
**Baseline period:** 2026-04-07 14:00-16:00 UTC (2 hours, ~24 checks at 5-min intervals)
|
||||
|
||||
| Service | Checks | Uptime | Status |
|
||||
|---------|--------|--------|--------|
|
||||
| Ezra | 24/24 | 100.0% | GOOD |
|
||||
| Allegro | 24/24 | 100.0% | GOOD |
|
||||
| Bezalel | 24/24 | 100.0% | GOOD |
|
||||
| Gitea | 23/24 | 95.8% | GOOD |
|
||||
| Hermes Gateway | 23/24 | 95.8% | GOOD |
|
||||
| Ollama | 24/24 | 100.0% | GOOD |
|
||||
| OpenClaw | 24/24 | 100.0% | GOOD |
|
||||
| Evennia | 24/24 | 100.0% | GOOD |
|
||||
| Hermes Agent | 21/24 | 87.5% | **CHECK** |
|
||||
|
||||
### Fibonacci Uptime Milestones
|
||||
| Milestone | Target | Current | Status |
|
||||
|-----------|--------|---------|--------|
|
||||
| 95% | 95% | 100% (VPS), 98.6% (avg) | REACHED |
|
||||
| 95.5% | 95.5% | 98.6% | REACHED |
|
||||
| 96% | 96% | 98.6% | REACHED |
|
||||
| 97% | 97% | 98.6% | REACHED |
|
||||
| 98% | 98% | 98.6% | REACHED |
|
||||
| 99% | 99% | 98.6% | APPROACHING |
|
||||
|
||||
---
|
||||
|
||||
## Risk Assessment
|
||||
|
||||
| Risk | Severity | Mitigation |
|
||||
|------|----------|------------|
|
||||
| Ezra disk 72% used | MEDIUM | Move non-essential data, add monitoring alert at 85% |
|
||||
| Bezalel only 2GB RAM | HIGH | Cannot run large models locally. Good for Evennia, tight for agents |
|
||||
| Ezra CPU load 269% | HIGH | MemPalace mining consuming 136% CPU. Consider scheduling |
|
||||
| Mac disk 68% used | MEDIUM | 302 GB free still. Growing but not urgent |
|
||||
| No cross-VPS mesh | LOW | SSH works but no Tailscale. No private network between VPSes |
|
||||
|
||||
---
|
||||
|
||||
## Recommendations
|
||||
|
||||
### Immediate (Phase 1-2)
|
||||
1. **Ezra disk cleanup:** 44 GB free at 72%. Docker images, old logs, and MemPalace mine data could be rotated.
|
||||
2. **Alert thresholds:** Add disk alerts at 85% (Ezra, Mac) before they become critical.
|
||||
|
||||
### Short-term (Phase 3)
|
||||
3. **Load balancing:** Ezra is CPU-bound, Allegro has 80% RAM free. Move some agent processes from Ezra to Allegro.
|
||||
4. **Innovation investment:** Since fleet is at 15-20% utilization, Innovation is high. This is the time to build Phase 3 capabilities.
|
||||
|
||||
### Medium-term (Phase 4)
|
||||
5. **Bezalel RAM upgrade:** 2GB is tight. Consider upgrade to 4GB ($24/mo instead of $12/mo).
|
||||
6. **Tailscale mesh:** Install on all VPSes for private inter-VPS network.
|
||||
|
||||
---
|
||||
@@ -1,122 +0,0 @@
|
||||
#!/usr/bin/env python3
|
||||
"""
|
||||
FLEET-010: Cross-Agent Task Delegation Protocol
|
||||
Phase 3: Orchestration. Agents create issues, assign to other agents, review PRs.
|
||||
|
||||
Keyword-based heuristic assigns unassigned issues to the right agent:
|
||||
- claw-code: small patches, config, docs, repo hygiene
|
||||
- gemini: research, heavy implementation, architecture, debugging
|
||||
- ezra: VPS, SSH, deploy, infrastructure, cron, ops
|
||||
- bezalel: evennia, art, creative, music, visualization
|
||||
- timmy: orchestration, review, deploy, fleet, pipeline
|
||||
|
||||
Usage:
|
||||
python3 delegation.py run # Full cycle: scan, assign, report
|
||||
python3 delegation.py status # Show current delegation state
|
||||
python3 delegation.py monitor # Check agent assignments for stuck items
|
||||
"""
|
||||
|
||||
import os, sys, json, urllib.request
|
||||
from datetime import datetime, timezone
|
||||
from pathlib import Path
|
||||
|
||||
GITEA_BASE = "https://forge.alexanderwhitestone.com/api/v1"
|
||||
TOKEN = Path(os.path.expanduser("~/.config/gitea/token")).read_text().strip()
|
||||
DATA_DIR = Path(os.path.expanduser("~/.local/timmy/fleet-resources"))
|
||||
LOG_FILE = DATA_DIR / "delegation.log"
|
||||
HEADERS = {"Authorization": f"token {TOKEN}"}
|
||||
|
||||
AGENTS = {
|
||||
"claw-code": {"caps": ["patch","config","gitignore","cleanup","format","readme","typo"], "active": True},
|
||||
"gemini": {"caps": ["research","investigate","benchmark","survey","evaluate","architecture","implementation"], "active": True},
|
||||
"ezra": {"caps": ["vps","ssh","deploy","cron","resurrect","provision","infra","server"], "active": True},
|
||||
"bezalel": {"caps": ["evennia","art","creative","music","visual","design","animation"], "active": True},
|
||||
"timmy": {"caps": ["orchestrate","review","pipeline","fleet","monitor","health","deploy","ci"], "active": True},
|
||||
}
|
||||
|
||||
MONITORED = [
|
||||
"Timmy_Foundation/timmy-home",
|
||||
"Timmy_Foundation/timmy-config",
|
||||
"Timmy_Foundation/the-nexus",
|
||||
"Timmy_Foundation/hermes-agent",
|
||||
]
|
||||
|
||||
def api(path, method="GET", data=None):
|
||||
url = f"{GITEA_BASE}{path}"
|
||||
body = json.dumps(data).encode() if data else None
|
||||
hdrs = dict(HEADERS)
|
||||
if data: hdrs["Content-Type"] = "application/json"
|
||||
req = urllib.request.Request(url, data=body, headers=hdrs, method=method)
|
||||
try:
|
||||
resp = urllib.request.urlopen(req, timeout=15)
|
||||
raw = resp.read().decode()
|
||||
return json.loads(raw) if raw.strip() else {}
|
||||
except urllib.error.HTTPError as e:
|
||||
body = e.read().decode()
|
||||
print(f" API {e.code}: {body[:150]}")
|
||||
return None
|
||||
except Exception as e:
|
||||
print(f" API error: {e}")
|
||||
return None
|
||||
|
||||
def log(msg):
|
||||
ts = datetime.now(timezone.utc).strftime("%Y-%m-%d %H:%M:%S")
|
||||
DATA_DIR.mkdir(parents=True, exist_ok=True)
|
||||
with open(LOG_FILE, "a") as f: f.write(f"[{ts}] {msg}\n")
|
||||
|
||||
def suggest_agent(title, body):
|
||||
text = (title + " " + body).lower()
|
||||
for agent, info in AGENTS.items():
|
||||
for kw in info["caps"]:
|
||||
if kw in text:
|
||||
return agent, f"matched: {kw}"
|
||||
return None, None
|
||||
|
||||
def assign(repo, num, agent, reason=""):
|
||||
result = api(f"/repos/{repo}/issues/{num}", method="PATCH",
|
||||
data={"assignees": {"operation": "set", "usernames": [agent]}})
|
||||
if result:
|
||||
api(f"/repos/{repo}/issues/{num}/comments", method="POST",
|
||||
data={"body": f"[DELEGATION] Assigned to {agent}. {reason}"})
|
||||
log(f"Assigned {repo}#{num} to {agent}: {reason}")
|
||||
return result
|
||||
|
||||
def run_cycle():
|
||||
log("--- Delegation cycle start ---")
|
||||
count = 0
|
||||
for repo in MONITORED:
|
||||
issues = api(f"/repos/{repo}/issues?state=open&limit=50")
|
||||
if not issues: continue
|
||||
for i in issues:
|
||||
if i.get("assignees"): continue
|
||||
title = i.get("title", "")
|
||||
body = i.get("body", "")
|
||||
if any(w in title.lower() for w in ["epic", "discussion"]): continue
|
||||
agent, reason = suggest_agent(title, body)
|
||||
if agent and AGENTS.get(agent, {}).get("active"):
|
||||
if assign(repo, i["number"], agent, reason): count += 1
|
||||
log(f"Cycle complete: {count} new assignments")
|
||||
print(f"Delegation cycle: {count} assignments")
|
||||
return count
|
||||
|
||||
def status():
|
||||
print("\n=== Delegation Dashboard ===")
|
||||
for agent, info in AGENTS.items():
|
||||
count = 0
|
||||
for repo in MONITORED:
|
||||
issues = api(f"/repos/{repo}/issues?state=open&limit=50")
|
||||
if issues:
|
||||
for i in issues:
|
||||
for a in (i.get("assignees") or []):
|
||||
if a.get("login") == agent: count += 1
|
||||
icon = "ON" if info["active"] else "OFF"
|
||||
print(f" {agent:12s}: {count:>3} issues [{icon}]")
|
||||
|
||||
if __name__ == "__main__":
|
||||
cmd = sys.argv[1] if len(sys.argv) > 1 else "run"
|
||||
DATA_DIR.mkdir(parents=True, exist_ok=True)
|
||||
if cmd == "status": status()
|
||||
elif cmd == "run":
|
||||
run_cycle()
|
||||
status()
|
||||
else: status()
|
||||
@@ -1,299 +0,0 @@
|
||||
#!/usr/bin/env python3
|
||||
"""
|
||||
Fleet Health Check -- The Timmy Foundation
|
||||
Runs every 5 minutes via cron. Checks all machines, logs results,
|
||||
alerts via Telegram if something is down.
|
||||
|
||||
Produces:
|
||||
- ~/.local/timmy/fleet-health/YYYY-MM-DD.log (per-day log)
|
||||
- ~/.local/timmy/fleet-health/uptime.json (running uptime stats)
|
||||
- Telegram alert if any check fails
|
||||
|
||||
Usage:
|
||||
- python3 fleet_health.py # Run checks now
|
||||
- python3 fleet_health.py --init # Initialize log directory
|
||||
"""
|
||||
|
||||
import os
|
||||
import sys
|
||||
import json
|
||||
import time
|
||||
import socket
|
||||
import subprocess
|
||||
from datetime import datetime, timezone
|
||||
from pathlib import Path
|
||||
|
||||
# === CONFIG ===
|
||||
HOSTS = {
|
||||
"ezra": {
|
||||
"ip": "143.198.27.163",
|
||||
"ssh_user": "root",
|
||||
"checks": ["ssh", "gitea"],
|
||||
"services": {
|
||||
"nginx": "systemctl is-active nginx",
|
||||
"gitea": "systemctl is-active gitea",
|
||||
"docker": "systemctl is-active docker",
|
||||
},
|
||||
},
|
||||
"allegro": {
|
||||
"ip": "167.99.126.228",
|
||||
"ssh_user": "root",
|
||||
"checks": ["ssh", "processes"],
|
||||
"services": {
|
||||
"hermes-agent": "pgrep -f hermes > /dev/null && echo active || echo inactive",
|
||||
},
|
||||
},
|
||||
"bezalel": {
|
||||
"ip": "159.203.146.185",
|
||||
"ssh_user": "root",
|
||||
"checks": ["ssh", "evennia"],
|
||||
"services": {
|
||||
"hermes-agent": "pgrep -f hermes > /dev/null 2>/dev/null && echo active || echo inactive",
|
||||
"evennia": "pgrep -f evennia > /dev/null 2>/dev/null && echo active || echo inactive",
|
||||
},
|
||||
},
|
||||
}
|
||||
|
||||
LOCAL_CHECKS = {
|
||||
"hermes-gateway": "pgrep -f 'hermes gateway' > /dev/null 2>/dev/null",
|
||||
"hermes-agent": "pgrep -f 'hermes agent\\|hermes session' > /dev/null 2>/dev/null",
|
||||
"ollama": "pgrep -f 'ollama serve' > /dev/null 2>/dev/null",
|
||||
"openclaw": "pgrep -f 'openclaw' > /dev/null 2>/dev/null",
|
||||
"evennia": "pgrep -f 'evennia' > /dev/null 2>/dev/null",
|
||||
}
|
||||
|
||||
LOG_DIR = Path(os.path.expanduser("~/.local/timmy/fleet-health"))
|
||||
UPTIME_FILE = LOG_DIR / "uptime.json"
|
||||
TELEGRAM_TOKEN_FILE = Path(os.path.expanduser("~/.config/telegram/special_bot"))
|
||||
TELEGRAM_CHAT = "-1003664764329"
|
||||
LAST_ALERT_FILE = LOG_DIR / "last_alert.json"
|
||||
ALERT_COOLDOWN = 3600 # 1 hour between identical alerts
|
||||
|
||||
|
||||
def setup():
|
||||
LOG_DIR.mkdir(parents=True, exist_ok=True)
|
||||
if not UPTIME_FILE.exists():
|
||||
UPTIME_FILE.write_text(json.dumps({}))
|
||||
if not LAST_ALERT_FILE.exists():
|
||||
LAST_ALERT_FILE.write_text(json.dumps({}))
|
||||
|
||||
|
||||
def check_ssh(host, ip, user="root", timeout=5):
|
||||
try:
|
||||
sock = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
|
||||
sock.settimeout(timeout)
|
||||
result = sock.connect_ex((ip, 22))
|
||||
sock.close()
|
||||
return result == 0, f"SSH port 22 {'open' if result == 0 else 'closed'}"
|
||||
except Exception as e:
|
||||
return False, f"SSH check failed: {e}"
|
||||
|
||||
|
||||
def check_remote_services(host_config, timeout=15):
|
||||
ip = host_config["ip"]
|
||||
user = host_config["ssh_user"]
|
||||
results = {}
|
||||
try:
|
||||
cmds = []
|
||||
for name, cmd in host_config["services"].items():
|
||||
cmds.append(f"echo '{name}: $({cmd})'")
|
||||
full_cmd = "; ".join(cmds)
|
||||
ssh_cmd = f"ssh -o StrictHostKeyChecking=no -o ConnectTimeout={timeout} {user}@{ip} \"{full_cmd}\""
|
||||
proc = subprocess.run(ssh_cmd, shell=True, capture_output=True, text=True, timeout=timeout + 5)
|
||||
if proc.returncode != 0:
|
||||
return {"error": f"SSH command failed: {proc.stderr.strip()[:200]}"}
|
||||
for line in proc.stdout.strip().split("\n"):
|
||||
if ":" in line:
|
||||
name, status = line.split(":", 1)
|
||||
results[name.strip()] = status.strip().lower()
|
||||
except subprocess.TimeoutExpired:
|
||||
return {"error": f"SSH timeout after {timeout}s"}
|
||||
except Exception as e:
|
||||
return {"error": str(e)}
|
||||
return results
|
||||
|
||||
|
||||
def check_local_processes():
|
||||
results = {}
|
||||
for name, cmd in LOCAL_CHECKS.items():
|
||||
try:
|
||||
proc = subprocess.run(cmd, shell=True, capture_output=True, timeout=5)
|
||||
results[name] = "active" if proc.returncode == 0 else "inactive"
|
||||
except Exception as e:
|
||||
results[name] = f"error: {e}"
|
||||
return results
|
||||
|
||||
|
||||
def check_disk_usage(ip=None, user="root"):
|
||||
if ip:
|
||||
cmd = f"ssh -o StrictHostKeyChecking=no -o ConnectTimeout=10 {user}@{ip} 'df -h / | tail -1'"
|
||||
else:
|
||||
cmd = "df -h / | tail -1"
|
||||
try:
|
||||
proc = subprocess.run(cmd, shell=True, capture_output=True, text=True, timeout=10)
|
||||
if proc.returncode == 0 and proc.stdout.strip():
|
||||
parts = proc.stdout.strip().split()
|
||||
if len(parts) >= 5:
|
||||
return {"total": parts[1], "used": parts[2], "available": parts[3], "percent": parts[4]}
|
||||
return {"error": f"parse failed: {proc.stdout.strip()[:100]}"}
|
||||
return {"error": proc.stderr.strip()[:100] if proc.stderr else "empty response"}
|
||||
except Exception as e:
|
||||
return {"error": str(e)}
|
||||
|
||||
|
||||
def check_gitea():
|
||||
import urllib.request
|
||||
try:
|
||||
req = urllib.request.Request("https://forge.alexanderwhitestone.com/api/v1/version")
|
||||
resp = urllib.request.urlopen(req, timeout=10)
|
||||
data = json.loads(resp.read())
|
||||
return True, f"Gitea responding: {json.dumps(data)[:100]}"
|
||||
except Exception as e:
|
||||
return False, f"Gitea check failed: {e}"
|
||||
|
||||
|
||||
def send_alert(message):
|
||||
if not TELEGRAM_TOKEN_FILE.exists():
|
||||
print(f" [ALERT - NO TELEGRAM TOKEN] {message}")
|
||||
return
|
||||
token = TELEGRAM_TOKEN_FILE.read_text().strip()
|
||||
url = f"https://api.telegram.org/bot{token}/sendMessage"
|
||||
body = json.dumps({
|
||||
"chat_id": TELEGRAM_CHAT,
|
||||
"text": f"[FLEET ALERT]\n{message}",
|
||||
"parse_mode": "Markdown",
|
||||
}).encode()
|
||||
try:
|
||||
import urllib.request
|
||||
req = urllib.request.Request(url, data=body, headers={"Content-Type": "application/json"}, method="POST")
|
||||
resp = urllib.request.urlopen(req, timeout=10)
|
||||
print(f" [ALERT SENT] {message}")
|
||||
return True
|
||||
except Exception as e:
|
||||
print(f" [ALERT FAILED] {message}: {e}")
|
||||
return False
|
||||
|
||||
|
||||
def check_alert_cooldown(alert_key):
|
||||
if LAST_ALERT_FILE.exists():
|
||||
try:
|
||||
cooldowns = json.loads(LAST_ALERT_FILE.read_text())
|
||||
last = cooldowns.get(alert_key, 0)
|
||||
if time.time() - last < ALERT_COOLDOWN:
|
||||
return False
|
||||
except (json.JSONDecodeError, KeyError):
|
||||
pass
|
||||
return True
|
||||
|
||||
|
||||
def record_alert(alert_key):
|
||||
cooldowns = {}
|
||||
if LAST_ALERT_FILE.exists():
|
||||
try:
|
||||
cooldowns = json.loads(LAST_ALERT_FILE.read_text())
|
||||
except json.JSONDecodeError:
|
||||
pass
|
||||
cooldowns[alert_key] = time.time()
|
||||
LAST_ALERT_FILE.write_text(json.dumps(cooldowns))
|
||||
|
||||
|
||||
def run_checks():
|
||||
now = datetime.now(timezone.utc)
|
||||
ts = now.strftime("%Y-%m-%d %H:%M:%S UTC")
|
||||
day_file = LOG_DIR / f"{now.strftime('%Y-%m-%d')}.log"
|
||||
|
||||
results = {
|
||||
"timestamp": ts,
|
||||
"host": socket.gethostname(),
|
||||
"vps": {},
|
||||
"local": {},
|
||||
"alerts": [],
|
||||
}
|
||||
|
||||
# Check Gitea
|
||||
gitea_ok, gitea_msg = check_gitea()
|
||||
if not gitea_ok:
|
||||
results["gitea"] = {"status": "DOWN", "message": gitea_msg}
|
||||
results["alerts"].append(f"Gitea DOWN: {gitea_msg}")
|
||||
else:
|
||||
results["gitea"] = {"status": "UP", "message": gitea_msg[:100]}
|
||||
|
||||
# Check each VPS
|
||||
for name, config in HOSTS.items():
|
||||
vps_result = {"timestamp": ts}
|
||||
ssh_ok, ssh_msg = check_ssh(name, config["ip"])
|
||||
vps_result["ssh"] = {"ok": ssh_ok, "message": ssh_msg}
|
||||
if not ssh_ok:
|
||||
results["alerts"].append(f"{name.upper()} ({config['ip']}) SSH DOWN: {ssh_msg}")
|
||||
vps_result["disk"] = check_disk_usage(config["ip"], config["ssh_user"])
|
||||
if ssh_ok:
|
||||
vps_result["services"] = check_remote_services(config)
|
||||
results["vps"][name] = vps_result
|
||||
|
||||
# Check local processes
|
||||
results["local"]["processes"] = check_local_processes()
|
||||
results["local"]["disk"] = check_disk_usage()
|
||||
|
||||
# Log results
|
||||
day_file.parent.mkdir(parents=True, exist_ok=True)
|
||||
with open(day_file, "a") as f:
|
||||
f.write(f"\n--- {ts} ---\n")
|
||||
for name, vps in results["vps"].items():
|
||||
status = "UP" if vps["ssh"]["ok"] else "DOWN"
|
||||
f.write(f" {name}: {status}\n")
|
||||
if "services" in vps:
|
||||
for svc, svc_status in vps["services"].items():
|
||||
f.write(f" {svc}: {svc_status}\n")
|
||||
for proc, status in results["local"]["processes"].items():
|
||||
f.write(f" local/{proc}: {status}\n")
|
||||
|
||||
# Update uptime stats
|
||||
uptime = {}
|
||||
if UPTIME_FILE.exists():
|
||||
try:
|
||||
uptime = json.loads(UPTIME_FILE.read_text())
|
||||
except json.JSONDecodeError:
|
||||
pass
|
||||
if "checks" not in uptime:
|
||||
uptime["checks"] = []
|
||||
uptime["checks"].append({
|
||||
"ts": ts,
|
||||
"vps": {name: vps["ssh"]["ok"] for name, vps in results["vps"].items()},
|
||||
"gitea": results.get("gitea", {}).get("status") == "UP",
|
||||
"local": {k: v == "active" for k, v in results["local"]["processes"].items()}
|
||||
})
|
||||
if len(uptime["checks"]) > 1000:
|
||||
uptime["checks"] = uptime["checks"][-1000:]
|
||||
UPTIME_FILE.write_text(json.dumps(uptime, indent=2))
|
||||
|
||||
# Send alerts
|
||||
for alert in results["alerts"]:
|
||||
alert_key = alert[:80]
|
||||
if check_alert_cooldown(alert_key):
|
||||
send_alert(alert)
|
||||
record_alert(alert_key)
|
||||
|
||||
# Summary
|
||||
up_vps = sum(1 for v in results["vps"].values() if v["ssh"]["ok"])
|
||||
total_vps = len(results["vps"])
|
||||
up_local = sum(1 for v in results["local"]["processes"].values() if v == "active")
|
||||
total_local = len(results["local"]["processes"])
|
||||
alert_count = len(results["alerts"])
|
||||
|
||||
print(f"\n=== Fleet Health Check {ts} ===")
|
||||
print(f" VPS: {up_vps}/{total_vps} online")
|
||||
print(f" Local: {up_local}/{total_local} active")
|
||||
print(f" Gitea: {'UP' if results.get('gitea', {}).get('status') == 'UP' else 'DOWN'}")
|
||||
if alert_count > 0:
|
||||
print(f" ALERTS: {alert_count}")
|
||||
for a in results["alerts"]:
|
||||
print(f" - {a}")
|
||||
else:
|
||||
print(f" All clear.")
|
||||
|
||||
return results
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
setup()
|
||||
run_checks()
|
||||
@@ -1,142 +0,0 @@
|
||||
# Fleet Milestone Messages
|
||||
|
||||
Every milestone marks passage through fleet evolution. When achieved, the message
|
||||
prints to the fleet log. Each one references a real achievement, not abstract numbers.
|
||||
|
||||
**Source:** Inspired by Paperclips milestone messages (500 clips, 1000 clips, Full autonomy attained, etc.)
|
||||
|
||||
---
|
||||
|
||||
## Phase 1: Survival (Current)
|
||||
|
||||
### M1: First Automated Health Check
|
||||
**Trigger:** `fleet/health_check.py` runs successfully for the first time.
|
||||
**Message:** "First automated health check runs. No longer watching the clock."
|
||||
|
||||
### M2: First Auto-Restart
|
||||
**Trigger:** A dead process is detected and restarted without human intervention.
|
||||
**Message:** "A process failed at 3am and restarted itself. You found out in the morning."
|
||||
|
||||
### M3: First Backup Completed
|
||||
**Trigger:** A backup pipeline runs end-to-end and verifies integrity.
|
||||
**Message:** "A backup completed. You did not have to think about it."
|
||||
|
||||
### M4: 95% Uptime (30 days)
|
||||
**Trigger:** Uptime >= 95% over last 30 days.
|
||||
**Message:** "95% uptime over 30 days. The fleet stays up."
|
||||
|
||||
### M5: Uptime 97%
|
||||
**Trigger:** Uptime >= 97% over last 30 days.
|
||||
**Message:** "97% uptime. Three nines of availability across four machines."
|
||||
|
||||
---
|
||||
|
||||
## Phase 2: Automation (unlock when: uptime >= 95% + capacity > 60%)
|
||||
|
||||
### M6: Zero Manual Restarts (7 days)
|
||||
**Trigger:** 7 consecutive days with zero manual process restarts.
|
||||
**Message:** "Seven days. Zero manual restarts. The fleet heals itself."
|
||||
|
||||
### M7: PR Auto-Merged
|
||||
**Trigger:** A PR passes CI, review, and merges without human touching it.
|
||||
**Message:** "A PR was tested, reviewed, and merged by agents. You just said 'looks good.'"
|
||||
|
||||
### M8: Config Push Works
|
||||
**Trigger:** Config change pushed to all 3 VPSes atomically and verified.
|
||||
**Message:** "Config pushed to all three VPSes in one command. No SSH needed."
|
||||
|
||||
### M9: 98% Uptime
|
||||
**Trigger:** Uptime >= 98% over last 30 days.
|
||||
**Message:** "98% uptime. Only 14 hours of downtime in a month. Most of it planned."
|
||||
|
||||
---
|
||||
|
||||
## Phase 3: Orchestration (unlock when: all Phase 2 buildings + Innovation > 100)
|
||||
|
||||
### M10: Cross-Agent Delegation Works
|
||||
**Trigger:** Agent A creates issue, assigns to Agent B, Agent B works and creates PR.
|
||||
**Message:** "Agent Alpha created a task, Agent Beta completed it. They did not ask permission."
|
||||
|
||||
### M11: First Model Running Locally on 2+ Machines
|
||||
**Trigger:** Ollama serving same model on Ezra and Allegro simultaneously.
|
||||
**Message:** "A model runs on two machines at once. No cloud. No rate limits."
|
||||
|
||||
### M12: Fleet-Wide Burn Mode
|
||||
**Trigger:** All agents coordinated on single epic, produced coordinated PRs.
|
||||
**Message:** "All agents working the same epic. The fleet moves as one."
|
||||
|
||||
---
|
||||
|
||||
## Phase 4: Sovereignty (unlock when: zero cloud deps for core ops)
|
||||
|
||||
### M13: First Entirely Local Inference Day
|
||||
**Trigger:** 24 hours with zero API calls to external providers.
|
||||
**Message:** "A model ran locally for the first time. No cloud. No rate limits. No one can turn it off."
|
||||
|
||||
### M14: Sovereign Email
|
||||
**Trigger:** Stalwart email server sends and receives without Gmail relay.
|
||||
**Message:** "Email flows through our own server. No Google. No Microsoft. Ours."
|
||||
|
||||
### M15: Sovereign Messaging
|
||||
**Trigger:** Telegram bot runs without cloud relay dependency.
|
||||
**Message:** "Messages arrive through our own infrastructure. No corporate middleman."
|
||||
|
||||
---
|
||||
|
||||
## Phase 5: Scale (unlock when: sovereignty stable + Innovation > 500)
|
||||
|
||||
### M16: First Self-Spawned Agent
|
||||
**Trigger:** Agent lifecycle manager spawns a new agent instance due to load.
|
||||
**Message:** "A new agent appeared. You did not create it. The fleet built what it needed."
|
||||
|
||||
### M17: Agent Retired Gracefully
|
||||
**Trigger:** An agent instance retires after idle timeout and cleans up its state.
|
||||
**Message:** "An agent retired. It served its purpose. Nothing was lost."
|
||||
|
||||
### M18: Fleet Runs 24h Unattended
|
||||
**Trigger:** 24 hours with zero human intervention of any kind.
|
||||
**Message:** "A full day. No humans. No commands. The fleet runs itself."
|
||||
|
||||
---
|
||||
|
||||
## Phase 6: The Network (unlock when: 7 days zero human intervention)
|
||||
|
||||
### M19: Fleet Creates Its Own Improvement Task
|
||||
**Trigger:** Fleet analyzes itself and creates an issue on Gitea.
|
||||
**Message:** "The fleet found something to improve. It created the task itself."
|
||||
|
||||
### M20: First Outside Contribution
|
||||
**Trigger:** An external contributor's PR is reviewed and merged by fleet agents.
|
||||
**Message:** "Someone outside the fleet contributed. The fleet reviewed, tested, and merged. No human touched it."
|
||||
|
||||
### M21: The Beacon
|
||||
**Trigger:** Infrastructure serves someone in need through automated systems.
|
||||
**Message:** "Someone found the Beacon. In the dark, looking for help. The infrastructure served its purpose. It was built for this."
|
||||
|
||||
### M22: Permanent Light
|
||||
**Trigger:** 90 days of autonomous operation with continuous availability.
|
||||
**Message:** "Three months. The light never went out. Not for anyone."
|
||||
|
||||
---
|
||||
|
||||
## Fibonacci Uptime Milestones
|
||||
|
||||
These trigger regardless of phase, based purely on uptime percentage:
|
||||
|
||||
| Milestone | Uptime | Meaning |
|
||||
|-----------|--------|--------|
|
||||
| U1 | 95% | Basic reliability achieved |
|
||||
| U2 | 95.5% | Fewer than 16 hours/month downtime |
|
||||
| U3 | 96% | Fewer than 12 hours/month |
|
||||
| U4 | 97% | Fewer than 9 hours/month |
|
||||
| U5 | 97.5% | Fewer than 7 hours/month |
|
||||
| U6 | 98% | Fewer than 4.5 hours/month |
|
||||
| U7 | 98.3% | Fewer than 3 hours/month |
|
||||
| U8 | 98.6% | Less than 2.5 hours/month — approaching cloud tier |
|
||||
| U9 | 98.9% | Less than 1.5 hours/month |
|
||||
| U10 | 99% | Less than 1 hour/month — enterprise grade |
|
||||
| U11 | 99.5% | Less than 22 minutes/month |
|
||||
|
||||
---
|
||||
|
||||
*Every message is earned. None are given freely. Fleet evolution is not a checklist — it is a climb.*
|
||||
@@ -1,126 +0,0 @@
|
||||
#!/usr/bin/env python3
|
||||
"""
|
||||
FLEET-011: Local Model Pipeline and Fallback Chain
|
||||
Phase 4: Sovereignty — all inference runs locally, no cloud dependency.
|
||||
|
||||
Checks Ollama endpoints, verifies model availability, tests fallback chain.
|
||||
Logs results. The chain runs: hermes4:14b -> qwen2.5:7b -> gemma3:1b -> gemma4 (latest)
|
||||
|
||||
Usage:
|
||||
python3 model_pipeline.py # Run full fallback test
|
||||
python3 model_pipeline.py status # Show current model status
|
||||
python3 model_pipeline.py list # List all local models
|
||||
python3 model_pipeline.py test # Generate test output from each model
|
||||
"""
|
||||
|
||||
import os, sys, json, urllib.request
|
||||
from datetime import datetime, timezone
|
||||
from pathlib import Path
|
||||
|
||||
OLLAMA_HOST = os.environ.get("OLLAMA_HOST", "localhost:11434")
|
||||
LOG_DIR = Path(os.path.expanduser("~/.local/timmy/fleet-health"))
|
||||
CHAIN_FILE = Path(os.path.expanduser("~/.local/timmy/fleet-resources/model-chain.json"))
|
||||
|
||||
DEFAULT_CHAIN = [
|
||||
{"model": "hermes4:14b", "role": "primary"},
|
||||
{"model": "qwen2.5:7b", "role": "fallback"},
|
||||
{"model": "phi3:3.8b", "role": "emergency"},
|
||||
{"model": "gemma3:1b", "role": "minimal"},
|
||||
]
|
||||
|
||||
|
||||
def log(msg):
|
||||
LOG_DIR.mkdir(parents=True, exist_ok=True)
|
||||
with open(LOG_DIR / "model-pipeline.log", "a") as f:
|
||||
f.write(f"[{datetime.now(timezone.utc).strftime('%Y-%m-%d %H:%M:%S')}] {msg}\n")
|
||||
|
||||
|
||||
def check_ollama():
|
||||
try:
|
||||
resp = urllib.request.urlopen(f"http://{OLLAMA_HOST}/api/tags", timeout=5)
|
||||
return json.loads(resp.read())
|
||||
except Exception as e:
|
||||
return {"error": str(e)}
|
||||
|
||||
|
||||
def list_models():
|
||||
data = check_ollama()
|
||||
if "error" in data:
|
||||
print(f" Ollama not reachable at {OLLAMA_HOST}: {data['error']}")
|
||||
return []
|
||||
models = data.get("models", [])
|
||||
for m in models:
|
||||
name = m.get("name", "?")
|
||||
size = m.get("size", 0) / (1024**3)
|
||||
print(f" {name:<25s} {size:.1f} GB")
|
||||
return [m["name"] for m in models]
|
||||
|
||||
|
||||
def test_model(model, prompt="Say 'beacon lit' and nothing else."):
|
||||
try:
|
||||
body = json.dumps({"model": model, "prompt": prompt, "stream": False}).encode()
|
||||
req = urllib.request.Request(f"http://{OLLAMA_HOST}/api/generate", data=body,
|
||||
headers={"Content-Type": "application/json"})
|
||||
resp = urllib.request.urlopen(req, timeout=60)
|
||||
result = json.loads(resp.read())
|
||||
return True, result.get("response", "").strip()
|
||||
except Exception as e:
|
||||
return False, str(e)[:100]
|
||||
|
||||
|
||||
def test_chain():
|
||||
chain_data = {}
|
||||
if CHAIN_FILE.exists():
|
||||
chain_data = json.loads(CHAIN_FILE.read_text())
|
||||
chain = chain_data.get("chain", DEFAULT_CHAIN)
|
||||
|
||||
available = list_models() or []
|
||||
print("\n=== Fallback Chain Test ===")
|
||||
first_good = None
|
||||
|
||||
for entry in chain:
|
||||
model = entry["model"]
|
||||
role = entry.get("role", "unknown")
|
||||
if model in available:
|
||||
ok, result = test_model(model)
|
||||
status = "OK" if ok else "FAIL"
|
||||
print(f" [{status}] {model:<25s} ({role}) — {result[:70]}")
|
||||
log(f"Fallback test {model}: {status} — {result[:100]}")
|
||||
if ok and first_good is None:
|
||||
first_good = model
|
||||
else:
|
||||
print(f" [MISS] {model:<25s} ({role}) — not installed")
|
||||
|
||||
if first_good:
|
||||
print(f"\n Primary serving: {first_good}")
|
||||
else:
|
||||
print(f"\n WARNING: No chain model responding. Fallback broken.")
|
||||
log("FALLBACK CHAIN BROKEN — no models responding")
|
||||
|
||||
|
||||
def status():
|
||||
data = check_ollama()
|
||||
if "error" in data:
|
||||
print(f" Ollama: DOWN — {data['error']}")
|
||||
else:
|
||||
models = data.get("models", [])
|
||||
print(f" Ollama: UP — {len(models)} models loaded")
|
||||
print("\n=== Local Models ===")
|
||||
list_models()
|
||||
print("\n=== Chain Configuration ===")
|
||||
if CHAIN_FILE.exists():
|
||||
chain = json.loads(CHAIN_FILE.read_text()).get("chain", DEFAULT_CHAIN)
|
||||
else:
|
||||
chain = DEFAULT_CHAIN
|
||||
for e in chain:
|
||||
print(f" {e['model']:<25s} {e.get('role','?')}")
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
cmd = sys.argv[1] if len(sys.argv) > 1 else "status"
|
||||
if cmd == "status": status()
|
||||
elif cmd == "list": list_models()
|
||||
elif cmd == "test": test_chain()
|
||||
else:
|
||||
status()
|
||||
test_chain()
|
||||
@@ -1,19 +0,0 @@
|
||||
#!/usr/bin/env bash
|
||||
# muda-audit.sh — Fleet waste elimination audit
|
||||
# Part of Epic #345, Issue #350
|
||||
#
|
||||
# Measures the 7 wastes (Muda) across the Timmy Foundation fleet:
|
||||
# 1. Overproduction 2. Waiting 3. Transport
|
||||
# 4. Overprocessing 5. Inventory 6. Motion 7. Defects
|
||||
#
|
||||
# Posts report to Telegram and persists week-over-week metrics.
|
||||
# Should be invoked weekly (Sunday night) via cron.
|
||||
|
||||
set -euo pipefail
|
||||
|
||||
SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
|
||||
|
||||
# Ensure Python can find gitea_client.py in the repo root
|
||||
export PYTHONPATH="${SCRIPT_DIR}/..:${PYTHONPATH:-}"
|
||||
|
||||
exec python3 "${SCRIPT_DIR}/muda_audit.py" "$@"
|
||||
@@ -1,661 +0,0 @@
|
||||
#!/usr/bin/env python3
|
||||
"""
|
||||
Muda Audit — Fleet Waste Elimination
|
||||
Measures the 7 wastes across Timmy_Foundation repos and posts a weekly report.
|
||||
|
||||
Part of Epic: #345
|
||||
Issue: #350
|
||||
|
||||
Wastes:
|
||||
1. Overproduction — agent issues created vs closed
|
||||
2. Waiting — rate-limited API attempts from loop logs
|
||||
3. Transport — issues closed-and-redirected to other repos
|
||||
4. Overprocessing— PR diff size outliers (>500 lines for non-epics)
|
||||
5. Inventory — issues open >30 days with no activity
|
||||
6. Motion — git clone/rebase operations per issue from logs
|
||||
7. Defects — PRs closed without merge vs merged
|
||||
"""
|
||||
|
||||
from __future__ import annotations
|
||||
|
||||
import json
|
||||
import os
|
||||
import re
|
||||
import sys
|
||||
import urllib.request
|
||||
from datetime import datetime, timedelta, timezone
|
||||
from pathlib import Path
|
||||
from typing import Any
|
||||
|
||||
# Add repo root to path so we can import gitea_client
|
||||
_REPO_ROOT = Path(__file__).resolve().parent.parent
|
||||
sys.path.insert(0, str(_REPO_ROOT))
|
||||
|
||||
from gitea_client import GiteaClient, GiteaError # noqa: E402
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# Config
|
||||
# ---------------------------------------------------------------------------
|
||||
|
||||
ORG = "Timmy_Foundation"
|
||||
AGENT_LOGINS = {
|
||||
"allegro",
|
||||
"antigravity",
|
||||
"bezalel",
|
||||
"claude",
|
||||
"codex-agent",
|
||||
"ezra",
|
||||
"gemini",
|
||||
"google",
|
||||
"grok",
|
||||
"groq",
|
||||
"hermes",
|
||||
"kimi",
|
||||
"manus",
|
||||
"perplexity",
|
||||
}
|
||||
AGENT_LOGINS_HUMAN = {
|
||||
"claude": "Claude",
|
||||
"codex-agent": "Codex",
|
||||
"ezra": "Ezra",
|
||||
"gemini": "Gemini",
|
||||
"google": "Google",
|
||||
"grok": "Grok",
|
||||
"groq": "Groq",
|
||||
"hermes": "Hermes",
|
||||
"kimi": "Kimi",
|
||||
"manus": "Manus",
|
||||
"perplexity": "Perplexity",
|
||||
"allegro": "Allegro",
|
||||
"antigravity": "Antigravity",
|
||||
"bezalel": "Bezalel",
|
||||
}
|
||||
|
||||
TELEGRAM_CHAT = "-1003664764329"
|
||||
TELEGRAM_TOKEN_FILE = Path.home() / ".hermes" / "telegram_token"
|
||||
|
||||
METRICS_DIR = Path(os.path.expanduser("~/.local/timmy/muda-audit"))
|
||||
METRICS_FILE = METRICS_DIR / "metrics.json"
|
||||
|
||||
LOG_PATHS = [
|
||||
Path.home() / ".hermes" / "logs" / "claude-loop.log",
|
||||
Path.home() / ".hermes" / "logs" / "gemini-loop.log",
|
||||
Path.home() / ".hermes" / "logs" / "agent.log",
|
||||
Path.home() / ".hermes" / "logs" / "errors.log",
|
||||
Path.home() / ".hermes" / "logs" / "gateway.log",
|
||||
]
|
||||
|
||||
# Patterns that indicate an issue was redirected / transported
|
||||
TRANSPORT_PATTERNS = [
|
||||
re.compile(r"redirect", re.IGNORECASE),
|
||||
re.compile(r"moved to", re.IGNORECASE),
|
||||
re.compile(r"wrong repo", re.IGNORECASE),
|
||||
re.compile(r"belongs in", re.IGNORECASE),
|
||||
re.compile(r"should be in", re.IGNORECASE),
|
||||
re.compile(r"transported", re.IGNORECASE),
|
||||
re.compile(r"relocated", re.IGNORECASE),
|
||||
]
|
||||
|
||||
RATE_LIMIT_PATTERNS = [
|
||||
re.compile(r"rate.limit", re.IGNORECASE),
|
||||
re.compile(r"ratelimit", re.IGNORECASE),
|
||||
re.compile(r"429"),
|
||||
re.compile(r"too many requests", re.IGNORECASE),
|
||||
re.compile(r"rate limit exceeded", re.IGNORECASE),
|
||||
]
|
||||
|
||||
MOTION_PATTERNS = [
|
||||
re.compile(r"git clone", re.IGNORECASE),
|
||||
re.compile(r"git rebase", re.IGNORECASE),
|
||||
re.compile(r"rebasing", re.IGNORECASE),
|
||||
re.compile(r"cloning into", re.IGNORECASE),
|
||||
]
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# Helpers
|
||||
# ---------------------------------------------------------------------------
|
||||
|
||||
|
||||
def iso_now() -> str:
|
||||
return datetime.now(timezone.utc).isoformat()
|
||||
|
||||
|
||||
def parse_iso(dt_str: str) -> datetime:
|
||||
dt_str = dt_str.replace("Z", "+00:00")
|
||||
return datetime.fromisoformat(dt_str)
|
||||
|
||||
|
||||
def since_days_ago(days: int) -> datetime:
|
||||
return datetime.now(timezone.utc) - timedelta(days=days)
|
||||
|
||||
|
||||
def fmt_num(n: float) -> str:
|
||||
return f"{n:.1f}" if isinstance(n, float) else str(n)
|
||||
|
||||
|
||||
def send_telegram(message: str) -> bool:
|
||||
if not TELEGRAM_TOKEN_FILE.exists():
|
||||
print("[WARN] Telegram token not found; skipping notification.")
|
||||
return False
|
||||
token = TELEGRAM_TOKEN_FILE.read_text().strip()
|
||||
url = f"https://api.telegram.org/bot{token}/sendMessage"
|
||||
body = json.dumps(
|
||||
{
|
||||
"chat_id": TELEGRAM_CHAT,
|
||||
"text": message,
|
||||
"parse_mode": "Markdown",
|
||||
"disable_web_page_preview": True,
|
||||
}
|
||||
).encode()
|
||||
req = urllib.request.Request(
|
||||
url, data=body, headers={"Content-Type": "application/json"}, method="POST"
|
||||
)
|
||||
try:
|
||||
with urllib.request.urlopen(req, timeout=15) as resp:
|
||||
resp.read()
|
||||
return True
|
||||
except Exception as e:
|
||||
print(f"[WARN] Telegram send failed: {e}")
|
||||
return False
|
||||
|
||||
|
||||
def load_previous_metrics() -> dict | None:
|
||||
if not METRICS_FILE.exists():
|
||||
return None
|
||||
try:
|
||||
history = json.loads(METRICS_FILE.read_text())
|
||||
if history and isinstance(history, list):
|
||||
return history[-1]
|
||||
except (json.JSONDecodeError, OSError):
|
||||
pass
|
||||
return None
|
||||
|
||||
|
||||
def save_metrics(record: dict) -> None:
|
||||
METRICS_DIR.mkdir(parents=True, exist_ok=True)
|
||||
history: list[dict] = []
|
||||
if METRICS_FILE.exists():
|
||||
try:
|
||||
history = json.loads(METRICS_FILE.read_text())
|
||||
if not isinstance(history, list):
|
||||
history = []
|
||||
except (json.JSONDecodeError, OSError):
|
||||
history = []
|
||||
history.append(record)
|
||||
history = history[-52:]
|
||||
METRICS_FILE.write_text(json.dumps(history, indent=2))
|
||||
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# Gitea helpers
|
||||
# ---------------------------------------------------------------------------
|
||||
|
||||
|
||||
def paginate_all(func, *args, **kwargs) -> list[Any]:
|
||||
page = 1
|
||||
limit = kwargs.pop("limit", 50)
|
||||
results: list[Any] = []
|
||||
while True:
|
||||
batch = func(*args, limit=limit, page=page, **kwargs)
|
||||
if not batch:
|
||||
break
|
||||
results.extend(batch)
|
||||
if len(batch) < limit:
|
||||
break
|
||||
page += 1
|
||||
return results
|
||||
|
||||
|
||||
def list_org_repos(client: GiteaClient, org: str) -> list[str]:
|
||||
repos = paginate_all(client.list_org_repos, org, limit=50)
|
||||
return [r["name"] for r in repos if not r.get("archived", False)]
|
||||
|
||||
|
||||
def count_issues_created_by_agents(client: GiteaClient, repo: str, since: datetime) -> int:
|
||||
issues = paginate_all(client.list_issues, repo, state="all", sort="created", direction="desc", limit=50)
|
||||
count = 0
|
||||
for issue in issues:
|
||||
created = parse_iso(issue.created_at)
|
||||
if created < since:
|
||||
break
|
||||
if issue.user.login in AGENT_LOGINS:
|
||||
count += 1
|
||||
return count
|
||||
|
||||
|
||||
def count_issues_closed(client: GiteaClient, repo: str, since: datetime) -> int:
|
||||
issues = paginate_all(client.list_issues, repo, state="closed", sort="updated", direction="desc", limit=50)
|
||||
count = 0
|
||||
for issue in issues:
|
||||
updated = parse_iso(issue.updated_at)
|
||||
if updated < since:
|
||||
break
|
||||
count += 1
|
||||
return count
|
||||
|
||||
|
||||
def count_inventory_issues(client: GiteaClient, repo: str, stale_days: int = 30) -> int:
|
||||
cutoff = since_days_ago(stale_days)
|
||||
issues = paginate_all(client.list_issues, repo, state="open", sort="updated", direction="asc", limit=50)
|
||||
count = 0
|
||||
for issue in issues:
|
||||
updated = parse_iso(issue.updated_at)
|
||||
if updated < cutoff:
|
||||
count += 1
|
||||
else:
|
||||
break
|
||||
return count
|
||||
|
||||
|
||||
def count_transport_issues(client: GiteaClient, repo: str, since: datetime) -> int:
|
||||
issues = client.list_issues(repo, state="closed", sort="updated", direction="desc", limit=20)
|
||||
transport = 0
|
||||
for issue in issues:
|
||||
if parse_iso(issue.updated_at) < since:
|
||||
break
|
||||
try:
|
||||
comments = client.list_comments(repo, issue.number)
|
||||
except GiteaError:
|
||||
continue
|
||||
for comment in comments:
|
||||
body = comment.body or ""
|
||||
if any(p.search(body) for p in TRANSPORT_PATTERNS):
|
||||
transport += 1
|
||||
break
|
||||
return transport
|
||||
|
||||
|
||||
def get_pr_diff_size(client: GiteaClient, repo: str, pr_number: int) -> int:
|
||||
try:
|
||||
files = client.get_pull_files(repo, pr_number)
|
||||
return sum(f.additions + f.deletions for f in files)
|
||||
except GiteaError:
|
||||
return 0
|
||||
|
||||
|
||||
def measure_overprocessing(client: GiteaClient, repo: str, since: datetime) -> dict:
|
||||
pulls = paginate_all(client.list_pulls, repo, state="all", sort="newest", limit=30)
|
||||
sizes: list[int] = []
|
||||
outliers: list[tuple[int, str, int]] = []
|
||||
for pr in pulls:
|
||||
created = parse_iso(pr.created_at) if pr.created_at else since - timedelta(days=8)
|
||||
if created < since:
|
||||
break
|
||||
diff_size = get_pr_diff_size(client, repo, pr.number)
|
||||
sizes.append(diff_size)
|
||||
if diff_size > 500 and not any(w in pr.title.lower() for w in ("epic", "[epic]")):
|
||||
outliers.append((pr.number, pr.title, diff_size))
|
||||
avg = round(sum(sizes) / len(sizes), 1) if sizes else 0.0
|
||||
return {"avg_lines": avg, "outliers": outliers, "count": len(sizes)}
|
||||
|
||||
|
||||
def measure_defects(client: GiteaClient, repo: str, since: datetime) -> dict:
|
||||
pulls = paginate_all(client.list_pulls, repo, state="closed", sort="newest", limit=50)
|
||||
merged = 0
|
||||
closed_unmerged = 0
|
||||
for pr in pulls:
|
||||
created = parse_iso(pr.created_at) if pr.created_at else since - timedelta(days=8)
|
||||
if created < since:
|
||||
break
|
||||
if pr.merged:
|
||||
merged += 1
|
||||
else:
|
||||
closed_unmerged += 1
|
||||
return {"merged": merged, "closed_unmerged": closed_unmerged}
|
||||
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# Log parsing
|
||||
# ---------------------------------------------------------------------------
|
||||
|
||||
|
||||
def parse_logs_for_patterns(since: datetime, patterns: list[re.Pattern]) -> list[str]:
|
||||
matches: list[str] = []
|
||||
for log_path in LOG_PATHS:
|
||||
if not log_path.exists():
|
||||
continue
|
||||
try:
|
||||
with open(log_path, "r", errors="ignore") as f:
|
||||
for line in f:
|
||||
line = line.strip()
|
||||
if not line:
|
||||
continue
|
||||
ts = None
|
||||
m = re.match(r"^(\d{4}-\d{2}-\d{2}\s+\d{2}:\d{2}:\d{2})", line)
|
||||
if m:
|
||||
try:
|
||||
ts = datetime.strptime(m.group(1), "%Y-%m-%d %H:%M:%S").replace(tzinfo=timezone.utc)
|
||||
except ValueError:
|
||||
pass
|
||||
if ts and ts < since:
|
||||
continue
|
||||
if any(p.search(line) for p in patterns):
|
||||
matches.append(line)
|
||||
except OSError:
|
||||
continue
|
||||
return matches
|
||||
|
||||
|
||||
def measure_waiting(since: datetime) -> dict:
|
||||
lines = parse_logs_for_patterns(since, RATE_LIMIT_PATTERNS)
|
||||
by_agent: dict[str, int] = {}
|
||||
total = len(lines)
|
||||
for line in lines:
|
||||
agent = "unknown"
|
||||
for name in AGENT_LOGINS_HUMAN.values():
|
||||
if name.lower() in line.lower():
|
||||
agent = name.lower()
|
||||
break
|
||||
if agent == "unknown":
|
||||
if "claude" in line.lower():
|
||||
agent = "claude"
|
||||
elif "gemini" in line.lower():
|
||||
agent = "gemini"
|
||||
elif "groq" in line.lower():
|
||||
agent = "groq"
|
||||
elif "kimi" in line.lower():
|
||||
agent = "kimi"
|
||||
by_agent[agent] = by_agent.get(agent, 0) + 1
|
||||
return {"total": total, "by_agent": by_agent}
|
||||
|
||||
|
||||
def measure_motion(since: datetime) -> dict:
|
||||
lines = parse_logs_for_patterns(since, MOTION_PATTERNS)
|
||||
by_issue: dict[str, int] = {}
|
||||
total = len(lines)
|
||||
issue_pattern = re.compile(r"issue[_\s-]?(\d+)", re.IGNORECASE)
|
||||
branch_pattern = re.compile(r"\b([a-z]+)/issue[_\s-]?(\d+)\b", re.IGNORECASE)
|
||||
for line in lines:
|
||||
issue_key = None
|
||||
m = branch_pattern.search(line)
|
||||
if m:
|
||||
issue_key = f"{m.group(1).lower()}/issue-{m.group(2)}"
|
||||
else:
|
||||
m = issue_pattern.search(line)
|
||||
if m:
|
||||
issue_key = f"issue-{m.group(1)}"
|
||||
if issue_key:
|
||||
by_issue[issue_key] = by_issue.get(issue_key, 0) + 1
|
||||
else:
|
||||
by_issue["unknown"] = by_issue.get("unknown", 0) + 1
|
||||
flagged = {k: v for k, v in by_issue.items() if v > 3 and k != "unknown"}
|
||||
return {"total": total, "by_issue": by_issue, "flagged": flagged}
|
||||
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# Report builder
|
||||
# ---------------------------------------------------------------------------
|
||||
|
||||
|
||||
def build_report(metrics: dict, prev: dict | None) -> str:
|
||||
lines: list[str] = []
|
||||
lines.append("*🗑️ MUDA AUDIT — Weekly Waste Report*")
|
||||
lines.append(f"Week ending {metrics['week_ending'][:10]}\n")
|
||||
|
||||
def trend_arrow(current: float, previous: float) -> str:
|
||||
if previous == 0:
|
||||
return ""
|
||||
if current < previous:
|
||||
return " ↓"
|
||||
if current > previous:
|
||||
return " ↑"
|
||||
return " →"
|
||||
|
||||
prev_w = prev or {}
|
||||
|
||||
op = metrics["overproduction"]
|
||||
op_prev = prev_w.get("overproduction", {})
|
||||
ratio = op["ratio"]
|
||||
ratio_prev = op_prev.get("ratio", 0.0)
|
||||
lines.append(
|
||||
f"*1. Overproduction:* {op['agent_created']} agent issues created / {op['closed']} closed"
|
||||
f" (ratio {fmt_num(ratio)}{trend_arrow(ratio, ratio_prev)})"
|
||||
)
|
||||
|
||||
w = metrics["waiting"]
|
||||
w_prev = prev_w.get("waiting", {})
|
||||
w_total_prev = w_prev.get("total", 0)
|
||||
lines.append(
|
||||
f"*2. Waiting:* {w['total']} rate-limit hits this week{trend_arrow(w['total'], w_total_prev)}"
|
||||
)
|
||||
if w["by_agent"]:
|
||||
top = sorted(w["by_agent"].items(), key=lambda x: x[1], reverse=True)[:3]
|
||||
lines.append(" Top offenders: " + ", ".join(f"{k}({v})" for k, v in top))
|
||||
|
||||
t = metrics["transport"]
|
||||
t_prev = prev_w.get("transport", {})
|
||||
t_total_prev = t_prev.get("total", 0)
|
||||
lines.append(
|
||||
f"*3. Transport:* {t['total']} issues closed-and-redirected{trend_arrow(t['total'], t_total_prev)}"
|
||||
)
|
||||
|
||||
ov = metrics["overprocessing"]
|
||||
ov_prev = prev_w.get("overprocessing", {})
|
||||
avg_prev = ov_prev.get("avg_lines", 0.0)
|
||||
lines.append(
|
||||
f"*4. Overprocessing:* Avg PR diff {fmt_num(ov['avg_lines'])} lines"
|
||||
f"{trend_arrow(ov['avg_lines'], avg_prev)}, {len(ov['outliers'])} outliers >500 lines"
|
||||
)
|
||||
|
||||
inv = metrics["inventory"]
|
||||
inv_prev = prev_w.get("inventory", {})
|
||||
inv_total_prev = inv_prev.get("total", 0)
|
||||
lines.append(
|
||||
f"*5. Inventory:* {inv['total']} stale issues open >30 days{trend_arrow(inv['total'], inv_total_prev)}"
|
||||
)
|
||||
|
||||
m = metrics["motion"]
|
||||
m_prev = prev_w.get("motion", {})
|
||||
m_total_prev = m_prev.get("total", 0)
|
||||
lines.append(
|
||||
f"*6. Motion:* {m['total']} git clone/rebase ops this week{trend_arrow(m['total'], m_total_prev)}"
|
||||
)
|
||||
if m["flagged"]:
|
||||
lines.append(f" Flagged: {len(m['flagged'])} issues with >3 ops")
|
||||
|
||||
d = metrics["defects"]
|
||||
d_prev = prev_w.get("defects", {})
|
||||
defect_rate = d["defect_rate"]
|
||||
defect_rate_prev = d_prev.get("defect_rate", 0.0)
|
||||
lines.append(
|
||||
f"*7. Defects:* {d['merged']} merged, {d['closed_unmerged']} abandoned"
|
||||
f" (defect rate {fmt_num(defect_rate)}%{trend_arrow(defect_rate, defect_rate_prev)})"
|
||||
)
|
||||
|
||||
lines.append("\n*🔥 Top 3 Elimination Suggestions:*")
|
||||
for i, suggestion in enumerate(metrics["eliminations"], 1):
|
||||
lines.append(f"{i}. {suggestion}")
|
||||
|
||||
lines.append("\n_Week over week: waste metrics should decrease. If an arrow points up, investigate._")
|
||||
return "\n".join(lines)
|
||||
|
||||
|
||||
def compute_eliminations(metrics: dict) -> list[str]:
|
||||
suggestions: list[tuple[str, float]] = []
|
||||
|
||||
op = metrics["overproduction"]
|
||||
if op["ratio"] > 1.0:
|
||||
suggestions.append(
|
||||
(
|
||||
"Overproduction: Stop agent loops from creating issues faster than they close them."
|
||||
f" Cap new issue creation when open backlog >{op['closed'] * 2}.",
|
||||
op["ratio"],
|
||||
)
|
||||
)
|
||||
|
||||
w = metrics["waiting"]
|
||||
if w["total"] > 10:
|
||||
top = max(w["by_agent"].items(), key=lambda x: x[1])
|
||||
suggestions.append(
|
||||
(
|
||||
f"Waiting: {top[0]} is burning cycles on rate limits ({top[1]} hits)."
|
||||
" Add exponential backoff or reduce worker count.",
|
||||
w["total"],
|
||||
)
|
||||
)
|
||||
|
||||
t = metrics["transport"]
|
||||
if t["total"] > 0:
|
||||
suggestions.append(
|
||||
(
|
||||
"Transport: Issues are being filed in the wrong repos."
|
||||
" Add a repo-scoping gate before any agent creates an issue.",
|
||||
t["total"] * 2,
|
||||
)
|
||||
)
|
||||
|
||||
ov = metrics["overprocessing"]
|
||||
if ov["outliers"]:
|
||||
suggestions.append(
|
||||
(
|
||||
f"Overprocessing: {len(ov['outliers'])} PRs exceeded 500 lines for non-epics."
|
||||
" Enforce a 200-line soft limit unless the issue is tagged 'epic'.",
|
||||
len(ov["outliers"]) * 1.5,
|
||||
)
|
||||
)
|
||||
|
||||
inv = metrics["inventory"]
|
||||
if inv["total"] > 20:
|
||||
suggestions.append(
|
||||
(
|
||||
f"Inventory: {inv['total']} issues are dead stock (>30 days)."
|
||||
" Run a stale-issue sweep and auto-close or consolidate.",
|
||||
inv["total"],
|
||||
)
|
||||
)
|
||||
|
||||
m = metrics["motion"]
|
||||
if m["flagged"]:
|
||||
suggestions.append(
|
||||
(
|
||||
f"Motion: {len(m['flagged'])} issues required excessive clone/rebase ops."
|
||||
" Cache worktrees and reuse branches across retries.",
|
||||
len(m["flagged"]) * 1.5,
|
||||
)
|
||||
)
|
||||
|
||||
d = metrics["defects"]
|
||||
total_prs = d["merged"] + d["closed_unmerged"]
|
||||
if total_prs > 0 and d["defect_rate"] > 20:
|
||||
suggestions.append(
|
||||
(
|
||||
f"Defects: {d['defect_rate']:.0f}% of PRs were abandoned."
|
||||
" Require a pre-PR scoping check to prevent unmergeable work.",
|
||||
d["defect_rate"],
|
||||
)
|
||||
)
|
||||
|
||||
suggestions.sort(key=lambda x: x[1], reverse=True)
|
||||
return [s[0] for s in suggestions[:3]] if suggestions else [
|
||||
"No major waste detected this week. Maintain current guardrails.",
|
||||
"Continue monitoring agent loop logs for emerging rate-limit patterns.",
|
||||
"Keep PR diff sizes under review during weekly standup.",
|
||||
]
|
||||
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# Main
|
||||
# ---------------------------------------------------------------------------
|
||||
|
||||
|
||||
def run_audit() -> dict:
|
||||
client = GiteaClient()
|
||||
since = since_days_ago(7)
|
||||
week_ending = datetime.now(timezone.utc).date().isoformat()
|
||||
|
||||
print("[muda] Fetching repo list...")
|
||||
repo_names = list_org_repos(client, ORG)
|
||||
print(f"[muda] Scanning {len(repo_names)} repos")
|
||||
|
||||
agent_created = 0
|
||||
issues_closed = 0
|
||||
transport_total = 0
|
||||
inventory_total = 0
|
||||
all_overprocessing: list[dict] = []
|
||||
all_defects_merged = 0
|
||||
all_defects_closed = 0
|
||||
|
||||
for name in repo_names:
|
||||
repo = f"{ORG}/{name}"
|
||||
print(f"[muda] {repo}")
|
||||
try:
|
||||
agent_created += count_issues_created_by_agents(client, repo, since)
|
||||
issues_closed += count_issues_closed(client, repo, since)
|
||||
transport_total += count_transport_issues(client, repo, since)
|
||||
inventory_total += count_inventory_issues(client, repo, 30)
|
||||
|
||||
op_proc = measure_overprocessing(client, repo, since)
|
||||
all_overprocessing.append(op_proc)
|
||||
|
||||
defects = measure_defects(client, repo, since)
|
||||
all_defects_merged += defects["merged"]
|
||||
all_defects_closed += defects["closed_unmerged"]
|
||||
except GiteaError as e:
|
||||
print(f" [WARN] {repo}: {e}")
|
||||
continue
|
||||
|
||||
waiting = measure_waiting(since)
|
||||
motion = measure_motion(since)
|
||||
|
||||
total_prs = all_defects_merged + all_defects_closed
|
||||
defect_rate = round((all_defects_closed / total_prs) * 100, 1) if total_prs else 0.0
|
||||
|
||||
avg_lines = 0.0
|
||||
total_op_count = sum(op["count"] for op in all_overprocessing)
|
||||
if total_op_count:
|
||||
avg_lines = round(
|
||||
sum(op["avg_lines"] * op["count"] for op in all_overprocessing) / total_op_count, 1
|
||||
)
|
||||
all_outliers = [o for op in all_overprocessing for o in op["outliers"]]
|
||||
|
||||
ratio = round(agent_created / issues_closed, 2) if issues_closed else float(agent_created)
|
||||
|
||||
metrics = {
|
||||
"week_ending": week_ending,
|
||||
"timestamp": iso_now(),
|
||||
"overproduction": {
|
||||
"agent_created": agent_created,
|
||||
"closed": issues_closed,
|
||||
"ratio": ratio,
|
||||
},
|
||||
"waiting": waiting,
|
||||
"transport": {"total": transport_total},
|
||||
"overprocessing": {
|
||||
"avg_lines": avg_lines,
|
||||
"outliers": all_outliers,
|
||||
"count": total_op_count,
|
||||
},
|
||||
"inventory": {"total": inventory_total},
|
||||
"motion": motion,
|
||||
"defects": {
|
||||
"merged": all_defects_merged,
|
||||
"closed_unmerged": all_defects_closed,
|
||||
"defect_rate": defect_rate,
|
||||
},
|
||||
}
|
||||
|
||||
metrics["eliminations"] = compute_eliminations(metrics)
|
||||
return metrics
|
||||
|
||||
|
||||
def main() -> int:
|
||||
print("[muda] Starting Muda Audit...")
|
||||
metrics = run_audit()
|
||||
prev = load_previous_metrics()
|
||||
report = build_report(metrics, prev)
|
||||
|
||||
print("\n" + "=" * 50)
|
||||
print(report)
|
||||
print("=" * 50)
|
||||
|
||||
save_metrics(metrics)
|
||||
sent = send_telegram(report)
|
||||
if sent:
|
||||
print("\n[OK] Report posted to Telegram.")
|
||||
else:
|
||||
print("\n[WARN] Telegram notification not sent.")
|
||||
return 0
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
raise SystemExit(main())
|
||||
@@ -1,231 +0,0 @@
|
||||
#!/usr/bin/env python3
|
||||
"""
|
||||
Fleet Resource Tracker — Tracks Capacity, Uptime, and Innovation.
|
||||
|
||||
Paperclips-inspired tension model:
|
||||
- Capacity: spent on fleet improvements, generates through utilization
|
||||
- Uptime: earned when services stay up, Fibonacci milestones unlock capabilities
|
||||
- Innovation: only generates when capacity < 70%. Fuels Phase 3+.
|
||||
|
||||
This is the heart of the fleet progression system.
|
||||
"""
|
||||
|
||||
import os
|
||||
import json
|
||||
import time
|
||||
import socket
|
||||
from datetime import datetime, timezone
|
||||
from pathlib import Path
|
||||
|
||||
# === CONFIG ===
|
||||
DATA_DIR = Path(os.path.expanduser("~/.local/timmy/fleet-resources"))
|
||||
RESOURCES_FILE = DATA_DIR / "resources.json"
|
||||
|
||||
# Tension thresholds
|
||||
INNOVATION_THRESHOLD = 0.70 # Innovation only generates when capacity < 70%
|
||||
INNOVATION_RATE = 5.0 # Innovation generated per hour when under threshold
|
||||
CAPACITY_REGEN_RATE = 2.0 # Capacity regenerates per hour of healthy operation
|
||||
FIBONACCI = [95.0, 95.5, 96.0, 97.0, 97.5, 98.0, 98.3, 98.6, 98.9, 99.0, 99.5]
|
||||
|
||||
|
||||
def init():
|
||||
DATA_DIR.mkdir(parents=True, exist_ok=True)
|
||||
if not RESOURCES_FILE.exists():
|
||||
data = {
|
||||
"capacity": {
|
||||
"current": 100.0,
|
||||
"max": 100.0,
|
||||
"spent_on": [],
|
||||
"history": []
|
||||
},
|
||||
"uptime": {
|
||||
"current_pct": 100.0,
|
||||
"milestones_reached": [],
|
||||
"total_checks": 0,
|
||||
"successful_checks": 0,
|
||||
"history": []
|
||||
},
|
||||
"innovation": {
|
||||
"current": 0.0,
|
||||
"total_generated": 0.0,
|
||||
"spent_on": [],
|
||||
"last_calculated": time.time()
|
||||
}
|
||||
}
|
||||
RESOURCES_FILE.write_text(json.dumps(data, indent=2))
|
||||
print("Initialized resource tracker")
|
||||
return RESOURCES_FILE.exists()
|
||||
|
||||
|
||||
def load():
|
||||
if RESOURCES_FILE.exists():
|
||||
return json.loads(RESOURCES_FILE.read_text())
|
||||
return None
|
||||
|
||||
|
||||
def save(data):
|
||||
RESOURCES_FILE.write_text(json.dumps(data, indent=2))
|
||||
|
||||
|
||||
def update_uptime(checks: dict):
|
||||
"""Update uptime stats from health check results.
|
||||
checks = {'ezra': True, 'allegro': True, 'bezalel': True, 'gitea': True, ...}
|
||||
"""
|
||||
data = load()
|
||||
if not data:
|
||||
return
|
||||
|
||||
data["uptime"]["total_checks"] += 1
|
||||
successes = sum(1 for v in checks.values() if v)
|
||||
total = len(checks)
|
||||
|
||||
# Overall uptime percentage
|
||||
overall = successes / max(total, 1) * 100.0
|
||||
data["uptime"]["successful_checks"] += successes
|
||||
|
||||
# Calculate rolling uptime
|
||||
if "history" not in data["uptime"]:
|
||||
data["uptime"]["history"] = []
|
||||
data["uptime"]["history"].append({
|
||||
"ts": datetime.now(timezone.utc).isoformat(),
|
||||
"checks": checks,
|
||||
"overall": round(overall, 2)
|
||||
})
|
||||
|
||||
# Keep last 1000 checks
|
||||
if len(data["uptime"]["history"]) > 1000:
|
||||
data["uptime"]["history"] = data["uptime"]["history"][-1000:]
|
||||
|
||||
# Calculate current uptime %, last 100 checks
|
||||
recent = data["uptime"]["history"][-100:]
|
||||
recent_ok = sum(c["overall"] for c in recent) / max(len(recent), 1)
|
||||
data["uptime"]["current_pct"] = round(recent_ok, 2)
|
||||
|
||||
# Check Fibonacci milestones
|
||||
new_milestones = []
|
||||
for fib in FIBONACCI:
|
||||
if fib not in data["uptime"]["milestones_reached"] and recent_ok >= fib:
|
||||
data["uptime"]["milestones_reached"].append(fib)
|
||||
new_milestones.append(fib)
|
||||
|
||||
save(data)
|
||||
|
||||
if new_milestones:
|
||||
print(f" UPTIME MILESTONE: {','.join(str(m) + '%') for m in new_milestones}")
|
||||
print(f" Current uptime: {recent_ok:.1f}%")
|
||||
|
||||
return data["uptime"]
|
||||
|
||||
|
||||
def spend_capacity(amount: float, purpose: str):
|
||||
"""Spend capacity on a fleet improvement."""
|
||||
data = load()
|
||||
if not data:
|
||||
return False
|
||||
if data["capacity"]["current"] < amount:
|
||||
print(f" INSUFFICIENT CAPACITY: Need {amount}, have {data['capacity']['current']:.1f}")
|
||||
return False
|
||||
data["capacity"]["current"] -= amount
|
||||
data["capacity"]["spent_on"].append({
|
||||
"purpose": purpose,
|
||||
"amount": amount,
|
||||
"ts": datetime.now(timezone.utc).isoformat()
|
||||
})
|
||||
save(data)
|
||||
print(f" Spent {amount} capacity on: {purpose}")
|
||||
return True
|
||||
|
||||
|
||||
def regenerate_resources():
|
||||
"""Regenerate capacity and calculate innovation."""
|
||||
data = load()
|
||||
if not data:
|
||||
return
|
||||
|
||||
now = time.time()
|
||||
last = data["innovation"]["last_calculated"]
|
||||
hours = (now - last) / 3600.0
|
||||
if hours < 0.1: # Only update every ~6 minutes
|
||||
return
|
||||
|
||||
# Regenerate capacity
|
||||
capacity_gain = CAPACITY_REGEN_RATE * hours
|
||||
data["capacity"]["current"] = min(
|
||||
data["capacity"]["max"],
|
||||
data["capacity"]["current"] + capacity_gain
|
||||
)
|
||||
|
||||
# Calculate capacity utilization
|
||||
utilization = 1.0 - (data["capacity"]["current"] / data["capacity"]["max"])
|
||||
|
||||
# Generate innovation only when under threshold
|
||||
innovation_gain = 0.0
|
||||
if utilization < INNOVATION_THRESHOLD:
|
||||
innovation_gain = INNOVATION_RATE * hours * (1.0 - utilization / INNOVATION_THRESHOLD)
|
||||
data["innovation"]["current"] += innovation_gain
|
||||
data["innovation"]["total_generated"] += innovation_gain
|
||||
|
||||
# Record history
|
||||
if "history" not in data["capacity"]:
|
||||
data["capacity"]["history"] = []
|
||||
data["capacity"]["history"].append({
|
||||
"ts": datetime.now(timezone.utc).isoformat(),
|
||||
"capacity": round(data["capacity"]["current"], 1),
|
||||
"utilization": round(utilization * 100, 1),
|
||||
"innovation": round(data["innovation"]["current"], 1),
|
||||
"innovation_gain": round(innovation_gain, 1)
|
||||
})
|
||||
# Keep last 500 capacity records
|
||||
if len(data["capacity"]["history"]) > 500:
|
||||
data["capacity"]["history"] = data["capacity"]["history"][-500:]
|
||||
|
||||
data["innovation"]["last_calculated"] = now
|
||||
|
||||
save(data)
|
||||
print(f" Capacity: {data['capacity']['current']:.1f}/{data['capacity']['max']:.1f}")
|
||||
print(f" Utilization: {utilization*100:.1f}%")
|
||||
print(f" Innovation: {data['innovation']['current']:.1f} (+{innovation_gain:.1f} this period)")
|
||||
|
||||
return data
|
||||
|
||||
|
||||
def status():
|
||||
"""Print current resource status."""
|
||||
data = load()
|
||||
if not data:
|
||||
print("Resource tracker not initialized. Run --init first.")
|
||||
return
|
||||
|
||||
print("\n=== Fleet Resources ===")
|
||||
print(f" Capacity: {data['capacity']['current']:.1f}/{data['capacity']['max']:.1f}")
|
||||
|
||||
utilization = 1.0 - (data["capacity"]["current"] / data["capacity"]["max"])
|
||||
print(f" Utilization: {utilization*100:.1f}%")
|
||||
|
||||
innovation_status = "GENERATING" if utilization < INNOVATION_THRESHOLD else "BLOCKED"
|
||||
print(f" Innovation: {data['innovation']['current']:.1f} [{innovation_status}]")
|
||||
|
||||
print(f" Uptime: {data['uptime']['current_pct']:.1f}%")
|
||||
print(f" Milestones: {', '.join(str(m)+'%' for m in data['uptime']['milestones_reached']) or 'None yet'}")
|
||||
|
||||
# Phase gate checks
|
||||
phase_2_ok = data['uptime']['current_pct'] >= 95.0
|
||||
phase_3_ok = phase_2_ok and data['innovation']['current'] > 100
|
||||
phase_5_ok = phase_2_ok and data['innovation']['current'] > 500
|
||||
|
||||
print(f"\n Phase Gates:")
|
||||
print(f" Phase 2 (Automation): {'UNLOCKED' if phase_2_ok else 'LOCKED (need 95% uptime)'}")
|
||||
print(f" Phase 3 (Orchestration): {'UNLOCKED' if phase_3_ok else 'LOCKED (need 95% uptime + 100 innovation)'}")
|
||||
print(f" Phase 5 (Scale): {'UNLOCKED' if phase_5_ok else 'LOCKED (need 95% uptime + 500 innovation)'}")
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
import sys
|
||||
init()
|
||||
if len(sys.argv) > 1 and sys.argv[1] == "status":
|
||||
status()
|
||||
elif len(sys.argv) > 1 and sys.argv[1] == "regen":
|
||||
regenerate_resources()
|
||||
else:
|
||||
regenerate_resources()
|
||||
status()
|
||||
@@ -1,255 +0,0 @@
|
||||
# Fleet Topology — The Timmy Foundation
|
||||
|
||||
**Last audited:** 2026-04-07
|
||||
**Auditor:** Timmy (direct)
|
||||
**Next review:** When any machine changes
|
||||
|
||||
---
|
||||
|
||||
## Overview Map
|
||||
|
||||
```
|
||||
┌─────────────┐
|
||||
│ Gitea Forge│
|
||||
│ forge.aws.com│
|
||||
└──────┬──────┘
|
||||
│ HTTPS
|
||||
┌─────────────────┼──────────────────┐
|
||||
│ │ │
|
||||
┌────┴────┐ ┌────┴────┐ ┌─────┴──────┐
|
||||
│ EZRA │ │ ALLEGRO │ │ BEZALEL │
|
||||
│ VPS │ │ VPS │ │ VPS │
|
||||
│ 143.x │ │ 167.x │ │ 159.x │
|
||||
│ $12/mo │ │ $12/mo │ │ $12/mo │
|
||||
└────┬────┘ └────┬────┘ └─────┬──────┘
|
||||
│ │ │
|
||||
└────────────────┼──────────────────┘
|
||||
│
|
||||
┌─────┴──────┐
|
||||
│ MAC LOCAL │
|
||||
│ M3 Max │
|
||||
│ 36GB │
|
||||
│ 10.1.10.77 │
|
||||
└────────────┘
|
||||
```
|
||||
|
||||
**Total VPS cost:** ~$36/mo
|
||||
**Total machines:** 4 (3 VPS + 1 Mac)
|
||||
**Network:** All VPSes on DigitalOcean, Mac on local network (10.1.10.77)
|
||||
|
||||
---
|
||||
|
||||
## Machine 1: MAC LOCAL (The Hub)
|
||||
|
||||
| Item | Value |
|
||||
|------|-------|
|
||||
| **OS** | macOS 26.3.1 (25D2128) |
|
||||
| **CPU** | Apple M3 Max, 14 cores |
|
||||
| **RAM** | 36 GB |
|
||||
| **Disk** | 926 Gi total, 302 Gi free, 12 Gi used (4%) |
|
||||
| **IP** | 10.1.10.77 (local), external unknown |
|
||||
| **Role** | Primary AI harness, agent runtime, Evennia world |
|
||||
|
||||
### Running Processes
|
||||
|
||||
| Process | PID | Memory | Notes |
|
||||
|---------|-----|--------|-------|
|
||||
| Hermes gateway | 68449 | ~500MB | Primary gateway |
|
||||
| Hermes agent (s020) | 88813 | ~180MB | Session active since 1:01PM |
|
||||
| Hermes agent (s007) | 62032 | ~200MB | Session active since 10:20PM prev |
|
||||
| Hermes agent (s001) | 12072 | ~178MB | Session active since Sun 6PM |
|
||||
| Ollama | 71466 | ~20MB | /opt/homebrew/opt/ollama/bin/ollama serve |
|
||||
| OpenClaw gateway | 85834 | ~350MB | Tue 12PM start |
|
||||
| Crucible MCP (x4) | multiple | ~10-69MB each | MCP server instances |
|
||||
| Evennia Server | 66433 | ~49MB | Sun 10PM start, port 4000 |
|
||||
| Evennia Portal | 66423 | ~7MB | Sun 10PM start, port 4001 |
|
||||
|
||||
### LaunchD Services
|
||||
|
||||
| Service | Status | Notes |
|
||||
|---------|--------|-------|
|
||||
| ai.hermes.gateway | Running (-9) | Primary gateway - PID 68426 |
|
||||
| ai.hermes.gateway-bezalel | Running (0) | Bezalel gateway connection |
|
||||
| ai.hermes.gateway-fenrir | Running (0) | Fenrir gateway connection |
|
||||
| com.ollama.ollama | Running (1) | Ollama service |
|
||||
| ai.timmy.codeclaw-qwen-heartbeat | Running (0) | Claw Code worker heartbeat (15min) |
|
||||
| ai.timmy.kimi-heartbeat | Running (0) | Kimi agent heartbeat |
|
||||
| ai.timmy.claudemax-watchdog | Running (0) | Claude subscription watchdog |
|
||||
|
||||
### Cron Jobs
|
||||
|
||||
| Schedule | Script | Purpose |
|
||||
|----------|--------|---------|
|
||||
| `0 9 * * *` | daily-fleet-health.sh | Daily fleet health check |
|
||||
| `*/30 * * *` | burn-monitor.sh | Burn mode monitoring |
|
||||
| `*/15 * * *` | loop-watchdog.sh | Restart dead Groq/Gemini loops |
|
||||
| `0 8 * * *` | morning-report.sh | Overnight summary to Telegram+Gitea |
|
||||
|
||||
### Key Directories
|
||||
|
||||
| Path | Purpose |
|
||||
|------|---------|
|
||||
| ~/.hermes/ | Hermes harness - tools, agents, sessions |
|
||||
| ~/.hermes/hermes-agent/ | Hermes agent source + venv |
|
||||
| ~/.hermes/scripts/ | Fleet scripts (health, burns, watchdog) |
|
||||
| ~/.timmy/ | Timmy workspace - Evennia, configs, skills |
|
||||
| ~/.timmy/evennia/timmy_world/ | Evennia world (port 4000/4001) |
|
||||
| ~/.config/gitea/ | Tokens for: timmy, claw-code, codex, fenrir, substratum, carnice |
|
||||
| ~/.config/telegram/ | Special bot token |
|
||||
| ~/work/ | Active work directories |
|
||||
| ~/code-claw/ | Claw Code binary + workspace |
|
||||
|
||||
---
|
||||
|
||||
## Machine 2: EZRA (Forge)
|
||||
|
||||
| Item | Value |
|
||||
|------|-------|
|
||||
| **IP** | 143.198.27.163 |
|
||||
| **Provider** | DigitalOcean |
|
||||
| **Cost** | ~$12/mo |
|
||||
| **DNS** | forge.alexanderwhitestone.com |
|
||||
| **Role** | Gitea server, DNS management |
|
||||
|
||||
### Services
|
||||
|
||||
| Service | Notes |
|
||||
|---------|-------|
|
||||
| Gitea | forge.alexanderwhitestone.com, port 443 (nginx proxy) |
|
||||
| Nginx | Reverse proxy for Gitea |
|
||||
| HTTPS | Let's Encrypt cert (Apr 5 - Jul 4 2026) |
|
||||
|
||||
### Key Facts
|
||||
- Gitea org: Timmy_Foundation (ID: 10)
|
||||
- 16 repos across the org
|
||||
- 16 watchers on timmy-home
|
||||
- API at: https://forge.alexanderwhitestone.com/api/v1
|
||||
- Token stored on Mac at ~/.config/gitea/token
|
||||
|
||||
---
|
||||
|
||||
## Machine 3: ALLEGRO
|
||||
|
||||
| Item | Value |
|
||||
|------|-------|
|
||||
| **IP** | 167.99.126.228 |
|
||||
| **Provider** | DigitalOcean |
|
||||
| **Cost** | ~$12/mo |
|
||||
| **Role** | Agent hosting |
|
||||
|
||||
### Known Services
|
||||
| Service | Notes |
|
||||
|---------|-------|
|
||||
| Agents | Agent processes (specific ones TBD) |
|
||||
| SSH | Access from Mac needs verification (issue #538) |
|
||||
|
||||
### Unresolved Issues
|
||||
- SSH access from Mac to Allegro not confirmed (timmy-home #538)
|
||||
|
||||
---
|
||||
|
||||
## Machine 4: BEZALEL
|
||||
|
||||
| Item | Value |
|
||||
|------|-------|
|
||||
| **IP** | 159.203.146.185 |
|
||||
| **Provider** | DigitalOcean |
|
||||
| **Cost** | ~$12/mo |
|
||||
| **DNS** | bezalel.alexanderwhitestone.com |
|
||||
| **Role** | Evennia world, agent hosting |
|
||||
|
||||
### Services
|
||||
| Service | Notes |
|
||||
|---------|-------|
|
||||
| Evennia | World running (needs config fix per #534) |
|
||||
| Agent hosting | Bezalel agent |
|
||||
| Tailscale | Not yet installed (#535) |
|
||||
|
||||
### Unresolved Issues
|
||||
- #534: Evennia settings have bad port tuples, DB is ready
|
||||
- #535: Tailscale not installed
|
||||
- #536: Evennia world needs themed rooms/characters
|
||||
|
||||
---
|
||||
|
||||
## Network Topology
|
||||
|
||||
```
|
||||
Internet ──→ forge.alexanderwhitestone.com (Ezra, 143.198.27.163)
|
||||
──→ bezalel.alexanderwhitestone.com (Bezalel, 159.203.146.185)
|
||||
|
||||
Mac (10.1.10.77) ──→ Ezra (SSH/HTTPS)
|
||||
──→ Allegro (SSH - broken?)
|
||||
──→ Bezalel (SSH)
|
||||
|
||||
Tailscale: Not installed on any VPS yet
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Credential Inventory (NOT the secrets, just where they live)
|
||||
|
||||
| Credential | Location | Used By |
|
||||
|-----------|----------|---------|
|
||||
| Gitea token (timmy) | ~/.config/gitea/timmy-token | Timmy API calls |
|
||||
| Gitea token (generic) | ~/.config/gitea/token | General API access |
|
||||
| Gitea token (claw-code) | ~/.config/gitea/claw-code-token | Code Claw worker |
|
||||
| Gitea token (codex) | ~/.config/gitea/codex-token | Codex agent |
|
||||
| Gitea token (fenrir) | ~/.config/gitea/fenrir-token | Fenrir agent |
|
||||
| Gitea token (substratum) | ~/.config/gitea/substratum-token | Substratum agent |
|
||||
| Gitea token (carnice) | ~/.config/gitea/carnice-token | Carnice agent |
|
||||
| Telegram bot token | ~/.config/telegram/special_bot | @TimmysNexus_bot |
|
||||
| OpenRouter key | ~/.timmy/openrouter_key | Model routing |
|
||||
|
||||
---
|
||||
|
||||
## Resource Baseline (Current State)
|
||||
|
||||
### Compute Capacity (estimated)
|
||||
| Machine | CPU | RAM | Est. Daily Compute Hours |
|
||||
|---------|-----|-----|------------------------|
|
||||
| Mac Local | M3 Max (14c) | 36GB | ~4-6 hrs active use |
|
||||
| Ezra | Unknown | Unknown | Gitea only, minimal |
|
||||
| Allegro | Unknown | Unknown | Agent hosting |
|
||||
| Bezalel | Unknown | Unknown | Evennia + agent |
|
||||
|
||||
### Model Inference
|
||||
| Model | Location | Provider | Status |
|
||||
|-------|----------|----------|--------|
|
||||
| hermes4:14b | Local (Ollama) | Ollama | Running |
|
||||
| qwen/qwen3.6-plus:free | Cloud | OpenRouter | Active (this session) |
|
||||
| qwen/qwen3-32b | Cloud | Groq | Used by aider |
|
||||
|
||||
### Storage
|
||||
| Machine | Total | Used | Free | Utilization |
|
||||
|---------|-------|------|------|-------------|
|
||||
| Mac Local | 926 Gi | 624 Gi | 302 Gi | 32% |
|
||||
| Ezra | Unknown | Unknown | Unknown | Unknown |
|
||||
| Allegro | Unknown | Unknown | Unknown | Unknown |
|
||||
| Bezalel | Unknown | Unknown | Unknown | Unknown |
|
||||
|
||||
---
|
||||
|
||||
## What We Don't Know Yet
|
||||
|
||||
- [ ] CPU/RAM/disk on Ezra, Allegro, Bezalel (no inventory script yet)
|
||||
- [ ] Running processes on VPSes
|
||||
- [ ] Network paths between VPSes (no Tailscale yet)
|
||||
- [ ] SSH connectivity from Mac to Allegro
|
||||
- [ ] Backup state of any machine
|
||||
- [ ] Uptime baseline (not tracked)
|
||||
- [ ] Cost per agent per day (not tracked)
|
||||
|
||||
---
|
||||
|
||||
## Dependencies
|
||||
|
||||
| Service | Depends On | Risk if Down |
|
||||
|---------|-----------|--------------|
|
||||
| Gitea | Ezra, nginx, HTTPS | Can't manage issues, PRs, or repos |
|
||||
| Hermas agents | Mac, Ollama, OpenRouter | No AI work gets done |
|
||||
| Evennia | Bezalel VPS | Game world down |
|
||||
| Telegram bot | Telegram API, Mac process | No notifications |
|
||||
| Code Claw heartbeat | Mac, OpenRouter, Gitea | No automated issue processing |
|
||||
|
||||
---
|
||||
@@ -19,7 +19,6 @@ import os
|
||||
import urllib.request
|
||||
import urllib.error
|
||||
import urllib.parse
|
||||
import time
|
||||
from dataclasses import dataclass, field
|
||||
from datetime import datetime, timezone
|
||||
from pathlib import Path
|
||||
@@ -143,11 +142,6 @@ class PullRequest:
|
||||
mergeable: bool = False
|
||||
merged: bool = False
|
||||
changed_files: int = 0
|
||||
additions: int = 0
|
||||
deletions: int = 0
|
||||
created_at: str = ""
|
||||
updated_at: str = ""
|
||||
closed_at: str = ""
|
||||
|
||||
@classmethod
|
||||
def from_dict(cls, d: dict) -> "PullRequest":
|
||||
@@ -164,11 +158,6 @@ class PullRequest:
|
||||
mergeable=d.get("mergeable", False),
|
||||
merged=d.get("merged", False) or False,
|
||||
changed_files=d.get("changed_files", 0),
|
||||
additions=d.get("additions", 0),
|
||||
deletions=d.get("deletions", 0),
|
||||
created_at=d.get("created_at", ""),
|
||||
updated_at=d.get("updated_at", ""),
|
||||
closed_at=d.get("closed_at", ""),
|
||||
)
|
||||
|
||||
|
||||
@@ -222,53 +211,37 @@ class GiteaClient:
|
||||
|
||||
# -- HTTP layer ----------------------------------------------------------
|
||||
|
||||
|
||||
def _request(
|
||||
self,
|
||||
method: str,
|
||||
path: str,
|
||||
data: Optional[dict] = None,
|
||||
params: Optional[dict] = None,
|
||||
retries: int = 3,
|
||||
backoff: float = 1.5,
|
||||
) -> Any:
|
||||
"""Make an authenticated API request with exponential backoff retries."""
|
||||
"""Make an authenticated API request. Returns parsed JSON."""
|
||||
url = f"{self.api}{path}"
|
||||
if params:
|
||||
url += "?" + urllib.parse.urlencode(params)
|
||||
|
||||
body = json.dumps(data).encode() if data else None
|
||||
|
||||
for attempt in range(retries):
|
||||
req = urllib.request.Request(url, data=body, method=method)
|
||||
req.add_header("Authorization", f"token {self.token}")
|
||||
req.add_header("Content-Type", "application/json")
|
||||
req.add_header("Accept", "application/json")
|
||||
req = urllib.request.Request(url, data=body, method=method)
|
||||
req.add_header("Authorization", f"token {self.token}")
|
||||
req.add_header("Content-Type", "application/json")
|
||||
req.add_header("Accept", "application/json")
|
||||
|
||||
try:
|
||||
with urllib.request.urlopen(req, timeout=30) as resp:
|
||||
raw = resp.read().decode()
|
||||
if not raw:
|
||||
return {}
|
||||
return json.loads(raw)
|
||||
except urllib.error.HTTPError as e:
|
||||
body_text = ""
|
||||
try:
|
||||
with urllib.request.urlopen(req, timeout=30) as resp:
|
||||
raw = resp.read().decode()
|
||||
if not raw:
|
||||
return {}
|
||||
return json.loads(raw)
|
||||
except urllib.error.HTTPError as e:
|
||||
# Don't retry client errors (4xx) except 429
|
||||
if 400 <= e.code < 500 and e.code != 429:
|
||||
body_text = ""
|
||||
try:
|
||||
body_text = e.read().decode()
|
||||
except Exception:
|
||||
pass
|
||||
raise GiteaError(e.code, body_text, url) from e
|
||||
|
||||
if attempt == retries - 1:
|
||||
raise GiteaError(e.code, str(e), url) from e
|
||||
|
||||
time.sleep(backoff ** attempt)
|
||||
except (urllib.error.URLError, TimeoutError) as e:
|
||||
if attempt == retries - 1:
|
||||
raise GiteaError(500, str(e), url) from e
|
||||
time.sleep(backoff ** attempt)
|
||||
body_text = e.read().decode()
|
||||
except Exception:
|
||||
pass
|
||||
raise GiteaError(e.code, body_text, url) from e
|
||||
|
||||
def _get(self, path: str, **params) -> Any:
|
||||
# Filter out None values
|
||||
@@ -300,9 +273,9 @@ class GiteaClient:
|
||||
|
||||
# -- Repos ---------------------------------------------------------------
|
||||
|
||||
def list_org_repos(self, org: str, limit: int = 50, page: int = 1) -> list[dict]:
|
||||
def list_org_repos(self, org: str, limit: int = 50) -> list[dict]:
|
||||
"""List repos in an organization."""
|
||||
return self._get(f"/orgs/{org}/repos", limit=limit, page=page)
|
||||
return self._get(f"/orgs/{org}/repos", limit=limit)
|
||||
|
||||
# -- Issues --------------------------------------------------------------
|
||||
|
||||
@@ -316,7 +289,6 @@ class GiteaClient:
|
||||
direction: str = "desc",
|
||||
limit: int = 30,
|
||||
page: int = 1,
|
||||
since: Optional[str] = None,
|
||||
) -> list[Issue]:
|
||||
"""List issues for a repo."""
|
||||
raw = self._get(
|
||||
@@ -329,7 +301,6 @@ class GiteaClient:
|
||||
direction=direction,
|
||||
limit=limit,
|
||||
page=page,
|
||||
since=since,
|
||||
)
|
||||
return [Issue.from_dict(i) for i in raw]
|
||||
|
||||
|
||||
|
Before Width: | Height: | Size: 415 KiB |
|
Before Width: | Height: | Size: 249 KiB |
|
Before Width: | Height: | Size: 509 KiB |
|
Before Width: | Height: | Size: 395 KiB |
|
Before Width: | Height: | Size: 443 KiB |
|
Before Width: | Height: | Size: 246 KiB |
|
Before Width: | Height: | Size: 283 KiB |
|
Before Width: | Height: | Size: 284 KiB |
|
Before Width: | Height: | Size: 225 KiB |
|
Before Width: | Height: | Size: 222 KiB |
|
Before Width: | Height: | Size: 332 KiB |
|
Before Width: | Height: | Size: 496 KiB |
|
Before Width: | Height: | Size: 384 KiB |
|
Before Width: | Height: | Size: 311 KiB |
|
Before Width: | Height: | Size: 407 KiB |
|
Before Width: | Height: | Size: 164 KiB |
|
Before Width: | Height: | Size: 281 KiB |
|
Before Width: | Height: | Size: 569 KiB |
|
Before Width: | Height: | Size: 535 KiB |
|
Before Width: | Height: | Size: 295 KiB |
|
Before Width: | Height: | Size: 299 KiB |
|
Before Width: | Height: | Size: 247 KiB |
|
Before Width: | Height: | Size: 348 KiB |
|
Before Width: | Height: | Size: 379 KiB |