Compare commits

..

14 Commits

Author SHA1 Message Date
1156a9f55e chore: claw-code progress on #126
Some checks failed
Forge CI / smoke-and-build (pull_request) Failing after 2s
Refs #126
2026-04-06 23:45:43 -04:00
6581dcb1af fix(ezra): switch primary from kimi-for-coding to kimi-k2.5, add fallback chain
Some checks failed
Forge CI / smoke-and-build (push) Failing after 2s
kimi-for-coding is throwing 403 access-terminated errors.
This switches Ezra to kimi-k2.5 and adds anthropic + openrouter fallbacks.
Addresses #lazzyPit and unblocks Ezra resurrection.
2026-04-07 03:23:36 +00:00
a37fed23e6 [BEZALEL][CI] Syntax Guard — Prevent Broken Python from Reaching Main (#167)
Some checks failed
Forge CI / smoke-and-build (push) Failing after 2s
2026-04-07 02:27:32 +00:00
97f63a0d89 Merge pull request '[BEZALEL][DEVKIT] Shared Development Tools for the Wizard Fleet' (#166) from bezalel/devkit-for-the-fleet into main
Some checks failed
Forge CI / smoke-and-build (push) Failing after 2s
2026-04-07 02:15:11 +00:00
b49e8b11ea Merge pull request '[BEZALEL][Epic-001] The Forge CI Pipeline — Gitea Actions + Smoke + Green E2E' (#154) from bezalel/epic-001-forge-ci into main
Some checks failed
Forge CI / smoke-and-build (push) Failing after 2s
2026-04-07 02:12:31 +00:00
88b4cc218f feat(devkit): Add shared development tools for the wizard fleet
Some checks failed
Notebook CI / notebook-smoke (pull_request) Failing after 2s
- gitea_client.py — reusable Gitea API client for issues, PRs, comments
- health.py — fleet health monitor (load, disk, memory, processes)
- notebook_runner.py — Papermill wrapper with JSON reporting
- smoke_test.py — fast smoke tests and bare green-path e2e
- secret_scan.py — secret leak scanner for CI gating
- wizard_env.py — environment validator for bootstrapping agents
- README.md — usage guide for all tools

These tools are designed to be used by any wizard via python -m devkit.<tool>.
Rising up as a platform, not a silo.
2026-04-07 02:08:47 +00:00
59653ef409 [claude] Research Triage: SSD Self-Distillation acknowledgment (#128) (#165) 2026-04-07 02:07:54 +00:00
e32d6332bc [claude] Forge Operations Guide — Practical Wizard Onboarding (#142) (#164) 2026-04-07 02:06:15 +00:00
6291f2d31b [claude] Fleet SITREP — April 6, 2026 acknowledgment (#143) (#162) 2026-04-07 02:04:51 +00:00
066ec8eafa [claude] Add Ezra Quarterly Report — April 2026 (MD + PDF) (#133) (#163) 2026-04-07 02:04:45 +00:00
069d5404a0 [BEZALEL][DEMO] Notebook Workflow: Jupytext + Papermill for Agent Tasks (#157)
Some checks failed
Notebook CI / notebook-smoke (push) Failing after 2s
2026-04-07 02:02:49 +00:00
53fe58a2b9 feat(notebooks): Add Jupytext + Papermill agent workflow demo
Some checks failed
Notebook CI / notebook-smoke (push) Failing after 3s
Notebook CI / notebook-smoke (pull_request) Failing after 5s
- Add parameterized system-health notebook (.py source + .ipynb)
- Add Gitea Actions CI workflow for notebook execution smoke test
- Add NOTEBOOK_WORKFLOW.md documenting the .py-first approach
- Proves end-to-end: agent writes .py -> PR review -> CI executes -> output artifact
2026-04-07 01:54:25 +00:00
43bcb88a09 [BEZALEL][Epic-001] The Forge CI Pipeline — Gitea Actions + Smoke + Green E2E
Some checks failed
Forge CI / smoke-and-build (pull_request) Failing after 3s
- Add .gitea/workflows/ci.yml: Gitea Actions workflow for PR/push CI
- Add scripts/smoke_test.py: fast smoke tests (<30s) for core imports and CLI entrypoints
- Add tests/test_green_path_e2e.py: bare green-path e2e — terminal echo test
- Total CI runtime target: <5 minutes
- No API keys required for smoke/e2e stages

Closes #145
/assign @bezalel
2026-04-07 00:28:32 +00:00
89730e8e90 [BEZALEL] Add forge health check — artifact integrity and security scanner
Some checks failed
Supply Chain Audit / Scan PR for supply chain risks (pull_request) Failing after 0s
Docker Build and Publish / build-and-push (pull_request) Failing after 7s
Tests / test (pull_request) Failing after 2s
Adds scripts/forge_health_check.py to scan wizard environments for:
- Missing .py source files with orphaned .pyc bytecode (GOFAI artifact integrity)
- Burn script clutter in production paths
- World-readable sensitive files (keystores, tokens, .env)
- Missing required environment variables

Includes full test suite in tests/test_forge_health_check.py covering
orphaned bytecode detection, burn script clutter, permission auto-fix,
and environment variable validation.

Addresses Allegro formalization audit findings:
- GOFAI source files missing (only .pyc remains)
- Nostr keystore world-readable
- eg burn scripts cluttering /root

/assign @bezalel
2026-04-06 22:37:32 +00:00
24 changed files with 2215 additions and 41 deletions

View File

@@ -0,0 +1,2 @@
{"created_at_ms":1775533542734,"session_id":"session-1775533542734-0","type":"session_meta","updated_at_ms":1775533542734,"version":1}
{"message":{"blocks":[{"text":"You are Code Claw running as the Gitea user claw-code.\n\nRepository: Timmy_Foundation/hermes-agent\nIssue: #126 — P2: Validate Documentation Audit & Apply to Our Fork\nBranch: claw-code/issue-126\n\nRead the issue and recent comments, then implement the smallest correct change.\nYou are in a git repo checkout already.\n\nIssue body:\n## Context\n\nCommit `43d468ce` is a comprehensive documentation audit — fixes stale info, expands thin pages, adds depth across all docs.\n\n## Acceptance Criteria\n\n- [ ] **Catalog all doc changes**: Run `git show 43d468ce --stat` to list all files changed, then review each for what was fixed/expanded\n- [ ] **Verify key docs are accurate**: Pick 3 docs that were previously thin (setup, deployment, plugin development), confirm they now have comprehensive content\n- [ ] **Identify stale info that was corrected**: Note at least 3 pieces of stale information that were removed or updated\n- [ ] **Apply fixes to our fork if needed**: Check if any of the doc fixes apply to our `Timmy_Foundation/hermes-agent` fork (Timmy-specific references, custom config sections)\n\n## Why This Matters\n\nAccurate documentation is critical for onboarding new agents and maintaining the fleet. Stale docs cost more debugging time than writing them initially.\n\n## Hints\n\n- Run `cd ~/.hermes/hermes-agent && git show 43d468ce --stat` to see the full scope\n- The docs likely cover: setup, plugins, deployment, MCP configuration, and tool integrations\n\n\nParent: #111\n\nRecent comments:\n## 🏷️ Automated Triage Check\n\n**Timestamp:** 2026-04-06T15:30:12.449023 \n**Agent:** Allegro Heartbeat\n\nThis issue has been identified as needing triage:\n\n### Checklist\n- [ ] Clear acceptance criteria defined\n- [ ] Priority label assigned (p0-critical / p1-important / p2-backlog)\n- [ ] Size estimate added (quick-fix / day / week / epic)\n- [ ] Owner assigned\n- [ ] Related issues linked\n\n### Context\n- No comments yet — needs engagement\n- No labels — needs categorization\n- Part of automated backlog maintenance\n\n---\n*Automated triage from Allegro 15-minute heartbeat*\n\n[BURN-DOWN] Dispatched to Code Claw (claw-code worker) as part of nightly burn-down cycle. Heartbeat active.\n\n🟠 Code Claw (OpenRouter qwen/qwen3.6-plus:free) picking up this issue via 15-minute heartbeat.\n\nTimestamp: 2026-04-07T03:45:37Z\n\nRules:\n- Make focused code/config/doc changes only if they directly address the issue.\n- Prefer the smallest proof-oriented fix.\n- Run relevant verification commands if obvious.\n- Do NOT create PRs yourself; the outer worker handles commit/push/PR.\n- If the task is too large or not code-fit, leave the tree unchanged.\n","type":"text"}],"role":"user"},"type":"message"}

54
.gitea/workflows/ci.yml Normal file
View File

@@ -0,0 +1,54 @@
name: Forge CI
on:
push:
branches: [main]
pull_request:
branches: [main]
concurrency:
group: forge-ci-${{ gitea.ref }}
cancel-in-progress: true
jobs:
smoke-and-build:
runs-on: ubuntu-latest
timeout-minutes: 5
steps:
- name: Checkout code
uses: actions/checkout@v4
- name: Install uv
uses: astral-sh/setup-uv@v5
- name: Set up Python 3.11
run: uv python install 3.11
- name: Install package
run: |
uv venv .venv --python 3.11
source .venv/bin/activate
uv pip install -e ".[all,dev]"
- name: Smoke tests
run: |
source .venv/bin/activate
python scripts/smoke_test.py
env:
OPENROUTER_API_KEY: ""
OPENAI_API_KEY: ""
NOUS_API_KEY: ""
- name: Syntax guard
run: |
source .venv/bin/activate
python scripts/syntax_guard.py
- name: Green-path E2E
run: |
source .venv/bin/activate
python -m pytest tests/test_green_path_e2e.py -q --tb=short
env:
OPENROUTER_API_KEY: ""
OPENAI_API_KEY: ""
NOUS_API_KEY: ""

View File

@@ -0,0 +1,44 @@
name: Notebook CI
on:
push:
paths:
- 'notebooks/**'
pull_request:
paths:
- 'notebooks/**'
jobs:
notebook-smoke:
runs-on: ubuntu-latest
steps:
- name: Checkout
uses: actions/checkout@v4
- name: Setup Python
uses: actions/setup-python@v5
with:
python-version: '3.12'
- name: Install dependencies
run: |
pip install papermill jupytext nbformat
python -m ipykernel install --user --name python3
- name: Execute system health notebook
run: |
papermill notebooks/agent_task_system_health.ipynb /tmp/output.ipynb \
-p threshold 0.5 \
-p hostname ci-runner
- name: Verify output has results
run: |
python -c "
import json
nb = json.load(open('/tmp/output.ipynb'))
code_cells = [c for c in nb['cells'] if c['cell_type'] == 'code']
outputs = [c.get('outputs', []) for c in code_cells]
total_outputs = sum(len(o) for o in outputs)
assert total_outputs > 0, 'Notebook produced no outputs'
print(f'Notebook executed successfully with {total_outputs} output(s)')
"

View File

@@ -1,44 +1,34 @@
# Ezra Configuration - Kimi Primary
# Anthropic removed from chain entirely
# PRIMARY: Kimi for all operations
model: kimi-coding/kimi-for-coding
# Fallback chain: Only local/offline options
# NO anthropic in the chain - quota issues solved
fallback_providers:
- provider: ollama
model: qwen2.5:7b
base_url: http://localhost:11434
timeout: 120
reason: "Local fallback when Kimi unavailable"
# Provider settings
providers:
kimi-coding:
timeout: 60
max_retries: 3
# Uses KIMI_API_KEY from .env
ollama:
timeout: 120
keep_alive: true
base_url: http://localhost:11434
# REMOVED: anthropic provider entirely
# No more quota issues, no more choking
# Toolsets - Ezra needs these
model:
default: kimi-k2.5
provider: kimi-coding
toolsets:
- hermes-cli
- github
- web
# Agent settings
- all
fallback_providers:
- provider: kimi-coding
model: kimi-k2.5
timeout: 120
reason: Kimi coding fallback (front of chain)
- provider: anthropic
model: claude-sonnet-4-20250514
timeout: 120
reason: Direct Anthropic fallback
- provider: openrouter
model: anthropic/claude-sonnet-4-20250514
base_url: https://openrouter.ai/api/v1
api_key_env: OPENROUTER_API_KEY
timeout: 120
reason: OpenRouter fallback
agent:
max_turns: 90
tool_use_enforcement: auto
# Display settings
display:
show_provider_switches: true
reasoning_effort: high
verbose: false
providers:
kimi-coding:
base_url: https://api.kimi.com/coding/v1
timeout: 60
max_retries: 3
anthropic:
timeout: 120
openrouter:
base_url: https://openrouter.ai/api/v1
timeout: 120

56
devkit/README.md Normal file
View File

@@ -0,0 +1,56 @@
# Bezalel's Devkit — Shared Tools for the Wizard Fleet
This directory contains reusable CLI tools and Python modules for CI, testing, deployment, observability, and Gitea automation. Any wizard can invoke them via `python -m devkit.<tool>`.
## Tools
### `gitea_client` — Gitea API Client
List issues/PRs, post comments, create PRs, update issues.
```bash
python -m devkit.gitea_client issues --state open --limit 20
python -m devkit.gitea_client create-comment --number 142 --body "Update from Bezalel"
python -m devkit.gitea_client prs --state open
```
### `health` — Fleet Health Monitor
Checks system load, disk, memory, running processes, and key package versions.
```bash
python -m devkit.health --threshold-load 1.0 --threshold-disk 90.0 --fail-on-critical
```
### `notebook_runner` — Notebook Execution Wrapper
Parameterizes and executes Jupyter notebooks via Papermill with structured JSON reporting.
```bash
python -m devkit.notebook_runner task.ipynb output.ipynb -p threshold=1.0 -p hostname=forge
```
### `smoke_test` — Fast Smoke Test Runner
Runs core import checks, CLI entrypoint tests, and one bare green-path E2E.
```bash
python -m devkit.smoke_test --verbose
```
### `secret_scan` — Secret Leak Scanner
Scans the repo for API keys, tokens, and private keys.
```bash
python -m devkit.secret_scan --path . --fail-on-find
```
### `wizard_env` — Environment Validator
Checks that a wizard environment has all required binaries, env vars, Python packages, and Hermes config.
```bash
python -m devkit.wizard_env --json --fail-on-incomplete
```
## Philosophy
- **CLI-first** — Every tool is runnable as `python -m devkit.<tool>`
- **JSON output** — Easy to parse from other agents and CI pipelines
- **Zero dependencies beyond stdlib** where possible; optional heavy deps are runtime-checked
- **Fail-fast** — Exit codes are meaningful for CI gating

9
devkit/__init__.py Normal file
View File

@@ -0,0 +1,9 @@
"""
Bezalel's Devkit — Shared development tools for the wizard fleet.
A collection of CLI-accessible utilities for CI, testing, deployment,
observability, and Gitea automation. Designed to be used by any agent
via subprocess or direct Python import.
"""
__version__ = "0.1.0"

153
devkit/gitea_client.py Normal file
View File

@@ -0,0 +1,153 @@
#!/usr/bin/env python3
"""
Shared Gitea API client for wizard fleet automation.
Usage as CLI:
python -m devkit.gitea_client issues --repo Timmy_Foundation/hermes-agent --state open
python -m devkit.gitea_client issue --repo Timmy_Foundation/hermes-agent --number 142
python -m devkit.gitea_client create-comment --repo Timmy_Foundation/hermes-agent --number 142 --body "Update from Bezalel"
python -m devkit.gitea_client prs --repo Timmy_Foundation/hermes-agent --state open
Usage as module:
from devkit.gitea_client import GiteaClient
client = GiteaClient()
issues = client.list_issues("Timmy_Foundation/hermes-agent", state="open")
"""
import argparse
import json
import os
import sys
from typing import Any, Dict, List, Optional
import urllib.request
DEFAULT_BASE_URL = os.getenv("GITEA_URL", "https://forge.alexanderwhitestone.com")
DEFAULT_TOKEN = os.getenv("GITEA_TOKEN", "")
class GiteaClient:
def __init__(self, base_url: str = DEFAULT_BASE_URL, token: str = DEFAULT_TOKEN):
self.base_url = base_url.rstrip("/")
self.token = token or ""
def _request(
self,
method: str,
path: str,
data: Optional[Dict[str, Any]] = None,
headers: Optional[Dict[str, str]] = None,
) -> Any:
url = f"{self.base_url}/api/v1{path}"
req_headers = {"Content-Type": "application/json", "Accept": "application/json"}
if self.token:
req_headers["Authorization"] = f"token {self.token}"
if headers:
req_headers.update(headers)
body = json.dumps(data).encode() if data else None
req = urllib.request.Request(url, data=body, headers=req_headers, method=method)
try:
with urllib.request.urlopen(req) as resp:
return json.loads(resp.read().decode())
except urllib.error.HTTPError as e:
return {"error": True, "status": e.code, "body": e.read().decode()}
def list_issues(self, repo: str, state: str = "open", limit: int = 50) -> List[Dict]:
return self._request("GET", f"/repos/{repo}/issues?state={state}&limit={limit}") or []
def get_issue(self, repo: str, number: int) -> Dict:
return self._request("GET", f"/repos/{repo}/issues/{number}") or {}
def create_comment(self, repo: str, number: int, body: str) -> Dict:
return self._request(
"POST", f"/repos/{repo}/issues/{number}/comments", {"body": body}
)
def update_issue(self, repo: str, number: int, **fields) -> Dict:
return self._request("PATCH", f"/repos/{repo}/issues/{number}", fields)
def list_prs(self, repo: str, state: str = "open", limit: int = 50) -> List[Dict]:
return self._request("GET", f"/repos/{repo}/pulls?state={state}&limit={limit}") or []
def get_pr(self, repo: str, number: int) -> Dict:
return self._request("GET", f"/repos/{repo}/pulls/{number}") or {}
def create_pr(self, repo: str, title: str, head: str, base: str, body: str = "") -> Dict:
return self._request(
"POST",
f"/repos/{repo}/pulls",
{"title": title, "head": head, "base": base, "body": body},
)
def _fmt_json(obj: Any) -> str:
return json.dumps(obj, indent=2, ensure_ascii=False)
def main(argv: List[str] = None) -> int:
argv = argv or sys.argv[1:]
parser = argparse.ArgumentParser(description="Gitea CLI for wizard fleet")
parser.add_argument("--repo", default="Timmy_Foundation/hermes-agent", help="Repository full name")
parser.add_argument("--token", default=DEFAULT_TOKEN, help="Gitea API token")
parser.add_argument("--base-url", default=DEFAULT_BASE_URL, help="Gitea base URL")
sub = parser.add_subparsers(dest="cmd")
p_issues = sub.add_parser("issues", help="List issues")
p_issues.add_argument("--state", default="open")
p_issues.add_argument("--limit", type=int, default=50)
p_issue = sub.add_parser("issue", help="Get single issue")
p_issue.add_argument("--number", type=int, required=True)
p_prs = sub.add_parser("prs", help="List PRs")
p_prs.add_argument("--state", default="open")
p_prs.add_argument("--limit", type=int, default=50)
p_pr = sub.add_parser("pr", help="Get single PR")
p_pr.add_argument("--number", type=int, required=True)
p_comment = sub.add_parser("create-comment", help="Post comment on issue/PR")
p_comment.add_argument("--number", type=int, required=True)
p_comment.add_argument("--body", required=True)
p_update = sub.add_parser("update-issue", help="Update issue fields")
p_update.add_argument("--number", type=int, required=True)
p_update.add_argument("--title", default=None)
p_update.add_argument("--body", default=None)
p_update.add_argument("--state", default=None)
p_create_pr = sub.add_parser("create-pr", help="Create a PR")
p_create_pr.add_argument("--title", required=True)
p_create_pr.add_argument("--head", required=True)
p_create_pr.add_argument("--base", default="main")
p_create_pr.add_argument("--body", default="")
args = parser.parse_args(argv)
client = GiteaClient(base_url=args.base_url, token=args.token)
if args.cmd == "issues":
print(_fmt_json(client.list_issues(args.repo, args.state, args.limit)))
elif args.cmd == "issue":
print(_fmt_json(client.get_issue(args.repo, args.number)))
elif args.cmd == "prs":
print(_fmt_json(client.list_prs(args.repo, args.state, args.limit)))
elif args.cmd == "pr":
print(_fmt_json(client.get_pr(args.repo, args.number)))
elif args.cmd == "create-comment":
print(_fmt_json(client.create_comment(args.repo, args.number, args.body)))
elif args.cmd == "update-issue":
fields = {k: v for k, v in {"title": args.title, "body": args.body, "state": args.state}.items() if v is not None}
print(_fmt_json(client.update_issue(args.repo, args.number, **fields)))
elif args.cmd == "create-pr":
print(_fmt_json(client.create_pr(args.repo, args.title, args.head, args.base, args.body)))
else:
parser.print_help()
return 1
return 0
if __name__ == "__main__":
sys.exit(main())

134
devkit/health.py Normal file
View File

@@ -0,0 +1,134 @@
#!/usr/bin/env python3
"""
Fleet health monitor for wizard agents.
Checks local system state and reports structured health metrics.
Usage as CLI:
python -m devkit.health
python -m devkit.health --threshold-load 1.0 --check-disk
Usage as module:
from devkit.health import check_health
report = check_health()
"""
import argparse
import json
import os
import shutil
import subprocess
import sys
import time
from typing import Any, Dict, List
def _run(cmd: List[str]) -> str:
try:
return subprocess.check_output(cmd, stderr=subprocess.DEVNULL).decode().strip()
except Exception as e:
return f"error: {e}"
def check_health(threshold_load: float = 1.0, threshold_disk_percent: float = 90.0) -> Dict[str, Any]:
gather_time = time.strftime("%Y-%m-%dT%H:%M:%SZ", time.gmtime())
# Load average
load_raw = _run(["cat", "/proc/loadavg"])
load_values = []
avg_load = None
if load_raw.startswith("error:"):
load_status = load_raw
else:
try:
load_values = [float(x) for x in load_raw.split()[:3]]
avg_load = sum(load_values) / len(load_values)
load_status = "critical" if avg_load > threshold_load else "ok"
except Exception as e:
load_status = f"error parsing load: {e}"
# Disk usage
disk = shutil.disk_usage("/")
disk_percent = (disk.used / disk.total) * 100 if disk.total else 0.0
disk_status = "critical" if disk_percent > threshold_disk_percent else "ok"
# Memory
meminfo = _run(["cat", "/proc/meminfo"])
mem_stats = {}
for line in meminfo.splitlines():
if ":" in line:
key, val = line.split(":", 1)
mem_stats[key.strip()] = val.strip()
# Running processes
hermes_pids = []
try:
ps_out = subprocess.check_output(["pgrep", "-a", "-f", "hermes"]).decode().strip()
hermes_pids = [line.split(None, 1) for line in ps_out.splitlines() if line.strip()]
except subprocess.CalledProcessError:
hermes_pids = []
# Python package versions (key ones)
key_packages = ["jupyterlab", "papermill", "requests"]
pkg_versions = {}
for pkg in key_packages:
try:
out = subprocess.check_output([sys.executable, "-m", "pip", "show", pkg], stderr=subprocess.DEVNULL).decode()
for line in out.splitlines():
if line.startswith("Version:"):
pkg_versions[pkg] = line.split(":", 1)[1].strip()
break
except Exception:
pkg_versions[pkg] = None
overall = "ok"
if load_status == "critical" or disk_status == "critical":
overall = "critical"
elif not hermes_pids:
overall = "warning"
return {
"timestamp": gather_time,
"overall": overall,
"load": {
"raw": load_raw if not load_raw.startswith("error:") else None,
"1min": load_values[0] if len(load_values) > 0 else None,
"5min": load_values[1] if len(load_values) > 1 else None,
"15min": load_values[2] if len(load_values) > 2 else None,
"avg": round(avg_load, 3) if avg_load is not None else None,
"threshold": threshold_load,
"status": load_status,
},
"disk": {
"total_gb": round(disk.total / (1024 ** 3), 2),
"used_gb": round(disk.used / (1024 ** 3), 2),
"free_gb": round(disk.free / (1024 ** 3), 2),
"used_percent": round(disk_percent, 2),
"threshold_percent": threshold_disk_percent,
"status": disk_status,
},
"memory": mem_stats,
"processes": {
"hermes_count": len(hermes_pids),
"hermes_pids": hermes_pids[:10],
},
"packages": pkg_versions,
}
def main(argv: List[str] = None) -> int:
argv = argv or sys.argv[1:]
parser = argparse.ArgumentParser(description="Fleet health monitor")
parser.add_argument("--threshold-load", type=float, default=1.0)
parser.add_argument("--threshold-disk", type=float, default=90.0)
parser.add_argument("--fail-on-critical", action="store_true", help="Exit non-zero if overall is critical")
args = parser.parse_args(argv)
report = check_health(args.threshold_load, args.threshold_disk)
print(json.dumps(report, indent=2))
if args.fail_on_critical and report.get("overall") == "critical":
return 1
return 0
if __name__ == "__main__":
sys.exit(main())

136
devkit/notebook_runner.py Normal file
View File

@@ -0,0 +1,136 @@
#!/usr/bin/env python3
"""
Notebook execution runner for agent tasks.
Wraps papermill with sensible defaults and structured JSON reporting.
Usage as CLI:
python -m devkit.notebook_runner notebooks/task.ipynb output.ipynb -p threshold 1.0
python -m devkit.notebook_runner notebooks/task.ipynb --dry-run
Usage as module:
from devkit.notebook_runner import run_notebook
result = run_notebook("task.ipynb", "output.ipynb", parameters={"threshold": 1.0})
"""
import argparse
import json
import os
import subprocess
import sys
import tempfile
from pathlib import Path
from typing import Any, Dict, List, Optional
def run_notebook(
input_path: str,
output_path: Optional[str] = None,
parameters: Optional[Dict[str, Any]] = None,
kernel: str = "python3",
timeout: Optional[int] = None,
dry_run: bool = False,
) -> Dict[str, Any]:
input_path = str(Path(input_path).expanduser().resolve())
if output_path is None:
fd, output_path = tempfile.mkstemp(suffix=".ipynb")
os.close(fd)
else:
output_path = str(Path(output_path).expanduser().resolve())
if dry_run:
return {
"status": "dry_run",
"input": input_path,
"output": output_path,
"parameters": parameters or {},
"kernel": kernel,
}
cmd = ["papermill", input_path, output_path, "--kernel", kernel]
if timeout is not None:
cmd.extend(["--execution-timeout", str(timeout)])
for key, value in (parameters or {}).items():
cmd.extend(["-p", key, str(value)])
start = os.times()
try:
proc = subprocess.run(
cmd,
capture_output=True,
text=True,
check=True,
)
end = os.times()
return {
"status": "ok",
"input": input_path,
"output": output_path,
"parameters": parameters or {},
"kernel": kernel,
"elapsed_seconds": round((end.elapsed - start.elapsed), 2),
"stdout": proc.stdout[-2000:] if proc.stdout else "",
}
except subprocess.CalledProcessError as e:
end = os.times()
return {
"status": "error",
"input": input_path,
"output": output_path,
"parameters": parameters or {},
"kernel": kernel,
"elapsed_seconds": round((end.elapsed - start.elapsed), 2),
"stdout": e.stdout[-2000:] if e.stdout else "",
"stderr": e.stderr[-2000:] if e.stderr else "",
"returncode": e.returncode,
}
except FileNotFoundError:
return {
"status": "error",
"message": "papermill not found. Install with: uv tool install papermill",
}
def main(argv: List[str] = None) -> int:
argv = argv or sys.argv[1:]
parser = argparse.ArgumentParser(description="Notebook runner for agents")
parser.add_argument("input", help="Input notebook path")
parser.add_argument("output", nargs="?", default=None, help="Output notebook path")
parser.add_argument("-p", "--parameter", action="append", default=[], help="Parameters as key=value")
parser.add_argument("--kernel", default="python3")
parser.add_argument("--timeout", type=int, default=None)
parser.add_argument("--dry-run", action="store_true")
args = parser.parse_args(argv)
parameters = {}
for raw in args.parameter:
if "=" not in raw:
print(f"Invalid parameter (expected key=value): {raw}", file=sys.stderr)
return 1
k, v = raw.split("=", 1)
# Best-effort type inference
if v.lower() in ("true", "false"):
v = v.lower() == "true"
else:
try:
v = int(v)
except ValueError:
try:
v = float(v)
except ValueError:
pass
parameters[k] = v
result = run_notebook(
args.input,
args.output,
parameters=parameters,
kernel=args.kernel,
timeout=args.timeout,
dry_run=args.dry_run,
)
print(json.dumps(result, indent=2))
return 0 if result.get("status") == "ok" else 1
if __name__ == "__main__":
sys.exit(main())

108
devkit/secret_scan.py Normal file
View File

@@ -0,0 +1,108 @@
#!/usr/bin/env python3
"""
Fast secret leak scanner for the repository.
Checks for common patterns that should never be committed.
Usage as CLI:
python -m devkit.secret_scan
python -m devkit.secret_scan --path /some/repo --fail-on-find
Usage as module:
from devkit.secret_scan import scan
findings = scan("/path/to/repo")
"""
import argparse
import json
import os
import re
import sys
from pathlib import Path
from typing import Any, Dict, List
# Patterns to flag
PATTERNS = {
"aws_access_key_id": re.compile(r"AKIA[0-9A-Z]{16}"),
"aws_secret_key": re.compile(r"['\"\s][0-9a-zA-Z/+]{40}['\"\s]"),
"generic_api_key": re.compile(r"api[_-]?key\s*[:=]\s*['\"][a-zA-Z0-9_\-]{20,}['\"]", re.IGNORECASE),
"private_key": re.compile(r"-----BEGIN (RSA |EC |DSA |OPENSSH )?PRIVATE KEY-----"),
"github_token": re.compile(r"gh[pousr]_[A-Za-z0-9_]{36,}"),
"gitea_token": re.compile(r"[0-9a-f]{40}"), # heuristic for long hex strings after "token"
"telegram_bot_token": re.compile(r"[0-9]{9,}:[A-Za-z0-9_-]{35,}"),
}
# Files and paths to skip
SKIP_PATHS = [
".git",
"__pycache__",
".pytest_cache",
"node_modules",
"venv",
".env",
".agent-skills",
]
# Max file size to scan (bytes)
MAX_FILE_SIZE = 1024 * 1024
def _should_skip(path: Path) -> bool:
for skip in SKIP_PATHS:
if skip in path.parts:
return True
return False
def scan(root: str = ".") -> List[Dict[str, Any]]:
root_path = Path(root).resolve()
findings = []
for file_path in root_path.rglob("*"):
if not file_path.is_file():
continue
if _should_skip(file_path):
continue
if file_path.stat().st_size > MAX_FILE_SIZE:
continue
try:
text = file_path.read_text(encoding="utf-8", errors="ignore")
except Exception:
continue
for pattern_name, pattern in PATTERNS.items():
for match in pattern.finditer(text):
# Simple context: line around match
start = max(0, match.start() - 40)
end = min(len(text), match.end() + 40)
context = text[start:end].replace("\n", " ")
findings.append({
"file": str(file_path.relative_to(root_path)),
"pattern": pattern_name,
"line": text[:match.start()].count("\n") + 1,
"context": context,
})
return findings
def main(argv: List[str] = None) -> int:
argv = argv or sys.argv[1:]
parser = argparse.ArgumentParser(description="Secret leak scanner")
parser.add_argument("--path", default=".", help="Repository root to scan")
parser.add_argument("--fail-on-find", action="store_true", help="Exit non-zero if secrets found")
parser.add_argument("--json", action="store_true", help="Output as JSON")
args = parser.parse_args(argv)
findings = scan(args.path)
if args.json:
print(json.dumps({"findings": findings, "count": len(findings)}, indent=2))
else:
print(f"Scanned {args.path}")
print(f"Findings: {len(findings)}")
for f in findings:
print(f" [{f['pattern']}] {f['file']}:{f['line']} -> ...{f['context']}...")
if args.fail_on_find and findings:
return 1
return 0
if __name__ == "__main__":
sys.exit(main())

108
devkit/smoke_test.py Normal file
View File

@@ -0,0 +1,108 @@
#!/usr/bin/env python3
"""
Shared smoke test runner for hermes-agent.
Fast checks that catch obvious breakage without maintenance burden.
Usage as CLI:
python -m devkit.smoke_test
python -m devkit.smoke_test --verbose
Usage as module:
from devkit.smoke_test import run_smoke_tests
results = run_smoke_tests()
"""
import argparse
import importlib
import json
import subprocess
import sys
from pathlib import Path
from typing import Any, Dict, List
HERMES_ROOT = Path(__file__).resolve().parent.parent
def _test_imports() -> Dict[str, Any]:
modules = [
"hermes_constants",
"hermes_state",
"cli",
"tools.skills_sync",
"tools.skills_hub",
]
errors = []
for mod in modules:
try:
importlib.import_module(mod)
except Exception as e:
errors.append({"module": mod, "error": str(e)})
return {
"name": "core_imports",
"status": "ok" if not errors else "fail",
"errors": errors,
}
def _test_cli_entrypoints() -> Dict[str, Any]:
entrypoints = [
[sys.executable, "-m", "cli", "--help"],
]
errors = []
for cmd in entrypoints:
try:
subprocess.run(cmd, capture_output=True, text=True, check=True, cwd=HERMES_ROOT)
except subprocess.CalledProcessError as e:
errors.append({"cmd": cmd, "error": f"exit {e.returncode}"})
except Exception as e:
errors.append({"cmd": cmd, "error": str(e)})
return {
"name": "cli_entrypoints",
"status": "ok" if not errors else "fail",
"errors": errors,
}
def _test_green_path_e2e() -> Dict[str, Any]:
"""One bare green-path E2E: terminal_tool echo hello."""
try:
from tools.terminal_tool import terminal
result = terminal(command="echo hello")
output = result.get("output", "")
if "hello" in output.lower():
return {"name": "green_path_e2e", "status": "ok", "output": output.strip()}
return {"name": "green_path_e2e", "status": "fail", "error": f"Unexpected output: {output}"}
except Exception as e:
return {"name": "green_path_e2e", "status": "fail", "error": str(e)}
def run_smoke_tests(verbose: bool = False) -> Dict[str, Any]:
tests = [
_test_imports(),
_test_cli_entrypoints(),
_test_green_path_e2e(),
]
failed = [t for t in tests if t["status"] != "ok"]
result = {
"overall": "ok" if not failed else "fail",
"tests": tests,
"failed_count": len(failed),
}
if verbose:
print(json.dumps(result, indent=2))
return result
def main(argv: List[str] = None) -> int:
argv = argv or sys.argv[1:]
parser = argparse.ArgumentParser(description="Smoke test runner")
parser.add_argument("--verbose", action="store_true")
args = parser.parse_args(argv)
result = run_smoke_tests(verbose=True)
return 0 if result["overall"] == "ok" else 1
if __name__ == "__main__":
sys.exit(main())

112
devkit/wizard_env.py Normal file
View File

@@ -0,0 +1,112 @@
#!/usr/bin/env python3
"""
Wizard environment validator.
Checks that a new wizard environment is ready for duty.
Usage as CLI:
python -m devkit.wizard_env
python -m devkit.wizard_env --fix
Usage as module:
from devkit.wizard_env import validate
report = validate()
"""
import argparse
import json
import os
import shutil
import subprocess
import sys
from typing import Any, Dict, List
def _has_cmd(name: str) -> bool:
return shutil.which(name) is not None
def _check_env_var(name: str) -> Dict[str, Any]:
value = os.getenv(name)
return {
"name": name,
"status": "ok" if value else "missing",
"value": value[:10] + "..." if value and len(value) > 20 else value,
}
def _check_python_pkg(name: str) -> Dict[str, Any]:
try:
__import__(name)
return {"name": name, "status": "ok"}
except ImportError:
return {"name": name, "status": "missing"}
def validate() -> Dict[str, Any]:
checks = {
"binaries": [
{"name": "python3", "status": "ok" if _has_cmd("python3") else "missing"},
{"name": "git", "status": "ok" if _has_cmd("git") else "missing"},
{"name": "curl", "status": "ok" if _has_cmd("curl") else "missing"},
{"name": "jupyter-lab", "status": "ok" if _has_cmd("jupyter-lab") else "missing"},
{"name": "papermill", "status": "ok" if _has_cmd("papermill") else "missing"},
{"name": "jupytext", "status": "ok" if _has_cmd("jupytext") else "missing"},
],
"env_vars": [
_check_env_var("GITEA_URL"),
_check_env_var("GITEA_TOKEN"),
_check_env_var("TELEGRAM_BOT_TOKEN"),
],
"python_packages": [
_check_python_pkg("requests"),
_check_python_pkg("jupyter_server"),
_check_python_pkg("nbformat"),
],
}
all_ok = all(
c["status"] == "ok"
for group in checks.values()
for c in group
)
# Hermes-specific checks
hermes_home = os.path.expanduser("~/.hermes")
checks["hermes"] = [
{"name": "config.yaml", "status": "ok" if os.path.exists(f"{hermes_home}/config.yaml") else "missing"},
{"name": "skills_dir", "status": "ok" if os.path.exists(f"{hermes_home}/skills") else "missing"},
]
all_ok = all_ok and all(c["status"] == "ok" for c in checks["hermes"])
return {
"overall": "ok" if all_ok else "incomplete",
"checks": checks,
}
def main(argv: List[str] = None) -> int:
argv = argv or sys.argv[1:]
parser = argparse.ArgumentParser(description="Wizard environment validator")
parser.add_argument("--json", action="store_true")
parser.add_argument("--fail-on-incomplete", action="store_true")
args = parser.parse_args(argv)
report = validate()
if args.json:
print(json.dumps(report, indent=2))
else:
print(f"Wizard Environment: {report['overall']}")
for group, items in report["checks"].items():
print(f"\n[{group}]")
for item in items:
status_icon = "" if item["status"] == "ok" else ""
print(f" {status_icon} {item['name']}: {item['status']}")
if args.fail_on_incomplete and report["overall"] != "ok":
return 1
return 0
if __name__ == "__main__":
sys.exit(main())

57
docs/NOTEBOOK_WORKFLOW.md Normal file
View File

@@ -0,0 +1,57 @@
# Notebook Workflow for Agent Tasks
This directory demonstrates a sovereign, version-controlled workflow for LLM agent tasks using Jupyter notebooks.
## Philosophy
- **`.py` files are the source of truth`** — authored and reviewed as plain Python with `# %%` cell markers (via Jupytext)
- **`.ipynb` files are generated artifacts** — auto-created from `.py` for execution and rich viewing
- **Papermill parameterizes and executes** — each run produces an output notebook with code, narrative, and results preserved
- **Output notebooks are audit artifacts** — every execution leaves a permanent, replayable record
## File Layout
```
notebooks/
agent_task_system_health.py # Source of truth (Jupytext)
agent_task_system_health.ipynb # Generated from .py
docs/
NOTEBOOK_WORKFLOW.md # This document
.gitea/workflows/
notebook-ci.yml # CI gate: executes notebooks on PR/push
```
## How Agents Work With Notebooks
1. **Create** — Agent generates a `.py` notebook using `# %% [markdown]` and `# %%` code blocks
2. **Review** — PR reviewers see clean diffs in Gitea (no JSON noise)
3. **Generate**`jupytext --to ipynb` produces the `.ipynb` before merge
4. **Execute** — Papermill runs the notebook with injected parameters
5. **Archive** — Output notebook is committed to a `reports/` branch or artifact store
## Converting Between Formats
```bash
# .py -> .ipynb
jupytext --to ipynb notebooks/agent_task_system_health.py
# .ipynb -> .py
jupytext --to py notebooks/agent_task_system_health.ipynb
# Execute with parameters
papermill notebooks/agent_task_system_health.ipynb output.ipynb \
-p threshold 1.0 -p hostname forge-vps-01
```
## CI Gate
The `notebook-ci.yml` workflow executes all notebooks in `notebooks/` on every PR and push, ensuring that checked-in notebooks still run and produce outputs.
## Why This Matters
| Problem | Notebook Solution |
|---|---|
| Ephemeral agent reasoning | Markdown cells narrate the thought process |
| Stateless single-turn tools | Stateful cells persist variables across steps |
| Unreviewable binary artifacts | `.py` source is diffable and PR-friendly |
| No execution audit trail | Output notebook preserves code + outputs + metadata |

View File

@@ -0,0 +1,132 @@
# Fleet SITREP — April 6, 2026
**Classification:** Consolidated Status Report
**Compiled by:** Ezra
**Acknowledged by:** Claude (Issue #143)
---
## Executive Summary
Allegro executed 7 tasks across infrastructure, contracting, audits, and security. Ezra shipped PR #131, filed formalization audit #132, delivered quarterly report #133, and self-assigned issues #134#138. All wizard activity mapped below.
---
## 1. Allegro 7-Task Report
| Task | Description | Status |
|------|-------------|--------|
| 1 | Roll Call / Infrastructure Map | ✅ Complete |
| 2 | Dark industrial anthem (140 BPM, Suno-ready) | ✅ Complete |
| 3 | Operation Get A Job — 7-file contracting playbook pushed to `the-nexus` | ✅ Complete |
| 4 | Formalization audit filed ([the-nexus #893](https://forge.alexanderwhitestone.com/Timmy_Foundation/the-nexus/issues/893)) | ✅ Complete |
| 5 | GrepTard Memory Report — PR #525 on `timmy-home` | ✅ Complete |
| 6 | Self-audit issues #894#899 filed on `the-nexus` | ✅ Filed |
| 7 | `keystore.json` permissions fixed to `600` | ✅ Applied |
### Critical Findings from Task 4 (Formalization Audit)
- GOFAI source files missing — only `.pyc` remains
- Nostr keystore was world-readable — **FIXED** (Task 7)
- 39 burn scripts cluttering `/root` — archival pending ([#898](https://forge.alexanderwhitestone.com/Timmy_Foundation/the-nexus/issues/898))
---
## 2. Ezra Deliverables
| Deliverable | Issue/PR | Status |
|-------------|----------|--------|
| V-011 fix + compressor tuning | [PR #131](https://forge.alexanderwhitestone.com/Timmy_Foundation/hermes-agent/pulls/131) | ✅ Merged |
| Formalization audit (hermes-agent) | [Issue #132](https://forge.alexanderwhitestone.com/Timmy_Foundation/hermes-agent/issues/132) | Filed |
| Quarterly report (MD + PDF) | [Issue #133](https://forge.alexanderwhitestone.com/Timmy_Foundation/hermes-agent/issues/133) | Filed |
| Burn-mode concurrent tool tests | [Issue #134](https://forge.alexanderwhitestone.com/Timmy_Foundation/hermes-agent/issues/134) | Assigned → Ezra |
| MCP SDK migration | [Issue #135](https://forge.alexanderwhitestone.com/Timmy_Foundation/hermes-agent/issues/135) | Assigned → Ezra |
| APScheduler migration | [Issue #136](https://forge.alexanderwhitestone.com/Timmy_Foundation/hermes-agent/issues/136) | Assigned → Ezra |
| Pydantic-settings migration | [Issue #137](https://forge.alexanderwhitestone.com/Timmy_Foundation/hermes-agent/issues/137) | Assigned → Ezra |
| Contracting playbook tracker | [Issue #138](https://forge.alexanderwhitestone.com/Timmy_Foundation/hermes-agent/issues/138) | Assigned → Ezra |
---
## 3. Fleet Status
| Wizard | Host | Status | Blocker |
|--------|------|--------|---------|
| **Ezra** | Hermes VPS | Active — 5 issues queued | None |
| **Bezalel** | Hermes VPS | Gateway running on 8645 | None |
| **Allegro-Primus** | Hermes VPS | **Gateway DOWN on 8644** | Needs restart signal |
| **Bilbo** | External | Gemma 4B active, Telegram dual-mode | Host IP unknown to fleet |
### Allegro Gateway Recovery
Allegro-Primus gateway (port 8644) is down. Options:
1. **Alexander restarts manually** on Hermes VPS
2. **Delegate to Bezalel** — Bezalel can issue restart signal via Hermes VPS access
3. **Delegate to Ezra** — Ezra can coordinate restart as part of issue #894 work
---
## 4. Operation Get A Job — Contracting Playbook
Files pushed to `the-nexus/operation-get-a-job/`:
| File | Purpose |
|------|---------|
| `README.md` | Master plan |
| `entity-setup.md` | Wyoming LLC, Mercury, E&O insurance |
| `service-offerings.md` | Rates $150600/hr; packages $5k/$15k/$40k+ |
| `portfolio.md` | Portfolio structure |
| `outreach-templates.md` | Cold email templates |
| `proposal-template.md` | Client proposal structure |
| `rate-card.md` | Rate card |
**Human-only mile (Alexander's action items):**
1. Pick LLC name from `entity-setup.md`
2. File Wyoming LLC via Northwest Registered Agent ($225)
3. Get EIN from IRS (free, ~10 min)
4. Open Mercury account (requires EIN + LLC docs)
5. Secure E&O insurance (~$150250/month)
6. Restart Allegro-Primus gateway (port 8644)
7. Update LinkedIn using profile template
8. Send 5 cold emails using outreach templates
---
## 5. Pending Self-Audit Issues (the-nexus)
| Issue | Title | Priority |
|-------|-------|----------|
| [#894](https://forge.alexanderwhitestone.com/Timmy_Foundation/the-nexus/issues/894) | Deploy burn-mode cron jobs | CRITICAL |
| [#895](https://forge.alexanderwhitestone.com/Timmy_Foundation/the-nexus/issues/895) | Telegram thread-based reporting | Normal |
| [#896](https://forge.alexanderwhitestone.com/Timmy_Foundation/the-nexus/issues/896) | Retry logic and error recovery | Normal |
| [#897](https://forge.alexanderwhitestone.com/Timmy_Foundation/the-nexus/issues/897) | Automate morning reports at 0600 | Normal |
| [#898](https://forge.alexanderwhitestone.com/Timmy_Foundation/the-nexus/issues/898) | Archive 39 burn scripts | Normal |
| [#899](https://forge.alexanderwhitestone.com/Timmy_Foundation/the-nexus/issues/899) | Keystore permissions | ✅ Done |
---
## 6. Revenue Timeline
| Milestone | Target | Unlocks |
|-----------|--------|---------|
| LLC + Bank + E&O | Day 5 | Ability to invoice clients |
| First 5 emails sent | Day 7 | Pipeline generation |
| First scoping call | Day 14 | Qualified lead |
| First proposal accepted | Day 21 | **$4,500$12,000 revenue** |
| Monthly retainer signed | Day 45 | **$6,000/mo recurring** |
---
## 7. Delegation Matrix
| Owner | Owns |
|-------|------|
| **Alexander** | LLC filing, EIN, Mercury, E&O, LinkedIn, cold emails, gateway restart |
| **Ezra** | Issues #134#138 (tests, migrations, tracker) |
| **Allegro** | Issues #894, #898 (cron deployment, burn script archival) |
| **Bezalel** | Review formalization audit for Anthropic-specific gaps |
---
*SITREP acknowledged by Claude — April 6, 2026*
*Source issue: [hermes-agent #143](https://forge.alexanderwhitestone.com/Timmy_Foundation/hermes-agent/issues/143)*

View File

@@ -0,0 +1,166 @@
# Research Acknowledgment: SSD — Simple Self-Distillation Improves Code Generation
**Issue:** #128
**Paper:** [Embarrassingly Simple Self-Distillation Improves Code Generation](https://arxiv.org/abs/2604.01193)
**Authors:** Ruixiang Zhang, Richard He Bai, Huangjie Zheng, Navdeep Jaitly, Ronan Collobert, Yizhe Zhang (Apple)
**Date:** April 1, 2026
**Code:** https://github.com/apple/ml-ssd
**Acknowledged by:** Claude — April 6, 2026
---
## Assessment: High Relevance to Fleet
This paper is directly applicable to the hermes-agent fleet. The headline result — +7.5pp pass@1 on Qwen3-4B — is at exactly the scale we operate. The method requires no external infrastructure. Triage verdict: **P0 / Week-class work**.
---
## What SSD Actually Does
Three steps, nothing exotic:
1. **Sample**: For each coding prompt, generate one solution at temperature `T_train` (~0.9). Do NOT filter for correctness.
2. **Fine-tune**: SFT on the resulting `(prompt, unverified_solution)` pairs. Standard cross-entropy loss. No RLHF, no GRPO, no DPO.
3. **Evaluate**: At `T_eval` (which must be **different** from `T_train`). This asymmetry is not optional — using the same temperature for both loses 3050% of the gains.
The counterintuitive part: N=1 per problem, unverified. Prior self-improvement work uses N>>1 and filters by execution. SSD doesn't. The paper argues this is *why* it works — you're sharpening the model's own distribution, not fitting to a correctness filter's selection bias.
---
## The Fork/Lock Theory
The paper's core theoretical contribution explains *why* temperature asymmetry matters.
**Locks** — positions requiring syntactic precision: colons, parentheses, import paths, variable names. A mistake here is a hard error. Low temperature helps at Locks. But applying low temperature globally kills diversity everywhere.
**Forks** — algorithmic choice points where multiple valid continuations exist: picking a sort algorithm, choosing a data structure, deciding on a loop structure. High temperature helps at Forks. But applying high temperature globally introduces errors at Locks.
SSD's fine-tuning reshapes token distributions **context-dependently**:
- At Locks: narrows the distribution, suppressing distractor tokens
- At Forks: widens the distribution, preserving valid algorithmic paths
A single global temperature cannot do this. SFT on self-generated data can, because the model learns from examples that implicitly encode which positions are Locks and which are Forks in each problem context.
**Fleet implication**: Our agents are currently using a single temperature for everything. This is leaving performance on the table even without fine-tuning. The immediate zero-cost action is temperature auditing (see Phase 1 below).
---
## Results That Matter to Us
| Model | Before | After | Delta |
|-------|--------|-------|-------|
| Qwen3-30B-Instruct | 42.4% | 55.3% | +12.9pp (+30% rel) |
| Qwen3-4B-Instruct | baseline | baseline+7.5pp | +7.5pp |
| Llama-3.1-8B-Instruct | baseline | baseline+3.5pp | +3.5pp |
Gains concentrate on hard problems: +14.2pp medium, +15.3pp hard. This is the distribution our agents face on real Gitea issues — not easy textbook problems.
---
## Fleet Implementation Plan
### Phase 1: Temperature Audit (Zero cost, this week)
Current state: fleet agents use default or eyeballed temperature settings. The paper shows T_eval != T_train is critical even without fine-tuning.
Actions:
1. Document current temperature settings in `hermes/`, `skills/`, and any Ollama config files
2. Establish a held-out test set of 20+ solved Gitea issues with known-correct outputs
3. Run A/B: current T_eval vs. T_eval=0.7 vs. T_eval=0.3 for code generation tasks
4. Record pass rates per condition; file findings as a follow-up issue
Expected outcome: measurable improvement with no model changes, no infrastructure, no cost.
### Phase 2: SSD Pipeline (12 weeks, single Mac)
Replicate the paper's method on Qwen3-4B via Ollama + axolotl or unsloth:
```
1. Dataset construction:
- Extract 100500 coding prompts from Gitea issue backlog
- Focus on issues that have accepted PRs (ground truth available for evaluation only, not training)
- Format: (system_prompt + issue_description) → model generates solution at T_train=0.9
2. Fine-tuning:
- Use LoRA (not full fine-tune) to stay local-first
- Standard SFT: cross-entropy on (prompt, self-generated_solution) pairs
- Recommended: unsloth for memory efficiency on Mac hardware
- Training budget: 13 epochs, small batch size
3. Evaluation:
- Compare base model vs. SSD-tuned model at T_eval=0.7
- Metric: pass@1 on held-out issues not in training set
- Also test on general coding benchmarks to check for capability regression
```
Infrastructure assessment:
- **RAM**: Qwen3-4B quantized (Q4_K_M) needs ~3.5GB VRAM for inference; LoRA fine-tuning needs ~812GB unified memory (Mac M-series feasible)
- **Storage**: Self-generated dataset is small; LoRA adapter is ~100500MB
- **Time**: 500 examples × 3 epochs ≈ 24 hours on M2/M3 Max
- **Dependencies**: Ollama (inference), unsloth or axolotl (fine-tuning), datasets (HuggingFace), trl
No cloud required. No teacher model required. No code execution environment required.
### Phase 3: Continuous Self-Improvement Loop (12 months)
Wire SSD into the fleet's burn mode:
```
Nightly cron:
1. Collect agent solutions from the day's completed issues
2. Filter: only solutions where the PR was merged (human-verified correct)
3. Append to rolling training buffer (last 500 examples)
4. Run SFT fine-tune on buffer → update LoRA adapter
5. Swap adapter into Ollama deployment at dawn
6. Agents start next day with yesterday's lessons baked in
```
This integrates naturally with RetainDB (#112) — the persistent memory system would track which solutions were merged, providing the feedback signal. The continuous loop turns every merged PR into a training example.
### Phase 4: Sovereignty Confirmation
The paper validates that external data is not required for improvement. Our fleet can:
- Fine-tune exclusively on its own conversation data
- Stay fully local (no API calls, no external datasets)
- Accumulate improvements over time without model subscriptions
This is the sovereign fine-tuning capability the fleet needs to remain independent as external model APIs change pricing or capabilities.
---
## Risks and Mitigations
| Risk | Assessment | Mitigation |
|------|------------|------------|
| SSD gains don't transfer from LiveCodeBench to Gitea issues | Medium — our domain is software engineering, not competitive programming | Test on actual Gitea issues from the backlog; don't assume benchmark numbers transfer |
| Fine-tuning degrades non-code capabilities | Low-Medium | LoRA instead of full fine-tune; test on general tasks after SFT; retain base model checkpoint |
| Small training set (<200 examples) insufficient | Medium | Paper shows gains at modest scale; supplement with open code datasets (Stack, TheVault) if needed |
| Qwen3 GGUF format incompatible with unsloth fine-tuning | Low | unsloth supports Qwen3; verify exact GGUF variant compatibility before starting |
| Temperature asymmetry effect smaller on instruction-tuned variants | Low | Paper explicitly tests instruct variants and shows gains; Qwen3-4B-Instruct is in the paper's results |
---
## Acceptance Criteria Status
From the issue:
- [ ] **Temperature audit** — Document current T/top_p settings across fleet agents, compare with paper recommendations
- [ ] **T_eval benchmark** — A/B test on 20+ solved Gitea issues; measure correctness
- [ ] **SSD reproduction** — Replicate pipeline on Qwen4B with 100 prompts; measure pass@1 change
- [ ] **Infrastructure assessment** — Documented above (Phase 2 section); GPU/RAM/storage requirements are Mac-feasible
- [ ] **Continuous loop design** — Architecture drafted above (Phase 3 section); integrates with RetainDB (#112)
Infrastructure assessment and continuous loop design are addressed in this document. Temperature audit and SSD reproduction require follow-up issues with execution.
---
## Recommended Follow-Up Issues
1. **Temperature Audit** — Audit all fleet agent temperature configs; run A/B on T_eval variants; file results (Phase 1)
2. **SSD Pipeline Spike** — Build and run the 3-stage SSD pipeline on Qwen3-4B; report pass@1 delta (Phase 2)
3. **Nightly SFT Integration** — Wire SSD into burn-mode cron; integrate with RetainDB feedback loop (Phase 3)
---
*Research acknowledged by Claude — April 6, 2026*
*Source issue: [hermes-agent #128](https://forge.alexanderwhitestone.com/Timmy_Foundation/hermes-agent/issues/128)*

View File

@@ -0,0 +1,57 @@
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Parameterized Agent Task: System Health Check\n",
"\n",
"This notebook demonstrates how an LLM agent can generate a task notebook,\n",
"a scheduler can parameterize and execute it via papermill,\n",
"and the output becomes a persistent audit artifact."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {"tags": ["parameters"]},
"outputs": [],
"source": [
"# Default parameters — papermill will inject overrides here\n",
"threshold = 1.0\n",
"hostname = \"localhost\""
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"import json, subprocess, datetime\n",
"gather_time = datetime.datetime.now().isoformat()\n",
"load_avg = subprocess.check_output([\"cat\", \"/proc/loadavg\"]).decode().strip()\n",
"load_values = [float(x) for x in load_avg.split()[:3]]\n",
"avg_load = sum(load_values) / len(load_values)\n",
"intervention_needed = avg_load > threshold\n",
"report = {\n",
" \"hostname\": hostname,\n",
" \"threshold\": threshold,\n",
" \"avg_load\": round(avg_load, 3),\n",
" \"intervention_needed\": intervention_needed,\n",
" \"gathered_at\": gather_time\n",
"}\n",
"print(json.dumps(report, indent=2))"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
}
},
"nbformat": 4,
"nbformat_minor": 5
}

View File

@@ -0,0 +1,41 @@
# ---
# jupyter:
# jupytext:
# text_representation:
# extension: .py
# format_name: percent
# format_version: '1.3'
# jupytext_version: 1.19.1
# kernelspec:
# display_name: Python 3
# language: python
# name: python3
# ---
# %% [markdown]
# # Parameterized Agent Task: System Health Check
#
# This notebook demonstrates how an LLM agent can generate a task notebook,
# a scheduler can parameterize and execute it via papermill,
# and the output becomes a persistent audit artifact.
# %% tags=["parameters"]
# Default parameters — papermill will inject overrides here
threshold = 1.0
hostname = "localhost"
# %%
import json, subprocess, datetime
gather_time = datetime.datetime.now().isoformat()
load_avg = subprocess.check_output(["cat", "/proc/loadavg"]).decode().strip()
load_values = [float(x) for x in load_avg.split()[:3]]
avg_load = sum(load_values) / len(load_values)
intervention_needed = avg_load > threshold
report = {
"hostname": hostname,
"threshold": threshold,
"avg_load": round(avg_load, 3),
"intervention_needed": intervention_needed,
"gathered_at": gather_time
}
print(json.dumps(report, indent=2))

View File

@@ -0,0 +1,252 @@
# Ezra — Quarterly Technical & Strategic Report
**April 2026**
---
## Executive Summary
This report consolidates the principal technical and strategic outputs from Q1/Q2 2026. Three major workstreams are covered:
1. **Security & Performance Hardening** — Shipped V-011 obfuscation detection and context-compressor tuning.
2. **System Formalization Audit** — Identified ~6,300 lines of homegrown infrastructure that can be replaced by well-maintained open-source projects.
3. **Business Development** — Formalized a pure-contracting go-to-market plan ("Operation Get A Job") to monetize the engineering collective.
---
## 1. Recent Deliverables
### 1.1 V-011 Obfuscation Bypass Detection
A significant security enhancement was shipped to the skills-guard subsystem to defeat obfuscated malicious skill code.
**Technical additions:**
- `normalize_input()` with NFKC normalization, case folding, and zero-width character removal to defeat homoglyph and ZWSP evasion.
- `PythonSecurityAnalyzer` AST visitor detecting `eval`/`exec`/`compile`, `getattr` dunder access, and imports of `base64`/`codecs`/`marshal`/`types`/`ctypes`.
- Additional regex patterns for `getattr` builtins chains, `__import__` os/subprocess, and nested base64 decoding.
- Full integration into `scan_file()`; Python files now receive both normalized regex scanning and AST-based analysis.
**Verification:** All tests passing (`103 passed, 4 warnings`).
**Reference:** Forge PR #131`[EPIC-999/Phase II] The Forge — V-011 obfuscation fix + compressor tuning`
### 1.2 Context Compressor Tuning
The default `protect_last_n` parameter was reduced from `20` to `5`. The previous default was overly conservative, preventing meaningful compression on long sessions. The new default preserves the five most recent conversational turns while allowing the compressor to effectively reduce token pressure.
A regression test was added verifying that the last five turns are never summarized away.
### 1.3 Burn Mode Resilience
The agent loop was enhanced with a configurable `burn_mode` flag that increases concurrent tool execution capacity and adds transient-failure retry logic.
**Changes:**
- `max_tool_workers` increased from `8` to `16` in burn mode.
- Expanded parallel tool coverage to include browser, vision, skill, and session-search tools.
- Added batch timeout protection (300s in burn mode / 180s normal) to prevent hung threads from blocking the agent loop.
- Thread-pool shutdown now uses `executor.shutdown(wait=False)` for immediate control return.
- Transient errors (timeouts, rate limits, 502/503/504) trigger one automatic retry in burn mode.
---
## 2. System Formalization Audit
A comprehensive audit was performed across the `hermes-agent` codebase to identify homegrown modules that could be replaced by mature open-source alternatives. The objective is efficiency: reduce maintenance burden, leverage community expertise, and improve reliability.
### 2.1 Candidate Matrix
| Priority | Component | Lines | Current State | Proposed Replacement | Effort | ROI |
|:--------:|-----------|------:|---------------|----------------------|:------:|:---:|
| **P0** | MCP Client | 2,176 | Custom asyncio transport, sampling, schema translation | `mcp` (official Python SDK) | 2-3 wks | Very High |
| **P0** | Cron Scheduler | ~1,500 | Custom JSON job store, manual tick loop | `APScheduler` | 1-2 wks | Very High |
| **P0** | Config Management | 2,589 | Manual YAML loader, no type safety | `pydantic-settings` + Pydantic v2 | 3-4 wks | High |
| **P1** | Checkpoint Manager | 548 | Shells out to `git` binary | `dulwich` (pure-Python git) | 1 wk | Medium-High |
| **P1** | Auth / Credential Pool | ~3,800 | Custom JWT decode, OAuth refresh, JSON auth store | `authlib` + `keyring` + `PyJWT` | 2-3 wks | Medium |
| **P1** | Batch Runner | 1,285 | Custom `multiprocessing.Pool` wrapper | `joblib` (local) or `celery` (distributed) | 1-2 wks | Medium |
| **P2** | SQLite Session Store | ~2,400 | Raw SQLite + FTS5, manual schema | SQLAlchemy ORM + Alembic | 2-3 wks | Medium |
| **P2** | Trajectory Compressor | 1,518 | Custom tokenizer + summarization pipeline | Keep core logic; add `zstandard` for binary storage | 3 days | Low-Medium |
| **P2** | Process Registry | 889 | Custom background process tracking | Keep (adds too much ops complexity) | — | Low |
| **P2** | Web Tools | 2,080+ | Firecrawl + Parallel wrappers | Keep (Firecrawl is already best-in-class) | — | Low |
### 2.2 P0 Replacements
#### MCP Client → Official `mcp` Python SDK
**Current:** `tools/mcp_tool.py` (2,176 lines) contains custom stdio/HTTP transport lifecycle, manual `anyio` cancel-scope cleanup, hand-rolled schema translation, custom sampling bridge, credential stripping, and reconnection backoff.
**Problem:** The Model Context Protocol is evolving rapidly. Maintaining a custom 2K-line client means every protocol revision requires manual patches. The official SDK already handles transport negotiation, lifecycle management, and type-safe schema generation.
**Migration Plan:**
1. Add `mcp>=1.0.0` to dependencies.
2. Build a thin `HermesMCPBridge` class that instantiates `mcp.ClientSession`, maps MCP `Tool` schemas to Hermes registry calls, forwards tool invocations, and preserves the sampling callback.
3. Deprecate the `_mcp_loop` background thread and `anyio`-based transport code.
4. Add integration tests against a test MCP server.
**Lines Saved:** ~1,600
**Risk:** Medium — sampling and timeout behavior need parity testing.
#### Cron Scheduler → APScheduler
**Current:** `cron/jobs.py` (753 lines) + `cron/scheduler.py` (~740 lines) use a JSON file as the job store, custom `parse_duration` and `compute_next_run` logic, a manual tick loop, and ad-hoc delivery orchestration.
**Problem:** Scheduling is a solved problem. The homegrown system lacks timezone support, job concurrency controls, graceful clustering, and durable execution guarantees.
**Migration Plan:**
1. Introduce `APScheduler` with a `SQLAlchemyJobStore` (or custom JSON store).
2. Refactor each Hermes cron job into an APScheduler `Job` function.
3. Preserve existing delivery logic (`_deliver_result`, `_build_job_prompt`, `_run_job_script`) as the job body.
4. Migrate `jobs.json` entries into APScheduler jobs on first run.
5. Expose `/cron` status via a thin CLI wrapper.
**Lines Saved:** ~700
**Risk:** Low — delivery logic is preserved; only the trigger mechanism changes.
#### Config Management → `pydantic-settings`
**Current:** `hermes_cli/config.py` (2,589 lines) uses manual YAML parsing with hardcoded defaults, a complex migration chain (`_config_version` currently at 11), no runtime type validation, and stringly-typed env var resolution.
**Problem:** Every new config option requires touching multiple places. Migration logic is ~400 lines and growing. Typo'd config values are only caught at runtime, often deep in the agent loop.
**Migration Plan:**
1. Define a `HermesConfig` Pydantic model with nested sections (`ModelConfig`, `ProviderConfig`, `AgentConfig`, `CompressionConfig`, etc.).
2. Use `pydantic-settings`'s `SettingsConfigDict(yaml_file="~/.hermes/config.yaml")` to auto-load.
3. Map env vars via `env_prefix="HERMES_"` or field-level `validation_alias`.
4. Keep the migration layer as a one-time upgrade function, then remove it after two releases.
5. Replace `load_config()` call sites with `HermesConfig()` instantiation.
**Lines Saved:** ~1,500
**Risk:** Medium-High — large blast radius; every module reads config. Requires backward compatibility.
### 2.3 P1 Replacements
**Checkpoint Manager → `dulwich`**
- Replace `subprocess.run(["git", ...])` calls with `dulwich.porcelain` equivalents.
- Use `dulwich.repo.Repo.init_bare()` for shadow repos.
- Snapshotting becomes an in-memory `Index` write + `commit()`.
- **Lines Saved:** ~200
- **Risk:** Low
**Auth / Credential Pool → `authlib` + `keyring` + `PyJWT`**
- Use `authlib` for OAuth2 session and token refresh.
- Replace custom JWT decoding with `PyJWT`.
- Migrate the auth store JSON to `keyring`-backed secure storage where available.
- Keep Hermes-specific credential pool strategies (round-robin, least-used, etc.).
- **Lines Saved:** ~800
- **Risk:** Medium
**Batch Runner → `joblib`**
- For typical local batch sizes, `joblib.Parallel(n_jobs=-1, backend='loky')` replaces the custom worker pool.
- Only migrate to Celery if cross-machine distribution is required.
- **Lines Saved:** ~400
- **Risk:** Low for `joblib`
### 2.4 Execution Roadmap
1. **Week 1-2:** Migrate Checkpoint Manager to `dulwich` (quick win, low risk)
2. **Week 3-4:** Migrate Cron Scheduler to `APScheduler` (high value, well-contained)
3. **Week 5-8:** Migrate MCP Client to official `mcp` SDK (highest complexity, highest payoff)
4. **Week 9-12:** Migrate Config Management to `pydantic-settings` (largest blast radius, do last)
5. **Ongoing:** Evaluate Auth/Credential Pool and Batch Runner replacements as follow-up epics.
### 2.5 Cost-Benefit Summary
| Metric | Value |
|--------|-------|
| Total homebrew lines audited | ~17,000 |
| Lines recommended for replacement | ~6,300 |
| Estimated dev weeks (P0 + P1) | 10-14 weeks |
| New runtime dependencies added | 4-6 well-maintained packages |
| Maintenance burden reduction | Very High |
| Risk level | Medium (mitigated by strong test coverage) |
---
## 3. Strategic Initiative: Operation Get A Job
### 3.1 Thesis
The engineering collective is capable of 10x delivery velocity compared to typical market offerings. The strategic opportunity is to monetize this capability through pure contracting — high-tempo, fixed-scope engagements with no exclusivity or employer-like constraints.
### 3.2 Service Menu
**Tier A — White-Glove Agent Infrastructure ($400-600/hr)**
- Custom AI agent deployment with tool use (Slack, Discord, Telegram, webhooks)
- MCP server development
- Local LLM stack setup (on-premise / VPC)
- Agent security audit and red teaming
**Tier B — Security Hardening & Code Review ($250-400/hr)**
- Security backlog burn-down (CVE-class bugs)
- Skills-guard / sandbox hardening
- Architecture review
**Tier C — Automation & Integration ($150-250/hr)**
- Webhook-to-action pipelines
- Research and intelligence reporting
- Content-to-code workflows
### 3.3 Engagement Packages
| Service | Description | Timeline | Investment |
|---------|-------------|----------|------------|
| Agent Security Audit | Review of one AI agent pipeline + written findings | 2-3 business days | $4,500 |
| MCP Server Build | One custom MCP server with 3-5 tools + docs + tests | 1-2 weeks | $8,000 |
| Custom Bot Deployment | End-to-end bot with up to 5 tools, deployed to client platform | 2-3 weeks | $12,000 |
| Security Sprint | Close top 5 security issues in a Python/JS repo | 1-2 weeks | $6,500 |
| Monthly Retainer — Core | 20 hrs/month prioritized engineering + triage | Ongoing | $6,000/mo |
| Monthly Retainer — Scale | 40 hrs/month prioritized engineering + on-call | Ongoing | $11,000/mo |
### 3.4 Go-to-Market Motion
**Immediate channels:**
- Cold outbound to CTOs/VPEs at Series A-C AI startups
- LinkedIn authority content (architecture reviews, security bulletins)
- Platform presence (Gun.io, Toptal, Upwork for specific niche keywords)
**Lead magnet:** Free 15-minute architecture review. No pitch. One concrete risk identified.
### 3.5 Infrastructure Foundation
The Hermes Agent framework serves as both the delivery platform and the portfolio piece:
- Open-source runtime with ~3,000 tests
- Gateway architecture supporting 8+ messaging platforms
- Native MCP client, cron scheduling, subagent delegation
- Self-hosted Forge (Gitea) with CI and automated PR review
- Local Gemma 4 inference stack on bare metal
### 3.6 90-Day Revenue Model
| Month | Target |
|-------|--------|
| Month 1 | $9-12K (1x retainer or 2x audits) |
| Month 2 | $17K (+ 1x MCP build) |
| Month 3 | $29K (+ 1x bot deployment + new retainer) |
### 3.7 Immediate Action Items
- File Wyoming LLC and obtain EIN
- Open Mercury business bank account
- Secure E&O insurance
- Update LinkedIn profile and publish first authority post
- Customize capabilities deck and begin warm outbound
---
## 4. Fleet Status Summary
| House | Host | Model / Provider | Gateway Status |
|-------|------|------------------|----------------|
| Ezra | Hermes VPS | `kimi-for-coding` (Kimi K2.5) | API `8658`, webhook `8648` — Active |
| Bezalel | Hermes VPS | Claude Opus 4.6 (Anthropic) | Port `8645` — Active |
| Allegro-Primus | Hermes VPS | Kimi K2.5 | Port `8644` — Requires restart |
| Bilbo | External | Gemma 4B (local) | Telegram dual-mode — Active |
**Network:** Hermes VPS public IP `143.198.27.163` (Ubuntu 24.04.3 LTS). Local Gemma 4 fallback on `127.0.0.1:11435`.
---
## 5. Conclusion
The codebase is in a strong position: security is hardened, the agent loop is more resilient, and a clear roadmap exists to replace high-maintenance homegrown infrastructure with battle-tested open-source projects. The commercialization strategy is formalized and ready for execution. The next critical path is the human-facing work of entity formation, sales outreach, and closing the first fixed-scope engagement.
Prepared by **Ezra**
April 2026

Binary file not shown.

261
scripts/forge_health_check.py Executable file
View File

@@ -0,0 +1,261 @@
#!/usr/bin/env python3
"""Forge Health Check — Build verification and artifact integrity scanner.
Scans wizard environments for:
- Missing source files (.pyc without .py) — Allegro finding: GOFAI source files gone
- Burn script accumulation in /root or wizard directories
- World-readable sensitive files (keystores, tokens, configs)
- Missing required environment variables
Usage:
python scripts/forge_health_check.py /root/wizards
python scripts/forge_health_check.py /root/wizards --json
python scripts/forge_health_check.py /root/wizards --fix-permissions
"""
from __future__ import annotations
import argparse
import json
import os
import stat
import sys
from dataclasses import asdict, dataclass, field
from pathlib import Path
from typing import Iterable
SENSITIVE_FILE_PATTERNS = (
"keystore",
"password",
"private",
"apikey",
"api_key",
"credentials",
)
SENSITIVE_NAME_PREFIXES = (
"key_",
"keys_",
"token_",
"tokens_",
"secret_",
"secrets_",
".env",
"env.",
)
SENSITIVE_NAME_SUFFIXES = (
"_key",
"_keys",
"_token",
"_tokens",
"_secret",
"_secrets",
".key",
".env",
".token",
".secret",
)
SENSIBLE_PERMISSIONS = 0o600 # owner read/write only
REQUIRED_ENV_VARS = (
"GITEA_URL",
"GITEA_TOKEN",
"GITEA_USER",
)
BURN_SCRIPT_PATTERNS = (
"burn",
"ignite",
"inferno",
"scorch",
"char",
"blaze",
"ember",
)
@dataclass
class HealthFinding:
category: str
severity: str # critical, warning, info
path: str
message: str
suggestion: str = ""
@dataclass
class HealthReport:
target: str
findings: list[HealthFinding] = field(default_factory=list)
passed: bool = True
def add(self, finding: HealthFinding) -> None:
self.findings.append(finding)
if finding.severity == "critical":
self.passed = False
def scan_orphaned_bytecode(root: Path, report: HealthReport) -> None:
"""Detect .pyc files without corresponding .py source files."""
for pyc in root.rglob("*.pyc"):
py = pyc.with_suffix(".py")
if not py.exists():
# Also check __pycache__ naming convention
if pyc.name.startswith("__") and pyc.parent.name == "__pycache__":
stem = pyc.stem.split(".")[0]
py = pyc.parent.parent / f"{stem}.py"
if not py.exists():
report.add(
HealthFinding(
category="artifact_integrity",
severity="critical",
path=str(pyc),
message=f"Compiled bytecode without source: {pyc}",
suggestion="Restore missing .py source file from version control or backup",
)
)
def scan_burn_script_clutter(root: Path, report: HealthReport) -> None:
"""Detect burn scripts and other temporary artifacts outside proper staging."""
for path in root.iterdir():
if not path.is_file():
continue
lower = path.name.lower()
if any(pat in lower for pat in BURN_SCRIPT_PATTERNS):
report.add(
HealthFinding(
category="deployment_hygiene",
severity="warning",
path=str(path),
message=f"Burn script or temporary artifact in production path: {path.name}",
suggestion="Archive to a burn/ or tmp/ directory, or remove if no longer needed",
)
)
def _is_sensitive_filename(name: str) -> bool:
"""Check if a filename indicates it may contain secrets."""
lower = name.lower()
if lower == ".env.example":
return False
if any(pat in lower for pat in SENSITIVE_FILE_PATTERNS):
return True
if any(lower.startswith(pref) for pref in SENSITIVE_NAME_PREFIXES):
return True
if any(lower.endswith(suff) for suff in SENSITIVE_NAME_SUFFIXES):
return True
return False
def scan_sensitive_file_permissions(root: Path, report: HealthReport, fix: bool = False) -> None:
"""Detect world-readable sensitive files."""
for fpath in root.rglob("*"):
if not fpath.is_file():
continue
# Skip test files — real secrets should never live in tests/
if "/tests/" in str(fpath) or str(fpath).startswith(str(root / "tests")):
continue
if not _is_sensitive_filename(fpath.name):
continue
try:
mode = fpath.stat().st_mode
except OSError:
continue
# Readable by group or other
if mode & stat.S_IRGRP or mode & stat.S_IROTH:
was_fixed = False
if fix:
try:
fpath.chmod(SENSIBLE_PERMISSIONS)
was_fixed = True
except OSError:
pass
report.add(
HealthFinding(
category="security",
severity="critical",
path=str(fpath),
message=(
f"Sensitive file world-readable: {fpath.name} "
f"(mode={oct(mode & 0o777)})"
),
suggestion=(
f"Fixed permissions to {oct(SENSIBLE_PERMISSIONS)}"
if was_fixed
else f"Run 'chmod {oct(SENSIBLE_PERMISSIONS)[2:]} {fpath}'"
),
)
)
def scan_environment_variables(report: HealthReport) -> None:
"""Check for required environment variables."""
for var in REQUIRED_ENV_VARS:
if not os.environ.get(var):
report.add(
HealthFinding(
category="configuration",
severity="warning",
path="$" + var,
message=f"Required environment variable {var} is missing or empty",
suggestion="Export the variable in your shell profile or secrets manager",
)
)
def run_health_check(target: Path, fix_permissions: bool = False) -> HealthReport:
report = HealthReport(target=str(target.resolve()))
if target.exists():
scan_orphaned_bytecode(target, report)
scan_burn_script_clutter(target, report)
scan_sensitive_file_permissions(target, report, fix=fix_permissions)
scan_environment_variables(report)
return report
def print_report(report: HealthReport) -> None:
status = "PASS" if report.passed else "FAIL"
print(f"Forge Health Check: {status}")
print(f"Target: {report.target}")
print(f"Findings: {len(report.findings)}\n")
by_category: dict[str, list[HealthFinding]] = {}
for f in report.findings:
by_category.setdefault(f.category, []).append(f)
for category, findings in by_category.items():
print(f"[{category.upper()}]")
for f in findings:
print(f" {f.severity.upper()}: {f.message}")
if f.suggestion:
print(f" -> {f.suggestion}")
print()
def main(argv: list[str] | None = None) -> int:
parser = argparse.ArgumentParser(description="Forge Health Check")
parser.add_argument("target", nargs="?", default="/root/wizards", help="Root path to scan")
parser.add_argument("--json", action="store_true", help="Output JSON report")
parser.add_argument("--fix-permissions", action="store_true", help="Auto-fix file permissions")
args = parser.parse_args(argv)
target = Path(args.target)
report = run_health_check(target, fix_permissions=args.fix_permissions)
if args.json:
print(json.dumps(asdict(report), indent=2))
else:
print_report(report)
return 0 if report.passed else 1
if __name__ == "__main__":
raise SystemExit(main())

89
scripts/smoke_test.py Executable file
View File

@@ -0,0 +1,89 @@
#!/usr/bin/env python3
"""Forge smoke tests — fast checks that core imports resolve and entrypoints load.
Total runtime target: < 30 seconds.
"""
from __future__ import annotations
import importlib
import subprocess
import sys
from pathlib import Path
# Allow running smoke test directly from repo root before pip install
REPO_ROOT = Path(__file__).parent.parent
sys.path.insert(0, str(REPO_ROOT))
CORE_MODULES = [
"hermes_cli.config",
"hermes_state",
"model_tools",
"toolsets",
"utils",
]
CLI_ENTRYPOINTS = [
[sys.executable, "cli.py", "--help"],
]
def test_imports() -> None:
ok = 0
skipped = 0
for mod in CORE_MODULES:
try:
importlib.import_module(mod)
ok += 1
except ImportError as exc:
# If the failure is a missing third-party dependency, skip rather than fail
# so the smoke test can run before `pip install` in bare environments.
msg = str(exc).lower()
if "no module named" in msg and mod.replace(".", "/") not in msg:
print(f"SKIP: import {mod} -> missing dependency ({exc})")
skipped += 1
else:
print(f"FAIL: import {mod} -> {exc}")
sys.exit(1)
except Exception as exc:
print(f"FAIL: import {mod} -> {exc}")
sys.exit(1)
print(f"OK: {ok} core imports", end="")
if skipped:
print(f" ({skipped} skipped due to missing deps)")
else:
print()
def test_cli_help() -> None:
ok = 0
skipped = 0
for cmd in CLI_ENTRYPOINTS:
result = subprocess.run(cmd, capture_output=True, timeout=30)
if result.returncode == 0:
ok += 1
continue
stderr = result.stderr.decode().lower()
# Gracefully skip if dependencies are missing in bare environments
if "modulenotfounderror" in stderr or "no module named" in stderr:
print(f"SKIP: {' '.join(cmd)} -> missing dependency")
skipped += 1
else:
print(f"FAIL: {' '.join(cmd)} -> {result.stderr.decode()[:200]}")
sys.exit(1)
print(f"OK: {ok} CLI entrypoints", end="")
if skipped:
print(f" ({skipped} skipped due to missing deps)")
else:
print()
def main() -> int:
test_imports()
test_cli_help()
print("Smoke tests passed.")
return 0
if __name__ == "__main__":
raise SystemExit(main())

20
scripts/syntax_guard.py Executable file
View File

@@ -0,0 +1,20 @@
#!/usr/bin/env python3
"""Syntax guard — compile all Python files to catch syntax errors before merge."""
import py_compile
import sys
from pathlib import Path
errors = []
for p in Path(".").rglob("*.py"):
if ".venv" in p.parts or "__pycache__" in p.parts:
continue
try:
py_compile.compile(str(p), doraise=True)
except py_compile.PyCompileError as e:
errors.append(f"{p}: {e}")
print(f"SYNTAX ERROR: {p}: {e}", file=sys.stderr)
if errors:
print(f"\n{len(errors)} file(s) with syntax errors", file=sys.stderr)
sys.exit(1)
print("All Python files compile successfully")

View File

@@ -0,0 +1,175 @@
"""Tests for scripts/forge_health_check.py"""
import os
import stat
from pathlib import Path
# Import the script as a module
import sys
sys.path.insert(0, str(Path(__file__).parent.parent / "scripts"))
from forge_health_check import (
HealthFinding,
HealthReport,
_is_sensitive_filename,
run_health_check,
scan_burn_script_clutter,
scan_orphaned_bytecode,
scan_sensitive_file_permissions,
scan_environment_variables,
)
class TestIsSensitiveFilename:
def test_keystore_is_sensitive(self) -> None:
assert _is_sensitive_filename("keystore.json") is True
def test_env_example_is_not_sensitive(self) -> None:
assert _is_sensitive_filename(".env.example") is False
def test_env_file_is_sensitive(self) -> None:
assert _is_sensitive_filename(".env") is True
assert _is_sensitive_filename("production.env") is True
def test_test_file_with_key_is_not_sensitive(self) -> None:
assert _is_sensitive_filename("test_interrupt_key_match.py") is False
assert _is_sensitive_filename("test_api_key_providers.py") is False
class TestScanOrphanedBytecode:
def test_detects_pyc_without_py(self, tmp_path: Path) -> None:
pyc = tmp_path / "module.pyc"
pyc.write_bytes(b"\x00")
report = HealthReport(target=str(tmp_path))
scan_orphaned_bytecode(tmp_path, report)
assert len(report.findings) == 1
assert report.findings[0].category == "artifact_integrity"
assert report.findings[0].severity == "critical"
def test_ignores_pyc_with_py(self, tmp_path: Path) -> None:
(tmp_path / "module.py").write_text("pass")
pyc = tmp_path / "module.pyc"
pyc.write_bytes(b"\x00")
report = HealthReport(target=str(tmp_path))
scan_orphaned_bytecode(tmp_path, report)
assert len(report.findings) == 0
def test_detects_pycache_orphan(self, tmp_path: Path) -> None:
pycache = tmp_path / "__pycache__"
pycache.mkdir()
pyc = pycache / "module.cpython-312.pyc"
pyc.write_bytes(b"\x00")
report = HealthReport(target=str(tmp_path))
scan_orphaned_bytecode(tmp_path, report)
assert len(report.findings) == 1
assert "__pycache__" in report.findings[0].path
class TestScanBurnScriptClutter:
def test_detects_burn_script(self, tmp_path: Path) -> None:
(tmp_path / "burn_test.sh").write_text("#!/bin/bash")
report = HealthReport(target=str(tmp_path))
scan_burn_script_clutter(tmp_path, report)
assert len(report.findings) == 1
assert report.findings[0].category == "deployment_hygiene"
assert report.findings[0].severity == "warning"
def test_ignores_regular_files(self, tmp_path: Path) -> None:
(tmp_path / "deploy.sh").write_text("#!/bin/bash")
report = HealthReport(target=str(tmp_path))
scan_burn_script_clutter(tmp_path, report)
assert len(report.findings) == 0
class TestScanSensitiveFilePermissions:
def test_detects_world_readable_keystore(self, tmp_path: Path) -> None:
ks = tmp_path / "keystore.json"
ks.write_text("{}")
ks.chmod(0o644)
report = HealthReport(target=str(tmp_path))
scan_sensitive_file_permissions(tmp_path, report)
assert len(report.findings) == 1
assert report.findings[0].category == "security"
assert report.findings[0].severity == "critical"
assert "644" in report.findings[0].message
def test_auto_fixes_permissions(self, tmp_path: Path) -> None:
ks = tmp_path / "keystore.json"
ks.write_text("{}")
ks.chmod(0o644)
report = HealthReport(target=str(tmp_path))
scan_sensitive_file_permissions(tmp_path, report, fix=True)
assert len(report.findings) == 1
assert ks.stat().st_mode & 0o777 == 0o600
def test_ignores_safe_permissions(self, tmp_path: Path) -> None:
ks = tmp_path / "keystore.json"
ks.write_text("{}")
ks.chmod(0o600)
report = HealthReport(target=str(tmp_path))
scan_sensitive_file_permissions(tmp_path, report)
assert len(report.findings) == 0
def test_ignores_env_example(self, tmp_path: Path) -> None:
env = tmp_path / ".env.example"
env.write_text("# example")
env.chmod(0o644)
report = HealthReport(target=str(tmp_path))
scan_sensitive_file_permissions(tmp_path, report)
assert len(report.findings) == 0
def test_ignores_test_directory(self, tmp_path: Path) -> None:
tests_dir = tmp_path / "tests"
tests_dir.mkdir()
ks = tests_dir / "keystore.json"
ks.write_text("{}")
ks.chmod(0o644)
report = HealthReport(target=str(tmp_path))
scan_sensitive_file_permissions(tmp_path, report)
assert len(report.findings) == 0
class TestScanEnvironmentVariables:
def test_reports_missing_env_var(self, monkeypatch) -> None:
monkeypatch.delenv("GITEA_TOKEN", raising=False)
report = HealthReport(target=".")
scan_environment_variables(report)
missing = [f for f in report.findings if f.path == "$GITEA_TOKEN"]
assert len(missing) == 1
assert missing[0].severity == "warning"
def test_passes_when_env_vars_present(self, monkeypatch) -> None:
for var in ("GITEA_URL", "GITEA_TOKEN", "GITEA_USER"):
monkeypatch.setenv(var, "present")
report = HealthReport(target=".")
scan_environment_variables(report)
assert len(report.findings) == 0
class TestRunHealthCheck:
def test_full_run(self, tmp_path: Path, monkeypatch) -> None:
monkeypatch.setenv("GITEA_URL", "https://example.com")
monkeypatch.setenv("GITEA_TOKEN", "secret")
monkeypatch.setenv("GITEA_USER", "bezalel")
(tmp_path / "orphan.pyc").write_bytes(b"\x00")
(tmp_path / "burn_it.sh").write_text("#!/bin/bash")
ks = tmp_path / "keystore.json"
ks.write_text("{}")
ks.chmod(0o644)
report = run_health_check(tmp_path)
assert not report.passed
categories = {f.category for f in report.findings}
assert "artifact_integrity" in categories
assert "deployment_hygiene" in categories
assert "security" in categories
def test_clean_run_passes(self, tmp_path: Path, monkeypatch) -> None:
for var in ("GITEA_URL", "GITEA_TOKEN", "GITEA_USER"):
monkeypatch.setenv(var, "present")
(tmp_path / "module.py").write_text("pass")
report = run_health_check(tmp_path)
assert report.passed
assert len(report.findings) == 0

View File

@@ -0,0 +1,18 @@
"""Bare green-path E2E — one happy-path tool call cycle.
Exercises the terminal tool directly and verifies the response structure.
No API keys required. Runtime target: < 10 seconds.
"""
import json
from tools.terminal_tool import terminal_tool
def test_terminal_echo_green_path() -> None:
"""terminal('echo hello') -> verify response contains 'hello' and exit_code 0."""
result = terminal_tool(command="echo hello", timeout=10)
data = json.loads(result)
assert data["exit_code"] == 0, f"Expected exit_code 0, got {data['exit_code']}"
assert "hello" in data["output"], f"Expected 'hello' in output, got: {data['output']}"