Compare commits

...

2 Commits

Author SHA1 Message Date
Alexander Whitestone
9c2341f4ca feat: add Ezra quarterly report April 2026 (MD + PDF)
Brings in consolidated quarterly report from epic-999-phase-ii-forge branch.
Covers V-011 security hardening, context compressor tuning, burn mode resilience,
system formalization audit, Operation Get A Job GTM strategy, and fleet status.

Fixes #133
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-06 22:04:29 -04:00
069d5404a0 [BEZALEL][DEMO] Notebook Workflow: Jupytext + Papermill for Agent Tasks (#157)
Some checks failed
Notebook CI / notebook-smoke (push) Failing after 2s
2026-04-07 02:02:49 +00:00
6 changed files with 451 additions and 0 deletions

View File

@@ -0,0 +1,44 @@
name: Notebook CI
on:
push:
paths:
- 'notebooks/**'
pull_request:
paths:
- 'notebooks/**'
jobs:
notebook-smoke:
runs-on: ubuntu-latest
steps:
- name: Checkout
uses: actions/checkout@v4
- name: Setup Python
uses: actions/setup-python@v5
with:
python-version: '3.12'
- name: Install dependencies
run: |
pip install papermill jupytext nbformat
python -m ipykernel install --user --name python3
- name: Execute system health notebook
run: |
papermill notebooks/agent_task_system_health.ipynb /tmp/output.ipynb \
-p threshold 0.5 \
-p hostname ci-runner
- name: Verify output has results
run: |
python -c "
import json
nb = json.load(open('/tmp/output.ipynb'))
code_cells = [c for c in nb['cells'] if c['cell_type'] == 'code']
outputs = [c.get('outputs', []) for c in code_cells]
total_outputs = sum(len(o) for o in outputs)
assert total_outputs > 0, 'Notebook produced no outputs'
print(f'Notebook executed successfully with {total_outputs} output(s)')
"

57
docs/NOTEBOOK_WORKFLOW.md Normal file
View File

@@ -0,0 +1,57 @@
# Notebook Workflow for Agent Tasks
This directory demonstrates a sovereign, version-controlled workflow for LLM agent tasks using Jupyter notebooks.
## Philosophy
- **`.py` files are the source of truth`** — authored and reviewed as plain Python with `# %%` cell markers (via Jupytext)
- **`.ipynb` files are generated artifacts** — auto-created from `.py` for execution and rich viewing
- **Papermill parameterizes and executes** — each run produces an output notebook with code, narrative, and results preserved
- **Output notebooks are audit artifacts** — every execution leaves a permanent, replayable record
## File Layout
```
notebooks/
agent_task_system_health.py # Source of truth (Jupytext)
agent_task_system_health.ipynb # Generated from .py
docs/
NOTEBOOK_WORKFLOW.md # This document
.gitea/workflows/
notebook-ci.yml # CI gate: executes notebooks on PR/push
```
## How Agents Work With Notebooks
1. **Create** — Agent generates a `.py` notebook using `# %% [markdown]` and `# %%` code blocks
2. **Review** — PR reviewers see clean diffs in Gitea (no JSON noise)
3. **Generate**`jupytext --to ipynb` produces the `.ipynb` before merge
4. **Execute** — Papermill runs the notebook with injected parameters
5. **Archive** — Output notebook is committed to a `reports/` branch or artifact store
## Converting Between Formats
```bash
# .py -> .ipynb
jupytext --to ipynb notebooks/agent_task_system_health.py
# .ipynb -> .py
jupytext --to py notebooks/agent_task_system_health.ipynb
# Execute with parameters
papermill notebooks/agent_task_system_health.ipynb output.ipynb \
-p threshold 1.0 -p hostname forge-vps-01
```
## CI Gate
The `notebook-ci.yml` workflow executes all notebooks in `notebooks/` on every PR and push, ensuring that checked-in notebooks still run and produce outputs.
## Why This Matters
| Problem | Notebook Solution |
|---|---|
| Ephemeral agent reasoning | Markdown cells narrate the thought process |
| Stateless single-turn tools | Stateful cells persist variables across steps |
| Unreviewable binary artifacts | `.py` source is diffable and PR-friendly |
| No execution audit trail | Output notebook preserves code + outputs + metadata |

View File

@@ -0,0 +1,57 @@
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Parameterized Agent Task: System Health Check\n",
"\n",
"This notebook demonstrates how an LLM agent can generate a task notebook,\n",
"a scheduler can parameterize and execute it via papermill,\n",
"and the output becomes a persistent audit artifact."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {"tags": ["parameters"]},
"outputs": [],
"source": [
"# Default parameters — papermill will inject overrides here\n",
"threshold = 1.0\n",
"hostname = \"localhost\""
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"import json, subprocess, datetime\n",
"gather_time = datetime.datetime.now().isoformat()\n",
"load_avg = subprocess.check_output([\"cat\", \"/proc/loadavg\"]).decode().strip()\n",
"load_values = [float(x) for x in load_avg.split()[:3]]\n",
"avg_load = sum(load_values) / len(load_values)\n",
"intervention_needed = avg_load > threshold\n",
"report = {\n",
" \"hostname\": hostname,\n",
" \"threshold\": threshold,\n",
" \"avg_load\": round(avg_load, 3),\n",
" \"intervention_needed\": intervention_needed,\n",
" \"gathered_at\": gather_time\n",
"}\n",
"print(json.dumps(report, indent=2))"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
}
},
"nbformat": 4,
"nbformat_minor": 5
}

View File

@@ -0,0 +1,41 @@
# ---
# jupyter:
# jupytext:
# text_representation:
# extension: .py
# format_name: percent
# format_version: '1.3'
# jupytext_version: 1.19.1
# kernelspec:
# display_name: Python 3
# language: python
# name: python3
# ---
# %% [markdown]
# # Parameterized Agent Task: System Health Check
#
# This notebook demonstrates how an LLM agent can generate a task notebook,
# a scheduler can parameterize and execute it via papermill,
# and the output becomes a persistent audit artifact.
# %% tags=["parameters"]
# Default parameters — papermill will inject overrides here
threshold = 1.0
hostname = "localhost"
# %%
import json, subprocess, datetime
gather_time = datetime.datetime.now().isoformat()
load_avg = subprocess.check_output(["cat", "/proc/loadavg"]).decode().strip()
load_values = [float(x) for x in load_avg.split()[:3]]
avg_load = sum(load_values) / len(load_values)
intervention_needed = avg_load > threshold
report = {
"hostname": hostname,
"threshold": threshold,
"avg_load": round(avg_load, 3),
"intervention_needed": intervention_needed,
"gathered_at": gather_time
}
print(json.dumps(report, indent=2))

View File

@@ -0,0 +1,252 @@
# Ezra — Quarterly Technical & Strategic Report
**April 2026**
---
## Executive Summary
This report consolidates the principal technical and strategic outputs from Q1/Q2 2026. Three major workstreams are covered:
1. **Security & Performance Hardening** — Shipped V-011 obfuscation detection and context-compressor tuning.
2. **System Formalization Audit** — Identified ~6,300 lines of homegrown infrastructure that can be replaced by well-maintained open-source projects.
3. **Business Development** — Formalized a pure-contracting go-to-market plan ("Operation Get A Job") to monetize the engineering collective.
---
## 1. Recent Deliverables
### 1.1 V-011 Obfuscation Bypass Detection
A significant security enhancement was shipped to the skills-guard subsystem to defeat obfuscated malicious skill code.
**Technical additions:**
- `normalize_input()` with NFKC normalization, case folding, and zero-width character removal to defeat homoglyph and ZWSP evasion.
- `PythonSecurityAnalyzer` AST visitor detecting `eval`/`exec`/`compile`, `getattr` dunder access, and imports of `base64`/`codecs`/`marshal`/`types`/`ctypes`.
- Additional regex patterns for `getattr` builtins chains, `__import__` os/subprocess, and nested base64 decoding.
- Full integration into `scan_file()`; Python files now receive both normalized regex scanning and AST-based analysis.
**Verification:** All tests passing (`103 passed, 4 warnings`).
**Reference:** Forge PR #131`[EPIC-999/Phase II] The Forge — V-011 obfuscation fix + compressor tuning`
### 1.2 Context Compressor Tuning
The default `protect_last_n` parameter was reduced from `20` to `5`. The previous default was overly conservative, preventing meaningful compression on long sessions. The new default preserves the five most recent conversational turns while allowing the compressor to effectively reduce token pressure.
A regression test was added verifying that the last five turns are never summarized away.
### 1.3 Burn Mode Resilience
The agent loop was enhanced with a configurable `burn_mode` flag that increases concurrent tool execution capacity and adds transient-failure retry logic.
**Changes:**
- `max_tool_workers` increased from `8` to `16` in burn mode.
- Expanded parallel tool coverage to include browser, vision, skill, and session-search tools.
- Added batch timeout protection (300s in burn mode / 180s normal) to prevent hung threads from blocking the agent loop.
- Thread-pool shutdown now uses `executor.shutdown(wait=False)` for immediate control return.
- Transient errors (timeouts, rate limits, 502/503/504) trigger one automatic retry in burn mode.
---
## 2. System Formalization Audit
A comprehensive audit was performed across the `hermes-agent` codebase to identify homegrown modules that could be replaced by mature open-source alternatives. The objective is efficiency: reduce maintenance burden, leverage community expertise, and improve reliability.
### 2.1 Candidate Matrix
| Priority | Component | Lines | Current State | Proposed Replacement | Effort | ROI |
|:--------:|-----------|------:|---------------|----------------------|:------:|:---:|
| **P0** | MCP Client | 2,176 | Custom asyncio transport, sampling, schema translation | `mcp` (official Python SDK) | 2-3 wks | Very High |
| **P0** | Cron Scheduler | ~1,500 | Custom JSON job store, manual tick loop | `APScheduler` | 1-2 wks | Very High |
| **P0** | Config Management | 2,589 | Manual YAML loader, no type safety | `pydantic-settings` + Pydantic v2 | 3-4 wks | High |
| **P1** | Checkpoint Manager | 548 | Shells out to `git` binary | `dulwich` (pure-Python git) | 1 wk | Medium-High |
| **P1** | Auth / Credential Pool | ~3,800 | Custom JWT decode, OAuth refresh, JSON auth store | `authlib` + `keyring` + `PyJWT` | 2-3 wks | Medium |
| **P1** | Batch Runner | 1,285 | Custom `multiprocessing.Pool` wrapper | `joblib` (local) or `celery` (distributed) | 1-2 wks | Medium |
| **P2** | SQLite Session Store | ~2,400 | Raw SQLite + FTS5, manual schema | SQLAlchemy ORM + Alembic | 2-3 wks | Medium |
| **P2** | Trajectory Compressor | 1,518 | Custom tokenizer + summarization pipeline | Keep core logic; add `zstandard` for binary storage | 3 days | Low-Medium |
| **P2** | Process Registry | 889 | Custom background process tracking | Keep (adds too much ops complexity) | — | Low |
| **P2** | Web Tools | 2,080+ | Firecrawl + Parallel wrappers | Keep (Firecrawl is already best-in-class) | — | Low |
### 2.2 P0 Replacements
#### MCP Client → Official `mcp` Python SDK
**Current:** `tools/mcp_tool.py` (2,176 lines) contains custom stdio/HTTP transport lifecycle, manual `anyio` cancel-scope cleanup, hand-rolled schema translation, custom sampling bridge, credential stripping, and reconnection backoff.
**Problem:** The Model Context Protocol is evolving rapidly. Maintaining a custom 2K-line client means every protocol revision requires manual patches. The official SDK already handles transport negotiation, lifecycle management, and type-safe schema generation.
**Migration Plan:**
1. Add `mcp>=1.0.0` to dependencies.
2. Build a thin `HermesMCPBridge` class that instantiates `mcp.ClientSession`, maps MCP `Tool` schemas to Hermes registry calls, forwards tool invocations, and preserves the sampling callback.
3. Deprecate the `_mcp_loop` background thread and `anyio`-based transport code.
4. Add integration tests against a test MCP server.
**Lines Saved:** ~1,600
**Risk:** Medium — sampling and timeout behavior need parity testing.
#### Cron Scheduler → APScheduler
**Current:** `cron/jobs.py` (753 lines) + `cron/scheduler.py` (~740 lines) use a JSON file as the job store, custom `parse_duration` and `compute_next_run` logic, a manual tick loop, and ad-hoc delivery orchestration.
**Problem:** Scheduling is a solved problem. The homegrown system lacks timezone support, job concurrency controls, graceful clustering, and durable execution guarantees.
**Migration Plan:**
1. Introduce `APScheduler` with a `SQLAlchemyJobStore` (or custom JSON store).
2. Refactor each Hermes cron job into an APScheduler `Job` function.
3. Preserve existing delivery logic (`_deliver_result`, `_build_job_prompt`, `_run_job_script`) as the job body.
4. Migrate `jobs.json` entries into APScheduler jobs on first run.
5. Expose `/cron` status via a thin CLI wrapper.
**Lines Saved:** ~700
**Risk:** Low — delivery logic is preserved; only the trigger mechanism changes.
#### Config Management → `pydantic-settings`
**Current:** `hermes_cli/config.py` (2,589 lines) uses manual YAML parsing with hardcoded defaults, a complex migration chain (`_config_version` currently at 11), no runtime type validation, and stringly-typed env var resolution.
**Problem:** Every new config option requires touching multiple places. Migration logic is ~400 lines and growing. Typo'd config values are only caught at runtime, often deep in the agent loop.
**Migration Plan:**
1. Define a `HermesConfig` Pydantic model with nested sections (`ModelConfig`, `ProviderConfig`, `AgentConfig`, `CompressionConfig`, etc.).
2. Use `pydantic-settings`'s `SettingsConfigDict(yaml_file="~/.hermes/config.yaml")` to auto-load.
3. Map env vars via `env_prefix="HERMES_"` or field-level `validation_alias`.
4. Keep the migration layer as a one-time upgrade function, then remove it after two releases.
5. Replace `load_config()` call sites with `HermesConfig()` instantiation.
**Lines Saved:** ~1,500
**Risk:** Medium-High — large blast radius; every module reads config. Requires backward compatibility.
### 2.3 P1 Replacements
**Checkpoint Manager → `dulwich`**
- Replace `subprocess.run(["git", ...])` calls with `dulwich.porcelain` equivalents.
- Use `dulwich.repo.Repo.init_bare()` for shadow repos.
- Snapshotting becomes an in-memory `Index` write + `commit()`.
- **Lines Saved:** ~200
- **Risk:** Low
**Auth / Credential Pool → `authlib` + `keyring` + `PyJWT`**
- Use `authlib` for OAuth2 session and token refresh.
- Replace custom JWT decoding with `PyJWT`.
- Migrate the auth store JSON to `keyring`-backed secure storage where available.
- Keep Hermes-specific credential pool strategies (round-robin, least-used, etc.).
- **Lines Saved:** ~800
- **Risk:** Medium
**Batch Runner → `joblib`**
- For typical local batch sizes, `joblib.Parallel(n_jobs=-1, backend='loky')` replaces the custom worker pool.
- Only migrate to Celery if cross-machine distribution is required.
- **Lines Saved:** ~400
- **Risk:** Low for `joblib`
### 2.4 Execution Roadmap
1. **Week 1-2:** Migrate Checkpoint Manager to `dulwich` (quick win, low risk)
2. **Week 3-4:** Migrate Cron Scheduler to `APScheduler` (high value, well-contained)
3. **Week 5-8:** Migrate MCP Client to official `mcp` SDK (highest complexity, highest payoff)
4. **Week 9-12:** Migrate Config Management to `pydantic-settings` (largest blast radius, do last)
5. **Ongoing:** Evaluate Auth/Credential Pool and Batch Runner replacements as follow-up epics.
### 2.5 Cost-Benefit Summary
| Metric | Value |
|--------|-------|
| Total homebrew lines audited | ~17,000 |
| Lines recommended for replacement | ~6,300 |
| Estimated dev weeks (P0 + P1) | 10-14 weeks |
| New runtime dependencies added | 4-6 well-maintained packages |
| Maintenance burden reduction | Very High |
| Risk level | Medium (mitigated by strong test coverage) |
---
## 3. Strategic Initiative: Operation Get A Job
### 3.1 Thesis
The engineering collective is capable of 10x delivery velocity compared to typical market offerings. The strategic opportunity is to monetize this capability through pure contracting — high-tempo, fixed-scope engagements with no exclusivity or employer-like constraints.
### 3.2 Service Menu
**Tier A — White-Glove Agent Infrastructure ($400-600/hr)**
- Custom AI agent deployment with tool use (Slack, Discord, Telegram, webhooks)
- MCP server development
- Local LLM stack setup (on-premise / VPC)
- Agent security audit and red teaming
**Tier B — Security Hardening & Code Review ($250-400/hr)**
- Security backlog burn-down (CVE-class bugs)
- Skills-guard / sandbox hardening
- Architecture review
**Tier C — Automation & Integration ($150-250/hr)**
- Webhook-to-action pipelines
- Research and intelligence reporting
- Content-to-code workflows
### 3.3 Engagement Packages
| Service | Description | Timeline | Investment |
|---------|-------------|----------|------------|
| Agent Security Audit | Review of one AI agent pipeline + written findings | 2-3 business days | $4,500 |
| MCP Server Build | One custom MCP server with 3-5 tools + docs + tests | 1-2 weeks | $8,000 |
| Custom Bot Deployment | End-to-end bot with up to 5 tools, deployed to client platform | 2-3 weeks | $12,000 |
| Security Sprint | Close top 5 security issues in a Python/JS repo | 1-2 weeks | $6,500 |
| Monthly Retainer — Core | 20 hrs/month prioritized engineering + triage | Ongoing | $6,000/mo |
| Monthly Retainer — Scale | 40 hrs/month prioritized engineering + on-call | Ongoing | $11,000/mo |
### 3.4 Go-to-Market Motion
**Immediate channels:**
- Cold outbound to CTOs/VPEs at Series A-C AI startups
- LinkedIn authority content (architecture reviews, security bulletins)
- Platform presence (Gun.io, Toptal, Upwork for specific niche keywords)
**Lead magnet:** Free 15-minute architecture review. No pitch. One concrete risk identified.
### 3.5 Infrastructure Foundation
The Hermes Agent framework serves as both the delivery platform and the portfolio piece:
- Open-source runtime with ~3,000 tests
- Gateway architecture supporting 8+ messaging platforms
- Native MCP client, cron scheduling, subagent delegation
- Self-hosted Forge (Gitea) with CI and automated PR review
- Local Gemma 4 inference stack on bare metal
### 3.6 90-Day Revenue Model
| Month | Target |
|-------|--------|
| Month 1 | $9-12K (1x retainer or 2x audits) |
| Month 2 | $17K (+ 1x MCP build) |
| Month 3 | $29K (+ 1x bot deployment + new retainer) |
### 3.7 Immediate Action Items
- File Wyoming LLC and obtain EIN
- Open Mercury business bank account
- Secure E&O insurance
- Update LinkedIn profile and publish first authority post
- Customize capabilities deck and begin warm outbound
---
## 4. Fleet Status Summary
| House | Host | Model / Provider | Gateway Status |
|-------|------|------------------|----------------|
| Ezra | Hermes VPS | `kimi-for-coding` (Kimi K2.5) | API `8658`, webhook `8648` — Active |
| Bezalel | Hermes VPS | Claude Opus 4.6 (Anthropic) | Port `8645` — Active |
| Allegro-Primus | Hermes VPS | Kimi K2.5 | Port `8644` — Requires restart |
| Bilbo | External | Gemma 4B (local) | Telegram dual-mode — Active |
**Network:** Hermes VPS public IP `143.198.27.163` (Ubuntu 24.04.3 LTS). Local Gemma 4 fallback on `127.0.0.1:11435`.
---
## 5. Conclusion
The codebase is in a strong position: security is hardened, the agent loop is more resilient, and a clear roadmap exists to replace high-maintenance homegrown infrastructure with battle-tested open-source projects. The commercialization strategy is formalized and ready for execution. The next critical path is the human-facing work of entity formation, sales outreach, and closing the first fixed-scope engagement.
Prepared by **Ezra**
April 2026

Binary file not shown.