|
|
|
|
@@ -0,0 +1,678 @@
|
|
|
|
|
# Jupyter Notebooks as Core LLM Execution Layer — Deep Research Report
|
|
|
|
|
|
|
|
|
|
**Issue:** #155
|
|
|
|
|
**Date:** 2026-04-06
|
|
|
|
|
**Status:** Research / Spike
|
|
|
|
|
**Prior Art:** Timmy's initial spike (llm_execution_spike.ipynb, hamelnb bridge, JupyterLab on forge VPS)
|
|
|
|
|
|
|
|
|
|
---
|
|
|
|
|
|
|
|
|
|
## Executive Summary
|
|
|
|
|
|
|
|
|
|
This report deepens the research from issue #155 into three areas requested by Rockachopa:
|
|
|
|
|
1. The **full Jupyter product suite** — JupyterHub vs JupyterLab vs Notebook
|
|
|
|
|
2. **Papermill** — the production-grade notebook execution engine already used in real data pipelines
|
|
|
|
|
3. The **"PR model for notebooks"** — how agents can propose, diff, review, and merge changes to `.ipynb` files similarly to code PRs
|
|
|
|
|
|
|
|
|
|
The conclusion: an elegant, production-grade agent→notebook pipeline already exists as open-source tooling. We don't need to invent much — we need to compose what's there.
|
|
|
|
|
|
|
|
|
|
---
|
|
|
|
|
|
|
|
|
|
## 1. The Jupyter Product Suite
|
|
|
|
|
|
|
|
|
|
The Jupyter ecosystem has three distinct layers that are often conflated. Understanding the distinction is critical for architectural decisions.
|
|
|
|
|
|
|
|
|
|
### 1.1 Jupyter Notebook (Classic)
|
|
|
|
|
|
|
|
|
|
The original single-user interface. One browser tab = one `.ipynb` file. Version 6 is in maintenance-only mode. Version 7 was rebuilt on JupyterLab components and is functionally equivalent. For headless agent use, the UI is irrelevant — what matters is the `.ipynb` file format and the kernel execution model underneath.
|
|
|
|
|
|
|
|
|
|
### 1.2 JupyterLab
|
|
|
|
|
|
|
|
|
|
The current canonical Jupyter interface for human users: full IDE, multi-pane, terminal, extension manager, built-in diff viewer, and `jupyterlab-git` for Git workflows from the UI. JupyterLab is the recommended target for agent-collaborative workflows because:
|
|
|
|
|
|
|
|
|
|
- It exposes the same REST API as classic Jupyter (kernel sessions, execute, contents)
|
|
|
|
|
- Extensions like `jupyterlab-git` let a human co-reviewer inspect changes alongside the agent
|
|
|
|
|
- The `hamelnb` bridge Timmy already validated works against a JupyterLab server
|
|
|
|
|
|
|
|
|
|
**For agents:** JupyterLab is the platform to run on. The agent doesn't interact with the UI — it uses the Jupyter REST API or Papermill on top of it.
|
|
|
|
|
|
|
|
|
|
### 1.3 JupyterHub — The Multi-User Orchestration Layer
|
|
|
|
|
|
|
|
|
|
JupyterHub is not a UI. It is a **multi-user server** that spawns, manages, and proxies individual single-user Jupyter servers. This is the production infrastructure layer.
|
|
|
|
|
|
|
|
|
|
```
|
|
|
|
|
[Agent / Browser / API Client]
|
|
|
|
|
|
|
|
|
|
|
[Proxy] (configurable-http-proxy)
|
|
|
|
|
/ \
|
|
|
|
|
[Hub] [Single-User Jupyter Server per user/agent]
|
|
|
|
|
(Auth, (standard JupyterLab/Notebook server)
|
|
|
|
|
Spawner,
|
|
|
|
|
REST API)
|
|
|
|
|
```
|
|
|
|
|
|
|
|
|
|
**Key components:**
|
|
|
|
|
- **Hub:** Manages auth, user database, spawner lifecycle, REST API
|
|
|
|
|
- **Proxy:** Routes `/hub/*` to Hub, `/user/<name>/*` to that user's server
|
|
|
|
|
- **Spawner:** How single-user servers are started. Default = local process. Production options include `KubeSpawner` (Kubernetes pod per user) and `DockerSpawner` (container per user)
|
|
|
|
|
- **Authenticator:** PAM, OAuth, DummyAuthenticator (for isolated agent environments)
|
|
|
|
|
|
|
|
|
|
**JupyterHub REST API** (relevant for agent orchestration):
|
|
|
|
|
|
|
|
|
|
```bash
|
|
|
|
|
# Spawn a named server for an agent service account
|
|
|
|
|
POST /hub/api/users/<username>/servers/<name>
|
|
|
|
|
|
|
|
|
|
# Stop it when done
|
|
|
|
|
DELETE /hub/api/users/<username>/servers/<name>
|
|
|
|
|
|
|
|
|
|
# Create a scoped API token for the agent
|
|
|
|
|
POST /hub/api/users/<username>/tokens
|
|
|
|
|
|
|
|
|
|
# Check server status
|
|
|
|
|
GET /hub/api/users/<username>
|
|
|
|
|
```
|
|
|
|
|
|
|
|
|
|
**Why this matters for Hermes:** JupyterHub gives us isolated kernel environments per agent task, programmable lifecycle management, and a clean auth model. Instead of running one shared JupyterLab instance on the forge VPS, we could spawn ephemeral single-user servers per notebook execution run — each with its own kernel, clean state, and resource limits.
|
|
|
|
|
|
|
|
|
|
### 1.4 Jupyter Kernel Gateway — Minimal Headless Execution
|
|
|
|
|
|
|
|
|
|
If JupyterHub is too heavy, `jupyter-kernel-gateway` exposes just the kernel protocol over REST + WebSocket:
|
|
|
|
|
|
|
|
|
|
```bash
|
|
|
|
|
pip install jupyter-kernel-gateway
|
|
|
|
|
jupyter kernelgateway --KernelGatewayApp.api=kernel_gateway.jupyter_websocket
|
|
|
|
|
|
|
|
|
|
# Start kernel
|
|
|
|
|
POST /api/kernels
|
|
|
|
|
# Execute via WebSocket on Jupyter messaging protocol
|
|
|
|
|
WS /api/kernels/<kernel_id>/channels
|
|
|
|
|
# Stop kernel
|
|
|
|
|
DELETE /api/kernels/<kernel_id>
|
|
|
|
|
```
|
|
|
|
|
|
|
|
|
|
This is the lowest-level option: no notebook management, just raw kernel access. Suitable if we want to build our own execution layer from scratch.
|
|
|
|
|
|
|
|
|
|
---
|
|
|
|
|
|
|
|
|
|
## 2. Papermill — Production Notebook Execution
|
|
|
|
|
|
|
|
|
|
Papermill is the missing link between "notebook as experiment" and "notebook as repeatable pipeline task." It is already used at scale in industry data pipelines (Netflix, Airbnb, etc.).
|
|
|
|
|
|
|
|
|
|
### 2.1 Core Concept: Parameterization
|
|
|
|
|
|
|
|
|
|
Papermill's key innovation is **parameter injection**. Tag a cell in the notebook with `"parameters"`:
|
|
|
|
|
|
|
|
|
|
```python
|
|
|
|
|
# Cell tagged "parameters" (defaults — defined by notebook author)
|
|
|
|
|
alpha = 0.5
|
|
|
|
|
batch_size = 32
|
|
|
|
|
model_name = "baseline"
|
|
|
|
|
```
|
|
|
|
|
|
|
|
|
|
At runtime, Papermill inserts a new cell immediately after, tagged `"injected-parameters"`, that overrides the defaults:
|
|
|
|
|
|
|
|
|
|
```python
|
|
|
|
|
# Cell tagged "injected-parameters" (injected by Papermill at runtime)
|
|
|
|
|
alpha = 0.01
|
|
|
|
|
batch_size = 128
|
|
|
|
|
model_name = "experiment_007"
|
|
|
|
|
```
|
|
|
|
|
|
|
|
|
|
Because Python executes top-to-bottom, the injected cell shadows the defaults. The original notebook is never mutated — Papermill reads input, writes to a new output file.
|
|
|
|
|
|
|
|
|
|
### 2.2 Python API
|
|
|
|
|
|
|
|
|
|
```python
|
|
|
|
|
import papermill as pm
|
|
|
|
|
|
|
|
|
|
nb = pm.execute_notebook(
|
|
|
|
|
input_path="analysis.ipynb", # source (can be s3://, az://, gs://)
|
|
|
|
|
output_path="output/run_001.ipynb", # destination (persists outputs)
|
|
|
|
|
parameters={
|
|
|
|
|
"alpha": 0.01,
|
|
|
|
|
"n_samples": 1000,
|
|
|
|
|
"run_id": "fleet-check-2026-04-06",
|
|
|
|
|
},
|
|
|
|
|
kernel_name="python3",
|
|
|
|
|
execution_timeout=300, # per-cell timeout in seconds
|
|
|
|
|
log_output=True, # stream cell output to logger
|
|
|
|
|
cwd="/path/to/notebook/", # working directory
|
|
|
|
|
)
|
|
|
|
|
# Returns: NotebookNode (the fully executed notebook with all outputs)
|
|
|
|
|
```
|
|
|
|
|
|
|
|
|
|
On cell failure, Papermill raises `PapermillExecutionError` with:
|
|
|
|
|
- `cell_index` — which cell failed
|
|
|
|
|
- `source` — the failing cell's code
|
|
|
|
|
- `ename` / `evalue` — exception type and message
|
|
|
|
|
- `traceback` — full traceback
|
|
|
|
|
|
|
|
|
|
Even on failure, the output notebook is written with whatever cells completed — enabling partial-run inspection.
|
|
|
|
|
|
|
|
|
|
### 2.3 CLI
|
|
|
|
|
|
|
|
|
|
```bash
|
|
|
|
|
# Basic execution
|
|
|
|
|
papermill analysis.ipynb output/run_001.ipynb \
|
|
|
|
|
-p alpha 0.01 \
|
|
|
|
|
-p n_samples 1000
|
|
|
|
|
|
|
|
|
|
# From YAML parameter file
|
|
|
|
|
papermill analysis.ipynb output/run_001.ipynb -f params.yaml
|
|
|
|
|
|
|
|
|
|
# CI-friendly: log outputs, no progress bar
|
|
|
|
|
papermill analysis.ipynb output/run_001.ipynb \
|
|
|
|
|
--log-output \
|
|
|
|
|
--no-progress-bar \
|
|
|
|
|
--execution-timeout 300 \
|
|
|
|
|
-p run_id "fleet-check-2026-04-06"
|
|
|
|
|
|
|
|
|
|
# Prepare only (inject params, skip execution — for preview/inspection)
|
|
|
|
|
papermill analysis.ipynb preview.ipynb --prepare-only -p alpha 0.01
|
|
|
|
|
|
|
|
|
|
# Inspect parameter schema
|
|
|
|
|
papermill --help-notebook analysis.ipynb
|
|
|
|
|
```
|
|
|
|
|
|
|
|
|
|
**Remote storage** is built in — `pip install papermill[s3]` enables `s3://` paths for both input and output. Azure and GCS are also supported. For Hermes, this means notebook runs can be stored in object storage and retrieved later for audit.
|
|
|
|
|
|
|
|
|
|
### 2.4 Scrapbook — Structured Output Collection
|
|
|
|
|
|
|
|
|
|
`scrapbook` is Papermill's companion for extracting structured data from executed notebooks. Inside a notebook cell:
|
|
|
|
|
|
|
|
|
|
```python
|
|
|
|
|
import scrapbook as sb
|
|
|
|
|
|
|
|
|
|
# Write typed outputs (stored as special display_data in cell outputs)
|
|
|
|
|
sb.glue("accuracy", 0.9342)
|
|
|
|
|
sb.glue("metrics", {"precision": 0.91, "recall": 0.93, "f1": 0.92})
|
|
|
|
|
sb.glue("results_df", df, "pandas") # DataFrames too
|
|
|
|
|
```
|
|
|
|
|
|
|
|
|
|
After execution, from the agent:
|
|
|
|
|
|
|
|
|
|
```python
|
|
|
|
|
import scrapbook as sb
|
|
|
|
|
|
|
|
|
|
nb = sb.read_notebook("output/fleet-check-2026-04-06.ipynb")
|
|
|
|
|
metrics = nb.scraps["metrics"].data # -> {"precision": 0.91, ...}
|
|
|
|
|
accuracy = nb.scraps["accuracy"].data # -> 0.9342
|
|
|
|
|
|
|
|
|
|
# Or aggregate across many runs
|
|
|
|
|
book = sb.read_notebooks("output/")
|
|
|
|
|
book.scrap_dataframe # -> pd.DataFrame with all scraps + filenames
|
|
|
|
|
```
|
|
|
|
|
|
|
|
|
|
This is the clean interface between notebook execution and agent decision-making: the notebook outputs its findings as named, typed scraps; the agent reads them programmatically and acts.
|
|
|
|
|
|
|
|
|
|
### 2.5 How Papermill Compares to hamelnb
|
|
|
|
|
|
|
|
|
|
| Capability | hamelnb | Papermill |
|
|
|
|
|
|---|---|---|
|
|
|
|
|
| Stateful kernel session | Yes | No (fresh kernel per run) |
|
|
|
|
|
| Parameter injection | No | Yes |
|
|
|
|
|
| Persistent output notebook | No | Yes |
|
|
|
|
|
| Remote storage (S3/Azure) | No | Yes |
|
|
|
|
|
| Per-cell timing/metadata | No | Yes (in output nb metadata) |
|
|
|
|
|
| Error isolation (partial runs) | No | Yes |
|
|
|
|
|
| Production pipeline use | Experimental | Industry-standard |
|
|
|
|
|
| Structured output collection | No | Yes (via scrapbook) |
|
|
|
|
|
|
|
|
|
|
**Verdict:** `hamelnb` is great for interactive REPL-style exploration (where state accumulates). Papermill is better for task execution (where we want reproducible, parameterized, auditable runs). They serve different use cases. Hermes needs both.
|
|
|
|
|
|
|
|
|
|
---
|
|
|
|
|
|
|
|
|
|
## 3. The `.ipynb` File Format — What the Agent Is Actually Working With
|
|
|
|
|
|
|
|
|
|
Understanding the format is essential for the "PR model." A `.ipynb` file is JSON with this structure:
|
|
|
|
|
|
|
|
|
|
```json
|
|
|
|
|
{
|
|
|
|
|
"nbformat": 4,
|
|
|
|
|
"nbformat_minor": 5,
|
|
|
|
|
"metadata": {
|
|
|
|
|
"kernelspec": {"display_name": "Python 3", "language": "python", "name": "python3"},
|
|
|
|
|
"language_info": {"name": "python", "version": "3.10.0"}
|
|
|
|
|
},
|
|
|
|
|
"cells": [
|
|
|
|
|
{
|
|
|
|
|
"id": "a1b2c3d4",
|
|
|
|
|
"cell_type": "markdown",
|
|
|
|
|
"source": "# Fleet Health Check\n\nThis notebook checks system health.",
|
|
|
|
|
"metadata": {}
|
|
|
|
|
},
|
|
|
|
|
{
|
|
|
|
|
"id": "e5f6g7h8",
|
|
|
|
|
"cell_type": "code",
|
|
|
|
|
"source": "alpha = 0.5\nthreshold = 0.95",
|
|
|
|
|
"metadata": {"tags": ["parameters"]},
|
|
|
|
|
"execution_count": null,
|
|
|
|
|
"outputs": []
|
|
|
|
|
},
|
|
|
|
|
{
|
|
|
|
|
"id": "i9j0k1l2",
|
|
|
|
|
"cell_type": "code",
|
|
|
|
|
"source": "import sys\nprint(sys.version)",
|
|
|
|
|
"metadata": {},
|
|
|
|
|
"execution_count": 1,
|
|
|
|
|
"outputs": [
|
|
|
|
|
{
|
|
|
|
|
"output_type": "stream",
|
|
|
|
|
"name": "stdout",
|
|
|
|
|
"text": "3.10.0 (default, ...)\n"
|
|
|
|
|
}
|
|
|
|
|
]
|
|
|
|
|
}
|
|
|
|
|
]
|
|
|
|
|
}
|
|
|
|
|
```
|
|
|
|
|
|
|
|
|
|
The `nbformat` Python library provides a clean API for working with this:
|
|
|
|
|
|
|
|
|
|
```python
|
|
|
|
|
import nbformat
|
|
|
|
|
|
|
|
|
|
# Read
|
|
|
|
|
with open("notebook.ipynb") as f:
|
|
|
|
|
nb = nbformat.read(f, as_version=4)
|
|
|
|
|
|
|
|
|
|
# Navigate
|
|
|
|
|
for cell in nb.cells:
|
|
|
|
|
if cell.cell_type == "code":
|
|
|
|
|
print(cell.source)
|
|
|
|
|
|
|
|
|
|
# Modify
|
|
|
|
|
nb.cells[2].source = "import sys\nprint('updated')"
|
|
|
|
|
|
|
|
|
|
# Add cells
|
|
|
|
|
new_md = nbformat.v4.new_markdown_cell("## Agent Analysis\nInserted by Hermes.")
|
|
|
|
|
nb.cells.insert(3, new_md)
|
|
|
|
|
|
|
|
|
|
# Write
|
|
|
|
|
with open("modified.ipynb", "w") as f:
|
|
|
|
|
nbformat.write(nb, f)
|
|
|
|
|
|
|
|
|
|
# Validate
|
|
|
|
|
nbformat.validate(nb) # raises nbformat.ValidationError on invalid format
|
|
|
|
|
```
|
|
|
|
|
|
|
|
|
|
---
|
|
|
|
|
|
|
|
|
|
## 4. The PR Model for Notebooks
|
|
|
|
|
|
|
|
|
|
This is the elegant architecture Rockachopa described: agents making PRs to notebooks the same way they make PRs to code. Here's how the full stack enables it.
|
|
|
|
|
|
|
|
|
|
### 4.1 The Problem: Raw `.ipynb` Diffs Are Unusable
|
|
|
|
|
|
|
|
|
|
Without tooling, a `git diff` on a notebook that was merely re-run (no source changes) produces thousands of lines of JSON changes — execution counts, timestamps, base64-encoded plot images. Code review on raw `.ipynb` diffs is impractical.
|
|
|
|
|
|
|
|
|
|
### 4.2 nbstripout — Clean Git History
|
|
|
|
|
|
|
|
|
|
`nbstripout` installs a git **clean filter** that strips outputs before files enter the git index. The working copy is untouched; only what gets committed is clean.
|
|
|
|
|
|
|
|
|
|
```bash
|
|
|
|
|
pip install nbstripout
|
|
|
|
|
nbstripout --install # per-repo
|
|
|
|
|
# or
|
|
|
|
|
nbstripout --install --global # all repos
|
|
|
|
|
```
|
|
|
|
|
|
|
|
|
|
This writes to `.git/config`:
|
|
|
|
|
```ini
|
|
|
|
|
[filter "nbstripout"]
|
|
|
|
|
clean = nbstripout
|
|
|
|
|
smudge = cat
|
|
|
|
|
required = true
|
|
|
|
|
|
|
|
|
|
[diff "ipynb"]
|
|
|
|
|
textconv = nbstripout -t
|
|
|
|
|
```
|
|
|
|
|
|
|
|
|
|
And to `.gitattributes`:
|
|
|
|
|
```
|
|
|
|
|
*.ipynb filter=nbstripout
|
|
|
|
|
*.ipynb diff=ipynb
|
|
|
|
|
```
|
|
|
|
|
|
|
|
|
|
Now `git diff` shows only source changes — same as reviewing a `.py` file.
|
|
|
|
|
|
|
|
|
|
**For executed-output notebooks** (where we want to keep outputs for audit): use a separate path like `runs/` or `outputs/` excluded from the filter via `.gitattributes`:
|
|
|
|
|
```
|
|
|
|
|
*.ipynb filter=nbstripout
|
|
|
|
|
runs/*.ipynb !filter
|
|
|
|
|
runs/*.ipynb !diff
|
|
|
|
|
```
|
|
|
|
|
|
|
|
|
|
### 4.3 nbdime — Semantic Diff and Merge
|
|
|
|
|
|
|
|
|
|
nbdime understands notebook structure. Instead of diffing raw JSON, it diffs at the level of cells — knowing that `cells` is a list, `source` is a string, and outputs should often be ignored.
|
|
|
|
|
|
|
|
|
|
```bash
|
|
|
|
|
pip install nbdime
|
|
|
|
|
|
|
|
|
|
# Enable semantic git diff/merge for all .ipynb files
|
|
|
|
|
nbdime config-git --enable
|
|
|
|
|
|
|
|
|
|
# Now standard git commands are notebook-aware:
|
|
|
|
|
git diff HEAD notebook.ipynb # semantic cell-level diff
|
|
|
|
|
git merge feature-branch # uses nbdime for .ipynb conflict resolution
|
|
|
|
|
git log -p notebook.ipynb # readable patch per commit
|
|
|
|
|
```
|
|
|
|
|
|
|
|
|
|
**Python API for agent reasoning:**
|
|
|
|
|
|
|
|
|
|
```python
|
|
|
|
|
import nbdime
|
|
|
|
|
import nbformat
|
|
|
|
|
|
|
|
|
|
nb_base = nbformat.read(open("original.ipynb"), as_version=4)
|
|
|
|
|
nb_pr = nbformat.read(open("proposed.ipynb"), as_version=4)
|
|
|
|
|
|
|
|
|
|
diff = nbdime.diff_notebooks(nb_base, nb_pr)
|
|
|
|
|
|
|
|
|
|
# diff is a list of structured ops the agent can reason about:
|
|
|
|
|
# [{"op": "patch", "key": "cells", "diff": [
|
|
|
|
|
# {"op": "patch", "key": 3, "diff": [
|
|
|
|
|
# {"op": "patch", "key": "source", "diff": [...string ops...]}
|
|
|
|
|
# ]}
|
|
|
|
|
# ]}]
|
|
|
|
|
|
|
|
|
|
# Apply a diff (patch)
|
|
|
|
|
from nbdime.patching import patch
|
|
|
|
|
nb_result = patch(nb_base, diff)
|
|
|
|
|
```
|
|
|
|
|
|
|
|
|
|
### 4.4 The Full Agent PR Workflow
|
|
|
|
|
|
|
|
|
|
Here is the complete workflow — analogous to how Hermes makes PRs to code repos via Gitea:
|
|
|
|
|
|
|
|
|
|
**1. Agent reads the task notebook**
|
|
|
|
|
```python
|
|
|
|
|
nb = nbformat.read(open("fleet_health_check.ipynb"), as_version=4)
|
|
|
|
|
```
|
|
|
|
|
|
|
|
|
|
**2. Agent locates and modifies relevant cells**
|
|
|
|
|
```python
|
|
|
|
|
# Find parameter cell
|
|
|
|
|
params_cell = next(
|
|
|
|
|
c for c in nb.cells
|
|
|
|
|
if "parameters" in c.get("metadata", {}).get("tags", [])
|
|
|
|
|
)
|
|
|
|
|
# Update threshold
|
|
|
|
|
params_cell.source = params_cell.source.replace("threshold = 0.95", "threshold = 0.90")
|
|
|
|
|
|
|
|
|
|
# Add explanatory markdown
|
|
|
|
|
nb.cells.insert(
|
|
|
|
|
nb.cells.index(params_cell) + 1,
|
|
|
|
|
nbformat.v4.new_markdown_cell(
|
|
|
|
|
"**Note (Hermes 2026-04-06):** Threshold lowered from 0.95 to 0.90 "
|
|
|
|
|
"based on false-positive analysis from last 7 days of runs."
|
|
|
|
|
)
|
|
|
|
|
)
|
|
|
|
|
```
|
|
|
|
|
|
|
|
|
|
**3. Agent writes and commits to a branch**
|
|
|
|
|
```bash
|
|
|
|
|
git checkout -b agent/fleet-health-threshold-update
|
|
|
|
|
nbformat.write(nb, open("fleet_health_check.ipynb", "w"))
|
|
|
|
|
git add fleet_health_check.ipynb
|
|
|
|
|
git commit -m "feat(notebooks): lower fleet health threshold to 0.90 (#155)"
|
|
|
|
|
```
|
|
|
|
|
|
|
|
|
|
**4. Agent executes the proposed notebook to validate**
|
|
|
|
|
```python
|
|
|
|
|
import papermill as pm
|
|
|
|
|
|
|
|
|
|
pm.execute_notebook(
|
|
|
|
|
"fleet_health_check.ipynb",
|
|
|
|
|
"output/validation_run.ipynb",
|
|
|
|
|
parameters={"run_id": "agent-validation-2026-04-06"},
|
|
|
|
|
log_output=True,
|
|
|
|
|
)
|
|
|
|
|
```
|
|
|
|
|
|
|
|
|
|
**5. Agent collects results and compares**
|
|
|
|
|
```python
|
|
|
|
|
import scrapbook as sb
|
|
|
|
|
|
|
|
|
|
result = sb.read_notebook("output/validation_run.ipynb")
|
|
|
|
|
health_score = result.scraps["health_score"].data
|
|
|
|
|
alert_count = result.scraps["alert_count"].data
|
|
|
|
|
```
|
|
|
|
|
|
|
|
|
|
**6. Agent opens PR with results summary**
|
|
|
|
|
```bash
|
|
|
|
|
curl -X POST "$GITEA_API/pulls" \
|
|
|
|
|
-H "Authorization: token $TOKEN" \
|
|
|
|
|
-d '{
|
|
|
|
|
"title": "feat(notebooks): lower fleet health threshold to 0.90",
|
|
|
|
|
"body": "## Agent Analysis\n\n- Health score: 0.94 (was 0.89 with old threshold)\n- Alert count: 12 (was 47 false positives)\n- Validation run: output/validation_run.ipynb\n\nRefs #155",
|
|
|
|
|
"head": "agent/fleet-health-threshold-update",
|
|
|
|
|
"base": "main"
|
|
|
|
|
}'
|
|
|
|
|
```
|
|
|
|
|
|
|
|
|
|
**7. Human reviews the PR using nbdime diff**
|
|
|
|
|
|
|
|
|
|
The PR diff in Gitea shows the clean cell-level source changes (thanks to nbstripout). The human can also run `nbdiff-web original.ipynb proposed.ipynb` locally for rich rendered diff with output comparison.
|
|
|
|
|
|
|
|
|
|
### 4.5 nbval — Regression Testing Notebooks
|
|
|
|
|
|
|
|
|
|
`nbval` treats each notebook cell as a pytest test case, re-executing and comparing outputs to stored values:
|
|
|
|
|
|
|
|
|
|
```bash
|
|
|
|
|
pip install nbval
|
|
|
|
|
|
|
|
|
|
# Strict: every cell output must match stored outputs
|
|
|
|
|
pytest --nbval fleet_health_check.ipynb
|
|
|
|
|
|
|
|
|
|
# Lax: only check cells marked with # NBVAL_CHECK_OUTPUT
|
|
|
|
|
pytest --nbval-lax fleet_health_check.ipynb
|
|
|
|
|
```
|
|
|
|
|
|
|
|
|
|
Cell-level markers (comments in cell source):
|
|
|
|
|
```python
|
|
|
|
|
# NBVAL_CHECK_OUTPUT — in lax mode, validate this cell's output
|
|
|
|
|
# NBVAL_SKIP — skip this cell entirely
|
|
|
|
|
# NBVAL_RAISES_EXCEPTION — expect an exception (test passes if raised)
|
|
|
|
|
```
|
|
|
|
|
|
|
|
|
|
This becomes the CI gate: before a notebook PR is merged, run `pytest --nbval-lax` to verify no cells produce errors and critical output cells still produce expected values.
|
|
|
|
|
|
|
|
|
|
---
|
|
|
|
|
|
|
|
|
|
## 5. Gaps and Recommendations
|
|
|
|
|
|
|
|
|
|
### 5.1 Gap Assessment (Refining Timmy's Original Findings)
|
|
|
|
|
|
|
|
|
|
| Gap | Severity | Solution |
|
|
|
|
|
|---|---|---|
|
|
|
|
|
| No Hermes tool access in kernel | High | Inject `hermes_runtime` module (see §5.2) |
|
|
|
|
|
| No structured output protocol | High | Use scrapbook `sb.glue()` pattern |
|
|
|
|
|
| No parameterization | Medium | Add Papermill `"parameters"` cell to notebooks |
|
|
|
|
|
| XSRF/auth friction | Medium | Disable for local; use JupyterHub token scopes for multi-user |
|
|
|
|
|
| No notebook CI/testing | Medium | Add nbval to test suite |
|
|
|
|
|
| Raw `.ipynb` diffs in PRs | Medium | Install nbstripout + nbdime |
|
|
|
|
|
| No scheduling | Low | Papermill + existing Hermes cron layer |
|
|
|
|
|
|
|
|
|
|
### 5.2 Short-Term Recommendations (This Month)
|
|
|
|
|
|
|
|
|
|
**1. `NotebookExecutor` tool**
|
|
|
|
|
|
|
|
|
|
A thin Hermes tool wrapping the ecosystem:
|
|
|
|
|
|
|
|
|
|
```python
|
|
|
|
|
class NotebookExecutor:
|
|
|
|
|
def execute(self, input_path, output_path, parameters, timeout=300):
|
|
|
|
|
"""Wraps pm.execute_notebook(). Returns structured result dict."""
|
|
|
|
|
|
|
|
|
|
def collect_outputs(self, notebook_path):
|
|
|
|
|
"""Wraps sb.read_notebook(). Returns dict of named scraps."""
|
|
|
|
|
|
|
|
|
|
def inspect_parameters(self, notebook_path):
|
|
|
|
|
"""Wraps pm.inspect_notebook(). Returns parameter schema."""
|
|
|
|
|
|
|
|
|
|
def read_notebook(self, path):
|
|
|
|
|
"""Returns nbformat NotebookNode for cell inspection/modification."""
|
|
|
|
|
|
|
|
|
|
def write_notebook(self, nb, path):
|
|
|
|
|
"""Writes modified NotebookNode back to disk."""
|
|
|
|
|
|
|
|
|
|
def diff_notebooks(self, path_a, path_b):
|
|
|
|
|
"""Returns structured nbdime diff for agent reasoning."""
|
|
|
|
|
|
|
|
|
|
def validate(self, notebook_path):
|
|
|
|
|
"""Runs nbformat.validate() + optional pytest --nbval-lax."""
|
|
|
|
|
```
|
|
|
|
|
|
|
|
|
|
Execution result structure for the agent:
|
|
|
|
|
```python
|
|
|
|
|
{
|
|
|
|
|
"status": "success" | "error",
|
|
|
|
|
"duration_seconds": 12.34,
|
|
|
|
|
"cells_executed": 15,
|
|
|
|
|
"failed_cell": { # None on success
|
|
|
|
|
"index": 7,
|
|
|
|
|
"source": "model.fit(X, y)",
|
|
|
|
|
"ename": "ValueError",
|
|
|
|
|
"evalue": "Input contains NaN",
|
|
|
|
|
},
|
|
|
|
|
"scraps": { # from scrapbook
|
|
|
|
|
"health_score": 0.94,
|
|
|
|
|
"alert_count": 12,
|
|
|
|
|
},
|
|
|
|
|
}
|
|
|
|
|
```
|
|
|
|
|
|
|
|
|
|
**2. Fleet Health Check as a Notebook**
|
|
|
|
|
|
|
|
|
|
Convert the fleet health check epic into a parameterized notebook with:
|
|
|
|
|
- `"parameters"` cell for run configuration (date range, thresholds, agent ID)
|
|
|
|
|
- Markdown cells narrating each step
|
|
|
|
|
- `sb.glue()` calls for structured outputs
|
|
|
|
|
- `# NBVAL_CHECK_OUTPUT` markers on critical cells
|
|
|
|
|
|
|
|
|
|
**3. Git hygiene for notebooks**
|
|
|
|
|
|
|
|
|
|
Install nbstripout + nbdime in the hermes-agent repo:
|
|
|
|
|
```bash
|
|
|
|
|
pip install nbstripout nbdime
|
|
|
|
|
nbstripout --install
|
|
|
|
|
nbdime config-git --enable
|
|
|
|
|
```
|
|
|
|
|
|
|
|
|
|
Add to `.gitattributes`:
|
|
|
|
|
```
|
|
|
|
|
*.ipynb filter=nbstripout
|
|
|
|
|
*.ipynb diff=ipynb
|
|
|
|
|
runs/*.ipynb !filter
|
|
|
|
|
```
|
|
|
|
|
|
|
|
|
|
### 5.3 Medium-Term Recommendations (Next Quarter)
|
|
|
|
|
|
|
|
|
|
**4. `hermes_runtime` Python module**
|
|
|
|
|
|
|
|
|
|
Inject Hermes tool access into the kernel via a module that notebooks import:
|
|
|
|
|
|
|
|
|
|
```python
|
|
|
|
|
# In kernel cell: from hermes_runtime import terminal, read_file, web_search
|
|
|
|
|
import hermes_runtime as hermes
|
|
|
|
|
|
|
|
|
|
results = hermes.web_search("fleet health metrics best practices")
|
|
|
|
|
hermes.terminal("systemctl status agent-fleet")
|
|
|
|
|
content = hermes.read_file("/var/log/hermes/agent.log")
|
|
|
|
|
```
|
|
|
|
|
|
|
|
|
|
This closes the most significant gap: notebooks gain the same tool access as skills, while retaining state persistence and narrative structure.
|
|
|
|
|
|
|
|
|
|
**5. Notebook-triggered cron**
|
|
|
|
|
|
|
|
|
|
Extend the Hermes cron layer to accept `.ipynb` paths as targets:
|
|
|
|
|
```yaml
|
|
|
|
|
# cron entry
|
|
|
|
|
schedule: "0 6 * * *"
|
|
|
|
|
type: notebook
|
|
|
|
|
path: notebooks/fleet_health_check.ipynb
|
|
|
|
|
parameters:
|
|
|
|
|
run_id: "{{date}}"
|
|
|
|
|
alert_threshold: 0.90
|
|
|
|
|
output_path: runs/fleet_health_{{date}}.ipynb
|
|
|
|
|
```
|
|
|
|
|
|
|
|
|
|
The cron runner calls `pm.execute_notebook()` and commits the output to the repo.
|
|
|
|
|
|
|
|
|
|
**6. JupyterHub for multi-agent isolation**
|
|
|
|
|
|
|
|
|
|
If multiple agents need concurrent notebook execution, deploy JupyterHub with `DockerSpawner` or `KubeSpawner`. Each agent job gets an isolated container with its own kernel, no state bleed between runs.
|
|
|
|
|
|
|
|
|
|
---
|
|
|
|
|
|
|
|
|
|
## 6. Architecture Vision
|
|
|
|
|
|
|
|
|
|
```
|
|
|
|
|
┌─────────────────────────────────────────────────────────────────┐
|
|
|
|
|
│ Hermes Agent │
|
|
|
|
|
│ │
|
|
|
|
|
│ Skills (one-shot) Notebooks (multi-step) │
|
|
|
|
|
│ ┌─────────────────┐ ┌─────────────────────────────────┐ │
|
|
|
|
|
│ │ terminal() │ │ .ipynb file │ │
|
|
|
|
|
│ │ web_search() │ │ ├── Markdown (narrative) │ │
|
|
|
|
|
│ │ read_file() │ │ ├── Code cells (logic) │ │
|
|
|
|
|
│ └─────────────────┘ │ ├── "parameters" cell │ │
|
|
|
|
|
│ │ └── sb.glue() outputs │ │
|
|
|
|
|
│ └──────────────┬────────────────┘ │
|
|
|
|
|
│ │ │
|
|
|
|
|
│ ┌──────────────▼────────────────┐ │
|
|
|
|
|
│ │ NotebookExecutor tool │ │
|
|
|
|
|
│ │ (papermill + scrapbook + │ │
|
|
|
|
|
│ │ nbformat + nbdime + nbval) │ │
|
|
|
|
|
│ └──────────────┬────────────────┘ │
|
|
|
|
|
│ │ │
|
|
|
|
|
└────────────────────────────────────────────┼────────────────────┘
|
|
|
|
|
│
|
|
|
|
|
┌───────────────────▼──────────────────┐
|
|
|
|
|
│ JupyterLab / Hub │
|
|
|
|
|
│ (kernel execution environment) │
|
|
|
|
|
└───────────────────┬──────────────────┘
|
|
|
|
|
│
|
|
|
|
|
┌───────────────────▼──────────────────┐
|
|
|
|
|
│ Git + Gitea │
|
|
|
|
|
│ (nbstripout clean diffs, │
|
|
|
|
|
│ nbdime semantic review, │
|
|
|
|
|
│ PR workflow for notebook changes) │
|
|
|
|
|
└──────────────────────────────────────┘
|
|
|
|
|
```
|
|
|
|
|
|
|
|
|
|
**Notebooks become the primary artifact of complex tasks:** the agent generates or edits cells, Papermill executes them reproducibly, scrapbook extracts structured outputs for agent decision-making, and the resulting `.ipynb` is both proof-of-work and human-readable report. Skills remain for one-shot actions. Notebooks own multi-step workflows.
|
|
|
|
|
|
|
|
|
|
---
|
|
|
|
|
|
|
|
|
|
## 7. Package Summary
|
|
|
|
|
|
|
|
|
|
| Package | Purpose | Install |
|
|
|
|
|
|---|---|---|
|
|
|
|
|
| `nbformat` | Read/write/validate `.ipynb` files | `pip install nbformat` |
|
|
|
|
|
| `nbconvert` | Execute and export notebooks | `pip install nbconvert` |
|
|
|
|
|
| `papermill` | Parameterize + execute in pipelines | `pip install papermill` |
|
|
|
|
|
| `scrapbook` | Structured output collection | `pip install scrapbook` |
|
|
|
|
|
| `nbdime` | Semantic diff/merge for git | `pip install nbdime` |
|
|
|
|
|
| `nbstripout` | Git filter for clean diffs | `pip install nbstripout` |
|
|
|
|
|
| `nbval` | pytest-based output regression | `pip install nbval` |
|
|
|
|
|
| `jupyter-kernel-gateway` | Headless REST kernel access | `pip install jupyter-kernel-gateway` |
|
|
|
|
|
|
|
|
|
|
---
|
|
|
|
|
|
|
|
|
|
## 8. References
|
|
|
|
|
|
|
|
|
|
- [Papermill GitHub (nteract/papermill)](https://github.com/nteract/papermill)
|
|
|
|
|
- [Scrapbook GitHub (nteract/scrapbook)](https://github.com/nteract/scrapbook)
|
|
|
|
|
- [nbformat format specification](https://nbformat.readthedocs.io/en/latest/format_description.html)
|
|
|
|
|
- [nbdime documentation](https://nbdime.readthedocs.io/)
|
|
|
|
|
- [nbdime diff format spec (JEP #8)](https://github.com/jupyter/enhancement-proposals/blob/master/08-notebook-diff/notebook-diff.md)
|
|
|
|
|
- [nbconvert execute API](https://nbconvert.readthedocs.io/en/latest/execute_api.html)
|
|
|
|
|
- [nbstripout README](https://github.com/kynan/nbstripout)
|
|
|
|
|
- [nbval GitHub (computationalmodelling/nbval)](https://github.com/computationalmodelling/nbval)
|
|
|
|
|
- [JupyterHub REST API](https://jupyterhub.readthedocs.io/en/stable/howto/rest.html)
|
|
|
|
|
- [JupyterHub Technical Overview](https://jupyterhub.readthedocs.io/en/latest/reference/technical-overview.html)
|
|
|
|
|
- [Jupyter Kernel Gateway](https://github.com/jupyter-server/kernel_gateway)
|