Timmy_Foundation/hermes-agent

Fork 0

Files

Claude (Opus 4.6) c994c01c9f

Docker Build and Publish / build-and-push (push) Has been cancelled

Details

Nix / nix (macos-latest) (push) Has been cancelled

Details

Nix / nix (ubuntu-latest) (push) Has been cancelled

Details

Tests / test (push) Has been cancelled

Details

[claude] Deep research: Jupyter ecosystem as LLM execution layer (#155 ) (#160 )

2026-04-07 02:00:20 +00:00

25 KiB

Raw Permalink Blame History

Jupyter Notebooks as Core LLM Execution Layer — Deep Research Report

Issue: #155 Date: 2026-04-06 Status: Research / Spike Prior Art: Timmy's initial spike (llm_execution_spike.ipynb, hamelnb bridge, JupyterLab on forge VPS)

Executive Summary

This report deepens the research from issue #155 into three areas requested by Rockachopa:

The full Jupyter product suite — JupyterHub vs JupyterLab vs Notebook
Papermill — the production-grade notebook execution engine already used in real data pipelines
The "PR model for notebooks" — how agents can propose, diff, review, and merge changes to .ipynb files similarly to code PRs

The conclusion: an elegant, production-grade agent→notebook pipeline already exists as open-source tooling. We don't need to invent much — we need to compose what's there.

1. The Jupyter Product Suite

The Jupyter ecosystem has three distinct layers that are often conflated. Understanding the distinction is critical for architectural decisions.

1.1 Jupyter Notebook (Classic)

The original single-user interface. One browser tab = one .ipynb file. Version 6 is in maintenance-only mode. Version 7 was rebuilt on JupyterLab components and is functionally equivalent. For headless agent use, the UI is irrelevant — what matters is the .ipynb file format and the kernel execution model underneath.

1.2 JupyterLab

The current canonical Jupyter interface for human users: full IDE, multi-pane, terminal, extension manager, built-in diff viewer, and jupyterlab-git for Git workflows from the UI. JupyterLab is the recommended target for agent-collaborative workflows because:

It exposes the same REST API as classic Jupyter (kernel sessions, execute, contents)
Extensions like jupyterlab-git let a human co-reviewer inspect changes alongside the agent
The hamelnb bridge Timmy already validated works against a JupyterLab server

For agents: JupyterLab is the platform to run on. The agent doesn't interact with the UI — it uses the Jupyter REST API or Papermill on top of it.

1.3 JupyterHub — The Multi-User Orchestration Layer

JupyterHub is not a UI. It is a multi-user server that spawns, manages, and proxies individual single-user Jupyter servers. This is the production infrastructure layer.

[Agent / Browser / API Client]
         |
      [Proxy]  (configurable-http-proxy)
      /      \
   [Hub]    [Single-User Jupyter Server per user/agent]
 (Auth,      (standard JupyterLab/Notebook server)
  Spawner,
  REST API)

Key components:

Hub: Manages auth, user database, spawner lifecycle, REST API
Proxy: Routes /hub/* to Hub, /user/<name>/* to that user's server
Spawner: How single-user servers are started. Default = local process. Production options include KubeSpawner (Kubernetes pod per user) and DockerSpawner (container per user)
Authenticator: PAM, OAuth, DummyAuthenticator (for isolated agent environments)

JupyterHub REST API (relevant for agent orchestration):

# Spawn a named server for an agent service account
POST /hub/api/users/<username>/servers/<name>

# Stop it when done
DELETE /hub/api/users/<username>/servers/<name>

# Create a scoped API token for the agent
POST /hub/api/users/<username>/tokens

# Check server status
GET /hub/api/users/<username>

Why this matters for Hermes: JupyterHub gives us isolated kernel environments per agent task, programmable lifecycle management, and a clean auth model. Instead of running one shared JupyterLab instance on the forge VPS, we could spawn ephemeral single-user servers per notebook execution run — each with its own kernel, clean state, and resource limits.

1.4 Jupyter Kernel Gateway — Minimal Headless Execution

If JupyterHub is too heavy, jupyter-kernel-gateway exposes just the kernel protocol over REST + WebSocket:

pip install jupyter-kernel-gateway
jupyter kernelgateway --KernelGatewayApp.api=kernel_gateway.jupyter_websocket

# Start kernel
POST /api/kernels
# Execute via WebSocket on Jupyter messaging protocol
WS /api/kernels/<kernel_id>/channels
# Stop kernel
DELETE /api/kernels/<kernel_id>

This is the lowest-level option: no notebook management, just raw kernel access. Suitable if we want to build our own execution layer from scratch.

2. Papermill — Production Notebook Execution

Papermill is the missing link between "notebook as experiment" and "notebook as repeatable pipeline task." It is already used at scale in industry data pipelines (Netflix, Airbnb, etc.).

2.1 Core Concept: Parameterization

Papermill's key innovation is parameter injection. Tag a cell in the notebook with "parameters":

# Cell tagged "parameters" (defaults — defined by notebook author)
alpha = 0.5
batch_size = 32
model_name = "baseline"

At runtime, Papermill inserts a new cell immediately after, tagged "injected-parameters", that overrides the defaults:

# Cell tagged "injected-parameters" (injected by Papermill at runtime)
alpha = 0.01
batch_size = 128
model_name = "experiment_007"

Because Python executes top-to-bottom, the injected cell shadows the defaults. The original notebook is never mutated — Papermill reads input, writes to a new output file.

2.2 Python API

import papermill as pm

nb = pm.execute_notebook(
    input_path="analysis.ipynb",     # source (can be s3://, az://, gs://)
    output_path="output/run_001.ipynb",  # destination (persists outputs)
    parameters={
        "alpha": 0.01,
        "n_samples": 1000,
        "run_id": "fleet-check-2026-04-06",
    },
    kernel_name="python3",
    execution_timeout=300,           # per-cell timeout in seconds
    log_output=True,                 # stream cell output to logger
    cwd="/path/to/notebook/",        # working directory
)
# Returns: NotebookNode (the fully executed notebook with all outputs)

On cell failure, Papermill raises PapermillExecutionError with:

cell_index — which cell failed
source — the failing cell's code
ename / evalue — exception type and message
traceback — full traceback

Even on failure, the output notebook is written with whatever cells completed — enabling partial-run inspection.

2.3 CLI

# Basic execution
papermill analysis.ipynb output/run_001.ipynb \
  -p alpha 0.01 \
  -p n_samples 1000

# From YAML parameter file
papermill analysis.ipynb output/run_001.ipynb -f params.yaml

# CI-friendly: log outputs, no progress bar
papermill analysis.ipynb output/run_001.ipynb \
  --log-output \
  --no-progress-bar \
  --execution-timeout 300 \
  -p run_id "fleet-check-2026-04-06"

# Prepare only (inject params, skip execution — for preview/inspection)
papermill analysis.ipynb preview.ipynb --prepare-only -p alpha 0.01

# Inspect parameter schema
papermill --help-notebook analysis.ipynb

Remote storage is built in — pip install papermill[s3] enables s3:// paths for both input and output. Azure and GCS are also supported. For Hermes, this means notebook runs can be stored in object storage and retrieved later for audit.

2.4 Scrapbook — Structured Output Collection

scrapbook is Papermill's companion for extracting structured data from executed notebooks. Inside a notebook cell:

import scrapbook as sb

# Write typed outputs (stored as special display_data in cell outputs)
sb.glue("accuracy", 0.9342)
sb.glue("metrics", {"precision": 0.91, "recall": 0.93, "f1": 0.92})
sb.glue("results_df", df, "pandas")  # DataFrames too

After execution, from the agent:

import scrapbook as sb

nb = sb.read_notebook("output/fleet-check-2026-04-06.ipynb")
metrics = nb.scraps["metrics"].data   # -> {"precision": 0.91, ...}
accuracy = nb.scraps["accuracy"].data # -> 0.9342

# Or aggregate across many runs
book = sb.read_notebooks("output/")
book.scrap_dataframe  # -> pd.DataFrame with all scraps + filenames

This is the clean interface between notebook execution and agent decision-making: the notebook outputs its findings as named, typed scraps; the agent reads them programmatically and acts.

2.5 How Papermill Compares to hamelnb

Capability	hamelnb	Papermill
Stateful kernel session	Yes	No (fresh kernel per run)
Parameter injection	No	Yes
Persistent output notebook	No	Yes
Remote storage (S3/Azure)	No	Yes
Per-cell timing/metadata	No	Yes (in output nb metadata)
Error isolation (partial runs)	No	Yes
Production pipeline use	Experimental	Industry-standard
Structured output collection	No	Yes (via scrapbook)

Verdict: hamelnb is great for interactive REPL-style exploration (where state accumulates). Papermill is better for task execution (where we want reproducible, parameterized, auditable runs). They serve different use cases. Hermes needs both.

3. The `.ipynb` File Format — What the Agent Is Actually Working With

Understanding the format is essential for the "PR model." A .ipynb file is JSON with this structure:

{
  "nbformat": 4,
  "nbformat_minor": 5,
  "metadata": {
    "kernelspec": {"display_name": "Python 3", "language": "python", "name": "python3"},
    "language_info": {"name": "python", "version": "3.10.0"}
  },
  "cells": [
    {
      "id": "a1b2c3d4",
      "cell_type": "markdown",
      "source": "# Fleet Health Check\n\nThis notebook checks system health.",
      "metadata": {}
    },
    {
      "id": "e5f6g7h8",
      "cell_type": "code",
      "source": "alpha = 0.5\nthreshold = 0.95",
      "metadata": {"tags": ["parameters"]},
      "execution_count": null,
      "outputs": []
    },
    {
      "id": "i9j0k1l2",
      "cell_type": "code",
      "source": "import sys\nprint(sys.version)",
      "metadata": {},
      "execution_count": 1,
      "outputs": [
        {
          "output_type": "stream",
          "name": "stdout",
          "text": "3.10.0 (default, ...)\n"
        }
      ]
    }
  ]
}

The nbformat Python library provides a clean API for working with this:

import nbformat

# Read
with open("notebook.ipynb") as f:
    nb = nbformat.read(f, as_version=4)

# Navigate
for cell in nb.cells:
    if cell.cell_type == "code":
        print(cell.source)

# Modify
nb.cells[2].source = "import sys\nprint('updated')"

# Add cells
new_md = nbformat.v4.new_markdown_cell("## Agent Analysis\nInserted by Hermes.")
nb.cells.insert(3, new_md)

# Write
with open("modified.ipynb", "w") as f:
    nbformat.write(nb, f)

# Validate
nbformat.validate(nb)  # raises nbformat.ValidationError on invalid format

4. The PR Model for Notebooks

This is the elegant architecture Rockachopa described: agents making PRs to notebooks the same way they make PRs to code. Here's how the full stack enables it.

4.1 The Problem: Raw `.ipynb` Diffs Are Unusable

Without tooling, a git diff on a notebook that was merely re-run (no source changes) produces thousands of lines of JSON changes — execution counts, timestamps, base64-encoded plot images. Code review on raw .ipynb diffs is impractical.

4.2 nbstripout — Clean Git History

nbstripout installs a git clean filter that strips outputs before files enter the git index. The working copy is untouched; only what gets committed is clean.

pip install nbstripout
nbstripout --install   # per-repo
# or
nbstripout --install --global  # all repos

This writes to .git/config:

[filter "nbstripout"]
    clean = nbstripout
    smudge = cat
    required = true

[diff "ipynb"]
    textconv = nbstripout -t

And to .gitattributes:

*.ipynb filter=nbstripout
*.ipynb diff=ipynb

Now git diff shows only source changes — same as reviewing a .py file.

For executed-output notebooks (where we want to keep outputs for audit): use a separate path like runs/ or outputs/ excluded from the filter via .gitattributes:

*.ipynb filter=nbstripout
runs/*.ipynb !filter
runs/*.ipynb !diff

4.3 nbdime — Semantic Diff and Merge

nbdime understands notebook structure. Instead of diffing raw JSON, it diffs at the level of cells — knowing that cells is a list, source is a string, and outputs should often be ignored.

pip install nbdime

# Enable semantic git diff/merge for all .ipynb files
nbdime config-git --enable

# Now standard git commands are notebook-aware:
git diff HEAD notebook.ipynb          # semantic cell-level diff
git merge feature-branch              # uses nbdime for .ipynb conflict resolution
git log -p notebook.ipynb            # readable patch per commit

Python API for agent reasoning:

import nbdime
import nbformat

nb_base = nbformat.read(open("original.ipynb"), as_version=4)
nb_pr   = nbformat.read(open("proposed.ipynb"), as_version=4)

diff = nbdime.diff_notebooks(nb_base, nb_pr)

# diff is a list of structured ops the agent can reason about:
# [{"op": "patch", "key": "cells", "diff": [
#     {"op": "patch", "key": 3, "diff": [
#         {"op": "patch", "key": "source", "diff": [...string ops...]}
#     ]}
# ]}]

# Apply a diff (patch)
from nbdime.patching import patch
nb_result = patch(nb_base, diff)

4.4 The Full Agent PR Workflow

Here is the complete workflow — analogous to how Hermes makes PRs to code repos via Gitea:

1. Agent reads the task notebook

nb = nbformat.read(open("fleet_health_check.ipynb"), as_version=4)

2. Agent locates and modifies relevant cells

# Find parameter cell
params_cell = next(
    c for c in nb.cells
    if "parameters" in c.get("metadata", {}).get("tags", [])
)
# Update threshold
params_cell.source = params_cell.source.replace("threshold = 0.95", "threshold = 0.90")

# Add explanatory markdown
nb.cells.insert(
    nb.cells.index(params_cell) + 1,
    nbformat.v4.new_markdown_cell(
        "**Note (Hermes 2026-04-06):** Threshold lowered from 0.95 to 0.90 "
        "based on false-positive analysis from last 7 days of runs."
    )
)

3. Agent writes and commits to a branch

git checkout -b agent/fleet-health-threshold-update
nbformat.write(nb, open("fleet_health_check.ipynb", "w"))
git add fleet_health_check.ipynb
git commit -m "feat(notebooks): lower fleet health threshold to 0.90 (#155)"

4. Agent executes the proposed notebook to validate

import papermill as pm

pm.execute_notebook(
    "fleet_health_check.ipynb",
    "output/validation_run.ipynb",
    parameters={"run_id": "agent-validation-2026-04-06"},
    log_output=True,
)

5. Agent collects results and compares

import scrapbook as sb

result = sb.read_notebook("output/validation_run.ipynb")
health_score = result.scraps["health_score"].data
alert_count = result.scraps["alert_count"].data

6. Agent opens PR with results summary

curl -X POST "$GITEA_API/pulls" \
  -H "Authorization: token $TOKEN" \
  -d '{
    "title": "feat(notebooks): lower fleet health threshold to 0.90",
    "body": "## Agent Analysis\n\n- Health score: 0.94 (was 0.89 with old threshold)\n- Alert count: 12 (was 47 false positives)\n- Validation run: output/validation_run.ipynb\n\nRefs #155",
    "head": "agent/fleet-health-threshold-update",
    "base": "main"
  }'

7. Human reviews the PR using nbdime diff

The PR diff in Gitea shows the clean cell-level source changes (thanks to nbstripout). The human can also run nbdiff-web original.ipynb proposed.ipynb locally for rich rendered diff with output comparison.

4.5 nbval — Regression Testing Notebooks

nbval treats each notebook cell as a pytest test case, re-executing and comparing outputs to stored values:

pip install nbval

# Strict: every cell output must match stored outputs
pytest --nbval fleet_health_check.ipynb

# Lax: only check cells marked with # NBVAL_CHECK_OUTPUT
pytest --nbval-lax fleet_health_check.ipynb

Cell-level markers (comments in cell source):

# NBVAL_CHECK_OUTPUT   — in lax mode, validate this cell's output
# NBVAL_SKIP           — skip this cell entirely
# NBVAL_RAISES_EXCEPTION  — expect an exception (test passes if raised)

This becomes the CI gate: before a notebook PR is merged, run pytest --nbval-lax to verify no cells produce errors and critical output cells still produce expected values.

5. Gaps and Recommendations

5.1 Gap Assessment (Refining Timmy's Original Findings)

Gap	Severity	Solution
No Hermes tool access in kernel	High	Inject `hermes_runtime` module (see §5.2)
No structured output protocol	High	Use scrapbook `sb.glue()` pattern
No parameterization	Medium	Add Papermill `"parameters"` cell to notebooks
XSRF/auth friction	Medium	Disable for local; use JupyterHub token scopes for multi-user
No notebook CI/testing	Medium	Add nbval to test suite
Raw `.ipynb` diffs in PRs	Medium	Install nbstripout + nbdime
No scheduling	Low	Papermill + existing Hermes cron layer

5.2 Short-Term Recommendations (This Month)

1. NotebookExecutor tool

A thin Hermes tool wrapping the ecosystem:

class NotebookExecutor:
    def execute(self, input_path, output_path, parameters, timeout=300):
        """Wraps pm.execute_notebook(). Returns structured result dict."""

    def collect_outputs(self, notebook_path):
        """Wraps sb.read_notebook(). Returns dict of named scraps."""

    def inspect_parameters(self, notebook_path):
        """Wraps pm.inspect_notebook(). Returns parameter schema."""

    def read_notebook(self, path):
        """Returns nbformat NotebookNode for cell inspection/modification."""

    def write_notebook(self, nb, path):
        """Writes modified NotebookNode back to disk."""

    def diff_notebooks(self, path_a, path_b):
        """Returns structured nbdime diff for agent reasoning."""

    def validate(self, notebook_path):
        """Runs nbformat.validate() + optional pytest --nbval-lax."""

Execution result structure for the agent:

{
    "status": "success" | "error",
    "duration_seconds": 12.34,
    "cells_executed": 15,
    "failed_cell": {       # None on success
        "index": 7,
        "source": "model.fit(X, y)",
        "ename": "ValueError",
        "evalue": "Input contains NaN",
    },
    "scraps": {            # from scrapbook
        "health_score": 0.94,
        "alert_count": 12,
    },
}

2. Fleet Health Check as a Notebook

Convert the fleet health check epic into a parameterized notebook with:

"parameters" cell for run configuration (date range, thresholds, agent ID)
Markdown cells narrating each step
sb.glue() calls for structured outputs
# NBVAL_CHECK_OUTPUT markers on critical cells

3. Git hygiene for notebooks

Install nbstripout + nbdime in the hermes-agent repo:

pip install nbstripout nbdime
nbstripout --install
nbdime config-git --enable

Add to .gitattributes:

*.ipynb filter=nbstripout
*.ipynb diff=ipynb
runs/*.ipynb !filter

5.3 Medium-Term Recommendations (Next Quarter)

4. hermes_runtime Python module

Inject Hermes tool access into the kernel via a module that notebooks import:

# In kernel cell: from hermes_runtime import terminal, read_file, web_search
import hermes_runtime as hermes

results = hermes.web_search("fleet health metrics best practices")
hermes.terminal("systemctl status agent-fleet")
content = hermes.read_file("/var/log/hermes/agent.log")

This closes the most significant gap: notebooks gain the same tool access as skills, while retaining state persistence and narrative structure.

5. Notebook-triggered cron

Extend the Hermes cron layer to accept .ipynb paths as targets:

# cron entry
schedule: "0 6 * * *"
type: notebook
path: notebooks/fleet_health_check.ipynb
parameters:
  run_id: "{{date}}"
  alert_threshold: 0.90
output_path: runs/fleet_health_{{date}}.ipynb

The cron runner calls pm.execute_notebook() and commits the output to the repo.

6. JupyterHub for multi-agent isolation

If multiple agents need concurrent notebook execution, deploy JupyterHub with DockerSpawner or KubeSpawner. Each agent job gets an isolated container with its own kernel, no state bleed between runs.

6. Architecture Vision

┌─────────────────────────────────────────────────────────────────┐
│                        Hermes Agent                             │
│                                                                  │
│  Skills (one-shot)          Notebooks (multi-step)              │
│  ┌─────────────────┐       ┌─────────────────────────────────┐  │
│  │ terminal()      │       │ .ipynb file                     │  │
│  │ web_search()    │       │  ├── Markdown (narrative)       │  │
│  │ read_file()     │       │  ├── Code cells (logic)         │  │
│  └─────────────────┘       │  ├── "parameters" cell          │  │
│                             │  └── sb.glue() outputs          │  │
│                             └──────────────┬────────────────┘  │
│                                            │                    │
│                             ┌──────────────▼────────────────┐  │
│                             │   NotebookExecutor tool        │  │
│                             │  (papermill + scrapbook +      │  │
│                             │   nbformat + nbdime + nbval)   │  │
│                             └──────────────┬────────────────┘  │
│                                            │                    │
└────────────────────────────────────────────┼────────────────────┘
                                             │
                         ┌───────────────────▼──────────────────┐
                         │          JupyterLab / Hub             │
                         │  (kernel execution environment)       │
                         └───────────────────┬──────────────────┘
                                             │
                         ┌───────────────────▼──────────────────┐
                         │           Git + Gitea                 │
                         │  (nbstripout clean diffs,            │
                         │   nbdime semantic review,            │
                         │   PR workflow for notebook changes)   │
                         └──────────────────────────────────────┘

Notebooks become the primary artifact of complex tasks: the agent generates or edits cells, Papermill executes them reproducibly, scrapbook extracts structured outputs for agent decision-making, and the resulting .ipynb is both proof-of-work and human-readable report. Skills remain for one-shot actions. Notebooks own multi-step workflows.

7. Package Summary

Package	Purpose	Install
`nbformat`	Read/write/validate `.ipynb` files	`pip install nbformat`
`nbconvert`	Execute and export notebooks	`pip install nbconvert`
`papermill`	Parameterize + execute in pipelines	`pip install papermill`
`scrapbook`	Structured output collection	`pip install scrapbook`
`nbdime`	Semantic diff/merge for git	`pip install nbdime`
`nbstripout`	Git filter for clean diffs	`pip install nbstripout`
`nbval`	pytest-based output regression	`pip install nbval`
`jupyter-kernel-gateway`	Headless REST kernel access	`pip install jupyter-kernel-gateway`

25 KiB Raw Permalink Blame History

Jupyter Notebooks as Core LLM Execution Layer — Deep Research Report

Executive Summary

1. The Jupyter Product Suite

1.1 Jupyter Notebook (Classic)

1.2 JupyterLab

1.3 JupyterHub — The Multi-User Orchestration Layer

1.4 Jupyter Kernel Gateway — Minimal Headless Execution

2. Papermill — Production Notebook Execution

2.1 Core Concept: Parameterization

2.2 Python API

2.3 CLI

2.4 Scrapbook — Structured Output Collection

2.5 How Papermill Compares to hamelnb

3. The .ipynb File Format — What the Agent Is Actually Working With

4. The PR Model for Notebooks

4.1 The Problem: Raw .ipynb Diffs Are Unusable

4.2 nbstripout — Clean Git History

4.3 nbdime — Semantic Diff and Merge

4.4 The Full Agent PR Workflow

4.5 nbval — Regression Testing Notebooks

5. Gaps and Recommendations

5.1 Gap Assessment (Refining Timmy's Original Findings)

5.2 Short-Term Recommendations (This Month)

5.3 Medium-Term Recommendations (Next Quarter)

6. Architecture Vision

7. Package Summary

8. References

25 KiB

Raw Permalink Blame History

3. The `.ipynb` File Format — What the Agent Is Actually Working With

4.1 The Problem: Raw `.ipynb` Diffs Are Unusable