Files
hermes-agent/docs/jupyter-as-execution-layer-research.md
Claude (Opus 4.6) c994c01c9f
Some checks failed
Docker Build and Publish / build-and-push (push) Has been cancelled
Nix / nix (macos-latest) (push) Has been cancelled
Nix / nix (ubuntu-latest) (push) Has been cancelled
Tests / test (push) Has been cancelled
[claude] Deep research: Jupyter ecosystem as LLM execution layer (#155) (#160)
2026-04-07 02:00:20 +00:00

25 KiB

Jupyter Notebooks as Core LLM Execution Layer — Deep Research Report

Issue: #155 Date: 2026-04-06 Status: Research / Spike Prior Art: Timmy's initial spike (llm_execution_spike.ipynb, hamelnb bridge, JupyterLab on forge VPS)


Executive Summary

This report deepens the research from issue #155 into three areas requested by Rockachopa:

  1. The full Jupyter product suite — JupyterHub vs JupyterLab vs Notebook
  2. Papermill — the production-grade notebook execution engine already used in real data pipelines
  3. The "PR model for notebooks" — how agents can propose, diff, review, and merge changes to .ipynb files similarly to code PRs

The conclusion: an elegant, production-grade agent→notebook pipeline already exists as open-source tooling. We don't need to invent much — we need to compose what's there.


1. The Jupyter Product Suite

The Jupyter ecosystem has three distinct layers that are often conflated. Understanding the distinction is critical for architectural decisions.

1.1 Jupyter Notebook (Classic)

The original single-user interface. One browser tab = one .ipynb file. Version 6 is in maintenance-only mode. Version 7 was rebuilt on JupyterLab components and is functionally equivalent. For headless agent use, the UI is irrelevant — what matters is the .ipynb file format and the kernel execution model underneath.

1.2 JupyterLab

The current canonical Jupyter interface for human users: full IDE, multi-pane, terminal, extension manager, built-in diff viewer, and jupyterlab-git for Git workflows from the UI. JupyterLab is the recommended target for agent-collaborative workflows because:

  • It exposes the same REST API as classic Jupyter (kernel sessions, execute, contents)
  • Extensions like jupyterlab-git let a human co-reviewer inspect changes alongside the agent
  • The hamelnb bridge Timmy already validated works against a JupyterLab server

For agents: JupyterLab is the platform to run on. The agent doesn't interact with the UI — it uses the Jupyter REST API or Papermill on top of it.

1.3 JupyterHub — The Multi-User Orchestration Layer

JupyterHub is not a UI. It is a multi-user server that spawns, manages, and proxies individual single-user Jupyter servers. This is the production infrastructure layer.

[Agent / Browser / API Client]
         |
      [Proxy]  (configurable-http-proxy)
      /      \
   [Hub]    [Single-User Jupyter Server per user/agent]
 (Auth,      (standard JupyterLab/Notebook server)
  Spawner,
  REST API)

Key components:

  • Hub: Manages auth, user database, spawner lifecycle, REST API
  • Proxy: Routes /hub/* to Hub, /user/<name>/* to that user's server
  • Spawner: How single-user servers are started. Default = local process. Production options include KubeSpawner (Kubernetes pod per user) and DockerSpawner (container per user)
  • Authenticator: PAM, OAuth, DummyAuthenticator (for isolated agent environments)

JupyterHub REST API (relevant for agent orchestration):

# Spawn a named server for an agent service account
POST /hub/api/users/<username>/servers/<name>

# Stop it when done
DELETE /hub/api/users/<username>/servers/<name>

# Create a scoped API token for the agent
POST /hub/api/users/<username>/tokens

# Check server status
GET /hub/api/users/<username>

Why this matters for Hermes: JupyterHub gives us isolated kernel environments per agent task, programmable lifecycle management, and a clean auth model. Instead of running one shared JupyterLab instance on the forge VPS, we could spawn ephemeral single-user servers per notebook execution run — each with its own kernel, clean state, and resource limits.

1.4 Jupyter Kernel Gateway — Minimal Headless Execution

If JupyterHub is too heavy, jupyter-kernel-gateway exposes just the kernel protocol over REST + WebSocket:

pip install jupyter-kernel-gateway
jupyter kernelgateway --KernelGatewayApp.api=kernel_gateway.jupyter_websocket

# Start kernel
POST /api/kernels
# Execute via WebSocket on Jupyter messaging protocol
WS /api/kernels/<kernel_id>/channels
# Stop kernel
DELETE /api/kernels/<kernel_id>

This is the lowest-level option: no notebook management, just raw kernel access. Suitable if we want to build our own execution layer from scratch.


2. Papermill — Production Notebook Execution

Papermill is the missing link between "notebook as experiment" and "notebook as repeatable pipeline task." It is already used at scale in industry data pipelines (Netflix, Airbnb, etc.).

2.1 Core Concept: Parameterization

Papermill's key innovation is parameter injection. Tag a cell in the notebook with "parameters":

# Cell tagged "parameters" (defaults — defined by notebook author)
alpha = 0.5
batch_size = 32
model_name = "baseline"

At runtime, Papermill inserts a new cell immediately after, tagged "injected-parameters", that overrides the defaults:

# Cell tagged "injected-parameters" (injected by Papermill at runtime)
alpha = 0.01
batch_size = 128
model_name = "experiment_007"

Because Python executes top-to-bottom, the injected cell shadows the defaults. The original notebook is never mutated — Papermill reads input, writes to a new output file.

2.2 Python API

import papermill as pm

nb = pm.execute_notebook(
    input_path="analysis.ipynb",     # source (can be s3://, az://, gs://)
    output_path="output/run_001.ipynb",  # destination (persists outputs)
    parameters={
        "alpha": 0.01,
        "n_samples": 1000,
        "run_id": "fleet-check-2026-04-06",
    },
    kernel_name="python3",
    execution_timeout=300,           # per-cell timeout in seconds
    log_output=True,                 # stream cell output to logger
    cwd="/path/to/notebook/",        # working directory
)
# Returns: NotebookNode (the fully executed notebook with all outputs)

On cell failure, Papermill raises PapermillExecutionError with:

  • cell_index — which cell failed
  • source — the failing cell's code
  • ename / evalue — exception type and message
  • traceback — full traceback

Even on failure, the output notebook is written with whatever cells completed — enabling partial-run inspection.

2.3 CLI

# Basic execution
papermill analysis.ipynb output/run_001.ipynb \
  -p alpha 0.01 \
  -p n_samples 1000

# From YAML parameter file
papermill analysis.ipynb output/run_001.ipynb -f params.yaml

# CI-friendly: log outputs, no progress bar
papermill analysis.ipynb output/run_001.ipynb \
  --log-output \
  --no-progress-bar \
  --execution-timeout 300 \
  -p run_id "fleet-check-2026-04-06"

# Prepare only (inject params, skip execution — for preview/inspection)
papermill analysis.ipynb preview.ipynb --prepare-only -p alpha 0.01

# Inspect parameter schema
papermill --help-notebook analysis.ipynb

Remote storage is built in — pip install papermill[s3] enables s3:// paths for both input and output. Azure and GCS are also supported. For Hermes, this means notebook runs can be stored in object storage and retrieved later for audit.

2.4 Scrapbook — Structured Output Collection

scrapbook is Papermill's companion for extracting structured data from executed notebooks. Inside a notebook cell:

import scrapbook as sb

# Write typed outputs (stored as special display_data in cell outputs)
sb.glue("accuracy", 0.9342)
sb.glue("metrics", {"precision": 0.91, "recall": 0.93, "f1": 0.92})
sb.glue("results_df", df, "pandas")  # DataFrames too

After execution, from the agent:

import scrapbook as sb

nb = sb.read_notebook("output/fleet-check-2026-04-06.ipynb")
metrics = nb.scraps["metrics"].data   # -> {"precision": 0.91, ...}
accuracy = nb.scraps["accuracy"].data # -> 0.9342

# Or aggregate across many runs
book = sb.read_notebooks("output/")
book.scrap_dataframe  # -> pd.DataFrame with all scraps + filenames

This is the clean interface between notebook execution and agent decision-making: the notebook outputs its findings as named, typed scraps; the agent reads them programmatically and acts.

2.5 How Papermill Compares to hamelnb

Capability hamelnb Papermill
Stateful kernel session Yes No (fresh kernel per run)
Parameter injection No Yes
Persistent output notebook No Yes
Remote storage (S3/Azure) No Yes
Per-cell timing/metadata No Yes (in output nb metadata)
Error isolation (partial runs) No Yes
Production pipeline use Experimental Industry-standard
Structured output collection No Yes (via scrapbook)

Verdict: hamelnb is great for interactive REPL-style exploration (where state accumulates). Papermill is better for task execution (where we want reproducible, parameterized, auditable runs). They serve different use cases. Hermes needs both.


3. The .ipynb File Format — What the Agent Is Actually Working With

Understanding the format is essential for the "PR model." A .ipynb file is JSON with this structure:

{
  "nbformat": 4,
  "nbformat_minor": 5,
  "metadata": {
    "kernelspec": {"display_name": "Python 3", "language": "python", "name": "python3"},
    "language_info": {"name": "python", "version": "3.10.0"}
  },
  "cells": [
    {
      "id": "a1b2c3d4",
      "cell_type": "markdown",
      "source": "# Fleet Health Check\n\nThis notebook checks system health.",
      "metadata": {}
    },
    {
      "id": "e5f6g7h8",
      "cell_type": "code",
      "source": "alpha = 0.5\nthreshold = 0.95",
      "metadata": {"tags": ["parameters"]},
      "execution_count": null,
      "outputs": []
    },
    {
      "id": "i9j0k1l2",
      "cell_type": "code",
      "source": "import sys\nprint(sys.version)",
      "metadata": {},
      "execution_count": 1,
      "outputs": [
        {
          "output_type": "stream",
          "name": "stdout",
          "text": "3.10.0 (default, ...)\n"
        }
      ]
    }
  ]
}

The nbformat Python library provides a clean API for working with this:

import nbformat

# Read
with open("notebook.ipynb") as f:
    nb = nbformat.read(f, as_version=4)

# Navigate
for cell in nb.cells:
    if cell.cell_type == "code":
        print(cell.source)

# Modify
nb.cells[2].source = "import sys\nprint('updated')"

# Add cells
new_md = nbformat.v4.new_markdown_cell("## Agent Analysis\nInserted by Hermes.")
nb.cells.insert(3, new_md)

# Write
with open("modified.ipynb", "w") as f:
    nbformat.write(nb, f)

# Validate
nbformat.validate(nb)  # raises nbformat.ValidationError on invalid format

4. The PR Model for Notebooks

This is the elegant architecture Rockachopa described: agents making PRs to notebooks the same way they make PRs to code. Here's how the full stack enables it.

4.1 The Problem: Raw .ipynb Diffs Are Unusable

Without tooling, a git diff on a notebook that was merely re-run (no source changes) produces thousands of lines of JSON changes — execution counts, timestamps, base64-encoded plot images. Code review on raw .ipynb diffs is impractical.

4.2 nbstripout — Clean Git History

nbstripout installs a git clean filter that strips outputs before files enter the git index. The working copy is untouched; only what gets committed is clean.

pip install nbstripout
nbstripout --install   # per-repo
# or
nbstripout --install --global  # all repos

This writes to .git/config:

[filter "nbstripout"]
    clean = nbstripout
    smudge = cat
    required = true

[diff "ipynb"]
    textconv = nbstripout -t

And to .gitattributes:

*.ipynb filter=nbstripout
*.ipynb diff=ipynb

Now git diff shows only source changes — same as reviewing a .py file.

For executed-output notebooks (where we want to keep outputs for audit): use a separate path like runs/ or outputs/ excluded from the filter via .gitattributes:

*.ipynb filter=nbstripout
runs/*.ipynb !filter
runs/*.ipynb !diff

4.3 nbdime — Semantic Diff and Merge

nbdime understands notebook structure. Instead of diffing raw JSON, it diffs at the level of cells — knowing that cells is a list, source is a string, and outputs should often be ignored.

pip install nbdime

# Enable semantic git diff/merge for all .ipynb files
nbdime config-git --enable

# Now standard git commands are notebook-aware:
git diff HEAD notebook.ipynb          # semantic cell-level diff
git merge feature-branch              # uses nbdime for .ipynb conflict resolution
git log -p notebook.ipynb            # readable patch per commit

Python API for agent reasoning:

import nbdime
import nbformat

nb_base = nbformat.read(open("original.ipynb"), as_version=4)
nb_pr   = nbformat.read(open("proposed.ipynb"), as_version=4)

diff = nbdime.diff_notebooks(nb_base, nb_pr)

# diff is a list of structured ops the agent can reason about:
# [{"op": "patch", "key": "cells", "diff": [
#     {"op": "patch", "key": 3, "diff": [
#         {"op": "patch", "key": "source", "diff": [...string ops...]}
#     ]}
# ]}]

# Apply a diff (patch)
from nbdime.patching import patch
nb_result = patch(nb_base, diff)

4.4 The Full Agent PR Workflow

Here is the complete workflow — analogous to how Hermes makes PRs to code repos via Gitea:

1. Agent reads the task notebook

nb = nbformat.read(open("fleet_health_check.ipynb"), as_version=4)

2. Agent locates and modifies relevant cells

# Find parameter cell
params_cell = next(
    c for c in nb.cells
    if "parameters" in c.get("metadata", {}).get("tags", [])
)
# Update threshold
params_cell.source = params_cell.source.replace("threshold = 0.95", "threshold = 0.90")

# Add explanatory markdown
nb.cells.insert(
    nb.cells.index(params_cell) + 1,
    nbformat.v4.new_markdown_cell(
        "**Note (Hermes 2026-04-06):** Threshold lowered from 0.95 to 0.90 "
        "based on false-positive analysis from last 7 days of runs."
    )
)

3. Agent writes and commits to a branch

git checkout -b agent/fleet-health-threshold-update
nbformat.write(nb, open("fleet_health_check.ipynb", "w"))
git add fleet_health_check.ipynb
git commit -m "feat(notebooks): lower fleet health threshold to 0.90 (#155)"

4. Agent executes the proposed notebook to validate

import papermill as pm

pm.execute_notebook(
    "fleet_health_check.ipynb",
    "output/validation_run.ipynb",
    parameters={"run_id": "agent-validation-2026-04-06"},
    log_output=True,
)

5. Agent collects results and compares

import scrapbook as sb

result = sb.read_notebook("output/validation_run.ipynb")
health_score = result.scraps["health_score"].data
alert_count = result.scraps["alert_count"].data

6. Agent opens PR with results summary

curl -X POST "$GITEA_API/pulls" \
  -H "Authorization: token $TOKEN" \
  -d '{
    "title": "feat(notebooks): lower fleet health threshold to 0.90",
    "body": "## Agent Analysis\n\n- Health score: 0.94 (was 0.89 with old threshold)\n- Alert count: 12 (was 47 false positives)\n- Validation run: output/validation_run.ipynb\n\nRefs #155",
    "head": "agent/fleet-health-threshold-update",
    "base": "main"
  }'

7. Human reviews the PR using nbdime diff

The PR diff in Gitea shows the clean cell-level source changes (thanks to nbstripout). The human can also run nbdiff-web original.ipynb proposed.ipynb locally for rich rendered diff with output comparison.

4.5 nbval — Regression Testing Notebooks

nbval treats each notebook cell as a pytest test case, re-executing and comparing outputs to stored values:

pip install nbval

# Strict: every cell output must match stored outputs
pytest --nbval fleet_health_check.ipynb

# Lax: only check cells marked with # NBVAL_CHECK_OUTPUT
pytest --nbval-lax fleet_health_check.ipynb

Cell-level markers (comments in cell source):

# NBVAL_CHECK_OUTPUT   — in lax mode, validate this cell's output
# NBVAL_SKIP           — skip this cell entirely
# NBVAL_RAISES_EXCEPTION  — expect an exception (test passes if raised)

This becomes the CI gate: before a notebook PR is merged, run pytest --nbval-lax to verify no cells produce errors and critical output cells still produce expected values.


5. Gaps and Recommendations

5.1 Gap Assessment (Refining Timmy's Original Findings)

Gap Severity Solution
No Hermes tool access in kernel High Inject hermes_runtime module (see §5.2)
No structured output protocol High Use scrapbook sb.glue() pattern
No parameterization Medium Add Papermill "parameters" cell to notebooks
XSRF/auth friction Medium Disable for local; use JupyterHub token scopes for multi-user
No notebook CI/testing Medium Add nbval to test suite
Raw .ipynb diffs in PRs Medium Install nbstripout + nbdime
No scheduling Low Papermill + existing Hermes cron layer

5.2 Short-Term Recommendations (This Month)

1. NotebookExecutor tool

A thin Hermes tool wrapping the ecosystem:

class NotebookExecutor:
    def execute(self, input_path, output_path, parameters, timeout=300):
        """Wraps pm.execute_notebook(). Returns structured result dict."""

    def collect_outputs(self, notebook_path):
        """Wraps sb.read_notebook(). Returns dict of named scraps."""

    def inspect_parameters(self, notebook_path):
        """Wraps pm.inspect_notebook(). Returns parameter schema."""

    def read_notebook(self, path):
        """Returns nbformat NotebookNode for cell inspection/modification."""

    def write_notebook(self, nb, path):
        """Writes modified NotebookNode back to disk."""

    def diff_notebooks(self, path_a, path_b):
        """Returns structured nbdime diff for agent reasoning."""

    def validate(self, notebook_path):
        """Runs nbformat.validate() + optional pytest --nbval-lax."""

Execution result structure for the agent:

{
    "status": "success" | "error",
    "duration_seconds": 12.34,
    "cells_executed": 15,
    "failed_cell": {       # None on success
        "index": 7,
        "source": "model.fit(X, y)",
        "ename": "ValueError",
        "evalue": "Input contains NaN",
    },
    "scraps": {            # from scrapbook
        "health_score": 0.94,
        "alert_count": 12,
    },
}

2. Fleet Health Check as a Notebook

Convert the fleet health check epic into a parameterized notebook with:

  • "parameters" cell for run configuration (date range, thresholds, agent ID)
  • Markdown cells narrating each step
  • sb.glue() calls for structured outputs
  • # NBVAL_CHECK_OUTPUT markers on critical cells

3. Git hygiene for notebooks

Install nbstripout + nbdime in the hermes-agent repo:

pip install nbstripout nbdime
nbstripout --install
nbdime config-git --enable

Add to .gitattributes:

*.ipynb filter=nbstripout
*.ipynb diff=ipynb
runs/*.ipynb !filter

5.3 Medium-Term Recommendations (Next Quarter)

4. hermes_runtime Python module

Inject Hermes tool access into the kernel via a module that notebooks import:

# In kernel cell: from hermes_runtime import terminal, read_file, web_search
import hermes_runtime as hermes

results = hermes.web_search("fleet health metrics best practices")
hermes.terminal("systemctl status agent-fleet")
content = hermes.read_file("/var/log/hermes/agent.log")

This closes the most significant gap: notebooks gain the same tool access as skills, while retaining state persistence and narrative structure.

5. Notebook-triggered cron

Extend the Hermes cron layer to accept .ipynb paths as targets:

# cron entry
schedule: "0 6 * * *"
type: notebook
path: notebooks/fleet_health_check.ipynb
parameters:
  run_id: "{{date}}"
  alert_threshold: 0.90
output_path: runs/fleet_health_{{date}}.ipynb

The cron runner calls pm.execute_notebook() and commits the output to the repo.

6. JupyterHub for multi-agent isolation

If multiple agents need concurrent notebook execution, deploy JupyterHub with DockerSpawner or KubeSpawner. Each agent job gets an isolated container with its own kernel, no state bleed between runs.


6. Architecture Vision

┌─────────────────────────────────────────────────────────────────┐
│                        Hermes Agent                             │
│                                                                  │
│  Skills (one-shot)          Notebooks (multi-step)              │
│  ┌─────────────────┐       ┌─────────────────────────────────┐  │
│  │ terminal()      │       │ .ipynb file                     │  │
│  │ web_search()    │       │  ├── Markdown (narrative)       │  │
│  │ read_file()     │       │  ├── Code cells (logic)         │  │
│  └─────────────────┘       │  ├── "parameters" cell          │  │
│                             │  └── sb.glue() outputs          │  │
│                             └──────────────┬────────────────┘  │
│                                            │                    │
│                             ┌──────────────▼────────────────┐  │
│                             │   NotebookExecutor tool        │  │
│                             │  (papermill + scrapbook +      │  │
│                             │   nbformat + nbdime + nbval)   │  │
│                             └──────────────┬────────────────┘  │
│                                            │                    │
└────────────────────────────────────────────┼────────────────────┘
                                             │
                         ┌───────────────────▼──────────────────┐
                         │          JupyterLab / Hub             │
                         │  (kernel execution environment)       │
                         └───────────────────┬──────────────────┘
                                             │
                         ┌───────────────────▼──────────────────┐
                         │           Git + Gitea                 │
                         │  (nbstripout clean diffs,            │
                         │   nbdime semantic review,            │
                         │   PR workflow for notebook changes)   │
                         └──────────────────────────────────────┘

Notebooks become the primary artifact of complex tasks: the agent generates or edits cells, Papermill executes them reproducibly, scrapbook extracts structured outputs for agent decision-making, and the resulting .ipynb is both proof-of-work and human-readable report. Skills remain for one-shot actions. Notebooks own multi-step workflows.


7. Package Summary

Package Purpose Install
nbformat Read/write/validate .ipynb files pip install nbformat
nbconvert Execute and export notebooks pip install nbconvert
papermill Parameterize + execute in pipelines pip install papermill
scrapbook Structured output collection pip install scrapbook
nbdime Semantic diff/merge for git pip install nbdime
nbstripout Git filter for clean diffs pip install nbstripout
nbval pytest-based output regression pip install nbval
jupyter-kernel-gateway Headless REST kernel access pip install jupyter-kernel-gateway

8. References