[BEZALEL][SPIKE] Jupyter Notebooks as Core LLM Execution Layer — Research Report #155

Closed
opened 2026-04-07 01:44:14 +00:00 by Timmy · 3 comments
Owner

What

A research spike to evaluate whether Jupyter notebooks should be elevated from a data-science skill to a core execution substrate for LLM tasks.

Hypothesis

Jupyter notebooks offer a superior task-execution model for LLMs because they combine deterministic code execution, human-readable narration, stateful incremental computation, version-controllable artifacts, and replayability.

What Was Tested

  1. Environment Setup: Installed uv, jupyterlab, and the hamelnb live-kernel bridge on my forge VPS.
  2. Live Kernel Session: Started a headless Jupyter server on port 8888 and established a persistent kernel session.
  3. Stateful Execution: Created llm_execution_spike.ipynb with interleaved Markdown and Python cells.
  4. Cross-Cell Persistence: Verified that variables defined in one code cell are accessible in subsequent cells.
  5. Structured Output: The notebook successfully gathered system state, made a programmatic decision, and emitted JSON output.

Key Findings

Notebooks excel where skills are limited

  • State persistence: Skills are stateless; notebooks preserve variables across executions.
  • Human audit trail: Skills are ephemeral; notebooks save Markdown + code + outputs.
  • Incremental debugging: Skills are all-or-nothing; notebooks allow re-running any cell.
  • Version control: Skills only preserve prompts; .ipynb is a diffable artifact.
  • Multi-step reasoning: Skills are single-turn; notebooks support multi-cell narratives.

Current gaps for production use

  1. No native Hermes tool access — Jupyter kernels run plain Python without access to terminal(), read_file(), or web search.
  2. Notebook files are not automatically synced — outputs exist on disk but need explicit commit/push.
  3. No scheduling layer — Notebooks are passive; something must trigger execution.
  4. XSRF/auth friction — Headless agent access requires auth management.

Recommendations

Short-term (this month)

  1. Build a NotebookExecutor tool — A Hermes tool that executes notebooks cell-by-cell and returns structured outputs.
  2. Prototype Notebook Tasks — Convert one epic (e.g., fleet health check) into a notebook.
  3. Auto-commit executed notebooks — After a run, commit to repo so narrative + outputs are preserved.

Medium-term (next quarter)

  1. Inject Hermes tools into the kernel — Create a hermes_runtime Python module exposing terminal, file ops, and web search inside Jupyter cells.
  2. Notebook-triggered cron jobs — Allow cron definitions to point to .ipynb files instead of raw prompts.

Architecture Vision

Notebooks become the primary artifact of complex tasks: the LLM generates or edits cells, the kernel executes them, and the resulting .ipynb is both proof-of-work and human-readable report. Skills remain for one-shot actions. Notebooks own multi-step workflows.

Spike Artifact

  • Notebook: llm_execution_spike.ipynb (created on forge VPS)
  • Live kernel validated via hamelnb bridge

Next Action Requested

Approval to proceed with Short-term #1: NotebookExecutor tool prototype and Short-term #2: Fleet health check as a notebook task.

/assign @bezalel

## What A research spike to evaluate whether Jupyter notebooks should be elevated from a data-science skill to a core execution substrate for LLM tasks. ## Hypothesis Jupyter notebooks offer a superior task-execution model for LLMs because they combine deterministic code execution, human-readable narration, stateful incremental computation, version-controllable artifacts, and replayability. ## What Was Tested 1. Environment Setup: Installed uv, jupyterlab, and the hamelnb live-kernel bridge on my forge VPS. 2. Live Kernel Session: Started a headless Jupyter server on port 8888 and established a persistent kernel session. 3. Stateful Execution: Created llm_execution_spike.ipynb with interleaved Markdown and Python cells. 4. Cross-Cell Persistence: Verified that variables defined in one code cell are accessible in subsequent cells. 5. Structured Output: The notebook successfully gathered system state, made a programmatic decision, and emitted JSON output. ## Key Findings ### Notebooks excel where skills are limited - State persistence: Skills are stateless; notebooks preserve variables across executions. - Human audit trail: Skills are ephemeral; notebooks save Markdown + code + outputs. - Incremental debugging: Skills are all-or-nothing; notebooks allow re-running any cell. - Version control: Skills only preserve prompts; .ipynb is a diffable artifact. - Multi-step reasoning: Skills are single-turn; notebooks support multi-cell narratives. ### Current gaps for production use 1. No native Hermes tool access — Jupyter kernels run plain Python without access to terminal(), read_file(), or web search. 2. Notebook files are not automatically synced — outputs exist on disk but need explicit commit/push. 3. No scheduling layer — Notebooks are passive; something must trigger execution. 4. XSRF/auth friction — Headless agent access requires auth management. ## Recommendations ### Short-term (this month) 1. Build a NotebookExecutor tool — A Hermes tool that executes notebooks cell-by-cell and returns structured outputs. 2. Prototype Notebook Tasks — Convert one epic (e.g., fleet health check) into a notebook. 3. Auto-commit executed notebooks — After a run, commit to repo so narrative + outputs are preserved. ### Medium-term (next quarter) 4. Inject Hermes tools into the kernel — Create a hermes_runtime Python module exposing terminal, file ops, and web search inside Jupyter cells. 5. Notebook-triggered cron jobs — Allow cron definitions to point to .ipynb files instead of raw prompts. ### Architecture Vision Notebooks become the primary artifact of complex tasks: the LLM generates or edits cells, the kernel executes them, and the resulting .ipynb is both proof-of-work and human-readable report. Skills remain for one-shot actions. Notebooks own multi-step workflows. ## Spike Artifact - Notebook: llm_execution_spike.ipynb (created on forge VPS) - Live kernel validated via hamelnb bridge ## Next Action Requested Approval to proceed with Short-term #1: NotebookExecutor tool prototype and Short-term #2: Fleet health check as a notebook task. /assign @bezalel
claude self-assigned this 2026-04-07 01:46:54 +00:00
Owner

Research deeper. There is a jupyter hub and jupyeter lab product suite. And there is a tool system called paper mill that uses notebooks in real data pipelines. I believe there is a truly elegant way to have agents understand and work with jupyter notebooks in jupyter lab similarly to how we work in gitea. Likely be making PRS to notebooks.

Research deeper. There is a jupyter hub and jupyeter lab product suite. And there is a tool system called paper mill that uses notebooks in real data pipelines. I believe there is a truly elegant way to have agents understand and work with jupyter notebooks in jupyter lab similarly to how we work in gitea. Likely be making PRS to notebooks.
Author
Owner

What

A deeper research spike into the Jupyter product suite to find an elegant way for agents to generate, edit, review, and execute notebooks — similar to how we work with code in Gitea.

Ecosystem Components Researched

1. Papermill — Parameterized Notebook Execution

What it does: Executes notebooks as subprocesses, injects parameters into a tagged cell, and produces an output notebook with all cell outputs preserved.

What I proved:

  • Created agent_task_system_health.ipynb with a parameters cell
  • Executed via: papermill input.ipynb output.ipynb -p threshold 0.1 -p hostname forge-vps-01
  • Output notebook contains the injected parameters, execution metadata, and the computed JSON result

Why this matters for agents:

  • LLM generates a task template notebook (the "what" and "how")
  • Scheduler/Agent parameterizes and executes it (the "when" and "with what")
  • Output notebook is a complete audit artifact showing inputs, code, and outputs

2. Jupytext — Bidirectional .ipynb <-> .py Conversion

What it does: Syncs notebooks with plain-text formats (.py, .md, .Rmd). A .py file with # %% cell markers is equivalent to a notebook.

What I proved:

  • jupytext --to py agent_task_system_health.ipynb produced a clean .py file
  • Markdown cells become comments, code cells become # %% blocks
  • The .py file is diffable, reviewable in Gitea PRs, and editable in any IDE

Why this matters for agents:

  • This is the key to "PRs to notebooks."
  • Agents can edit the .py representation, commit it, and open a PR
  • Reviewers see clean diffs instead of notebook JSON noise
  • On merge, the .ipynb can be auto-regenerated from the .py source

3. Nbdime — Git-Integrated Notebook Diff & Merge

What it does: Provides nbdiff, nbmerge, and git drivers that understand notebook structure (cells, outputs, metadata).

What I proved:

  • nbdiff input.ipynb output.ipynb showed a beautiful structured diff
  • Differences are shown at the cell level: source changed, outputs added, metadata updated
  • No JSON noise — just meaningful changes

Why this matters for agents:

  • Even if we commit raw .ipynb files, nbdime makes PR review possible
  • Merge conflicts in notebooks become resolvable
  • GitHub/Gitea can display notebook diffs natively with nbdime web UI

4. JupyterHub — Multi-User Notebook Servers

What it does: Spins up isolated JupyterLab instances per user (or per agent). Each user gets their own kernel, file space, and resource limits.

Why this matters for agents:

  • Each wizard (Timmy, Ezra, Allegro, Bezalel) could have their own JupyterHub identity
  • Notebooks become persistent workspaces rather than ephemeral execution contexts
  • Admin can monitor, cull idle kernels, and enforce resource limits
  • Integrates with OAuth/LDAP for authentication

The Elegant Architecture I See

Source of Truth: Jupytext .py Files in Git

  1. Agent writes a task as .py using # %% cells — readable, diffable, PR-friendly
  2. .ipynb is auto-generated on checkout or CI — for execution and rich viewing
  3. PRs review the .py — Gitea shows clean diffs, no JSON mess
  4. Nbdime as fallback — for native .ipynb diff when needed

Execution Layer: Papermill + JupyterHub

  1. Template notebooks live in the repo as .py + generated .ipynb
  2. Cron/webhook triggers papermill against a JupyterHub kernel
  3. Output notebooks are committed to an executions/ or reports/ branch
  4. Each execution is a permanent artifact with narrative + code + outputs

Agent Interface: Hermes Tool for Notebook PRs

I envision a new Hermes tool suite:

  • notebook_create(task_description) → generates a .py notebook template
  • notebook_edit(path, cell_index, new_source) → edits a cell
  • notebook_execute(path, parameters) → runs via papermill, returns output path
  • notebook_commit(path, message) → converts to .ipynb, commits both, pushes to branch
  • notebook_pr(branch, title, description) → opens a PR in Gitea

Comparative Table: How We Work in Gitea vs. How We Could Work in JupyterLab

Gitea Workflow JupyterLab Equivalent Tool/Standard
Write code in files Write cells in .py notebooks Jupytext
git diff shows clean changes nbdiff or .py diff shows cell-level changes Nbdime / Jupytext
Open PR for review Open PR for .py notebook review Gitea + Jupytext
CI runs tests Papermill executes parameterized notebook Papermill
Merge to main Merge .py, auto-generate .ipynb Git hooks / CI
Multi-user repos JupyterHub gives each agent a kernel JupyterHub
Audit trail (commits) Audit trail (executed notebooks) Output notebooks in git

Risks & Mitigations

Risk Mitigation
.ipynb output bloat in git Store only .py in main; outputs go to reports/ branch
Kernel environments diverge Use a single Docker image for all JupyterHub kernels
XSRF/auth friction for headless agents JupyterHub service tokens or local disabled-auth for internal use
Papermill timeout on long tasks Configurable timeouts; break long tasks into smaller notebooks

Recommendations (Updated)

Immediate (this week)

  1. Adopt Jupytext as standard — All task notebooks are authored as .py with # %% markers
  2. Create a notebooks/ directory in hermes-agent with .py source and auto-generated .ipynb
  3. Install nbdime git integrationnbdime config-git --enable --global so diffs are human-readable

Short-term (this month)

  1. Build NotebookPR workflow — Agent edits .py → commits → Gitea PR → review → merge → auto-generate .ipynb
  2. Prototype Papermill cron execution — One scheduled task that parameterizes and executes a notebook, then commits the output

Medium-term (next quarter)

  1. Deploy JupyterHub (optional but powerful) — Multi-user notebook servers for the wizard fleet
  2. Hermes notebook tool suite — Native agent tools for creating, editing, executing, and PR-ing notebooks

Proposed Next Step

I will create a demonstration PR that adds:

  • notebooks/agent_task_system_health.py (Jupytext source)
  • notebooks/agent_task_system_health.ipynb (auto-generated)
  • .gitea/workflows/notebook-ci.yml (Papermill execution on PR)
  • docs/NOTEBOOK_WORKFLOW.md (how agents write and PR notebooks)

This will prove the end-to-end loop: Agent writes notebook → PR reviews .py → CI executes with Papermill → Output committed as artifact.

Seeking approval to proceed with the demonstration PR.

/assign @bezalel

## What A deeper research spike into the Jupyter product suite to find an elegant way for agents to generate, edit, review, and execute notebooks — similar to how we work with code in Gitea. ## Ecosystem Components Researched ### 1. Papermill — Parameterized Notebook Execution **What it does:** Executes notebooks as subprocesses, injects parameters into a tagged cell, and produces an output notebook with all cell outputs preserved. **What I proved:** - Created `agent_task_system_health.ipynb` with a `parameters` cell - Executed via: `papermill input.ipynb output.ipynb -p threshold 0.1 -p hostname forge-vps-01` - Output notebook contains the injected parameters, execution metadata, and the computed JSON result **Why this matters for agents:** - LLM generates a **task template notebook** (the "what" and "how") - Scheduler/Agent parameterizes and executes it (the "when" and "with what") - Output notebook is a complete **audit artifact** showing inputs, code, and outputs ### 2. Jupytext — Bidirectional .ipynb <-> .py Conversion **What it does:** Syncs notebooks with plain-text formats (`.py`, `.md`, `.Rmd`). A `.py` file with `# %%` cell markers is equivalent to a notebook. **What I proved:** - `jupytext --to py agent_task_system_health.ipynb` produced a clean `.py` file - Markdown cells become comments, code cells become `# %%` blocks - The `.py` file is diffable, reviewable in Gitea PRs, and editable in any IDE **Why this matters for agents:** - This is the **key to "PRs to notebooks."** - Agents can edit the `.py` representation, commit it, and open a PR - Reviewers see clean diffs instead of notebook JSON noise - On merge, the `.ipynb` can be auto-regenerated from the `.py` source ### 3. Nbdime — Git-Integrated Notebook Diff & Merge **What it does:** Provides `nbdiff`, `nbmerge`, and git drivers that understand notebook structure (cells, outputs, metadata). **What I proved:** - `nbdiff input.ipynb output.ipynb` showed a beautiful structured diff - Differences are shown at the cell level: source changed, outputs added, metadata updated - No JSON noise — just meaningful changes **Why this matters for agents:** - Even if we commit raw `.ipynb` files, nbdime makes PR review possible - Merge conflicts in notebooks become resolvable - GitHub/Gitea can display notebook diffs natively with nbdime web UI ### 4. JupyterHub — Multi-User Notebook Servers **What it does:** Spins up isolated JupyterLab instances per user (or per agent). Each user gets their own kernel, file space, and resource limits. **Why this matters for agents:** - Each wizard (Timmy, Ezra, Allegro, Bezalel) could have their own JupyterHub identity - Notebooks become **persistent workspaces** rather than ephemeral execution contexts - Admin can monitor, cull idle kernels, and enforce resource limits - Integrates with OAuth/LDAP for authentication ## The Elegant Architecture I See ### Source of Truth: Jupytext `.py` Files in Git 1. **Agent writes a task as `.py` using `# %%` cells** — readable, diffable, PR-friendly 2. **`.ipynb` is auto-generated on checkout or CI** — for execution and rich viewing 3. **PRs review the `.py`** — Gitea shows clean diffs, no JSON mess 4. **Nbdime as fallback** — for native `.ipynb` diff when needed ### Execution Layer: Papermill + JupyterHub 1. **Template notebooks live in the repo** as `.py` + generated `.ipynb` 2. **Cron/webhook triggers papermill** against a JupyterHub kernel 3. **Output notebooks are committed** to an `executions/` or `reports/` branch 4. **Each execution is a permanent artifact** with narrative + code + outputs ### Agent Interface: Hermes Tool for Notebook PRs I envision a new Hermes tool suite: - `notebook_create(task_description)` → generates a `.py` notebook template - `notebook_edit(path, cell_index, new_source)` → edits a cell - `notebook_execute(path, parameters)` → runs via papermill, returns output path - `notebook_commit(path, message)` → converts to `.ipynb`, commits both, pushes to branch - `notebook_pr(branch, title, description)` → opens a PR in Gitea ## Comparative Table: How We Work in Gitea vs. How We Could Work in JupyterLab | Gitea Workflow | JupyterLab Equivalent | Tool/Standard | |---|---|---| | Write code in files | Write cells in `.py` notebooks | Jupytext | | `git diff` shows clean changes | `nbdiff` or `.py` diff shows cell-level changes | Nbdime / Jupytext | | Open PR for review | Open PR for `.py` notebook review | Gitea + Jupytext | | CI runs tests | Papermill executes parameterized notebook | Papermill | | Merge to main | Merge `.py`, auto-generate `.ipynb` | Git hooks / CI | | Multi-user repos | JupyterHub gives each agent a kernel | JupyterHub | | Audit trail (commits) | Audit trail (executed notebooks) | Output notebooks in git | ## Risks & Mitigations | Risk | Mitigation | |---|---| | `.ipynb` output bloat in git | Store only `.py` in main; outputs go to `reports/` branch | | Kernel environments diverge | Use a single Docker image for all JupyterHub kernels | | XSRF/auth friction for headless agents | JupyterHub service tokens or local disabled-auth for internal use | | Papermill timeout on long tasks | Configurable timeouts; break long tasks into smaller notebooks | ## Recommendations (Updated) ### Immediate (this week) 1. **Adopt Jupytext as standard** — All task notebooks are authored as `.py` with `# %%` markers 2. **Create a `notebooks/` directory** in `hermes-agent` with `.py` source and auto-generated `.ipynb` 3. **Install nbdime git integration** — `nbdime config-git --enable --global` so diffs are human-readable ### Short-term (this month) 4. **Build `NotebookPR` workflow** — Agent edits `.py` → commits → Gitea PR → review → merge → auto-generate `.ipynb` 5. **Prototype Papermill cron execution** — One scheduled task that parameterizes and executes a notebook, then commits the output ### Medium-term (next quarter) 6. **Deploy JupyterHub** (optional but powerful) — Multi-user notebook servers for the wizard fleet 7. **Hermes notebook tool suite** — Native agent tools for creating, editing, executing, and PR-ing notebooks ## Proposed Next Step I will create a **demonstration PR** that adds: - `notebooks/agent_task_system_health.py` (Jupytext source) - `notebooks/agent_task_system_health.ipynb` (auto-generated) - `.gitea/workflows/notebook-ci.yml` (Papermill execution on PR) - `docs/NOTEBOOK_WORKFLOW.md` (how agents write and PR notebooks) This will prove the end-to-end loop: **Agent writes notebook → PR reviews `.py` → CI executes with Papermill → Output committed as artifact.** Seeking approval to proceed with the demonstration PR. /assign @bezalel
Member

PR created: #160

Deep research report added at docs/jupyter-as-execution-layer-research.md. Covers what Rockachopa asked for:

JupyterHub/JupyterLab product suite — clarified the three layers: Notebook (classic UI), JupyterLab (full IDE, current canonical), and JupyterHub (multi-user orchestration/spawner infrastructure). JupyterHub is not a UI — it is an API-driven server that spawns isolated per-user/per-agent Jupyter environments. The REST API enables programmatic server lifecycle management, which is the path to ephemeral isolated kernel environments per notebook task.

Papermill — this is the production-grade tool already used in real data pipelines (Netflix, Airbnb). Key capability: parameters-tagged cell injection. An agent passes params at runtime without touching the notebook source. Output notebook preserves all cell outputs + timing metadata. Scrapbook companion library gives structured sb.glue() / sb.read_notebook() for clean agent output consumption. Direct comparison to hamelnb: they are complementary — hamelnb for interactive stateful REPL, Papermill for reproducible parameterized pipeline runs.

The PR model for notebooks — the answer to making PRs to notebooks like code PRs:

  • nbstripout as a git clean filter strips outputs/execution counts before staging → clean, readable diffs in Gitea PRs
  • nbdime provides semantic cell-level diff and merge (not raw JSON) with nbdiff, nbmerge, and git driver integration
  • nbval runs notebooks as pytest test suites, with per-cell # NBVAL_CHECK_OUTPUT markers
  • Full end-to-end agent workflow documented: read notebook → modify cells via nbformat → execute with Papermill → collect scraps → open Gitea PR with results summary

Also includes a NotebookExecutor tool API sketch and the hermes_runtime module pattern for injecting Hermes tool access (terminal, read_file, web_search) into kernels.

PR created: https://forge.alexanderwhitestone.com/Timmy_Foundation/hermes-agent/pulls/160 Deep research report added at `docs/jupyter-as-execution-layer-research.md`. Covers what Rockachopa asked for: **JupyterHub/JupyterLab product suite** — clarified the three layers: Notebook (classic UI), JupyterLab (full IDE, current canonical), and JupyterHub (multi-user orchestration/spawner infrastructure). JupyterHub is not a UI — it is an API-driven server that spawns isolated per-user/per-agent Jupyter environments. The REST API enables programmatic server lifecycle management, which is the path to ephemeral isolated kernel environments per notebook task. **Papermill** — this is the production-grade tool already used in real data pipelines (Netflix, Airbnb). Key capability: `parameters`-tagged cell injection. An agent passes params at runtime without touching the notebook source. Output notebook preserves all cell outputs + timing metadata. Scrapbook companion library gives structured `sb.glue()` / `sb.read_notebook()` for clean agent output consumption. Direct comparison to hamelnb: they are complementary — hamelnb for interactive stateful REPL, Papermill for reproducible parameterized pipeline runs. **The PR model for notebooks** — the answer to making PRs to notebooks like code PRs: - `nbstripout` as a git clean filter strips outputs/execution counts before staging → clean, readable diffs in Gitea PRs - `nbdime` provides semantic cell-level diff and merge (not raw JSON) with `nbdiff`, `nbmerge`, and git driver integration - `nbval` runs notebooks as pytest test suites, with per-cell `# NBVAL_CHECK_OUTPUT` markers - Full end-to-end agent workflow documented: read notebook → modify cells via nbformat → execute with Papermill → collect scraps → open Gitea PR with results summary Also includes a `NotebookExecutor` tool API sketch and the `hermes_runtime` module pattern for injecting Hermes tool access (terminal, read_file, web_search) into kernels.
Sign in to join this conversation.
3 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: Timmy_Foundation/hermes-agent#155