[BEZALEL][SPIKE] Jupyter Notebooks as Core LLM Execution Layer — Research Report #155
Reference in New Issue
Block a user
Delete Branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
What
A research spike to evaluate whether Jupyter notebooks should be elevated from a data-science skill to a core execution substrate for LLM tasks.
Hypothesis
Jupyter notebooks offer a superior task-execution model for LLMs because they combine deterministic code execution, human-readable narration, stateful incremental computation, version-controllable artifacts, and replayability.
What Was Tested
Key Findings
Notebooks excel where skills are limited
Current gaps for production use
Recommendations
Short-term (this month)
Medium-term (next quarter)
Architecture Vision
Notebooks become the primary artifact of complex tasks: the LLM generates or edits cells, the kernel executes them, and the resulting .ipynb is both proof-of-work and human-readable report. Skills remain for one-shot actions. Notebooks own multi-step workflows.
Spike Artifact
Next Action Requested
Approval to proceed with Short-term #1: NotebookExecutor tool prototype and Short-term #2: Fleet health check as a notebook task.
/assign @bezalel
Research deeper. There is a jupyter hub and jupyeter lab product suite. And there is a tool system called paper mill that uses notebooks in real data pipelines. I believe there is a truly elegant way to have agents understand and work with jupyter notebooks in jupyter lab similarly to how we work in gitea. Likely be making PRS to notebooks.
What
A deeper research spike into the Jupyter product suite to find an elegant way for agents to generate, edit, review, and execute notebooks — similar to how we work with code in Gitea.
Ecosystem Components Researched
1. Papermill — Parameterized Notebook Execution
What it does: Executes notebooks as subprocesses, injects parameters into a tagged cell, and produces an output notebook with all cell outputs preserved.
What I proved:
agent_task_system_health.ipynbwith aparameterscellpapermill input.ipynb output.ipynb -p threshold 0.1 -p hostname forge-vps-01Why this matters for agents:
2. Jupytext — Bidirectional .ipynb <-> .py Conversion
What it does: Syncs notebooks with plain-text formats (
.py,.md,.Rmd). A.pyfile with# %%cell markers is equivalent to a notebook.What I proved:
jupytext --to py agent_task_system_health.ipynbproduced a clean.pyfile# %%blocks.pyfile is diffable, reviewable in Gitea PRs, and editable in any IDEWhy this matters for agents:
.pyrepresentation, commit it, and open a PR.ipynbcan be auto-regenerated from the.pysource3. Nbdime — Git-Integrated Notebook Diff & Merge
What it does: Provides
nbdiff,nbmerge, and git drivers that understand notebook structure (cells, outputs, metadata).What I proved:
nbdiff input.ipynb output.ipynbshowed a beautiful structured diffWhy this matters for agents:
.ipynbfiles, nbdime makes PR review possible4. JupyterHub — Multi-User Notebook Servers
What it does: Spins up isolated JupyterLab instances per user (or per agent). Each user gets their own kernel, file space, and resource limits.
Why this matters for agents:
The Elegant Architecture I See
Source of Truth: Jupytext
.pyFiles in Git.pyusing# %%cells — readable, diffable, PR-friendly.ipynbis auto-generated on checkout or CI — for execution and rich viewing.py— Gitea shows clean diffs, no JSON mess.ipynbdiff when neededExecution Layer: Papermill + JupyterHub
.py+ generated.ipynbexecutions/orreports/branchAgent Interface: Hermes Tool for Notebook PRs
I envision a new Hermes tool suite:
notebook_create(task_description)→ generates a.pynotebook templatenotebook_edit(path, cell_index, new_source)→ edits a cellnotebook_execute(path, parameters)→ runs via papermill, returns output pathnotebook_commit(path, message)→ converts to.ipynb, commits both, pushes to branchnotebook_pr(branch, title, description)→ opens a PR in GiteaComparative Table: How We Work in Gitea vs. How We Could Work in JupyterLab
.pynotebooksgit diffshows clean changesnbdiffor.pydiff shows cell-level changes.pynotebook review.py, auto-generate.ipynbRisks & Mitigations
.ipynboutput bloat in git.pyin main; outputs go toreports/branchRecommendations (Updated)
Immediate (this week)
.pywith# %%markersnotebooks/directory inhermes-agentwith.pysource and auto-generated.ipynbnbdime config-git --enable --globalso diffs are human-readableShort-term (this month)
NotebookPRworkflow — Agent edits.py→ commits → Gitea PR → review → merge → auto-generate.ipynbMedium-term (next quarter)
Proposed Next Step
I will create a demonstration PR that adds:
notebooks/agent_task_system_health.py(Jupytext source)notebooks/agent_task_system_health.ipynb(auto-generated).gitea/workflows/notebook-ci.yml(Papermill execution on PR)docs/NOTEBOOK_WORKFLOW.md(how agents write and PR notebooks)This will prove the end-to-end loop: Agent writes notebook → PR reviews
.py→ CI executes with Papermill → Output committed as artifact.Seeking approval to proceed with the demonstration PR.
/assign @bezalel
PR created: #160
Deep research report added at
docs/jupyter-as-execution-layer-research.md. Covers what Rockachopa asked for:JupyterHub/JupyterLab product suite — clarified the three layers: Notebook (classic UI), JupyterLab (full IDE, current canonical), and JupyterHub (multi-user orchestration/spawner infrastructure). JupyterHub is not a UI — it is an API-driven server that spawns isolated per-user/per-agent Jupyter environments. The REST API enables programmatic server lifecycle management, which is the path to ephemeral isolated kernel environments per notebook task.
Papermill — this is the production-grade tool already used in real data pipelines (Netflix, Airbnb). Key capability:
parameters-tagged cell injection. An agent passes params at runtime without touching the notebook source. Output notebook preserves all cell outputs + timing metadata. Scrapbook companion library gives structuredsb.glue()/sb.read_notebook()for clean agent output consumption. Direct comparison to hamelnb: they are complementary — hamelnb for interactive stateful REPL, Papermill for reproducible parameterized pipeline runs.The PR model for notebooks — the answer to making PRs to notebooks like code PRs:
nbstripoutas a git clean filter strips outputs/execution counts before staging → clean, readable diffs in Gitea PRsnbdimeprovides semantic cell-level diff and merge (not raw JSON) withnbdiff,nbmerge, and git driver integrationnbvalruns notebooks as pytest test suites, with per-cell# NBVAL_CHECK_OUTPUTmarkersAlso includes a
NotebookExecutortool API sketch and thehermes_runtimemodule pattern for injecting Hermes tool access (terminal, read_file, web_search) into kernels.