fix(tui): harden against Node V8 OOM + GatewayClient memory leaks

Long TUI sessions were crashing Node via V8 fatal-OOM once transcripts +
reasoning blobs crossed the default 1.5–4GB heap cap. This adds defense
in depth: a bigger heap, leak-proofing the RPC hot path, bounded
diagnostic buffers, automatic heap dumps at high-water marks, and
graceful signal / uncaught handlers.

## Changes

### Heap budget
- hermes_cli/main.py: `_launch_tui` now injects `NODE_OPTIONS=
  --max-old-space-size=8192 --expose-gc` (appended — does not clobber
  user-supplied NODE_OPTIONS). Covers both `node dist/entry.js` and
  `tsx src/entry.tsx` launch paths.
- ui-tui/src/entry.tsx: shebang rewritten to
  `#!/usr/bin/env -S node --max-old-space-size=8192 --expose-gc` as a
  fallback when the binary is invoked directly.

### GatewayClient (ui-tui/src/gatewayClient.ts)
- `setMaxListeners(0)` — silences spurious warnings from React hook
  subscribers.
- `logs` and `bufferedEvents` replaced with fixed-capacity
  CircularBuffer — O(1) push, no splice(0, …) copies under load.
- RPC timeout refactor: `setTimeout(this.onTimeout.bind(this), …, id)`
  replaces the inline arrow closure that captured `method`/`params`/
  `resolve`/`reject` for the full 120 s request timeout. Each Pending
  record now stores its own timeout handle, `.unref()`'d so stuck
  timers never keep the event loop alive, and `rejectPending()` clears
  them (previously leaked the timer itself).

### Memory diagnostics (new)
- ui-tui/src/lib/memory.ts: `performHeapDump()` +
  `captureMemoryDiagnostics()`. Writes heap snapshot + JSON diag
  sidecar to `~/.hermes/heapdumps/` (override via
  `HERMES_HEAPDUMP_DIR`). Diagnostics are written first so we still get
  useful data if the snapshot crashes on very large heaps.
  Captures: detached V8 contexts (closure-leak signal), active
  handles/requests (`process._getActiveHandles/_getActiveRequests`),
  Linux `/proc/self/fd` count + `/proc/self/smaps_rollup`, heap growth
  rate (MB/hr), and auto-classifies likely leak sources.
- ui-tui/src/lib/memoryMonitor.ts: 10 s interval polling heapUsed. At
  1.5 GB writes an auto heap dump (trigger=`auto-high`); at 2.5 GB
  writes a final dump and exits 137 before V8 fatal-OOMs so the user
  can restart cleanly. Handle is `.unref()`'d so it never holds the
  process open.

### Graceful exit (new)
- ui-tui/src/lib/gracefulExit.ts: SIGINT/SIGTERM/SIGHUP run registered
  cleanups through a 4 s failsafe `setTimeout` that hard-exits if
  cleanup hangs.
  `uncaughtException` / `unhandledRejection` are logged to stderr
  instead of crashing — a transient TUI render error should not kill
  an in-flight agent turn.

### Slash commands (new)
- ui-tui/src/app/slash/commands/debug.ts:
  - `/heapdump` — manual snapshot + diagnostics.
  - `/mem` — live heap / rss / external / array-buffer / uptime panel.
- Registered in `ui-tui/src/app/slash/registry.ts`.

### Utility (new)
- ui-tui/src/lib/circularBuffer.ts: small fixed-capacity ring buffer
  with `push` / `tail(n)` / `drain()` / `clear()`. Replaces the ad-hoc
  `array.splice(0, len - MAX)` pattern.

## Validation

- tsc `--noEmit` clean
- `vitest run`: 15 files, 102 tests passing
- eslint clean on all touched/new files
- build produces executable `dist/entry.js` with preserved shebang
- smoke-tested: `HERMES_HEAPDUMP_DIR=… performHeapDump('manual')`
  writes both a valid `.heapsnapshot` and a `.diagnostics.json`
  containing detached-contexts, active-handles, smaps_rollup.

## Env knobs
- `HERMES_HEAPDUMP_DIR` — override snapshot output dir
- `HERMES_HEAPDUMP_ON_START=1` — dump once at boot
- existing `NODE_OPTIONS` is respected and appended, not replaced
This commit is contained in:
Brooklyn Nicholson
2026-04-20 18:35:18 -05:00
parent 36e8435d3e
commit 0785aec444
9 changed files with 569 additions and 40 deletions

View File

@@ -0,0 +1,75 @@
import { type HeapDumpResult, performHeapDump } from './memory.js'
export type MemoryLevel = 'critical' | 'high' | 'normal'
export interface MemorySnapshot {
heapUsed: number
level: MemoryLevel
rss: number
}
export interface MemoryMonitorOptions {
criticalBytes?: number
highBytes?: number
intervalMs?: number
onCritical?: (snap: MemorySnapshot, dump: HeapDumpResult | null) => void
onHigh?: (snap: MemorySnapshot, dump: HeapDumpResult | null) => void
onSnapshot?: (snap: MemorySnapshot) => void
}
const GB = 1024 ** 3
const DEFAULTS = {
criticalBytes: 2.5 * GB,
highBytes: 1.5 * GB,
intervalMs: 10_000
}
export function startMemoryMonitor({
criticalBytes = DEFAULTS.criticalBytes,
highBytes = DEFAULTS.highBytes,
intervalMs = DEFAULTS.intervalMs,
onCritical,
onHigh,
onSnapshot
}: MemoryMonitorOptions = {}): () => void {
let dumpedHigh = false
let dumpedCritical = false
const tick = async () => {
const { heapUsed, rss } = process.memoryUsage()
const level: MemoryLevel = heapUsed >= criticalBytes ? 'critical' : heapUsed >= highBytes ? 'high' : 'normal'
const snap: MemorySnapshot = { heapUsed, level, rss }
onSnapshot?.(snap)
if (level === 'normal') {
dumpedHigh = false
dumpedCritical = false
return
}
if (level === 'high' && !dumpedHigh) {
dumpedHigh = true
const dump = await performHeapDump('auto-high').catch(() => null)
onHigh?.(snap, dump)
return
}
if (level === 'critical' && !dumpedCritical) {
dumpedCritical = true
const dump = await performHeapDump('auto-critical').catch(() => null)
onCritical?.(snap, dump)
}
}
const handle = setInterval(() => void tick(), intervalMs)
handle.unref?.()
return () => clearInterval(handle)
}