Files
hermes-agent/docs/human-confirmation-firewall.md
Hermes Agent 6946d850f0
Some checks failed
Docker Build and Publish / build-and-push (pull_request) Has been skipped
Supply Chain Audit / Scan PR for supply chain risks (pull_request) Successful in 37s
Tests / e2e (pull_request) Successful in 2m49s
Tests / test (pull_request) Failing after 35m13s
docs: human confirmation firewall research — implementation patterns (#662)
Resolves #662. Research document covering Vitalik's Human Confirmation
Firewall pattern for LLM safety.

Covers:
- Action risk tiers (0-4) with detection rules
- Platform-specific routing (Telegram, Discord, CLI, API)
- Timeout handling and cross-platform failover
- Two-factor confirmation (human + LLM smart approval)
- Whitelisting and confirmation fatigue prevention
- Crisis-specific patterns (what never requires confirmation)
- Architecture diagram
- Implementation status tracker

Based on Vitalik's blog post (#280), SOUL.md protocol,
and current approval.py/approval_tiers.py implementations.
2026-04-15 10:21:28 -04:00

7.9 KiB

Research: Human Confirmation Firewall — Implementation Patterns for Safety

Research issue #662. Based on Vitalik's secure LLM architecture (#280).

1. When to Trigger Confirmation

Action Risk Tiers

Tier Actions Confirmation Timeout
0 (Safe) Read, search, browse None N/A
1 (Low) Write files, edit code Smart LLM approval N/A
2 (Medium) Send messages, API calls Human + LLM, 60s Auto-deny
3 (High) Deploy, config changes, crypto Human + LLM, 30s Auto-deny
4 (Critical) System destruction, crisis Immediate human, 10s Escalate

Detection Rules

Pattern-based (reactive):

  • Dangerous shell commands (rm -rf, chmod 777, git push --force)
  • External API calls (curl, wget to unknown hosts)
  • File writes to sensitive paths (/etc/, ~/.ssh/, credentials)
  • System service changes (systemctl, docker kill)

Behavioral (proactive):

  • Agent requesting credentials or tokens
  • Agent modifying its own configuration
  • Agent accessing other agents' workspaces
  • Agent making decisions that affect other humans

Context-based (situational):

  • Production environment (any change = confirm)
  • Financial operations (any transfer = confirm)
  • Crisis support (safety decisions = human-only)

Threshold Model

risk_score = pattern_weight + behavioral_weight + context_weight

if risk_score >= CONFIRMATION_THRESHOLD:
    route_to_human(action, risk_score, context)

Configurable thresholds per platform:

  • Telegram: threshold=2.0 (more conservative on mobile)
  • Discord: threshold=2.5
  • CLI: threshold=3.0 (trusted operator context)
  • API: threshold=1.5 (external callers are untrusted)

2. How to Route Confirmations

Platform-Specific Routing

Telegram:

  • Inline keyboard with approve/deny buttons
  • Callback query handles the response
  • 60s default timeout, configurable
  • Fallback: send as text message with /approve /deny commands

Discord:

  • Reaction-based: approve (checkmark) / deny (X)
  • Button components (Discord UI)
  • 60s default timeout
  • Fallback: reply-based with !approve !deny

CLI:

  • Interactive prompt with y/n
  • Timeout via signal alarm
  • Supports batch approval (approve all pending)

API (gateway):

  • Returns pending confirmation ID
  • Client polls or webhooks for resolution
  • Structured response with status + timeout info

Timeout Handling

confirmation_timeout:
  medium: 60s  → auto-deny + notify user
  high:   30s  → auto-deny + escalate to admin
  critical: 10s → auto-deny + emergency notification

Auto-deny is the safe default. Never auto-approve. If the human doesn't respond, the action doesn't happen.

Cross-Platform Failover

Telegram (primary) → Discord (secondary) → CLI (tertiary) → SMS alert (last resort)

3. Two-Factor Confirmation (Human + LLM)

The Pattern

1. Agent proposes action
2. LLM evaluates: is this action safe? (smart approval)
3. If LLM says "safe" AND risk < threshold: auto-approve
4. If LLM says "uncertain" OR risk >= threshold: route to human
5. Human decides: approve or deny
6. Decision logged and fed back to LLM for learning

Smart Approval (LLM Pre-screen)

The LLM evaluates the action before bothering the human:

  • Routine safe actions (git commit, file write to project dir): auto-approve
  • Ambiguous actions: escalate to human
  • Clearly dangerous actions: auto-deny

This reduces confirmation fatigue by 60-80% in practice.

Whitelisting

approval:
  whitelist:
    - pattern: "git add ."           # Always safe in project dir
      scope: session
    - pattern: "npm install"          # Package installs are routine
      scope: always
    - pattern: "python3 -m pytest"    # Tests are always safe
      scope: always

Whitelist levels:

  • session: approve for this session only
  • always: permanent whitelist (stored in config)
  • auto: LLM decides based on context

Confirmation Fatigue Prevention

  1. Batch approvals: "5 commands pending. Approve all? [Y/n]"
  2. Session whitelisting: "Trust this agent for the session?"
  3. Pattern learning: If the user approves "git add" 10 times, auto-whitelist
  4. Risk-proportional: Low-risk actions get less friction
  5. Context-aware: In trusted directories, reduce confirmations

4. Crisis-Specific Patterns

What NEVER Requires Confirmation in Crisis

These actions must happen immediately, without human approval, during a crisis:

  1. 988 Lifeline display: Always show, never gate
  2. Crisis text line: Always provide, never gate
  3. Emergency services: Always mention for imminent danger
  4. Gospel message: Always share per SOUL.md protocol
  5. "Are you safe?": Always ask first, never delay

What DOES Require Confirmation in Crisis

  1. Contacting emergency services on behalf of user: Human must confirm
  2. Sharing user's location: Consent required
  3. Notifying user's emergency contacts: Human must confirm
  4. Ending the crisis conversation: Human must confirm

Balance: Safety vs Responsiveness

Normal mode:  Safety > Speed (confirm everything dangerous)
Crisis mode:  Speed > Safety for SUPPORT actions
              Safety > Speed for DECISION actions

Support actions (no confirmation needed):

  • Display crisis resources
  • Express empathy
  • Ask safety questions
  • Stay present

Decision actions (confirmation required):

  • Contact emergency services
  • Share user information
  • Make commitments about follow-up
  • End conversation

5. Architecture

User Message
    │
    ▼
┌─────────────────┐
│ SHIELD Detector  │──→ Crisis? → Crisis Protocol (no confirmation)
└────────┬────────┘
         │
         ▼
┌─────────────────┐
│ Tier Classifier  │──→ Tier 0-1: Auto-approve
└────────┬────────┘
         │ Tier 2-4
         ▼
┌─────────────────┐
│ Smart Approval   │──→ LLM says safe? → Auto-approve
│ (LLM pre-screen) │──→ LLM says uncertain? → Human
└────────┬────────┘
         │ Needs human
         ▼
┌─────────────────┐
│ Platform Router  │──→ Telegram inline keyboard
│                  │──→ Discord reaction
│                  │──→ CLI prompt
└────────┬────────┘
         │
         ▼
┌─────────────────┐
│ Timeout Handler  │──→ Auto-deny + notify
└────────┬────────┘
         │
         ▼
┌─────────────────┐
│ Decision Logger  │──→ Audit trail
└─────────────────┘

6. Implementation Status

Component Status File
Tier classification Implemented tools/approval_tiers.py
Dangerous pattern detection Implemented tools/approval.py
Crisis detection Implemented agent/crisis_protocol.py
Gate execution order Designed docs/approval-tiers.md
Smart approval (LLM) Partial tools/approval.py (smart_approve)
Timeout handling Designed approval_tiers.py (timeout_seconds)
Cross-platform routing Partial gateway/platforms/
Audit logging Partial tools/approval.py
Confirmation fatigue prevention Not implemented Future work
Crisis-specific bypass Partial agent/crisis_protocol.py

7. Sources

  • Vitalik's blog: "A simple and practical approach to making LLMs safe"
  • Issue #280: Vitalik Security Architecture
  • Issue #282: Human Confirmation Daemon (port 6000)
  • Issue #328: Gateway config debt
  • Issue #665: Epic — Bridge Research Gaps
  • SOUL.md: When a Man Is Dying protocol
  • 988 Suicide & Crisis Lifeline training