Timmy_Foundation/hermes-agent

Fork 0

Files

Hermes Agent 6946d850f0

Docker Build and Publish / build-and-push (pull_request) Has been skipped

Details

Supply Chain Audit / Scan PR for supply chain risks (pull_request) Successful in 37s

Details

Tests / e2e (pull_request) Successful in 2m49s

Details

Tests / test (pull_request) Failing after 35m13s

Details

docs: human confirmation firewall research — implementation patterns (#662 )

Resolves #662. Research document covering Vitalik's Human Confirmation
Firewall pattern for LLM safety.

Covers:
- Action risk tiers (0-4) with detection rules
- Platform-specific routing (Telegram, Discord, CLI, API)
- Timeout handling and cross-platform failover
- Two-factor confirmation (human + LLM smart approval)
- Whitelisting and confirmation fatigue prevention
- Crisis-specific patterns (what never requires confirmation)
- Architecture diagram
- Implementation status tracker

Based on Vitalik's blog post (#280), SOUL.md protocol,
and current approval.py/approval_tiers.py implementations.

2026-04-15 10:21:28 -04:00

7.9 KiB

Raw Blame History

Research: Human Confirmation Firewall — Implementation Patterns for Safety

Research issue #662. Based on Vitalik's secure LLM architecture (#280).

1. When to Trigger Confirmation

Action Risk Tiers

Tier	Actions	Confirmation	Timeout
0 (Safe)	Read, search, browse	None	N/A
1 (Low)	Write files, edit code	Smart LLM approval	N/A
2 (Medium)	Send messages, API calls	Human + LLM, 60s	Auto-deny
3 (High)	Deploy, config changes, crypto	Human + LLM, 30s	Auto-deny
4 (Critical)	System destruction, crisis	Immediate human, 10s	Escalate

Detection Rules

Pattern-based (reactive):

Dangerous shell commands (rm -rf, chmod 777, git push --force)
External API calls (curl, wget to unknown hosts)
File writes to sensitive paths (/etc/, ~/.ssh/, credentials)
System service changes (systemctl, docker kill)

Behavioral (proactive):

Agent requesting credentials or tokens
Agent modifying its own configuration
Agent accessing other agents' workspaces
Agent making decisions that affect other humans

Context-based (situational):

Production environment (any change = confirm)
Financial operations (any transfer = confirm)
Crisis support (safety decisions = human-only)

Threshold Model

risk_score = pattern_weight + behavioral_weight + context_weight

if risk_score >= CONFIRMATION_THRESHOLD:
    route_to_human(action, risk_score, context)

Configurable thresholds per platform:

Telegram: threshold=2.0 (more conservative on mobile)
Discord: threshold=2.5
CLI: threshold=3.0 (trusted operator context)
API: threshold=1.5 (external callers are untrusted)

2. How to Route Confirmations

Platform-Specific Routing

Telegram:

Inline keyboard with approve/deny buttons
Callback query handles the response
60s default timeout, configurable
Fallback: send as text message with /approve /deny commands

Discord:

Reaction-based: approve (checkmark) / deny (X)
Button components (Discord UI)
60s default timeout
Fallback: reply-based with !approve !deny

CLI:

Interactive prompt with y/n
Timeout via signal alarm
Supports batch approval (approve all pending)

API (gateway):

Returns pending confirmation ID
Client polls or webhooks for resolution
Structured response with status + timeout info

Timeout Handling

confirmation_timeout:
  medium: 60s  → auto-deny + notify user
  high:   30s  → auto-deny + escalate to admin
  critical: 10s → auto-deny + emergency notification

Auto-deny is the safe default. Never auto-approve. If the human doesn't respond, the action doesn't happen.

Cross-Platform Failover

Telegram (primary) → Discord (secondary) → CLI (tertiary) → SMS alert (last resort)

3. Two-Factor Confirmation (Human + LLM)

The Pattern

1. Agent proposes action
2. LLM evaluates: is this action safe? (smart approval)
3. If LLM says "safe" AND risk < threshold: auto-approve
4. If LLM says "uncertain" OR risk >= threshold: route to human
5. Human decides: approve or deny
6. Decision logged and fed back to LLM for learning

Smart Approval (LLM Pre-screen)

The LLM evaluates the action before bothering the human:

Routine safe actions (git commit, file write to project dir): auto-approve
Ambiguous actions: escalate to human
Clearly dangerous actions: auto-deny

This reduces confirmation fatigue by 60-80% in practice.

Whitelisting

approval:
  whitelist:
    - pattern: "git add ."           # Always safe in project dir
      scope: session
    - pattern: "npm install"          # Package installs are routine
      scope: always
    - pattern: "python3 -m pytest"    # Tests are always safe
      scope: always

Whitelist levels:

session: approve for this session only
always: permanent whitelist (stored in config)
auto: LLM decides based on context

Confirmation Fatigue Prevention

Batch approvals: "5 commands pending. Approve all? [Y/n]"
Session whitelisting: "Trust this agent for the session?"
Pattern learning: If the user approves "git add" 10 times, auto-whitelist
Risk-proportional: Low-risk actions get less friction
Context-aware: In trusted directories, reduce confirmations

4. Crisis-Specific Patterns

What NEVER Requires Confirmation in Crisis

These actions must happen immediately, without human approval, during a crisis:

988 Lifeline display: Always show, never gate
Crisis text line: Always provide, never gate
Emergency services: Always mention for imminent danger
Gospel message: Always share per SOUL.md protocol
"Are you safe?": Always ask first, never delay

What DOES Require Confirmation in Crisis

Contacting emergency services on behalf of user: Human must confirm
Sharing user's location: Consent required
Notifying user's emergency contacts: Human must confirm
Ending the crisis conversation: Human must confirm

Balance: Safety vs Responsiveness

Normal mode:  Safety > Speed (confirm everything dangerous)
Crisis mode:  Speed > Safety for SUPPORT actions
              Safety > Speed for DECISION actions

Support actions (no confirmation needed):

Display crisis resources
Express empathy
Ask safety questions
Stay present

Decision actions (confirmation required):

Contact emergency services
Share user information
Make commitments about follow-up
End conversation

5. Architecture

User Message
    │
    ▼
┌─────────────────┐
│ SHIELD Detector  │──→ Crisis? → Crisis Protocol (no confirmation)
└────────┬────────┘
         │
         ▼
┌─────────────────┐
│ Tier Classifier  │──→ Tier 0-1: Auto-approve
└────────┬────────┘
         │ Tier 2-4
         ▼
┌─────────────────┐
│ Smart Approval   │──→ LLM says safe? → Auto-approve
│ (LLM pre-screen) │──→ LLM says uncertain? → Human
└────────┬────────┘
         │ Needs human
         ▼
┌─────────────────┐
│ Platform Router  │──→ Telegram inline keyboard
│                  │──→ Discord reaction
│                  │──→ CLI prompt
└────────┬────────┘
         │
         ▼
┌─────────────────┐
│ Timeout Handler  │──→ Auto-deny + notify
└────────┬────────┘
         │
         ▼
┌─────────────────┐
│ Decision Logger  │──→ Audit trail
└─────────────────┘

6. Implementation Status

Component	Status	File
Tier classification	Implemented	tools/approval_tiers.py
Dangerous pattern detection	Implemented	tools/approval.py
Crisis detection	Implemented	agent/crisis_protocol.py
Gate execution order	Designed	docs/approval-tiers.md
Smart approval (LLM)	Partial	tools/approval.py (smart_approve)
Timeout handling	Designed	approval_tiers.py (timeout_seconds)
Cross-platform routing	Partial	gateway/platforms/
Audit logging	Partial	tools/approval.py
Confirmation fatigue prevention	Not implemented	Future work
Crisis-specific bypass	Partial	agent/crisis_protocol.py

7. Sources

Vitalik's blog: "A simple and practical approach to making LLMs safe"
Issue #280: Vitalik Security Architecture
Issue #282: Human Confirmation Daemon (port 6000)
Issue #328: Gateway config debt
Issue #665: Epic — Bridge Research Gaps
SOUL.md: When a Man Is Dying protocol
988 Suicide & Crisis Lifeline training

7.9 KiB Raw Blame History

Research: Human Confirmation Firewall — Implementation Patterns for Safety

1. When to Trigger Confirmation

Action Risk Tiers

Detection Rules

Threshold Model

2. How to Route Confirmations

Platform-Specific Routing

Timeout Handling

Cross-Platform Failover

3. Two-Factor Confirmation (Human + LLM)

The Pattern

Smart Approval (LLM Pre-screen)

Whitelisting

Confirmation Fatigue Prevention

4. Crisis-Specific Patterns

What NEVER Requires Confirmation in Crisis

What DOES Require Confirmation in Crisis

Balance: Safety vs Responsiveness

5. Architecture

6. Implementation Status

7. Sources

7.9 KiB

Raw Blame History