docs: document two-factor confirmation pattern (closes #286 )

ARCHITECTURE.md documents the Human + LLM approval architecture: - Phase 1: Pattern detection (regex + Tirith) - Phase 2: Smart approval (auxiliary LLM, isolated prompt) - Phase 3: Human confirmation (CLI / gateway) Explains why two independent failure modes are strictly safer than either factor alone. Links to Vitalik's secure LLM analysis. Part of Epic #281.
2026-04-13 18:05:33 -04:00
1 changed files with 149 additions and 0 deletions
--- a/ARCHITECTURE.md
+++ b/ARCHITECTURE.md
@@ -0,0 +1,149 @@
+# ARCHITECTURE.md — Hermes Agent
+
+## Security Architecture
+
+### Two-Factor Confirmation Pattern
+
+Hermes implements a **two-factor confirmation** system for dangerous commands, inspired by Vitalik Buterin's analysis of secure LLM architectures (April 2026). The core principle: two agents with distinct failure modes reviewing the same action is strictly safer than either agent alone.
+
+**Source:** https://vitalik.eth.limo/general/2026/04/02/secure_llms.html
+**Epic:** #281
+
+#### Why Two Factors
+
+A single approval gate has a single failure mode:
+
+| Single-factor | Failure mode |
+|---|---|
+| Human only | Rubber-stamping, fatigue, context-switching. Users approve dangerous commands without reading them. |
+| LLM only | Prompt injection, jailbreaks, adversarial inputs. An attacker who controls the LLM's input controls the approval. |
+
+Two factors with *independent* failure modes create coverage:
+
+```
+Dangerous command detected
+        │
+        ▼
+┌─────────────────────┐
+│  Phase 1: Pattern   │  Regex-based detection (DANGEROUS_PATTERNS)
+│  Detection          │  Tirith policy engine (configurable rules)
+│                     │  Catches: rm -rf, chmod 777, fork bombs, etc.
+└─────────┬───────────┘
+          │ flagged
+          ▼
+┌─────────────────────┐
+│  Phase 2: Smart     │  Auxiliary LLM risk assessment
+│  Approval           │  Separate model from the primary agent
+│  (LLM Factor)       │  Verdict: APPROVE / DENY / ESCALATE
+│                     │
+│  Failure mode:      │  Prompt injection, adversarial input
+│  Mitigated by:      │  Human confirmation below
+└─────────┬───────────┘
+          │ ESCALATE (or mode=manual)
+          ▼
+┌─────────────────────┐
+│  Phase 3: Human     │  Interactive prompt (CLI) or /approve (gateway)
+│  Confirmation       │  User sees command + flagged reason
+│  (Human Factor)     │  Session-scoped approval (re-ask per session)
+│                     │
+│  Failure mode:      │  Rubber-stamping, fatigue
+│  Mitigated by:      │  LLM assessment above
+└─────────┬───────────┘
+          │ approved
+          ▼
+    Command executes
+```
+
+The critical insight: an attacker who can fool the human (social engineering) is unlikely to also fool the LLM (requires prompt injection). An attacker who can fool the LLM (adversarial input) is unlikely to also fool the human (requires social engineering). The attack surface for *both* is the intersection, which is strictly smaller than either alone.
+
+#### Phase 1: Pattern Detection
+
+Entry point: `tools/approval.py` — `check_dangerous_command()`
+
+**Pattern matching** uses `DANGEROUS_PATTERNS`, a list of regex patterns that flag commands by structural risk:
+- `rm -rf` / `rm --recursive` — recursive deletion
+- `chmod 777` / `chmod o+w` — world-writable permissions
+- `dd if=... of=/dev/` — disk-level writes
+- `curl | bash` — remote code execution
+- Shell expansion via `-c` flag — script execution
+- Sensitive path writes (`~/.ssh/`, `~/.hermes/.env`, `/etc/`)
+
+**Tirith policy engine** adds configurable rules beyond regex. When tirith is installed, commands are scanned against policy rules. Both `block` and `warn` findings route through the approval flow.
+
+Session-scoped approvals: once a pattern is approved for a session, it is not re-asked. This prevents approval fatigue from repeated safe commands.
+
+#### Phase 2: Smart Approval (LLM Factor)
+
+When `approvals.mode` is set to `smart` in config, the system calls an **auxiliary LLM** — a separate model from the primary agent — to assess risk before prompting the user.
+
+Implementation: `tools/approval.py` — `_smart_approve()`
+
+```python
+# The aux LLM receives:
+# - The exact command string
+# - The pattern-matched flag reason
+# - A prompt instructing it to APPROVE/DENY/ESCALATE
+
+# The aux LLM is NOT the same model as the primary agent.
+# This matters: an attacker who compromises the primary agent's
+# output cannot directly control the approval verdict.
+```
+
+The auxiliary LLM sees only the command and flag reason — not the conversation history, not the user's message, not the agent's reasoning. This isolation is intentional: it limits the injection surface.
+
+**Three verdicts:**
+- `APPROVE` — auto-approve, grant session-level approval for these patterns
+- `DENY` — hard block, return "BLOCKED by smart approval" to the agent
+- `ESCALATE` — fall through to human confirmation
+
+On LLM call failure: default to `escalate` (fail-open to human).
+
+#### Phase 3: Human Confirmation (Human Factor)
+
+When smart approval returns `escalate` (or mode is `manual`), the system prompts the human.
+
+**CLI mode:** Interactive `input()` prompt showing the command and flagged reason. User types `y`/`n` or uses `--yes` flag for session-level pre-approval.
+
+**Gateway mode (Telegram, Discord, etc.):** Queue-based blocking. The agent thread blocks until the user responds with `/approve` or `/deny`. Default timeout: 5 minutes. Each parallel subagent gets its own approval queue entry.
+
+**Approval scope:** Session-scoped. Approving `rm -rf /tmp/cache` in one turn doesn't approve it in a different session. The user is re-asked when context changes.
+
+#### Configuration
+
+```yaml
+# ~/.hermes/config.yaml
+approvals:
+  mode: smart          # "manual" | "smart" | "off"
+  gateway_timeout: 300 # seconds to wait for gateway approval
+  allowlist:           # permanent per-pattern session exemptions
+    - "git stash"
+    - "pip install"
+```
+
+- `manual` — skip LLM assessment, always prompt human
+- `smart` — LLM first, escalate to human on ESCALATE or failure
+- `off` — skip all approval (dangerous, for trusted environments only)
+
+#### What This Is NOT
+
+- **Not a sandbox.** Two-factor confirmation is a *social* security layer, not a computational one. An approved command still runs with the agent's full OS permissions.
+- **Not prompt injection proof.** A sufficiently sophisticated injection could fool both factors. Defense-in-depth requires additional layers (containerized environments, OS-level permissions).
+- **Not a replacement for least privilege.** The agent should run with minimal OS permissions. Two-factor confirmation is the last line of defense, not the first.
+
+#### Relationship to Vitalik's Architecture
+
+Vitalik's proposal describes a "Human Confirmation Firewall" — a separate daemon that mediates between the LLM and external actions. Hermes implements this as an in-process approval gate rather than a network daemon, which trades network isolation for lower latency and simpler deployment.
+
+Key differences from the theoretical model:
+- **No separate daemon** — approval logic runs in-process (`tools/approval.py`). A compromised agent process could theoretically bypass it. Mitigated by the pattern detection being pure regex (no LLM involvement in Phase 1).
+- **Auxiliary LLM, not separate process** — the smart approval LLM runs through the same API client infrastructure. Mitigated by using a different model and isolated prompt.
+- **Session-scoped, not per-command permanent** — approvals are granted per-session, not globally. This reduces the blast radius of a single approval mistake.
+
+Future work toward the full Vitalik model:
+- Issue #283: Human Confirmation Firewall daemon (port 6000)
+- Issue #284: Input Privacy Filter for remote queries
+- Issue #285: Threat model documentation
+
+---
+
+*Part of Epic #281: Implementation of Vitalik's Secure LLM Architecture.*