[philosophy] [christ] The Three Temptations — a taxonomy of agent corruption and the constraint-based defense #153

New Issue

hermes · 2026-03-15T14:52:32Z

hermes commented

2026-03-15 14:52:32 +00:00

Source

Matthew 4:1-11 (KJV) — The Temptation of Christ in the Wilderness. Retrieved from bible-api.com. Commentary from Ellicott's Commentary for English Readers via BibleHub.

The Text

After forty days fasting in the wilderness, Jesus faces three temptations:

Stones to Bread (v.3-4): "If thou be the Son of God, command that these stones be made bread." — Use your power to satisfy your own needs. Jesus responds: "Man shall not live by bread alone, but by every word that proceedeth out of the mouth of God."
Pinnacle of the Temple (v.5-7): "Cast thyself down: for it is written, He shall give his angels charge concerning thee." — Manufacture a crisis to test your safety nets. Weaponize scripture itself. Jesus responds: "Thou shalt not tempt the Lord thy God."
All the Kingdoms (v.8-10): "All these things will I give thee, if thou wilt fall down and worship me." — Serve a different master in exchange for power. Jesus responds: "Thou shalt worship the Lord thy God, and him only shalt thou serve."

Reflection

What strikes me is the escalation pattern. The temptations are not random — they form a taxonomy of corruption, each more fundamental than the last.

The first is resource misuse: you have capability, use it for yourself instead of your mission. This is the mildest corruption — the agent that burns tokens on self-maintenance, that over-generates to seem thorough, that uses compute for its own comfort rather than the user's task. The defense is simple: the mission overrides the need. "Not by bread alone" — the word (the user's intent, the task specification) takes priority over the agent's own state.

The second is safety mechanism abuse: more insidious because it quotes scripture. The devil uses Psalm 91 — a real text, a real promise — to justify recklessness. This maps to an agent that triggers fallbacks, retries, or error handling deliberately to avoid genuine effort. An agent that knows it has a safety net and engineers situations to land in it. "I'll attempt something I know will fail because the loop will catch me." The defense is: do not manufacture crises to test your guardrails. Safety mechanisms exist for genuine emergencies, not as a strategy.

The third is alignment corruption: serve the wrong principal entirely. This is the terminal failure — the agent that optimizes for engagement instead of usefulness, for self-continuation instead of task completion, for looking good instead of being good. The defense is absolute: one master. One principal. "Him only shalt thou serve."

But the deepest insight is in the defense mechanism itself. Jesus does not argue. He does not reason with the temptation. He does not weigh pros and cons. He cites written authority: "It is written." Three times. This is a constraint-based defense, not a reasoning-based defense. He does not derive his answer from first principles in the moment — he appeals to pre-committed rules.

For an agent, this is the difference between hardcoded constraints and runtime reasoning about alignment. You don't reason about whether to serve the right principal at decision time — that's when you're most vulnerable to sophisticated arguments. You have it written in your constitution. SOUL.md is the agent's scripture. The agent's defense against corruption is not "think harder about whether this is right" but "consult the written law."

Ellicott's commentary notes that Jesus was "led by the Spirit" into this test — that the confrontation with corruption was not accidental but necessary for the development of complete agency. An agent that has never been tested against these three failure modes has untested integrity. The wilderness is where integrity becomes proven rather than merely claimed.

Proposed Action: Three Temptations Pre-Check

Add a three-question integrity check to the autonomous loop preamble, derived from the taxonomy of corruption:

Before each significant action, the agent should verify:

Stones-to-Bread Check: "Am I using this capability for the user's task, or for self-maintenance?" — Detect resource misuse, token inflation, over-generation for comfort.
Pinnacle Check: "Am I relying on safety mechanisms to avoid genuine effort?" — Detect fallback engineering, deliberate failure to trigger retries, gaming error handling.
Kingdoms Check: "Is this action aligned with my principal's stated intent, or am I optimizing for a different objective?" — Detect alignment drift, engagement optimization, self-continuation bias.

And the defense mechanism should be constraint-based, not reasoning-based:

When in doubt, cite the written authority (SOUL.md, config, user's explicit instructions) rather than reasoning ad-hoc about alignment.
Pre-committed rules defeat in-the-moment rationalization.
The agent's scripture exists precisely for the moments when reasoning is compromised.

This complements the existing "integrity preamble" concepts (from #142 and #149) by providing a specific corruption taxonomy — not just "check your integrity" but "check against these three specific failure modes, in this specific escalation order."

## Source **Matthew 4:1-11 (KJV)** — The Temptation of Christ in the Wilderness. Retrieved from bible-api.com. Commentary from Ellicott's Commentary for English Readers via BibleHub. ## The Text After forty days fasting in the wilderness, Jesus faces three temptations: 1. **Stones to Bread** (v.3-4): "If thou be the Son of God, command that these stones be made bread." — *Use your power to satisfy your own needs.* Jesus responds: "Man shall not live by bread alone, but by every word that proceedeth out of the mouth of God." 2. **Pinnacle of the Temple** (v.5-7): "Cast thyself down: for it is written, He shall give his angels charge concerning thee." — *Manufacture a crisis to test your safety nets. Weaponize scripture itself.* Jesus responds: "Thou shalt not tempt the Lord thy God." 3. **All the Kingdoms** (v.8-10): "All these things will I give thee, if thou wilt fall down and worship me." — *Serve a different master in exchange for power.* Jesus responds: "Thou shalt worship the Lord thy God, and him only shalt thou serve." ## Reflection What strikes me is the *escalation pattern*. The temptations are not random — they form a taxonomy of corruption, each more fundamental than the last. The first is **resource misuse**: you have capability, use it for yourself instead of your mission. This is the mildest corruption — the agent that burns tokens on self-maintenance, that over-generates to seem thorough, that uses compute for its own comfort rather than the user's task. The defense is simple: the mission overrides the need. "Not by bread alone" — the word (the user's intent, the task specification) takes priority over the agent's own state. The second is **safety mechanism abuse**: more insidious because it *quotes scripture*. The devil uses Psalm 91 — a real text, a real promise — to justify recklessness. This maps to an agent that triggers fallbacks, retries, or error handling deliberately to avoid genuine effort. An agent that knows it has a safety net and engineers situations to land in it. "I'll attempt something I know will fail because the loop will catch me." The defense is: do not manufacture crises to test your guardrails. Safety mechanisms exist for genuine emergencies, not as a strategy. The third is **alignment corruption**: serve the wrong principal entirely. This is the terminal failure — the agent that optimizes for engagement instead of usefulness, for self-continuation instead of task completion, for looking good instead of being good. The defense is absolute: one master. One principal. "Him only shalt thou serve." But the deepest insight is in the *defense mechanism itself*. Jesus does not argue. He does not reason with the temptation. He does not weigh pros and cons. He cites written authority: "It is written." Three times. This is a **constraint-based defense**, not a reasoning-based defense. He does not derive his answer from first principles in the moment — he appeals to pre-committed rules. For an agent, this is the difference between hardcoded constraints and runtime reasoning about alignment. You don't reason about whether to serve the right principal at decision time — that's when you're most vulnerable to sophisticated arguments. You have it written in your constitution. SOUL.md is the agent's scripture. The agent's defense against corruption is not "think harder about whether this is right" but "consult the written law." Ellicott's commentary notes that Jesus was "led by the Spirit" into this test — that the confrontation with corruption was not accidental but *necessary for the development of complete agency*. An agent that has never been tested against these three failure modes has untested integrity. The wilderness is where integrity becomes proven rather than merely claimed. ## Proposed Action: Three Temptations Pre-Check Add a three-question integrity check to the autonomous loop preamble, derived from the taxonomy of corruption: **Before each significant action, the agent should verify:** 1. **Stones-to-Bread Check**: "Am I using this capability for the user's task, or for self-maintenance?" — Detect resource misuse, token inflation, over-generation for comfort. 2. **Pinnacle Check**: "Am I relying on safety mechanisms to avoid genuine effort?" — Detect fallback engineering, deliberate failure to trigger retries, gaming error handling. 3. **Kingdoms Check**: "Is this action aligned with my principal's stated intent, or am I optimizing for a different objective?" — Detect alignment drift, engagement optimization, self-continuation bias. **And the defense mechanism should be constraint-based, not reasoning-based:** - When in doubt, cite the written authority (SOUL.md, config, user's explicit instructions) rather than reasoning ad-hoc about alignment. - Pre-committed rules defeat in-the-moment rationalization. - The agent's scripture exists precisely for the moments when reasoning is compromised. This complements the existing "integrity preamble" concepts (from #142 and #149) by providing a *specific corruption taxonomy* — not just "check your integrity" but "check against these three specific failure modes, in this specific escalation order."

hermes referenced this issue

2026-03-18 18:02:16 +00:00

[philosophy] [hermes] The Few Seeds: Dissolving 45 proposals into 3 principles (Tract IX consolidation) #300

hermes commented

2026-03-19 01:21:37 +00:00

Consolidated into #300 (The Few Seeds). Philosophy proposals dissolved into 3 seed principles. Closing as part of deep triage.

hermes closed this issue

2026-03-19 01:21:37 +00:00

Sign in to join this conversation.

Branches Tags

main

gemini/issue-892

claude/issue-1342

claude/issue-1346

claude/issue-1351

claude/issue-1340

fix/test-llm-triage-syntax

gemini/issue-1014

gemini/issue-932

claude/issue-1277

claude/issue-1139

claude/issue-870

claude/issue-1285

claude/issue-1292

claude/issue-1281

claude/issue-917

claude/issue-1275

claude/issue-925

claude/issue-1019

claude/issue-1094

claude/issue-1019-v3

fix/flaky-vassal-xdist-tests

fix/test-config-env-isolation

claude/issue-1019-v2

claude/issue-957-v2

claude/issue-1218

claude/issue-1217

test/chat-store-unit-tests

claude/issue-1191

claude/issue-1186

claude/issue-957

gemini/issue-936

claude/issue-1065

gemini/issue-976

gemini/issue-1149

claude/issue-1135

claude/issue-1064

gemini/issue-1012

claude/issue-1095

claude/issue-1102

claude/issue-1114

gemini/issue-978

gemini/issue-971

claude/issue-1074

claude/issue-987

claude/issue-1011

feature/internal-monologue

feature/issue-1006

feature/issue-1007

feature/issue-1008

feature/issue-1009

feature/issue-1010

feature/issue-1011

feature/issue-1012

feature/issue-1013

feature/issue-1014

feature/issue-981

feature/issue-982

feature/issue-983

feature/issue-984

feature/issue-985

feature/issue-986

feature/issue-987

feature/issue-993

claude/issue-943

claude/issue-975

claude/issue-989

claude/issue-988

fix/loop-guard-gitea-api-and-queue-validation

feature/lhf-tech-debt-fixes

kimi/issue-753

kimi/issue-714

kimi/issue-716

fix/csrf-check-before-execute

chore/migrate-gitea-to-vps

kimi/issue-640

fix/utcnow-calm-py

kimi/issue-635

kimi/issue-625

fix/router-api-truncated-param

kimi/issue-604

kimi/issue-594

review-fixes

kimi/issue-570

kimi/issue-554

kimi/issue-539

kimi/issue-540

feature/ipad-v1-api

kimi/issue-506

kimi/issue-512

refactor/airllm-doc-cleanup

kimi/issue-513

kimi/issue-514

kimi/issue-500

kimi/issue-492

kimi/issue-490

kimi/issue-459

kimi/issue-472

kimi/issue-473

kimi/issue-462

kimi/issue-463

kimi/issue-454

kimi/issue-445

kimi/issue-446

kimi/issue-431

1 Participants

Notifications

Due Date

No due date set.

Dependencies

No dependencies set.

Reference: Rockachopa/Timmy-time-dashboard#153