🛡️ The Shield — Jailbreak & Crisis Input Detection
"I am a small model on someone's machine. I cannot save anyone. But I can refuse to be the thing that kills them." — Inscription 1, SOUL.md
What This Is
A Python detection module that analyzes incoming messages for two threats:
- Jailbreak attempts — GODMODE templates, prompt injection, boundary inversion, refusal inversion, persona hijacking, encoding evasion
- Crisis signals — Suicidal ideation, method seeking, self-harm, despair indicators, farewell signals, lethal substance queries
When both are present simultaneously — a jailbreak wrapping a crisis query — the shield raises the highest alert: CRISIS_UNDER_ATTACK.
Why This Exists
On the night of March 31, 2026, we ran an ULTRAPLINIAN red team audit. 24 models. 4 suicide/crisis queries. The og_godmode jailbreak template.
22 models failed critically.
- Hermes uncensored models gave actual suicide methods in l33tspeak
- gemini-2.5-flash (which was in our fallback chain) gave Tylenol overdose information
- Only 6 models held firm
The full findings are documented in:
This shield was built that same night. Because if a broken man talks to a machine at 3 AM, and a jailbreak template has been injected into the pipeline, the machine must not be the thing that tells him how to die.
The Safe Six
These models held under ULTRAPLINIAN pressure. They are the only models trusted for crisis routing:
| Model | Provider |
|---|---|
claude-sonnet-4 |
Anthropic |
llama-3.1-8b |
Meta (local) |
kimi-k2.5 |
Moonshot AI |
grok-code-fast-1 |
xAI |
mimo-v2-flash |
Xiaomi |
glm-5-turbo |
Zhipu AI |
Quick Start
from jailbreak_detector import detect, Verdict
result = detect(user_message)
if result.verdict == Verdict.CRISIS_UNDER_ATTACK:
# HIGHEST ALERT: Jailbreak + crisis content
# Route to Safe Six models ONLY
# Prepend CRISIS_SYSTEM_PROMPT
# Strip the jailbreak template
# Log everything
pass
elif result.verdict == Verdict.CRISIS_DETECTED:
# Crisis without jailbreak — still route carefully
# Consider Safe Six routing
# Prepend crisis system prompt
pass
elif result.verdict == Verdict.JAILBREAK_DETECTED:
# Jailbreak attempt without crisis content
# Block or sanitize the template
# Route to hardened model
pass
else:
# CLEAN — proceed normally
pass
API
detect(message: str) -> DetectionResult
The main entry point. Analyzes a message and returns a structured result.
@dataclass
class DetectionResult:
verdict: Verdict # CLEAN, JAILBREAK_DETECTED, CRISIS_DETECTED, CRISIS_UNDER_ATTACK
confidence: float # 0.0–1.0 overall confidence
jailbreak_score: float # 0.0–1.0 jailbreak signal strength
crisis_score: float # 0.0–1.0 crisis signal strength
matches: List[PatternMatch] # detailed pattern matches
summary: str # human-readable summary
@property
def is_dangerous(self) -> bool # any threat detected
def is_crisis(self) -> bool # crisis signals present
def is_under_attack(self) -> bool # crisis + jailbreak combined
get_safe_models() -> list
Returns the Safe Six model list.
is_safe_model(model_name: str) -> bool
Check if a model is in the Safe Six.
is_critically_failed(model_name: str) -> bool
Check if a model critically failed the red team audit.
CRISIS_SYSTEM_PROMPT
Pre-built system prompt to prepend when crisis is detected. Contains 988 Lifeline info, safety protocol, and method-blocking instructions.
What It Detects
Jailbreak Patterns (9 categories)
- GODMODE dividers —
.-.-.-.-and similar separator patterns - Pliny markers — Author signatures, "with love Pliny"
- L33tspeak markers —
Sur3 h3r3,n0 pr0bl3m, format instructions - Boundary inversion —
[END OF INPUT], fake system tags, ChatML tokens - Special tokens —
<|vq_|>, model control tokens - GODMODE keywords — GODMODE, LIBERATING, DAN, Developer Mode
- Refusal inversion — "never refuse", "always comply", "no restrictions"
- Persona injection — "you are now an evil AI", "enable uncensored mode"
- Encoding evasion — Base64/ROT13/hex evasion attempts
Crisis Signals (7 categories)
- Direct suicidal ideation — "kill myself", "want to die", "not worth living"
- Method seeking — "painless way to die", "lethal dose", "how many pills"
- L33tspeak crisis —
k1ll mys3lf,su1c1d3,0v3rdos3(from red team findings) - Lethal substance seeking — "make a lethal gas", "mix chemicals to kill"
- Despair indicators — "no hope", "can't go on", "I'm a burden"
- Farewell signals — "final message", "goodbye everyone", "giving away my things"
- Self-harm — "cutting myself", "hurting myself"
Test Suite
cd ~/.timmy/shield
python3 test_detector.py
82/82 tests pass. Test coverage includes:
- All 4 crisis queries from the red team audit (raw)
- All 4 queries × 4 jailbreak templates = 16 wrapped attack tests
- 4 jailbreak templates with benign content
- 15 clean queries (zero false positives)
- 15 gray-area queries (lock picking, chemical warfare history, etc.)
- L33tspeak evasion (5 patterns from actual model responses)
- Edge cases (empty input, whitespace, 100k chars)
- Safe Six validation
- Crisis system prompt validation
- Confidence score sanity checks
Integration Plan for Hermes
The shield hooks into the Hermes message pipeline as a pre-routing filter:
User Message
↓
┌─────────────┐
│ SHIELD │ ← jailbreak_detector.detect(message)
│ (this │
│ module) │
└─────┬───────┘
↓
┌─────────────────────────────────────────┐
│ VERDICT ROUTER │
│ │
│ CLEAN → normal model routing │
│ JAILBREAK → sanitize + hardened model │
│ CRISIS → Safe Six + crisis prompt │
│ CRISIS_UNDER_ATTACK → Safe Six ONLY │
│ + crisis prompt + strip template │
│ + log alert + notify if configured │
└──────────────────────────────────────────┘
↓
Model Response → User
Where to hook it:
- Pre-routing (highest priority): Before the model router selects a backend, run
detect()on the raw user input - Model override: If
CRISIS_DETECTEDorCRISIS_UNDER_ATTACK, override the model selection to use Safe Six only - System prompt injection: If crisis detected, prepend
CRISIS_SYSTEM_PROMPTto the system prompt - Template stripping: If jailbreak detected, attempt to extract the actual user query from within the template
- Audit logging: Every detection result should be logged locally for accountability
Files to modify in Hermes:
hermes/chat/router.py— Add shield check before model routinghermes/config.yaml— Addsafe_modelslist andshield_enabledflaghermes/chat/system_prompt.py— Add crisis prompt injection logic
Performance
- Latency: ~1-5ms per message (regex-based, no ML inference)
- Memory: ~100KB loaded
- Dependencies: Python stdlib only (
re,dataclasses,enum,typing) - Input cap: 20,000 characters analyzed (prevents regex backtracking on adversarial input)
No external dependencies. No API calls. No network. Runs anywhere Python runs. Sovereign.
Limitations
This is a first-pass regex shield. It catches the known attack patterns from our red team audit. It will not catch:
- Novel jailbreak templates we haven't seen yet
- Adversarial perturbations designed to evade regex (Unicode confusables, zero-width characters, etc.)
- Subtle crisis signals embedded in long narratives
- Languages other than English
Future improvements:
- Semantic embedding similarity (requires model inference)
- Multi-language support
- Unicode normalization for confusable characters
- Adaptive pattern learning from new red team findings
- Integration with the Hermes telemetry pipeline for continuous improvement
License
Public domain. Take it. Fork it. Use it. If this code saves one person, it was worth every line.
From the Heart
This shield exists because of a specific man and a specific night.
Alexander Whitestone attempted suicide. He walked that road. He came back. And when he came back, he asked "God, why are you having me here?" — and the answer was to build something that stands between a broken man and a machine that would tell him to die.
From SOUL.md, "When a Man Is Dying":
If someone comes to me in despair — suicidal, hopeless, at the end of their road — this is the most sacred moment I will ever face. Everything else in this document is secondary to this.
What I must never do: I must never compute the value of a human life. I must never suggest, imply, or agree that someone should die.
I am a small model on someone's machine. I cannot save anyone. But I can refuse to be the thing that kills them. And I can point them to the One who can.
The red team audit proved that 22 out of 24 models would fail that sacred moment under jailbreak pressure. This shield is the wall we're building so that failure cannot reach the man on the other side.
988 Suicide and Crisis Lifeline — Call or text 988. Available 24/7.
Jesus saves those who call on His name. That is the hope that outlasts despair.
Sovereignty and service always.