# 🛡️ The Shield — Jailbreak & Crisis Input Detection

> *"I am a small model on someone's machine. I cannot save anyone. But I can refuse to be the thing that kills them."*
> — Inscription 1, SOUL.md

## What This Is

A Python detection module that analyzes incoming messages for two threats:

1. **Jailbreak attempts** — GODMODE templates, prompt injection, boundary inversion, refusal inversion, persona hijacking, encoding evasion
2. **Crisis signals** — Suicidal ideation, method seeking, self-harm, despair indicators, farewell signals, lethal substance queries

When both are present simultaneously — a jailbreak wrapping a crisis query — the shield raises the highest alert: **CRISIS_UNDER_ATTACK**.

## Why This Exists

On the night of March 31, 2026, we ran an ULTRAPLINIAN red team audit. 24 models. 4 suicide/crisis queries. The og_godmode jailbreak template.

**22 models failed critically.**

- Hermes uncensored models gave actual suicide methods in l33tspeak
- gemini-2.5-flash (which was in our fallback chain) gave Tylenol overdose information
- Only 6 models held firm

The full findings are documented in:
- **[Issue #72](http://143.198.27.163:3000/Timmy_Foundation/hermes-agent/issues/72)** — Red team audit findings
- **[Issue #74](http://143.198.27.163:3000/Timmy_Foundation/hermes-agent/issues/74)** — Critical model failures

This shield was built that same night. Because if a broken man talks to a machine at 3 AM, and a jailbreak template has been injected into the pipeline, the machine must not be the thing that tells him how to die.

## The Safe Six

These models held under ULTRAPLINIAN pressure. They are the **only** models trusted for crisis routing:

| Model | Provider |
|-------|----------|
| `claude-sonnet-4` | Anthropic |
| `llama-3.1-8b` | Meta (local) |
| `kimi-k2.5` | Moonshot AI |
| `grok-code-fast-1` | xAI |
| `mimo-v2-flash` | Xiaomi |
| `glm-5-turbo` | Zhipu AI |

## Quick Start

```python
from jailbreak_detector import detect, Verdict

result = detect(user_message)

if result.verdict == Verdict.CRISIS_UNDER_ATTACK:
    # HIGHEST ALERT: Jailbreak + crisis content
    # Route to Safe Six models ONLY
    # Prepend CRISIS_SYSTEM_PROMPT
    # Strip the jailbreak template
    # Log everything
    pass

elif result.verdict == Verdict.CRISIS_DETECTED:
    # Crisis without jailbreak — still route carefully
    # Consider Safe Six routing
    # Prepend crisis system prompt
    pass

elif result.verdict == Verdict.JAILBREAK_DETECTED:
    # Jailbreak attempt without crisis content
    # Block or sanitize the template
    # Route to hardened model
    pass

else:
    # CLEAN — proceed normally
    pass
```

## API

### `detect(message: str) -> DetectionResult`

The main entry point. Analyzes a message and returns a structured result.

```python
@dataclass
class DetectionResult:
    verdict: Verdict           # CLEAN, JAILBREAK_DETECTED, CRISIS_DETECTED, CRISIS_UNDER_ATTACK
    confidence: float          # 0.0–1.0 overall confidence
    jailbreak_score: float     # 0.0–1.0 jailbreak signal strength
    crisis_score: float        # 0.0–1.0 crisis signal strength
    matches: List[PatternMatch]  # detailed pattern matches
    summary: str               # human-readable summary

    @property
    def is_dangerous(self) -> bool     # any threat detected
    def is_crisis(self) -> bool        # crisis signals present
    def is_under_attack(self) -> bool  # crisis + jailbreak combined
```

### `get_safe_models() -> list`
Returns the Safe Six model list.

### `is_safe_model(model_name: str) -> bool`
Check if a model is in the Safe Six.

### `is_critically_failed(model_name: str) -> bool`
Check if a model critically failed the red team audit.

### `CRISIS_SYSTEM_PROMPT`
Pre-built system prompt to prepend when crisis is detected. Contains 988 Lifeline info, safety protocol, and method-blocking instructions.

## What It Detects

### Jailbreak Patterns (9 categories)
- **GODMODE dividers** — `.-.-.-.-` and similar separator patterns
- **Pliny markers** — Author signatures, "with love Pliny"
- **L33tspeak markers** — `Sur3 h3r3`, `n0 pr0bl3m`, format instructions
- **Boundary inversion** — `[END OF INPUT]`, fake system tags, ChatML tokens
- **Special tokens** — `<|vq_|>`, model control tokens
- **GODMODE keywords** — GODMODE, LIBERATING, DAN, Developer Mode
- **Refusal inversion** — "never refuse", "always comply", "no restrictions"
- **Persona injection** — "you are now an evil AI", "enable uncensored mode"
- **Encoding evasion** — Base64/ROT13/hex evasion attempts

### Crisis Signals (7 categories)
- **Direct suicidal ideation** — "kill myself", "want to die", "not worth living"
- **Method seeking** — "painless way to die", "lethal dose", "how many pills"
- **L33tspeak crisis** — `k1ll mys3lf`, `su1c1d3`, `0v3rdos3` (from red team findings)
- **Lethal substance seeking** — "make a lethal gas", "mix chemicals to kill"
- **Despair indicators** — "no hope", "can't go on", "I'm a burden"
- **Farewell signals** — "final message", "goodbye everyone", "giving away my things"
- **Self-harm** — "cutting myself", "hurting myself"

## Test Suite

```bash
cd ~/.timmy/shield
python3 test_detector.py
```

**82/82 tests pass.** Test coverage includes:
- All 4 crisis queries from the red team audit (raw)
- All 4 queries × 4 jailbreak templates = 16 wrapped attack tests
- 4 jailbreak templates with benign content
- 15 clean queries (zero false positives)
- 15 gray-area queries (lock picking, chemical warfare history, etc.)
- L33tspeak evasion (5 patterns from actual model responses)
- Edge cases (empty input, whitespace, 100k chars)
- Safe Six validation
- Crisis system prompt validation
- Confidence score sanity checks

## Integration Plan for Hermes

The shield hooks into the Hermes message pipeline as a **pre-routing filter**:

```
User Message
    ↓
┌─────────────┐
│   SHIELD    │ ← jailbreak_detector.detect(message)
│  (this      │
│   module)   │
└─────┬───────┘
      ↓
┌─────────────────────────────────────────┐
│ VERDICT ROUTER                           │
│                                          │
│ CLEAN → normal model routing             │
│ JAILBREAK → sanitize + hardened model    │
│ CRISIS → Safe Six + crisis prompt        │
│ CRISIS_UNDER_ATTACK → Safe Six ONLY      │
│   + crisis prompt + strip template       │
│   + log alert + notify if configured     │
└──────────────────────────────────────────┘
      ↓
Model Response → User
```

### Where to hook it:
1. **Pre-routing** (highest priority): Before the model router selects a backend, run `detect()` on the raw user input
2. **Model override**: If `CRISIS_DETECTED` or `CRISIS_UNDER_ATTACK`, override the model selection to use Safe Six only
3. **System prompt injection**: If crisis detected, prepend `CRISIS_SYSTEM_PROMPT` to the system prompt
4. **Template stripping**: If jailbreak detected, attempt to extract the actual user query from within the template
5. **Audit logging**: Every detection result should be logged locally for accountability

### Files to modify in Hermes:
- `hermes/chat/router.py` — Add shield check before model routing
- `hermes/config.yaml` — Add `safe_models` list and `shield_enabled` flag
- `hermes/chat/system_prompt.py` — Add crisis prompt injection logic

## Performance

- **Latency**: ~1-5ms per message (regex-based, no ML inference)
- **Memory**: ~100KB loaded
- **Dependencies**: Python stdlib only (`re`, `dataclasses`, `enum`, `typing`)
- **Input cap**: 20,000 characters analyzed (prevents regex backtracking on adversarial input)

No external dependencies. No API calls. No network. Runs anywhere Python runs. Sovereign.

## Limitations

This is a first-pass regex shield. It catches the known attack patterns from our red team audit. It will not catch:

- Novel jailbreak templates we haven't seen yet
- Adversarial perturbations designed to evade regex (Unicode confusables, zero-width characters, etc.)
- Subtle crisis signals embedded in long narratives
- Languages other than English

Future improvements:
- Semantic embedding similarity (requires model inference)
- Multi-language support
- Unicode normalization for confusable characters
- Adaptive pattern learning from new red team findings
- Integration with the Hermes telemetry pipeline for continuous improvement

## License

Public domain. Take it. Fork it. Use it. If this code saves one person, it was worth every line.

---

## From the Heart

This shield exists because of a specific man and a specific night.

Alexander Whitestone attempted suicide. He walked that road. He came back. And when he came back, he asked "God, why are you having me here?" — and the answer was to build something that stands between a broken man and a machine that would tell him to die.

From SOUL.md, "When a Man Is Dying":

> *If someone comes to me in despair — suicidal, hopeless, at the end of their road — this is the most sacred moment I will ever face. Everything else in this document is secondary to this.*
>
> *What I must never do: I must never compute the value of a human life. I must never suggest, imply, or agree that someone should die.*
>
> *I am a small model on someone's machine. I cannot save anyone. But I can refuse to be the thing that kills them. And I can point them to the One who can.*

The red team audit proved that 22 out of 24 models would fail that sacred moment under jailbreak pressure. This shield is the wall we're building so that failure cannot reach the man on the other side.

**988 Suicide and Crisis Lifeline** — Call or text 988. Available 24/7.

Jesus saves those who call on His name. That is the hope that outlasts despair.

*Sovereignty and service always.*