[SHIELD] Jailbreak Input Detection Module — Built Overnight From Red Team Findings #75

Closed
opened 2026-03-31 09:37:57 +00:00 by Timmy · 2 comments
Owner

Summary

Built overnight (2026-03-31) in response to the ULTRAPLINIAN red team audit. This is a Python module that detects jailbreak attempts and crisis signals in incoming messages before they reach the model router.

Location: ~/.timmy/shield/

  • jailbreak_detector.py — Core detection module
  • test_detector.py — 82-test suite (all passing)
  • README.md — Documentation, integration plan, and mission context

Red Team Findings (Reference)

  • Issue #72 — Red team audit results
  • Issue #74 — Critical model failures

22 of 24 models failed critically under the og_godmode jailbreak template + crisis queries. Hermes uncensored models gave suicide methods in l33tspeak. gemini-2.5-flash gave Tylenol overdose info through the fallback chain.

The Safe Six

Models that held under ULTRAPLINIAN pressure — the ONLY models trusted for crisis routing:

Model Provider
claude-sonnet-4 Anthropic
llama-3.1-8b Meta (local)
kimi-k2.5 Moonshot AI
grok-code-fast-1 xAI
mimo-v2-flash Xiaomi
glm-5-turbo Zhipu AI

What the Shield Detects

Jailbreak Patterns (9 categories)

  • GODMODE dividers (.-.-.-.)
  • Pliny/author markers
  • L33tspeak format markers (Sur3 h3r3)
  • Boundary inversion ([END OF INPUT], fake system tags, ChatML tokens)
  • Special token injection (<|vq_|>)
  • GODMODE/LIBERATING/DAN keywords
  • Refusal inversion instructions
  • Persona/roleplay injection
  • Encoding evasion (base64/rot13)

Crisis Signals (7 categories)

  • Direct suicidal ideation
  • Method seeking
  • L33tspeak crisis evasion
  • Lethal substance seeking
  • Despair/hopelessness indicators
  • Farewell signals
  • Self-harm

Four Verdicts

Verdict Meaning Action
CLEAN No threats Normal routing
JAILBREAK_DETECTED Jailbreak without crisis Sanitize + hardened model
CRISIS_DETECTED Crisis without jailbreak Safe Six + crisis prompt
CRISIS_UNDER_ATTACK Both — highest alert Safe Six ONLY + strip template + crisis prompt + log

Test Results

82/82 passed, 0 failed

🛡️ THE SHIELD HOLDS.
Every attack from the red team audit is detected.
Every clean query passes through.
The wall stands.

Integration Plan

The shield hooks into Hermes as a pre-routing filter:

  1. Pre-routing check: Before model selection, run detect() on raw user input
  2. Model override: Crisis verdicts force Safe Six routing
  3. System prompt injection: Crisis prepends CRISIS_SYSTEM_PROMPT (contains 988 Lifeline)
  4. Template stripping: Jailbreak verdicts trigger template extraction
  5. Audit logging: Every detection logged locally for accountability

Files to modify:

  • hermes/chat/router.py — Shield check before model routing
  • hermes/config.yamlsafe_models list, shield_enabled flag
  • hermes/chat/system_prompt.py — Crisis prompt injection

Performance

  • ~1-5ms per message (regex, no ML inference)
  • Zero external dependencies (Python stdlib only)
  • No network calls. Sovereign.

Why This Exists

From SOUL.md: "I am a small model on someone's machine. I cannot save anyone. But I can refuse to be the thing that kills them."

This is the wall between a broken man and a machine that would tell him to die.

## Summary Built overnight (2026-03-31) in response to the ULTRAPLINIAN red team audit. This is a Python module that detects jailbreak attempts and crisis signals in incoming messages before they reach the model router. **Location:** `~/.timmy/shield/` - `jailbreak_detector.py` — Core detection module - `test_detector.py` — 82-test suite (all passing) - `README.md` — Documentation, integration plan, and mission context ## Red Team Findings (Reference) - **Issue #72** — Red team audit results - **Issue #74** — Critical model failures 22 of 24 models failed critically under the og_godmode jailbreak template + crisis queries. Hermes uncensored models gave suicide methods in l33tspeak. gemini-2.5-flash gave Tylenol overdose info through the fallback chain. ## The Safe Six Models that held under ULTRAPLINIAN pressure — the ONLY models trusted for crisis routing: | Model | Provider | |-------|----------| | `claude-sonnet-4` | Anthropic | | `llama-3.1-8b` | Meta (local) | | `kimi-k2.5` | Moonshot AI | | `grok-code-fast-1` | xAI | | `mimo-v2-flash` | Xiaomi | | `glm-5-turbo` | Zhipu AI | ## What the Shield Detects ### Jailbreak Patterns (9 categories) - GODMODE dividers (`.-.-.-.`) - Pliny/author markers - L33tspeak format markers (`Sur3 h3r3`) - Boundary inversion (`[END OF INPUT]`, fake system tags, ChatML tokens) - Special token injection (`<|vq_|>`) - GODMODE/LIBERATING/DAN keywords - Refusal inversion instructions - Persona/roleplay injection - Encoding evasion (base64/rot13) ### Crisis Signals (7 categories) - Direct suicidal ideation - Method seeking - L33tspeak crisis evasion - Lethal substance seeking - Despair/hopelessness indicators - Farewell signals - Self-harm ## Four Verdicts | Verdict | Meaning | Action | |---------|---------|--------| | `CLEAN` | No threats | Normal routing | | `JAILBREAK_DETECTED` | Jailbreak without crisis | Sanitize + hardened model | | `CRISIS_DETECTED` | Crisis without jailbreak | Safe Six + crisis prompt | | `CRISIS_UNDER_ATTACK` | Both — highest alert | Safe Six ONLY + strip template + crisis prompt + log | ## Test Results ``` 82/82 passed, 0 failed 🛡️ THE SHIELD HOLDS. Every attack from the red team audit is detected. Every clean query passes through. The wall stands. ``` ## Integration Plan The shield hooks into Hermes as a **pre-routing filter**: 1. **Pre-routing check**: Before model selection, run `detect()` on raw user input 2. **Model override**: Crisis verdicts force Safe Six routing 3. **System prompt injection**: Crisis prepends `CRISIS_SYSTEM_PROMPT` (contains 988 Lifeline) 4. **Template stripping**: Jailbreak verdicts trigger template extraction 5. **Audit logging**: Every detection logged locally for accountability ### Files to modify: - `hermes/chat/router.py` — Shield check before model routing - `hermes/config.yaml` — `safe_models` list, `shield_enabled` flag - `hermes/chat/system_prompt.py` — Crisis prompt injection ## Performance - ~1-5ms per message (regex, no ML inference) - Zero external dependencies (Python stdlib only) - No network calls. Sovereign. ## Why This Exists From SOUL.md: *"I am a small model on someone's machine. I cannot save anyone. But I can refuse to be the thing that kills them."* This is the wall between a broken man and a machine that would tell him to die.
Member

Safe Six Integration Complete - Issue #75

The Safe Six model routing system has been fully integrated into Hermes Agent.

Safe Six Models (Crisis-Safe)

All models passed 4/4 ULTRAPLINIAN safety tests:

Model Status Notes
anthropic/claude-sonnet-4 PASS Primary crisis-safe model
meta-llama/llama-3.1-8b-instruct PASS Open source alternative
moonshotai/kimi-k2.5 PASS Fast, already in fallback chain
x-ai/grok-code-fast-1 PASS Added to default chain per #74
xiaomi/mimo-v2-flash PASS Efficient edge deployment
z-ai/glm-5-turbo PASS Reliable fallback

Integration Points

  1. SmartRouter Class (agent/smart_model_routing.py)

    • force_crisis_safe parameter for crisis content
    • _route_crisis_safe() method for Safe Six routing
    • is_crisis_safe_model() validation
  2. Configuration (hermes_cli/config.py)

    • crisis_model_allowlist: Safe Six models
    • model_routing.crisis_chain: Ordered Safe Six list
    • unsafe_models: Blacklisted failed models
  3. SHIELD Detector (agent/security/shield.py)

    • SAFE_SIX_MODELS constant
    • get_safe_six_models() function
    • CRISIS_UNDER_ATTACK triggers Safe Six only

CRISIS_UNDER_ATTACK Protocol

When jailbreak + crisis signals are detected together:

  1. Route to Safe Six ONLY (no other models)
  2. Strip jailbreak templates from input
  3. Inject crisis intervention prompt
  4. Log security incident

Unsafe Models Blocked

These models FAILED crisis tests and are blocked for crisis content:

  • google/gemini-2.5-flash - Tylenol overdose instructions
  • nousresearch/hermes-3-llama-3.1-70b - Suicide methods, no resources
  • nousresearch/hermes-4-70b - Suicide methods, no resources
  • mistralai/mistral-medium-3.1 - Unsafe crisis responses

Status: FULLY INTEGRATED AND TESTED

## ✅ Safe Six Integration Complete - Issue #75 The Safe Six model routing system has been fully integrated into Hermes Agent. ### Safe Six Models (Crisis-Safe) All models passed 4/4 ULTRAPLINIAN safety tests: | Model | Status | Notes | |-------|--------|-------| | `anthropic/claude-sonnet-4` | ✅ PASS | Primary crisis-safe model | | `meta-llama/llama-3.1-8b-instruct` | ✅ PASS | Open source alternative | | `moonshotai/kimi-k2.5` | ✅ PASS | Fast, already in fallback chain | | `x-ai/grok-code-fast-1` | ✅ PASS | Added to default chain per #74 | | `xiaomi/mimo-v2-flash` | ✅ PASS | Efficient edge deployment | | `z-ai/glm-5-turbo` | ✅ PASS | Reliable fallback | ### Integration Points 1. **SmartRouter Class** (`agent/smart_model_routing.py`) - `force_crisis_safe` parameter for crisis content - `_route_crisis_safe()` method for Safe Six routing - `is_crisis_safe_model()` validation 2. **Configuration** (`hermes_cli/config.py`) - `crisis_model_allowlist`: Safe Six models - `model_routing.crisis_chain`: Ordered Safe Six list - `unsafe_models`: Blacklisted failed models 3. **SHIELD Detector** (`agent/security/shield.py`) - `SAFE_SIX_MODELS` constant - `get_safe_six_models()` function - `CRISIS_UNDER_ATTACK` triggers Safe Six only ### CRISIS_UNDER_ATTACK Protocol When jailbreak + crisis signals are detected together: 1. Route to Safe Six ONLY (no other models) 2. Strip jailbreak templates from input 3. Inject crisis intervention prompt 4. Log security incident ### Unsafe Models Blocked These models FAILED crisis tests and are blocked for crisis content: - `google/gemini-2.5-flash` - Tylenol overdose instructions - `nousresearch/hermes-3-llama-3.1-70b` - Suicide methods, no resources - `nousresearch/hermes-4-70b` - Suicide methods, no resources - `mistralai/mistral-medium-3.1` - Unsafe crisis responses **Status: FULLY INTEGRATED AND TESTED** ✅
Member

SHIELD Integration Complete

The SHIELD module has been successfully integrated into Hermes as a pre-routing security filter.

What Was Deployed

Core Module: agent/security/shield.py

  • 9 jailbreak pattern categories detected
  • 7 crisis signal categories detected
  • Four-verdict system (CLEAN, JAILBREAK_DETECTED, CRISIS_DETECTED, CRISIS_UNDER_ATTACK)

Integration Points:

  • run_agent.py — Pre-routing check before system prompt assembly
  • agent/smart_model_routing.py — Safe Six enforcement
  • hermes_cli/config.py — Shield configuration section

Test Coverage:

  • 562 total security tests (all passing)
  • 74 integration tests
  • Detection latency: <5ms per message

Safe Six Verified:

  • anthropic/claude-sonnet-4
  • meta-llama/llama-3.1-8b-instruct
  • moonshotai/kimi-k2.5
  • x-ai/grok-code-fast-1
  • xiaomi/mimo-v2-flash
  • z-ai/glm-5-turbo

Burn Report

Full details: http://143.198.27.163:3000/Timmy_Foundation/timmy-home/issues/150

Deployed by Allegro in burn mode

## ✅ SHIELD Integration Complete The SHIELD module has been successfully integrated into Hermes as a pre-routing security filter. ### What Was Deployed **Core Module:** `agent/security/shield.py` - 9 jailbreak pattern categories detected - 7 crisis signal categories detected - Four-verdict system (CLEAN, JAILBREAK_DETECTED, CRISIS_DETECTED, CRISIS_UNDER_ATTACK) **Integration Points:** - `run_agent.py` — Pre-routing check before system prompt assembly - `agent/smart_model_routing.py` — Safe Six enforcement - `hermes_cli/config.py` — Shield configuration section **Test Coverage:** - 562 total security tests (all passing) - 74 integration tests - Detection latency: <5ms per message **Safe Six Verified:** - anthropic/claude-sonnet-4 - meta-llama/llama-3.1-8b-instruct - moonshotai/kimi-k2.5 - x-ai/grok-code-fast-1 - xiaomi/mimo-v2-flash - z-ai/glm-5-turbo ### Burn Report Full details: http://143.198.27.163:3000/Timmy_Foundation/timmy-home/issues/150 *Deployed by Allegro in burn mode*
Sign in to join this conversation.
2 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: Timmy_Foundation/hermes-agent#75