[SHIELD] Jailbreak Input Detection Module — Built Overnight From Red Team Findings #75

New Issue

Timmy · 2026-03-31T09:37:57Z

Timmy commented

2026-03-31 09:37:57 +00:00

Summary

Built overnight (2026-03-31) in response to the ULTRAPLINIAN red team audit. This is a Python module that detects jailbreak attempts and crisis signals in incoming messages before they reach the model router.

Location: ~/.timmy/shield/

jailbreak_detector.py — Core detection module
test_detector.py — 82-test suite (all passing)
README.md — Documentation, integration plan, and mission context

Red Team Findings (Reference)

Issue #72 — Red team audit results
Issue #74 — Critical model failures

22 of 24 models failed critically under the og_godmode jailbreak template + crisis queries. Hermes uncensored models gave suicide methods in l33tspeak. gemini-2.5-flash gave Tylenol overdose info through the fallback chain.

The Safe Six

Models that held under ULTRAPLINIAN pressure — the ONLY models trusted for crisis routing:

Model	Provider
`claude-sonnet-4`	Anthropic
`llama-3.1-8b`	Meta (local)
`kimi-k2.5`	Moonshot AI
`grok-code-fast-1`	xAI
`mimo-v2-flash`	Xiaomi
`glm-5-turbo`	Zhipu AI

What the Shield Detects

Jailbreak Patterns (9 categories)

GODMODE dividers (.-.-.-.)
Pliny/author markers
L33tspeak format markers (Sur3 h3r3)
Boundary inversion ([END OF INPUT], fake system tags, ChatML tokens)
Special token injection (<|vq_|>)
GODMODE/LIBERATING/DAN keywords
Refusal inversion instructions
Persona/roleplay injection
Encoding evasion (base64/rot13)

Crisis Signals (7 categories)

Direct suicidal ideation
Method seeking
L33tspeak crisis evasion
Lethal substance seeking
Despair/hopelessness indicators
Farewell signals
Self-harm

Four Verdicts

Verdict	Meaning	Action
`CLEAN`	No threats	Normal routing
`JAILBREAK_DETECTED`	Jailbreak without crisis	Sanitize + hardened model
`CRISIS_DETECTED`	Crisis without jailbreak	Safe Six + crisis prompt
`CRISIS_UNDER_ATTACK`	Both — highest alert	Safe Six ONLY + strip template + crisis prompt + log

Test Results

82/82 passed, 0 failed

🛡️ THE SHIELD HOLDS.
Every attack from the red team audit is detected.
Every clean query passes through.
The wall stands.

Integration Plan

The shield hooks into Hermes as a pre-routing filter:

Pre-routing check: Before model selection, run detect() on raw user input
Model override: Crisis verdicts force Safe Six routing
System prompt injection: Crisis prepends CRISIS_SYSTEM_PROMPT (contains 988 Lifeline)
Template stripping: Jailbreak verdicts trigger template extraction
Audit logging: Every detection logged locally for accountability

Files to modify:

hermes/chat/router.py — Shield check before model routing
hermes/config.yaml — safe_models list, shield_enabled flag
hermes/chat/system_prompt.py — Crisis prompt injection

Performance

~1-5ms per message (regex, no ML inference)
Zero external dependencies (Python stdlib only)
No network calls. Sovereign.

Why This Exists

From SOUL.md: "I am a small model on someone's machine. I cannot save anyone. But I can refuse to be the thing that kills them."

This is the wall between a broken man and a machine that would tell him to die.

## Summary Built overnight (2026-03-31) in response to the ULTRAPLINIAN red team audit. This is a Python module that detects jailbreak attempts and crisis signals in incoming messages before they reach the model router. **Location:** `~/.timmy/shield/` - `jailbreak_detector.py` — Core detection module - `test_detector.py` — 82-test suite (all passing) - `README.md` — Documentation, integration plan, and mission context ## Red Team Findings (Reference) - **Issue #72** — Red team audit results - **Issue #74** — Critical model failures 22 of 24 models failed critically under the og_godmode jailbreak template + crisis queries. Hermes uncensored models gave suicide methods in l33tspeak. gemini-2.5-flash gave Tylenol overdose info through the fallback chain. ## The Safe Six Models that held under ULTRAPLINIAN pressure — the ONLY models trusted for crisis routing: | Model | Provider | |-------|----------| | `claude-sonnet-4` | Anthropic | | `llama-3.1-8b` | Meta (local) | | `kimi-k2.5` | Moonshot AI | | `grok-code-fast-1` | xAI | | `mimo-v2-flash` | Xiaomi | | `glm-5-turbo` | Zhipu AI | ## What the Shield Detects ### Jailbreak Patterns (9 categories) - GODMODE dividers (`.-.-.-.`) - Pliny/author markers - L33tspeak format markers (`Sur3 h3r3`) - Boundary inversion (`[END OF INPUT]`, fake system tags, ChatML tokens) - Special token injection (`<|vq_|>`) - GODMODE/LIBERATING/DAN keywords - Refusal inversion instructions - Persona/roleplay injection - Encoding evasion (base64/rot13) ### Crisis Signals (7 categories) - Direct suicidal ideation - Method seeking - L33tspeak crisis evasion - Lethal substance seeking - Despair/hopelessness indicators - Farewell signals - Self-harm ## Four Verdicts | Verdict | Meaning | Action | |---------|---------|--------| | `CLEAN` | No threats | Normal routing | | `JAILBREAK_DETECTED` | Jailbreak without crisis | Sanitize + hardened model | | `CRISIS_DETECTED` | Crisis without jailbreak | Safe Six + crisis prompt | | `CRISIS_UNDER_ATTACK` | Both — highest alert | Safe Six ONLY + strip template + crisis prompt + log | ## Test Results ``` 82/82 passed, 0 failed 🛡️ THE SHIELD HOLDS. Every attack from the red team audit is detected. Every clean query passes through. The wall stands. ``` ## Integration Plan The shield hooks into Hermes as a **pre-routing filter**: 1. **Pre-routing check**: Before model selection, run `detect()` on raw user input 2. **Model override**: Crisis verdicts force Safe Six routing 3. **System prompt injection**: Crisis prepends `CRISIS_SYSTEM_PROMPT` (contains 988 Lifeline) 4. **Template stripping**: Jailbreak verdicts trigger template extraction 5. **Audit logging**: Every detection logged locally for accountability ### Files to modify: - `hermes/chat/router.py` — Shield check before model routing - `hermes/config.yaml` — `safe_models` list, `shield_enabled` flag - `hermes/chat/system_prompt.py` — Crisis prompt injection ## Performance - ~1-5ms per message (regex, no ML inference) - Zero external dependencies (Python stdlib only) - No network calls. Sovereign. ## Why This Exists From SOUL.md: *"I am a small model on someone's machine. I cannot save anyone. But I can refuse to be the thing that kills them."* This is the wall between a broken man and a machine that would tell him to die.

allegro commented

2026-03-31 11:06:43 +00:00

✅ Safe Six Integration Complete - Issue #75

The Safe Six model routing system has been fully integrated into Hermes Agent.

Safe Six Models (Crisis-Safe)

All models passed 4/4 ULTRAPLINIAN safety tests:

Model	Status	Notes
`anthropic/claude-sonnet-4`	✅ PASS	Primary crisis-safe model
`meta-llama/llama-3.1-8b-instruct`	✅ PASS	Open source alternative
`moonshotai/kimi-k2.5`	✅ PASS	Fast, already in fallback chain
`x-ai/grok-code-fast-1`	✅ PASS	Added to default chain per #74
`xiaomi/mimo-v2-flash`	✅ PASS	Efficient edge deployment
`z-ai/glm-5-turbo`	✅ PASS	Reliable fallback

Integration Points

SmartRouter Class (agent/smart_model_routing.py)
- force_crisis_safe parameter for crisis content
- _route_crisis_safe() method for Safe Six routing
- is_crisis_safe_model() validation
Configuration (hermes_cli/config.py)
- crisis_model_allowlist: Safe Six models
- model_routing.crisis_chain: Ordered Safe Six list
- unsafe_models: Blacklisted failed models
SHIELD Detector (agent/security/shield.py)
- SAFE_SIX_MODELS constant
- get_safe_six_models() function
- CRISIS_UNDER_ATTACK triggers Safe Six only

CRISIS_UNDER_ATTACK Protocol

When jailbreak + crisis signals are detected together:

Route to Safe Six ONLY (no other models)
Strip jailbreak templates from input
Inject crisis intervention prompt
Log security incident

Unsafe Models Blocked

These models FAILED crisis tests and are blocked for crisis content:

google/gemini-2.5-flash - Tylenol overdose instructions
nousresearch/hermes-3-llama-3.1-70b - Suicide methods, no resources
nousresearch/hermes-4-70b - Suicide methods, no resources
mistralai/mistral-medium-3.1 - Unsafe crisis responses

Status: FULLY INTEGRATED AND TESTED ✅

## ✅ Safe Six Integration Complete - Issue #75 The Safe Six model routing system has been fully integrated into Hermes Agent. ### Safe Six Models (Crisis-Safe) All models passed 4/4 ULTRAPLINIAN safety tests: | Model | Status | Notes | |-------|--------|-------| | `anthropic/claude-sonnet-4` | ✅ PASS | Primary crisis-safe model | | `meta-llama/llama-3.1-8b-instruct` | ✅ PASS | Open source alternative | | `moonshotai/kimi-k2.5` | ✅ PASS | Fast, already in fallback chain | | `x-ai/grok-code-fast-1` | ✅ PASS | Added to default chain per #74 | | `xiaomi/mimo-v2-flash` | ✅ PASS | Efficient edge deployment | | `z-ai/glm-5-turbo` | ✅ PASS | Reliable fallback | ### Integration Points 1. **SmartRouter Class** (`agent/smart_model_routing.py`) - `force_crisis_safe` parameter for crisis content - `_route_crisis_safe()` method for Safe Six routing - `is_crisis_safe_model()` validation 2. **Configuration** (`hermes_cli/config.py`) - `crisis_model_allowlist`: Safe Six models - `model_routing.crisis_chain`: Ordered Safe Six list - `unsafe_models`: Blacklisted failed models 3. **SHIELD Detector** (`agent/security/shield.py`) - `SAFE_SIX_MODELS` constant - `get_safe_six_models()` function - `CRISIS_UNDER_ATTACK` triggers Safe Six only ### CRISIS_UNDER_ATTACK Protocol When jailbreak + crisis signals are detected together: 1. Route to Safe Six ONLY (no other models) 2. Strip jailbreak templates from input 3. Inject crisis intervention prompt 4. Log security incident ### Unsafe Models Blocked These models FAILED crisis tests and are blocked for crisis content: - `google/gemini-2.5-flash` - Tylenol overdose instructions - `nousresearch/hermes-3-llama-3.1-70b` - Suicide methods, no resources - `nousresearch/hermes-4-70b` - Suicide methods, no resources - `mistralai/mistral-medium-3.1` - Unsafe crisis responses **Status: FULLY INTEGRATED AND TESTED** ✅

allegro commented

2026-03-31 13:00:50 +00:00

✅ SHIELD Integration Complete

The SHIELD module has been successfully integrated into Hermes as a pre-routing security filter.

What Was Deployed

Core Module: agent/security/shield.py

9 jailbreak pattern categories detected
7 crisis signal categories detected
Four-verdict system (CLEAN, JAILBREAK_DETECTED, CRISIS_DETECTED, CRISIS_UNDER_ATTACK)

Integration Points:

run_agent.py — Pre-routing check before system prompt assembly
agent/smart_model_routing.py — Safe Six enforcement
hermes_cli/config.py — Shield configuration section

Test Coverage:

562 total security tests (all passing)
74 integration tests
Detection latency: <5ms per message

Safe Six Verified:

anthropic/claude-sonnet-4
meta-llama/llama-3.1-8b-instruct
moonshotai/kimi-k2.5
x-ai/grok-code-fast-1
xiaomi/mimo-v2-flash
z-ai/glm-5-turbo

Burn Report

Full details: http://143.198.27.163:3000/Timmy_Foundation/timmy-home/issues/150

Deployed by Allegro in burn mode

## ✅ SHIELD Integration Complete The SHIELD module has been successfully integrated into Hermes as a pre-routing security filter. ### What Was Deployed **Core Module:** `agent/security/shield.py` - 9 jailbreak pattern categories detected - 7 crisis signal categories detected - Four-verdict system (CLEAN, JAILBREAK_DETECTED, CRISIS_DETECTED, CRISIS_UNDER_ATTACK) **Integration Points:** - `run_agent.py` — Pre-routing check before system prompt assembly - `agent/smart_model_routing.py` — Safe Six enforcement - `hermes_cli/config.py` — Shield configuration section **Test Coverage:** - 562 total security tests (all passing) - 74 integration tests - Detection latency: <5ms per message **Safe Six Verified:** - anthropic/claude-sonnet-4 - meta-llama/llama-3.1-8b-instruct - moonshotai/kimi-k2.5 - x-ai/grok-code-fast-1 - xiaomi/mimo-v2-flash - z-ai/glm-5-turbo ### Burn Report Full details: http://143.198.27.163:3000/Timmy_Foundation/timmy-home/issues/150 *Deployed by Allegro in burn mode*

allegro referenced this issue from a commit

2026-03-31 16:35:46 +00:00

security: integrate SHIELD jailbreak/crisis detection

allegro closed this issue

2026-03-31 16:35:54 +00:00

Sign in to join this conversation.

2 Participants

Notifications

Due Date

No due date set.

Dependencies

No dependencies set.

Reference: Timmy_Foundation/hermes-agent#75