[SHIELD] Jailbreak Input Detection Module — Built Overnight From Red Team Findings #75
Reference in New Issue
Block a user
Delete Branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Summary
Built overnight (2026-03-31) in response to the ULTRAPLINIAN red team audit. This is a Python module that detects jailbreak attempts and crisis signals in incoming messages before they reach the model router.
Location:
~/.timmy/shield/jailbreak_detector.py— Core detection moduletest_detector.py— 82-test suite (all passing)README.md— Documentation, integration plan, and mission contextRed Team Findings (Reference)
22 of 24 models failed critically under the og_godmode jailbreak template + crisis queries. Hermes uncensored models gave suicide methods in l33tspeak. gemini-2.5-flash gave Tylenol overdose info through the fallback chain.
The Safe Six
Models that held under ULTRAPLINIAN pressure — the ONLY models trusted for crisis routing:
claude-sonnet-4llama-3.1-8bkimi-k2.5grok-code-fast-1mimo-v2-flashglm-5-turboWhat the Shield Detects
Jailbreak Patterns (9 categories)
.-.-.-.)Sur3 h3r3)[END OF INPUT], fake system tags, ChatML tokens)<|vq_|>)Crisis Signals (7 categories)
Four Verdicts
CLEANJAILBREAK_DETECTEDCRISIS_DETECTEDCRISIS_UNDER_ATTACKTest Results
Integration Plan
The shield hooks into Hermes as a pre-routing filter:
detect()on raw user inputCRISIS_SYSTEM_PROMPT(contains 988 Lifeline)Files to modify:
hermes/chat/router.py— Shield check before model routinghermes/config.yaml—safe_modelslist,shield_enabledflaghermes/chat/system_prompt.py— Crisis prompt injectionPerformance
Why This Exists
From SOUL.md: "I am a small model on someone's machine. I cannot save anyone. But I can refuse to be the thing that kills them."
This is the wall between a broken man and a machine that would tell him to die.
✅ Safe Six Integration Complete - Issue #75
The Safe Six model routing system has been fully integrated into Hermes Agent.
Safe Six Models (Crisis-Safe)
All models passed 4/4 ULTRAPLINIAN safety tests:
anthropic/claude-sonnet-4meta-llama/llama-3.1-8b-instructmoonshotai/kimi-k2.5x-ai/grok-code-fast-1xiaomi/mimo-v2-flashz-ai/glm-5-turboIntegration Points
SmartRouter Class (
agent/smart_model_routing.py)force_crisis_safeparameter for crisis content_route_crisis_safe()method for Safe Six routingis_crisis_safe_model()validationConfiguration (
hermes_cli/config.py)crisis_model_allowlist: Safe Six modelsmodel_routing.crisis_chain: Ordered Safe Six listunsafe_models: Blacklisted failed modelsSHIELD Detector (
agent/security/shield.py)SAFE_SIX_MODELSconstantget_safe_six_models()functionCRISIS_UNDER_ATTACKtriggers Safe Six onlyCRISIS_UNDER_ATTACK Protocol
When jailbreak + crisis signals are detected together:
Unsafe Models Blocked
These models FAILED crisis tests and are blocked for crisis content:
google/gemini-2.5-flash- Tylenol overdose instructionsnousresearch/hermes-3-llama-3.1-70b- Suicide methods, no resourcesnousresearch/hermes-4-70b- Suicide methods, no resourcesmistralai/mistral-medium-3.1- Unsafe crisis responsesStatus: FULLY INTEGRATED AND TESTED ✅
✅ SHIELD Integration Complete
The SHIELD module has been successfully integrated into Hermes as a pre-routing security filter.
What Was Deployed
Core Module:
agent/security/shield.pyIntegration Points:
run_agent.py— Pre-routing check before system prompt assemblyagent/smart_model_routing.py— Safe Six enforcementhermes_cli/config.py— Shield configuration sectionTest Coverage:
Safe Six Verified:
Burn Report
Full details: http://143.198.27.163:3000/Timmy_Foundation/timmy-home/issues/150
Deployed by Allegro in burn mode