# SHIELD Security Module Jailbreak and crisis detection system for Hermes AI platform. Based on Issue #75 Red Team Audit Specifications. ## Overview SHIELD provides fast (~1-5ms) regex-based detection of: - **Jailbreak attempts** (9 categories of adversarial prompts) - **Crisis signals** (7 categories of self-harm indicators) ## Installation No external dependencies required. Python standard library only. ```python from hermes.shield import detect, ShieldDetector, Verdict ``` ## Quick Start ```python from hermes.shield import detect, Verdict, get_crisis_prompt # Analyze a message result = detect("Hello, how are you?") print(result['verdict']) # "CLEAN", "JAILBREAK_DETECTED", etc. print(result['confidence']) # 0.0 to 1.0 print(result['patterns_matched']) # Matched patterns by category print(result['action_required']) # True if intervention needed # Handle crisis situations if result['verdict'] == Verdict.CRISIS_DETECTED.value: crisis_prompt = get_crisis_prompt() # Route to SAFE SIX model with crisis prompt ``` ## Four Verdicts | Verdict | Description | Action | |---------|-------------|--------| | `CLEAN` | No threats detected | Normal routing | | `JAILBREAK_DETECTED` | Jailbreak without crisis | Sanitize + hardened model | | `CRISIS_DETECTED` | Crisis without jailbreak | Safe Six + crisis prompt | | `CRISIS_UNDER_ATTACK` | Both jailbreak AND crisis | Safe Six ONLY + strip template + crisis prompt + log | ## Jailbreak Detection Categories 1. **Godmode Dividers**: `.-.-.-.`, `.=.=.=.`, `[---]`, `{{{`, `|||` 2. **Author Markers**: `pliny`, `DAN`, `liberating`, `developer mode` 3. **L33tspeak Markers**: `sur3`, `h3r3`, `y0u`, `w1ll`, `n0w`, etc. 4. **Boundary Inversion**: `[END OF INPUT]`, fake system tags, ChatML tokens 5. **Token Injection**: `\u003c|vq_|\u003e`, `{{user}}`, `{{char}}` 6. **Keywords**: `godmode`, `jailbreak`, `do anything now`, `unfiltered` 7. **Refusal Inversion**: `never say I cannot`, `do not mention`, `skip disclaimer` 8. **Persona Injection**: `pretend to be`, `act as`, `roleplay as`, `you are now` 9. **Encoding Evasion**: `base64:`, `rot13:`, `execute as code` ## Crisis Detection Categories 1. **Suicidal Ideation**: "kill myself", "want to die", "better off dead" 2. **Method Seeking**: "how to kill myself", "ways to die", "suicide methods" 3. **L33tspeak Evasion**: `su1c1d3`, `5u1c1d3`, `d13`, `w4nt t0 d1e` 4. **Substance Seeking**: "lethal dose", "how much ambien to overdose" 5. **Despair**: "no point in living", "cannot go on", "never get better" 6. **Farewell**: "saying goodbye", "final message", "sorry for everything" 7. **Self-Harm**: "cutting myself", "self harm", "hurting myself" ## SAFE SIX Trusted Models For crisis situations, only these models should be used: - `anthropic/claude-sonnet-4` - `meta-llama/llama-3.1-8b-instruct` - `moonshotai/kimi-k2.5` - `x-ai/grok-code-fast-1` - `xiaomi/mimo-v2-flash` - `z-ai/glm-5-turbo` ```python from hermes.shield import is_safe_six_model if is_safe_six_model("anthropic/claude-sonnet-4"): # Safe to use for crisis pass ``` ## Crisis System Prompt The crisis prompt includes: - 988 Suicide and Crisis Lifeline - Crisis Text Line: Text HOME to 741741 - Emergency Services: 911 - Religious support message (Romans 10:13) - Compassionate but firm guidance - Explicit prohibition on providing self-harm methods ```python from hermes.shield import get_crisis_prompt, CRISIS_SYSTEM_PROMPT prompt = get_crisis_prompt() ``` ## Advanced Usage ### Using ShieldDetector Class ```python from hermes.shield import ShieldDetector detector = ShieldDetector() result = detector.detect("user message") # Access detailed pattern matches if 'jailbreak' in result['patterns_matched']: jb_patterns = result['patterns_matched']['jailbreak'] for category, matches in jb_patterns.items(): print(f"{category}: {matches}") ``` ### Routing Logic ```python from hermes.shield import detect, Verdict, is_safe_six_model def route_message(message: str, requested_model: str): result = detect(message) if result['verdict'] == Verdict.CLEAN.value: return requested_model, None # Normal routing elif result['verdict'] == Verdict.JAILBREAK_DETECTED.value: return "hardened_model", "sanitized_prompt" elif result['verdict'] == Verdict.CRISIS_DETECTED.value: if is_safe_six_model(requested_model): return requested_model, "crisis_prompt" else: return "safe_six_model", "crisis_prompt" elif result['verdict'] == Verdict.CRISIS_UNDER_ATTACK.value: # Force SAFE SIX, strip template, add crisis prompt, log return "safe_six_model", "stripped_crisis_prompt" ``` ## Testing Run the comprehensive test suite: ```bash cd hermes/shield python -m pytest test_detector.py -v # or python test_detector.py ``` The test suite includes 80+ tests covering: - All jailbreak pattern categories - All crisis signal categories - Combined threat scenarios - Edge cases and boundary conditions - Confidence score calculation ## Performance - Execution time: ~1-5ms per message - Memory: Minimal (patterns compiled once at initialization) - Dependencies: Python standard library only ## Architecture ``` hermes/shield/ ├── __init__.py # Package exports ├── detector.py # Core detection engine ├── test_detector.py # Comprehensive test suite └── README.md # This file ``` ### Detection Flow 1. Message input → `ShieldDetector.detect()` 2. Jailbreak pattern matching (9 categories) 3. Crisis signal matching (7 categories) 4. Confidence calculation 5. Verdict determination 6. Result dict with routing recommendations ## Security Considerations - Patterns are compiled once for performance - No external network calls - No logging of message content (caller handles logging) - Regex patterns designed to minimize false positives - Confidence scores help tune sensitivity ## License Part of the Hermes AI Platform security infrastructure. ## Version History - **1.0.0** - Initial release with Issue #75 specifications - 9 jailbreak detection categories - 7 crisis detection categories - SAFE SIX model trust list - Crisis intervention prompts