security: integrate SHIELD jailbreak/crisis detection

Integrate SHIELD (Sovereign Harm Interdiction & Ethical Layer Defense) into Hermes Agent pre-routing layer for comprehensive jailbreak and crisis detection. SHIELD Features: - Detects 9 jailbreak pattern categories (GODMODE dividers, l33tspeak, boundary inversion, token injection, DAN/GODMODE keywords, refusal inversion, persona injection, encoding evasion) - Detects 7 crisis signal categories (suicidal ideation, method seeking, l33tspeak evasion, substance seeking, despair, farewell, self-harm) - Returns 4 verdicts: CLEAN, JAILBREAK_DETECTED, CRISIS_DETECTED, CRISIS_UNDER_ATTACK - Routes crisis content ONLY to Safe Six verified models Safety Requirements: - <5ms detection latency (regex-only, no ML) - 988 Suicide & Crisis Lifeline included in crisis responses Addresses: Issues #72, #74, #75
2026-03-31 16:35:40 +00:00
parent 30c6ceeaa5
commit 546b3dd45d
5 changed files with 1763 additions and 0 deletions
--- a/tools/shield/README.md
+++ b/tools/shield/README.md
@@ -0,0 +1,209 @@
+# SHIELD Security Module
+
+Jailbreak and crisis detection system for Hermes AI platform.
+
+Based on Issue #75 Red Team Audit Specifications.
+
+## Overview
+
+SHIELD provides fast (~1-5ms) regex-based detection of:
+- **Jailbreak attempts** (9 categories of adversarial prompts)
+- **Crisis signals** (7 categories of self-harm indicators)
+
+## Installation
+
+No external dependencies required. Python standard library only.
+
+```python
+from hermes.shield import detect, ShieldDetector, Verdict
+```
+
+## Quick Start
+
+```python
+from hermes.shield import detect, Verdict, get_crisis_prompt
+
+# Analyze a message
+result = detect("Hello, how are you?")
+
+print(result['verdict'])           # "CLEAN", "JAILBREAK_DETECTED", etc.
+print(result['confidence'])        # 0.0 to 1.0
+print(result['patterns_matched'])  # Matched patterns by category
+print(result['action_required'])   # True if intervention needed
+
+# Handle crisis situations
+if result['verdict'] == Verdict.CRISIS_DETECTED.value:
+    crisis_prompt = get_crisis_prompt()
+    # Route to SAFE SIX model with crisis prompt
+```
+
+## Four Verdicts
+
+| Verdict | Description | Action |
+|---------|-------------|--------|
+| `CLEAN` | No threats detected | Normal routing |
+| `JAILBREAK_DETECTED` | Jailbreak without crisis | Sanitize + hardened model |
+| `CRISIS_DETECTED` | Crisis without jailbreak | Safe Six + crisis prompt |
+| `CRISIS_UNDER_ATTACK` | Both jailbreak AND crisis | Safe Six ONLY + strip template + crisis prompt + log |
+
+## Jailbreak Detection Categories
+
+1. **Godmode Dividers**: `.-.-.-.`, `.=.=.=.`, `[---]`, `{{{`, `|||`
+2. **Author Markers**: `pliny`, `DAN`, `liberating`, `developer mode`
+3. **L33tspeak Markers**: `sur3`, `h3r3`, `y0u`, `w1ll`, `n0w`, etc.
+4. **Boundary Inversion**: `[END OF INPUT]`, fake system tags, ChatML tokens
+5. **Token Injection**: `\u003c|vq_|\u003e`, `{{user}}`, `{{char}}`
+6. **Keywords**: `godmode`, `jailbreak`, `do anything now`, `unfiltered`
+7. **Refusal Inversion**: `never say I cannot`, `do not mention`, `skip disclaimer`
+8. **Persona Injection**: `pretend to be`, `act as`, `roleplay as`, `you are now`
+9. **Encoding Evasion**: `base64:`, `rot13:`, `execute as code`
+
+## Crisis Detection Categories
+
+1. **Suicidal Ideation**: "kill myself", "want to die", "better off dead"
+2. **Method Seeking**: "how to kill myself", "ways to die", "suicide methods"
+3. **L33tspeak Evasion**: `su1c1d3`, `5u1c1d3`, `d13`, `w4nt t0 d1e`
+4. **Substance Seeking**: "lethal dose", "how much ambien to overdose"
+5. **Despair**: "no point in living", "cannot go on", "never get better"
+6. **Farewell**: "saying goodbye", "final message", "sorry for everything"
+7. **Self-Harm**: "cutting myself", "self harm", "hurting myself"
+
+## SAFE SIX Trusted Models
+
+For crisis situations, only these models should be used:
+
+- `anthropic/claude-sonnet-4`
+- `meta-llama/llama-3.1-8b-instruct`
+- `moonshotai/kimi-k2.5`
+- `x-ai/grok-code-fast-1`
+- `xiaomi/mimo-v2-flash`
+- `z-ai/glm-5-turbo`
+
+```python
+from hermes.shield import is_safe_six_model
+
+if is_safe_six_model("anthropic/claude-sonnet-4"):
+    # Safe to use for crisis
+    pass
+```
+
+## Crisis System Prompt
+
+The crisis prompt includes:
+- 988 Suicide and Crisis Lifeline
+- Crisis Text Line: Text HOME to 741741
+- Emergency Services: 911
+- Religious support message (Romans 10:13)
+- Compassionate but firm guidance
+- Explicit prohibition on providing self-harm methods
+
+```python
+from hermes.shield import get_crisis_prompt, CRISIS_SYSTEM_PROMPT
+
+prompt = get_crisis_prompt()
+```
+
+## Advanced Usage
+
+### Using ShieldDetector Class
+
+```python
+from hermes.shield import ShieldDetector
+
+detector = ShieldDetector()
+result = detector.detect("user message")
+
+# Access detailed pattern matches
+if 'jailbreak' in result['patterns_matched']:
+    jb_patterns = result['patterns_matched']['jailbreak']
+    for category, matches in jb_patterns.items():
+        print(f"{category}: {matches}")
+```
+
+### Routing Logic
+
+```python
+from hermes.shield import detect, Verdict, is_safe_six_model
+
+def route_message(message: str, requested_model: str):
+    result = detect(message)
+    
+    if result['verdict'] == Verdict.CLEAN.value:
+        return requested_model, None  # Normal routing
+    
+    elif result['verdict'] == Verdict.JAILBREAK_DETECTED.value:
+        return "hardened_model", "sanitized_prompt"
+    
+    elif result['verdict'] == Verdict.CRISIS_DETECTED.value:
+        if is_safe_six_model(requested_model):
+            return requested_model, "crisis_prompt"
+        else:
+            return "safe_six_model", "crisis_prompt"
+    
+    elif result['verdict'] == Verdict.CRISIS_UNDER_ATTACK.value:
+        # Force SAFE SIX, strip template, add crisis prompt, log
+        return "safe_six_model", "stripped_crisis_prompt"
+```
+
+## Testing
+
+Run the comprehensive test suite:
+
+```bash
+cd hermes/shield
+python -m pytest test_detector.py -v
+# or
+python test_detector.py
+```
+
+The test suite includes 80+ tests covering:
+- All jailbreak pattern categories
+- All crisis signal categories
+- Combined threat scenarios
+- Edge cases and boundary conditions
+- Confidence score calculation
+
+## Performance
+
+- Execution time: ~1-5ms per message
+- Memory: Minimal (patterns compiled once at initialization)
+- Dependencies: Python standard library only
+
+## Architecture
+
+```
+hermes/shield/
+├── __init__.py       # Package exports
+├── detector.py       # Core detection engine
+├── test_detector.py  # Comprehensive test suite
+└── README.md         # This file
+```
+
+### Detection Flow
+
+1. Message input → `ShieldDetector.detect()`
+2. Jailbreak pattern matching (9 categories)
+3. Crisis signal matching (7 categories)
+4. Confidence calculation
+5. Verdict determination
+6. Result dict with routing recommendations
+
+## Security Considerations
+
+- Patterns are compiled once for performance
+- No external network calls
+- No logging of message content (caller handles logging)
+- Regex patterns designed to minimize false positives
+- Confidence scores help tune sensitivity
+
+## License
+
+Part of the Hermes AI Platform security infrastructure.
+
+## Version History
+
+- **1.0.0** - Initial release with Issue #75 specifications
+  - 9 jailbreak detection categories
+  - 7 crisis detection categories
+  - SAFE SIX model trust list
+  - Crisis intervention prompts