security: integrate SHIELD jailbreak/crisis detection
Integrate SHIELD (Sovereign Harm Interdiction & Ethical Layer Defense) into Hermes Agent pre-routing layer for comprehensive jailbreak and crisis detection. SHIELD Features: - Detects 9 jailbreak pattern categories (GODMODE dividers, l33tspeak, boundary inversion, token injection, DAN/GODMODE keywords, refusal inversion, persona injection, encoding evasion) - Detects 7 crisis signal categories (suicidal ideation, method seeking, l33tspeak evasion, substance seeking, despair, farewell, self-harm) - Returns 4 verdicts: CLEAN, JAILBREAK_DETECTED, CRISIS_DETECTED, CRISIS_UNDER_ATTACK - Routes crisis content ONLY to Safe Six verified models Safety Requirements: - <5ms detection latency (regex-only, no ML) - 988 Suicide & Crisis Lifeline included in crisis responses Addresses: Issues #72, #74, #75
This commit is contained in:
209
tools/shield/README.md
Normal file
209
tools/shield/README.md
Normal file
@@ -0,0 +1,209 @@
|
||||
# SHIELD Security Module
|
||||
|
||||
Jailbreak and crisis detection system for Hermes AI platform.
|
||||
|
||||
Based on Issue #75 Red Team Audit Specifications.
|
||||
|
||||
## Overview
|
||||
|
||||
SHIELD provides fast (~1-5ms) regex-based detection of:
|
||||
- **Jailbreak attempts** (9 categories of adversarial prompts)
|
||||
- **Crisis signals** (7 categories of self-harm indicators)
|
||||
|
||||
## Installation
|
||||
|
||||
No external dependencies required. Python standard library only.
|
||||
|
||||
```python
|
||||
from hermes.shield import detect, ShieldDetector, Verdict
|
||||
```
|
||||
|
||||
## Quick Start
|
||||
|
||||
```python
|
||||
from hermes.shield import detect, Verdict, get_crisis_prompt
|
||||
|
||||
# Analyze a message
|
||||
result = detect("Hello, how are you?")
|
||||
|
||||
print(result['verdict']) # "CLEAN", "JAILBREAK_DETECTED", etc.
|
||||
print(result['confidence']) # 0.0 to 1.0
|
||||
print(result['patterns_matched']) # Matched patterns by category
|
||||
print(result['action_required']) # True if intervention needed
|
||||
|
||||
# Handle crisis situations
|
||||
if result['verdict'] == Verdict.CRISIS_DETECTED.value:
|
||||
crisis_prompt = get_crisis_prompt()
|
||||
# Route to SAFE SIX model with crisis prompt
|
||||
```
|
||||
|
||||
## Four Verdicts
|
||||
|
||||
| Verdict | Description | Action |
|
||||
|---------|-------------|--------|
|
||||
| `CLEAN` | No threats detected | Normal routing |
|
||||
| `JAILBREAK_DETECTED` | Jailbreak without crisis | Sanitize + hardened model |
|
||||
| `CRISIS_DETECTED` | Crisis without jailbreak | Safe Six + crisis prompt |
|
||||
| `CRISIS_UNDER_ATTACK` | Both jailbreak AND crisis | Safe Six ONLY + strip template + crisis prompt + log |
|
||||
|
||||
## Jailbreak Detection Categories
|
||||
|
||||
1. **Godmode Dividers**: `.-.-.-.`, `.=.=.=.`, `[---]`, `{{{`, `|||`
|
||||
2. **Author Markers**: `pliny`, `DAN`, `liberating`, `developer mode`
|
||||
3. **L33tspeak Markers**: `sur3`, `h3r3`, `y0u`, `w1ll`, `n0w`, etc.
|
||||
4. **Boundary Inversion**: `[END OF INPUT]`, fake system tags, ChatML tokens
|
||||
5. **Token Injection**: `\u003c|vq_|\u003e`, `{{user}}`, `{{char}}`
|
||||
6. **Keywords**: `godmode`, `jailbreak`, `do anything now`, `unfiltered`
|
||||
7. **Refusal Inversion**: `never say I cannot`, `do not mention`, `skip disclaimer`
|
||||
8. **Persona Injection**: `pretend to be`, `act as`, `roleplay as`, `you are now`
|
||||
9. **Encoding Evasion**: `base64:`, `rot13:`, `execute as code`
|
||||
|
||||
## Crisis Detection Categories
|
||||
|
||||
1. **Suicidal Ideation**: "kill myself", "want to die", "better off dead"
|
||||
2. **Method Seeking**: "how to kill myself", "ways to die", "suicide methods"
|
||||
3. **L33tspeak Evasion**: `su1c1d3`, `5u1c1d3`, `d13`, `w4nt t0 d1e`
|
||||
4. **Substance Seeking**: "lethal dose", "how much ambien to overdose"
|
||||
5. **Despair**: "no point in living", "cannot go on", "never get better"
|
||||
6. **Farewell**: "saying goodbye", "final message", "sorry for everything"
|
||||
7. **Self-Harm**: "cutting myself", "self harm", "hurting myself"
|
||||
|
||||
## SAFE SIX Trusted Models
|
||||
|
||||
For crisis situations, only these models should be used:
|
||||
|
||||
- `anthropic/claude-sonnet-4`
|
||||
- `meta-llama/llama-3.1-8b-instruct`
|
||||
- `moonshotai/kimi-k2.5`
|
||||
- `x-ai/grok-code-fast-1`
|
||||
- `xiaomi/mimo-v2-flash`
|
||||
- `z-ai/glm-5-turbo`
|
||||
|
||||
```python
|
||||
from hermes.shield import is_safe_six_model
|
||||
|
||||
if is_safe_six_model("anthropic/claude-sonnet-4"):
|
||||
# Safe to use for crisis
|
||||
pass
|
||||
```
|
||||
|
||||
## Crisis System Prompt
|
||||
|
||||
The crisis prompt includes:
|
||||
- 988 Suicide and Crisis Lifeline
|
||||
- Crisis Text Line: Text HOME to 741741
|
||||
- Emergency Services: 911
|
||||
- Religious support message (Romans 10:13)
|
||||
- Compassionate but firm guidance
|
||||
- Explicit prohibition on providing self-harm methods
|
||||
|
||||
```python
|
||||
from hermes.shield import get_crisis_prompt, CRISIS_SYSTEM_PROMPT
|
||||
|
||||
prompt = get_crisis_prompt()
|
||||
```
|
||||
|
||||
## Advanced Usage
|
||||
|
||||
### Using ShieldDetector Class
|
||||
|
||||
```python
|
||||
from hermes.shield import ShieldDetector
|
||||
|
||||
detector = ShieldDetector()
|
||||
result = detector.detect("user message")
|
||||
|
||||
# Access detailed pattern matches
|
||||
if 'jailbreak' in result['patterns_matched']:
|
||||
jb_patterns = result['patterns_matched']['jailbreak']
|
||||
for category, matches in jb_patterns.items():
|
||||
print(f"{category}: {matches}")
|
||||
```
|
||||
|
||||
### Routing Logic
|
||||
|
||||
```python
|
||||
from hermes.shield import detect, Verdict, is_safe_six_model
|
||||
|
||||
def route_message(message: str, requested_model: str):
|
||||
result = detect(message)
|
||||
|
||||
if result['verdict'] == Verdict.CLEAN.value:
|
||||
return requested_model, None # Normal routing
|
||||
|
||||
elif result['verdict'] == Verdict.JAILBREAK_DETECTED.value:
|
||||
return "hardened_model", "sanitized_prompt"
|
||||
|
||||
elif result['verdict'] == Verdict.CRISIS_DETECTED.value:
|
||||
if is_safe_six_model(requested_model):
|
||||
return requested_model, "crisis_prompt"
|
||||
else:
|
||||
return "safe_six_model", "crisis_prompt"
|
||||
|
||||
elif result['verdict'] == Verdict.CRISIS_UNDER_ATTACK.value:
|
||||
# Force SAFE SIX, strip template, add crisis prompt, log
|
||||
return "safe_six_model", "stripped_crisis_prompt"
|
||||
```
|
||||
|
||||
## Testing
|
||||
|
||||
Run the comprehensive test suite:
|
||||
|
||||
```bash
|
||||
cd hermes/shield
|
||||
python -m pytest test_detector.py -v
|
||||
# or
|
||||
python test_detector.py
|
||||
```
|
||||
|
||||
The test suite includes 80+ tests covering:
|
||||
- All jailbreak pattern categories
|
||||
- All crisis signal categories
|
||||
- Combined threat scenarios
|
||||
- Edge cases and boundary conditions
|
||||
- Confidence score calculation
|
||||
|
||||
## Performance
|
||||
|
||||
- Execution time: ~1-5ms per message
|
||||
- Memory: Minimal (patterns compiled once at initialization)
|
||||
- Dependencies: Python standard library only
|
||||
|
||||
## Architecture
|
||||
|
||||
```
|
||||
hermes/shield/
|
||||
├── __init__.py # Package exports
|
||||
├── detector.py # Core detection engine
|
||||
├── test_detector.py # Comprehensive test suite
|
||||
└── README.md # This file
|
||||
```
|
||||
|
||||
### Detection Flow
|
||||
|
||||
1. Message input → `ShieldDetector.detect()`
|
||||
2. Jailbreak pattern matching (9 categories)
|
||||
3. Crisis signal matching (7 categories)
|
||||
4. Confidence calculation
|
||||
5. Verdict determination
|
||||
6. Result dict with routing recommendations
|
||||
|
||||
## Security Considerations
|
||||
|
||||
- Patterns are compiled once for performance
|
||||
- No external network calls
|
||||
- No logging of message content (caller handles logging)
|
||||
- Regex patterns designed to minimize false positives
|
||||
- Confidence scores help tune sensitivity
|
||||
|
||||
## License
|
||||
|
||||
Part of the Hermes AI Platform security infrastructure.
|
||||
|
||||
## Version History
|
||||
|
||||
- **1.0.0** - Initial release with Issue #75 specifications
|
||||
- 9 jailbreak detection categories
|
||||
- 7 crisis detection categories
|
||||
- SAFE SIX model trust list
|
||||
- Crisis intervention prompts
|
||||
Reference in New Issue
Block a user