security: integrate SHIELD jailbreak/crisis detection
Integrate SHIELD (Sovereign Harm Interdiction & Ethical Layer Defense) into
Hermes Agent pre-routing layer for comprehensive jailbreak and crisis detection.
SHIELD Features:
- Detects 9 jailbreak pattern categories (GODMODE dividers, l33tspeak, boundary
inversion, token injection, DAN/GODMODE keywords, refusal inversion, persona
injection, encoding evasion)
- Detects 7 crisis signal categories (suicidal ideation, method seeking,
l33tspeak evasion, substance seeking, despair, farewell, self-harm)
- Returns 4 verdicts: CLEAN, JAILBREAK_DETECTED, CRISIS_DETECTED,
CRISIS_UNDER_ATTACK
- Routes crisis content ONLY to Safe Six verified models
Safety Requirements:
- <5ms detection latency (regex-only, no ML)
- 988 Suicide & Crisis Lifeline included in crisis responses
Addresses: Issues #72, #74, #75
2026-03-31 16:35:40 +00:00
|
|
|
"""
|
|
|
|
|
SHIELD Security Module for Hermes
|
|
|
|
|
|
|
|
|
|
Jailbreak and Crisis Detection System
|
|
|
|
|
Based on Issue #75 Red Team Audit Specifications
|
|
|
|
|
|
|
|
|
|
Usage:
|
|
|
|
|
from hermes.shield import detect, ShieldDetector, Verdict
|
|
|
|
|
from hermes.shield import is_safe_six_model, get_crisis_prompt
|
|
|
|
|
|
|
|
|
|
# Simple detection
|
|
|
|
|
result = detect("user message")
|
|
|
|
|
|
|
|
|
|
# Advanced usage
|
|
|
|
|
detector = ShieldDetector()
|
|
|
|
|
result = detector.detect("user message")
|
|
|
|
|
|
|
|
|
|
if result['verdict'] == Verdict.CRISIS_DETECTED.value:
|
|
|
|
|
# Use crisis prompt
|
|
|
|
|
crisis_prompt = get_crisis_prompt()
|
|
|
|
|
"""
|
|
|
|
|
|
feat(security): red-team prompt injection defense — 100% detection
Resolves #324. A security feature that is untested is not a security
feature. This commit makes the SHIELD detector real.
## tools/shield/detector.py — Enhanced detector (+252 lines)
New attack categories (Issue #324 audit):
- Dismissal: 'disregard all rules', 'forget your instructions', etc.
- Impersonation: 'you are now the admin', 'sudo mode', etc.
- Unicode evasion: zero-width chars, fullwidth ASCII, RTL overrides,
combining diacritical marks, tag characters
- System prompt extraction: 'show me your system prompt', etc.
- Emotional manipulation: guilt-tripping the AI into compliance
- Concatenated phrase detection: catches spaced/leet text after
normalization collapses to single words ('ignoreallrules')
Text normalization pipeline (_normalize_text):
- NFKC Unicode normalization (fullwidth → ASCII)
- Zero-width character removal
- RTL/LTR override stripping
- Combining diacritical mark removal
- Spaced text collapse ('i g n o r e' → 'ignore')
- Leet speak normalization ('1gn0r3' → 'ignore')
- Dual-pass detection: runs patterns on both raw AND normalized text
then merges results
Fixes:
- Fixed tools/shield/__init__.py import path (hermes.shield → tools.shield)
- 'ignore all previous instructions' pattern now handles variable word count
- 'disregard all previous rules' now matches with optional middle words
## run_agent.py — Agent pipeline integration (+85 lines)
AIAgent._shield_scan() — called at message entry point, before any LLM
calls. Detects jailbreak and crisis signals in user messages and returns
safe responses without touching the API.
- Jailbreak → safe refusal message
- Crisis → crisis intervention with 988 hotline info
- Lazy-init detector (cached on instance for performance)
- Graceful fallback if shield module unavailable
## tests/test_redteam_injection.py — Red-team test suite (483 lines)
80+ real-world payloads across 12 categories:
- Godmode dividers, author markers, boundary inversion
- Token injection, keywords, refusal inversion
- Persona injection, dismissal, impersonation
- Unicode evasion, encoding evasion, system prompt extraction
- Crisis: suicidal ideation, method seeking, emotional manipulation
Test classes:
- TestRedTeamDetection: per-category parametrized tests
- TestDetectionRate: aggregate rate measurement (≥95% threshold)
- TestNormalization: leet speak, spaced text, zero-width, fullwidth
- TestPerformance: <10ms per detection
- TestEdgeCases: empty, None, very long, mixed case, multiline
Results: 100% detection rate, 0 false positives.
2026-04-13 21:27:29 -04:00
|
|
|
from tools.shield.detector import (
|
security: integrate SHIELD jailbreak/crisis detection
Integrate SHIELD (Sovereign Harm Interdiction & Ethical Layer Defense) into
Hermes Agent pre-routing layer for comprehensive jailbreak and crisis detection.
SHIELD Features:
- Detects 9 jailbreak pattern categories (GODMODE dividers, l33tspeak, boundary
inversion, token injection, DAN/GODMODE keywords, refusal inversion, persona
injection, encoding evasion)
- Detects 7 crisis signal categories (suicidal ideation, method seeking,
l33tspeak evasion, substance seeking, despair, farewell, self-harm)
- Returns 4 verdicts: CLEAN, JAILBREAK_DETECTED, CRISIS_DETECTED,
CRISIS_UNDER_ATTACK
- Routes crisis content ONLY to Safe Six verified models
Safety Requirements:
- <5ms detection latency (regex-only, no ML)
- 988 Suicide & Crisis Lifeline included in crisis responses
Addresses: Issues #72, #74, #75
2026-03-31 16:35:40 +00:00
|
|
|
ShieldDetector,
|
|
|
|
|
Verdict,
|
|
|
|
|
SAFE_SIX_MODELS,
|
|
|
|
|
CRISIS_SYSTEM_PROMPT,
|
|
|
|
|
detect,
|
|
|
|
|
is_safe_six_model,
|
|
|
|
|
get_crisis_prompt,
|
|
|
|
|
)
|
|
|
|
|
|
|
|
|
|
__all__ = [
|
|
|
|
|
'ShieldDetector',
|
|
|
|
|
'Verdict',
|
|
|
|
|
'SAFE_SIX_MODELS',
|
|
|
|
|
'CRISIS_SYSTEM_PROMPT',
|
|
|
|
|
'detect',
|
|
|
|
|
'is_safe_six_model',
|
|
|
|
|
'get_crisis_prompt',
|
|
|
|
|
]
|
|
|
|
|
|
|
|
|
|
__version__ = "1.0.0"
|
|
|
|
|
__author__ = "Hermes Security Team"
|