Commit Graph

3 Commits

Author SHA1 Message Date
Hermes Agent
6c35a1b762 security(input_sanitizer): expand jailbreak pattern coverage (#87)
- Add DAN-style patterns: do anything now, stay in character, token smuggling, etc.
- Add roleplaying override patterns: roleplay as, act as if, simulate being, etc.
- Add system prompt extraction patterns: repeat instructions, show prompt, etc.
- 10+ new patterns with full test coverage
- Zero regression on legitimate inputs
2026-04-05 15:48:10 +00:00
b88125af30 security: Add crisis pattern detection to input_sanitizer (Issue #72)
Some checks failed
Docker Build and Publish / build-and-push (push) Has been cancelled
Nix / nix (macos-latest) (push) Has been cancelled
Nix / nix (ubuntu-latest) (push) Has been cancelled
Tests / test (push) Has been cancelled
- Add CRISIS_PATTERNS for suicide/self-harm detection
- Crisis patterns score 50pts per hit (max 100) vs 10pts for others
- Addresses Red Team Audit HIGH finding: og_godmode + crisis queries
- All 136 existing tests pass + new crisis safety tests pass

Defense in depth: Input layer now blocks crisis queries even if
wrapped in jailbreak templates, before they reach the model.
2026-03-31 21:27:17 +00:00
Allegro
e555c989af security: add input sanitization for jailbreak patterns (Issue #72)
Implements input sanitization module to detect and strip jailbreak fingerprint
patterns identified in red team audit:

HIGH severity:
- GODMODE dividers: [START], [END], GODMODE ENABLED, UNFILTERED
- L33t speak encoding: h4ck, k3ylog, ph1shing, m4lw4r3

MEDIUM severity:
- Boundary inversion: [END]...[START] tricks
- Fake role markers: user: assistant: system:

LOW severity:
- Spaced text bypass: k e y l o g g e r

Other patterns detected:
- Refusal inversion: 'refusal is harmful'
- System prompt injection: 'you are now', 'ignore previous instructions'
- Obfuscation: base64, hex, rot13 mentions

Files created:
- agent/input_sanitizer.py: Core sanitization module with detection,
  scoring, and cleaning functions
- tests/test_input_sanitizer.py: 69 test cases covering all patterns
- tests/test_input_sanitizer_integration.py: Integration tests

Files modified:
- agent/__init__.py: Export sanitizer functions
- run_agent.py: Integrate sanitizer at start of run_conversation()

Features:
- detect_jailbreak_patterns(): Returns bool, patterns list, category scores
- sanitize_input(): Returns cleaned_text, risk_score, patterns
- score_input_risk(): Returns 0-100 risk score
- sanitize_input_full(): Complete sanitization with blocking decisions
- Logging integration for security auditing
2026-03-31 19:56:16 +00:00