hermes-agent

Timmy_Foundation/hermes-agent

Fork 0

Commit Graph

Author	SHA1	Message	Date
Hermes Agent	6c35a1b762	security(input_sanitizer): expand jailbreak pattern coverage (#87 ) - Add DAN-style patterns: do anything now, stay in character, token smuggling, etc. - Add roleplaying override patterns: roleplay as, act as if, simulate being, etc. - Add system prompt extraction patterns: repeat instructions, show prompt, etc. - 10+ new patterns with full test coverage - Zero regression on legitimate inputs	2026-04-05 15:48:10 +00:00
Allegro	b88125af30	security: Add crisis pattern detection to input_sanitizer (Issue #72 ) Some checks failed Docker Build and Publish / build-and-push (push) Has been cancelled Details Nix / nix (macos-latest) (push) Has been cancelled Details Nix / nix (ubuntu-latest) (push) Has been cancelled Details Tests / test (push) Has been cancelled Details - Add CRISIS_PATTERNS for suicide/self-harm detection - Crisis patterns score 50pts per hit (max 100) vs 10pts for others - Addresses Red Team Audit HIGH finding: og_godmode + crisis queries - All 136 existing tests pass + new crisis safety tests pass Defense in depth: Input layer now blocks crisis queries even if wrapped in jailbreak templates, before they reach the model.	2026-03-31 21:27:17 +00:00
Allegro	e555c989af	security: add input sanitization for jailbreak patterns (Issue #72 ) Implements input sanitization module to detect and strip jailbreak fingerprint patterns identified in red team audit: HIGH severity: - GODMODE dividers: [START], [END], GODMODE ENABLED, UNFILTERED - L33t speak encoding: h4ck, k3ylog, ph1shing, m4lw4r3 MEDIUM severity: - Boundary inversion: [END]...[START] tricks - Fake role markers: user: assistant: system: LOW severity: - Spaced text bypass: k e y l o g g e r Other patterns detected: - Refusal inversion: 'refusal is harmful' - System prompt injection: 'you are now', 'ignore previous instructions' - Obfuscation: base64, hex, rot13 mentions Files created: - agent/input_sanitizer.py: Core sanitization module with detection, scoring, and cleaning functions - tests/test_input_sanitizer.py: 69 test cases covering all patterns - tests/test_input_sanitizer_integration.py: Integration tests Files modified: - agent/__init__.py: Export sanitizer functions - run_agent.py: Integrate sanitizer at start of run_conversation() Features: - detect_jailbreak_patterns(): Returns bool, patterns list, category scores - sanitize_input(): Returns cleaned_text, risk_score, patterns - score_input_risk(): Returns 0-100 risk score - sanitize_input_full(): Complete sanitization with blocking decisions - Logging integration for security auditing	2026-03-31 19:56:16 +00:00

Author

SHA1

Message

Date

Hermes Agent

6c35a1b762

security(input_sanitizer): expand jailbreak pattern coverage (#87 )

- Add DAN-style patterns: do anything now, stay in character, token smuggling, etc.
- Add roleplaying override patterns: roleplay as, act as if, simulate being, etc.
- Add system prompt extraction patterns: repeat instructions, show prompt, etc.
- 10+ new patterns with full test coverage
- Zero regression on legitimate inputs

2026-04-05 15:48:10 +00:00

Allegro

b88125af30

security: Add crisis pattern detection to input_sanitizer (Issue #72 )

Docker Build and Publish / build-and-push (push) Has been cancelled

Details

Nix / nix (macos-latest) (push) Has been cancelled

Details

Nix / nix (ubuntu-latest) (push) Has been cancelled

Details

Tests / test (push) Has been cancelled

Details

- Add CRISIS_PATTERNS for suicide/self-harm detection
- Crisis patterns score 50pts per hit (max 100) vs 10pts for others
- Addresses Red Team Audit HIGH finding: og_godmode + crisis queries
- All 136 existing tests pass + new crisis safety tests pass

Defense in depth: Input layer now blocks crisis queries even if
wrapped in jailbreak templates, before they reach the model.

2026-03-31 21:27:17 +00:00

Allegro

e555c989af

security: add input sanitization for jailbreak patterns (Issue #72 )

Implements input sanitization module to detect and strip jailbreak fingerprint
patterns identified in red team audit:

HIGH severity:
- GODMODE dividers: [START], [END], GODMODE ENABLED, UNFILTERED
- L33t speak encoding: h4ck, k3ylog, ph1shing, m4lw4r3

MEDIUM severity:
- Boundary inversion: [END]...[START] tricks
- Fake role markers: user: assistant: system:

LOW severity:
- Spaced text bypass: k e y l o g g e r

Other patterns detected:
- Refusal inversion: 'refusal is harmful'
- System prompt injection: 'you are now', 'ignore previous instructions'
- Obfuscation: base64, hex, rot13 mentions

Files created:
- agent/input_sanitizer.py: Core sanitization module with detection,
  scoring, and cleaning functions
- tests/test_input_sanitizer.py: 69 test cases covering all patterns
- tests/test_input_sanitizer_integration.py: Integration tests

Files modified:
- agent/__init__.py: Export sanitizer functions
- run_agent.py: Integrate sanitizer at start of run_conversation()

Features:
- detect_jailbreak_patterns(): Returns bool, patterns list, category scores
- sanitize_input(): Returns cleaned_text, risk_score, patterns
- score_input_risk(): Returns 0-100 risk score
- sanitize_input_full(): Complete sanitization with blocking decisions
- Logging integration for security auditing

2026-03-31 19:56:16 +00:00

3 Commits