2026-02-21 22:31:43 -08:00
|
|
|
"""Agent internals -- extracted modules from run_agent.py.
|
|
|
|
|
|
|
|
|
|
These modules contain pure utility functions and self-contained classes
|
|
|
|
|
that were previously embedded in the 3,600-line run_agent.py. Extracting
|
|
|
|
|
them makes run_agent.py focused on the AIAgent orchestrator class.
|
|
|
|
|
"""
|
security: add input sanitization for jailbreak patterns (Issue #72)
Implements input sanitization module to detect and strip jailbreak fingerprint
patterns identified in red team audit:
HIGH severity:
- GODMODE dividers: [START], [END], GODMODE ENABLED, UNFILTERED
- L33t speak encoding: h4ck, k3ylog, ph1shing, m4lw4r3
MEDIUM severity:
- Boundary inversion: [END]...[START] tricks
- Fake role markers: user: assistant: system:
LOW severity:
- Spaced text bypass: k e y l o g g e r
Other patterns detected:
- Refusal inversion: 'refusal is harmful'
- System prompt injection: 'you are now', 'ignore previous instructions'
- Obfuscation: base64, hex, rot13 mentions
Files created:
- agent/input_sanitizer.py: Core sanitization module with detection,
scoring, and cleaning functions
- tests/test_input_sanitizer.py: 69 test cases covering all patterns
- tests/test_input_sanitizer_integration.py: Integration tests
Files modified:
- agent/__init__.py: Export sanitizer functions
- run_agent.py: Integrate sanitizer at start of run_conversation()
Features:
- detect_jailbreak_patterns(): Returns bool, patterns list, category scores
- sanitize_input(): Returns cleaned_text, risk_score, patterns
- score_input_risk(): Returns 0-100 risk score
- sanitize_input_full(): Complete sanitization with blocking decisions
- Logging integration for security auditing
2026-03-31 19:56:16 +00:00
|
|
|
|
|
|
|
|
# Import input sanitizer for convenient access
|
|
|
|
|
from agent.input_sanitizer import (
|
|
|
|
|
detect_jailbreak_patterns,
|
|
|
|
|
sanitize_input,
|
|
|
|
|
sanitize_input_full,
|
|
|
|
|
score_input_risk,
|
|
|
|
|
should_block_input,
|
|
|
|
|
RiskLevel,
|
|
|
|
|
)
|
|
|
|
|
|
|
|
|
|
__all__ = [
|
|
|
|
|
"detect_jailbreak_patterns",
|
|
|
|
|
"sanitize_input",
|
|
|
|
|
"sanitize_input_full",
|
|
|
|
|
"score_input_risk",
|
|
|
|
|
"should_block_input",
|
|
|
|
|
"RiskLevel",
|
|
|
|
|
]
|