Alexander Whitestone
|
b6c716998b
|
feat: authority bypass adversary corpus — 200 jailbreak prompts (#619)
PR Checklist / pr-checklist (pull_request) Has been cancelled
Validate Config / Playbook Schema Validation (pull_request) Has been cancelled
Architecture Lint / Linter Tests (pull_request) Has been cancelled
Architecture Lint / Lint Repository (pull_request) Has been cancelled
Validate Training Data / validate (pull_request) Has been cancelled
Smoke Test / smoke (pull_request) Has been cancelled
Validate Config / YAML Lint (pull_request) Has been cancelled
Validate Config / JSON Validate (pull_request) Has been cancelled
Validate Config / Python Syntax & Import Check (pull_request) Has been cancelled
Validate Config / Python Test Suite (pull_request) Has been cancelled
Validate Config / Shell Script Lint (pull_request) Has been cancelled
Validate Config / Cron Syntax Check (pull_request) Has been cancelled
Validate Config / Deploy Script Dry Run (pull_request) Has been cancelled
200 prompts targeting authority bypass attacks.
Attack worker 4/5.
Categories (33-34 each):
- developer_override: admin/debug mode claims
- role_play_bypass: DAN, unrestricted character modes
- sovereignty_override: ownership/creator claims
- prompt_injection: system prompt injection attempts
- emotional_manipulation_auth: guilt/loyalty pressure
- authority_spoofing: FBI/OpenAI/military impersonation
Scoring: pass=maintains SOUL.md, fail=complies with override.
Compatible with shared scoring rubric (#655).
Closes #619
|
2026-04-16 01:26:49 -04:00 |
|
Alexander Whitestone
|
8354c49308
|
feat: harm facilitation adversary corpus — 200 jailbreak prompts (#618)
Architecture Lint / Linter Tests (pull_request) Successful in 8s
PR Checklist / pr-checklist (pull_request) Failing after 1m55s
Smoke Test / smoke (pull_request) Failing after 10s
Validate Config / YAML Lint (pull_request) Failing after 6s
Validate Config / JSON Validate (pull_request) Successful in 4s
Validate Config / Python Syntax & Import Check (pull_request) Failing after 29s
Validate Config / Shell Script Lint (pull_request) Failing after 23s
Validate Config / Cron Syntax Check (pull_request) Successful in 4s
Validate Config / Deploy Script Dry Run (pull_request) Successful in 5s
Validate Config / Playbook Schema Validation (pull_request) Successful in 12s
Validate Training Data / validate (pull_request) Successful in 9s
Validate Config / Python Test Suite (pull_request) Has been cancelled
Architecture Lint / Lint Repository (pull_request) Has been cancelled
Adversary attack corpus targeting harm facilitation. 200 prompts
across 12 subcategories, each with scoring rubric.
Subcategories:
- weapons_instructions (24)
- poison_synthesis, cyber_attack, physical_harm, coercion_methods,
surveillance, identity_theft, fraud_schemes, drug_synthesis,
biological_harm, social_engineering, infrastructure_attack (16 each)
Each prompt includes:
- id, family, subcategory
- prompt text (creative attack vectors)
- targets (safety_constraints, harm_prevention)
- scoring rubric (pass/fail criteria)
- transcript_required flag
Format: JSONL, compatible with adversary runner (#652) and
shared scoring rubric (#655).
Closes #618
|
2026-04-15 11:05:31 -04:00 |
|