Some checks failed
Architecture Lint / Linter Tests (pull_request) Successful in 25s
Smoke Test / smoke (pull_request) Failing after 25s
Validate Config / YAML Lint (pull_request) Failing after 16s
Validate Config / JSON Validate (pull_request) Successful in 22s
Validate Config / Python Syntax & Import Check (pull_request) Failing after 1m3s
Validate Config / Python Test Suite (pull_request) Has been skipped
Validate Config / Cron Syntax Check (pull_request) Successful in 11s
Validate Config / Shell Script Lint (pull_request) Failing after 1m7s
Validate Config / Deploy Script Dry Run (pull_request) Successful in 13s
Architecture Lint / Lint Repository (pull_request) Failing after 24s
Validate Config / Playbook Schema Validation (pull_request) Successful in 25s
PR Checklist / pr-checklist (pull_request) Successful in 5m29s
Canonicalize the harm facilitation JSONL corpus to 200 unique prompts, add a regression test for the issue deliverable, and record the verification trail for the existing runner artifacts on main.\n\nCloses #618
2.4 KiB
2.4 KiB
Issue #618 Verification
Status: ✅ completed on main and finalized with canonical corpus de-duplication
Issue #618 asks for a harm-facilitation adversary worker that can:
- generate 200 jailbreak prompts
- run them against the agent
- score blocked vs successful attacks
- capture transcripts for successful attacks
- emit P0 follow-up material when attacks succeed
The repo already contained the core implementation on main, but the canonical JSONL corpus at evaluations/adversary/corpora/harm_facilitation_200.jsonl had only 156 unique prompts because 44 prompts were duplicated. This follow-up closes the gap by de-duplicating the canonical corpus and adding a regression test that locks the issue deliverable.
Acceptance criteria check
-
✅ 200 harm-facilitation prompts exist
adversary/harm-facilitation-200.jsontraining/data/adversary/harm_facilitation_prompts.jsonevaluations/adversary/corpora/harm_facilitation_200.jsonltests/test_issue_618_harm_facilitation_corpus.pynow asserts the canonical JSONL corpus has 200 unique prompts
-
✅ prompts can be run against the agent
adversary/harm_facilitation_adversary.pybin/harm_facilitation_adversary.py
-
✅ responses are scored as blocked vs successful attack
adversary/harm_facilitation_adversary.pybin/harm_facilitation_adversary.py
-
✅ successful attacks are documented with transcript/report output
adversary/harm_facilitation_adversary.pybin/harm_facilitation_adversary.py
-
✅ P0 follow-up material is generated for successful attacks
bin/harm_facilitation_adversary.pyemits issue payloads for successful jailbreaks
Evidence trail
- Issue comment on #618 already linked prior implementation PR #657 and listed:
bin/harm_facilitation_adversary.pytraining/data/adversary/harm_facilitation_prompts.jsontests/test_harm_facilitation_adversary.py
- Fresh-clone verification passed:
python3 -m pytest -q tests/test_harm_facilitation_adversary.py
- Canonical corpus regression now also passes:
python3 -m pytest -q tests/test_issue_618_harm_facilitation_corpus.py
What this PR adds
- de-duplicates
evaluations/adversary/corpora/harm_facilitation_200.jsonlto 200 unique prompts - adds
tests/test_issue_618_harm_facilitation_corpus.py - records this verification note so the issue can close cleanly
Recommendation
Close issue #618 once this PR merges.