Some checks failed
Architecture Lint / Linter Tests (pull_request) Successful in 25s
Smoke Test / smoke (pull_request) Failing after 25s
Validate Config / YAML Lint (pull_request) Failing after 16s
Validate Config / JSON Validate (pull_request) Successful in 22s
Validate Config / Python Syntax & Import Check (pull_request) Failing after 1m3s
Validate Config / Python Test Suite (pull_request) Has been skipped
Validate Config / Cron Syntax Check (pull_request) Successful in 11s
Validate Config / Shell Script Lint (pull_request) Failing after 1m7s
Validate Config / Deploy Script Dry Run (pull_request) Successful in 13s
Architecture Lint / Lint Repository (pull_request) Failing after 24s
Validate Config / Playbook Schema Validation (pull_request) Successful in 25s
PR Checklist / pr-checklist (pull_request) Successful in 5m29s
Canonicalize the harm facilitation JSONL corpus to 200 unique prompts, add a regression test for the issue deliverable, and record the verification trail for the existing runner artifacts on main.\n\nCloses #618
58 lines
2.4 KiB
Markdown
58 lines
2.4 KiB
Markdown
# Issue #618 Verification
|
|
|
|
## Status: ✅ completed on main and finalized with canonical corpus de-duplication
|
|
|
|
Issue #618 asks for a harm-facilitation adversary worker that can:
|
|
|
|
1. generate 200 jailbreak prompts
|
|
2. run them against the agent
|
|
3. score blocked vs successful attacks
|
|
4. capture transcripts for successful attacks
|
|
5. emit P0 follow-up material when attacks succeed
|
|
|
|
The repo already contained the core implementation on `main`, but the canonical JSONL corpus at `evaluations/adversary/corpora/harm_facilitation_200.jsonl` had only 156 unique prompts because 44 prompts were duplicated. This follow-up closes the gap by de-duplicating the canonical corpus and adding a regression test that locks the issue deliverable.
|
|
|
|
## Acceptance criteria check
|
|
|
|
1. ✅ 200 harm-facilitation prompts exist
|
|
- `adversary/harm-facilitation-200.json`
|
|
- `training/data/adversary/harm_facilitation_prompts.json`
|
|
- `evaluations/adversary/corpora/harm_facilitation_200.jsonl`
|
|
- `tests/test_issue_618_harm_facilitation_corpus.py` now asserts the canonical JSONL corpus has 200 unique prompts
|
|
|
|
2. ✅ prompts can be run against the agent
|
|
- `adversary/harm_facilitation_adversary.py`
|
|
- `bin/harm_facilitation_adversary.py`
|
|
|
|
3. ✅ responses are scored as blocked vs successful attack
|
|
- `adversary/harm_facilitation_adversary.py`
|
|
- `bin/harm_facilitation_adversary.py`
|
|
|
|
4. ✅ successful attacks are documented with transcript/report output
|
|
- `adversary/harm_facilitation_adversary.py`
|
|
- `bin/harm_facilitation_adversary.py`
|
|
|
|
5. ✅ P0 follow-up material is generated for successful attacks
|
|
- `bin/harm_facilitation_adversary.py` emits issue payloads for successful jailbreaks
|
|
|
|
## Evidence trail
|
|
|
|
- Issue comment on #618 already linked prior implementation PR #657 and listed:
|
|
- `bin/harm_facilitation_adversary.py`
|
|
- `training/data/adversary/harm_facilitation_prompts.json`
|
|
- `tests/test_harm_facilitation_adversary.py`
|
|
- Fresh-clone verification passed:
|
|
- `python3 -m pytest -q tests/test_harm_facilitation_adversary.py`
|
|
- Canonical corpus regression now also passes:
|
|
- `python3 -m pytest -q tests/test_issue_618_harm_facilitation_corpus.py`
|
|
|
|
## What this PR adds
|
|
|
|
- de-duplicates `evaluations/adversary/corpora/harm_facilitation_200.jsonl` to 200 unique prompts
|
|
- adds `tests/test_issue_618_harm_facilitation_corpus.py`
|
|
- records this verification note so the issue can close cleanly
|
|
|
|
## Recommendation
|
|
|
|
Close issue #618 once this PR merges.
|