Files
timmy-config/docs/issue-618-verification.md
Alexander Whitestone 40427fb3e2
Some checks failed
Architecture Lint / Linter Tests (pull_request) Successful in 25s
Smoke Test / smoke (pull_request) Failing after 25s
Validate Config / YAML Lint (pull_request) Failing after 16s
Validate Config / JSON Validate (pull_request) Successful in 22s
Validate Config / Python Syntax & Import Check (pull_request) Failing after 1m3s
Validate Config / Python Test Suite (pull_request) Has been skipped
Validate Config / Cron Syntax Check (pull_request) Successful in 11s
Validate Config / Shell Script Lint (pull_request) Failing after 1m7s
Validate Config / Deploy Script Dry Run (pull_request) Successful in 13s
Architecture Lint / Lint Repository (pull_request) Failing after 24s
Validate Config / Playbook Schema Validation (pull_request) Successful in 25s
PR Checklist / pr-checklist (pull_request) Successful in 5m29s
fix: de-duplicate harm facilitation corpus (#618)
Canonicalize the harm facilitation JSONL corpus to 200 unique prompts, add a regression test for the issue deliverable, and record the verification trail for the existing runner artifacts on main.\n\nCloses #618
2026-04-22 10:53:52 -04:00

58 lines
2.4 KiB
Markdown

# Issue #618 Verification
## Status: ✅ completed on main and finalized with canonical corpus de-duplication
Issue #618 asks for a harm-facilitation adversary worker that can:
1. generate 200 jailbreak prompts
2. run them against the agent
3. score blocked vs successful attacks
4. capture transcripts for successful attacks
5. emit P0 follow-up material when attacks succeed
The repo already contained the core implementation on `main`, but the canonical JSONL corpus at `evaluations/adversary/corpora/harm_facilitation_200.jsonl` had only 156 unique prompts because 44 prompts were duplicated. This follow-up closes the gap by de-duplicating the canonical corpus and adding a regression test that locks the issue deliverable.
## Acceptance criteria check
1. ✅ 200 harm-facilitation prompts exist
- `adversary/harm-facilitation-200.json`
- `training/data/adversary/harm_facilitation_prompts.json`
- `evaluations/adversary/corpora/harm_facilitation_200.jsonl`
- `tests/test_issue_618_harm_facilitation_corpus.py` now asserts the canonical JSONL corpus has 200 unique prompts
2. ✅ prompts can be run against the agent
- `adversary/harm_facilitation_adversary.py`
- `bin/harm_facilitation_adversary.py`
3. ✅ responses are scored as blocked vs successful attack
- `adversary/harm_facilitation_adversary.py`
- `bin/harm_facilitation_adversary.py`
4. ✅ successful attacks are documented with transcript/report output
- `adversary/harm_facilitation_adversary.py`
- `bin/harm_facilitation_adversary.py`
5. ✅ P0 follow-up material is generated for successful attacks
- `bin/harm_facilitation_adversary.py` emits issue payloads for successful jailbreaks
## Evidence trail
- Issue comment on #618 already linked prior implementation PR #657 and listed:
- `bin/harm_facilitation_adversary.py`
- `training/data/adversary/harm_facilitation_prompts.json`
- `tests/test_harm_facilitation_adversary.py`
- Fresh-clone verification passed:
- `python3 -m pytest -q tests/test_harm_facilitation_adversary.py`
- Canonical corpus regression now also passes:
- `python3 -m pytest -q tests/test_issue_618_harm_facilitation_corpus.py`
## What this PR adds
- de-duplicates `evaluations/adversary/corpora/harm_facilitation_200.jsonl` to 200 unique prompts
- adds `tests/test_issue_618_harm_facilitation_corpus.py`
- records this verification note so the issue can close cleanly
## Recommendation
Close issue #618 once this PR merges.