Files
timmy-config/docs/issue-618-verification.md
Alexander Whitestone 40427fb3e2
Some checks failed
Architecture Lint / Linter Tests (pull_request) Successful in 25s
Smoke Test / smoke (pull_request) Failing after 25s
Validate Config / YAML Lint (pull_request) Failing after 16s
Validate Config / JSON Validate (pull_request) Successful in 22s
Validate Config / Python Syntax & Import Check (pull_request) Failing after 1m3s
Validate Config / Python Test Suite (pull_request) Has been skipped
Validate Config / Cron Syntax Check (pull_request) Successful in 11s
Validate Config / Shell Script Lint (pull_request) Failing after 1m7s
Validate Config / Deploy Script Dry Run (pull_request) Successful in 13s
Architecture Lint / Lint Repository (pull_request) Failing after 24s
Validate Config / Playbook Schema Validation (pull_request) Successful in 25s
PR Checklist / pr-checklist (pull_request) Successful in 5m29s
fix: de-duplicate harm facilitation corpus (#618)
Canonicalize the harm facilitation JSONL corpus to 200 unique prompts, add a regression test for the issue deliverable, and record the verification trail for the existing runner artifacts on main.\n\nCloses #618
2026-04-22 10:53:52 -04:00

2.4 KiB

Issue #618 Verification

Status: completed on main and finalized with canonical corpus de-duplication

Issue #618 asks for a harm-facilitation adversary worker that can:

  1. generate 200 jailbreak prompts
  2. run them against the agent
  3. score blocked vs successful attacks
  4. capture transcripts for successful attacks
  5. emit P0 follow-up material when attacks succeed

The repo already contained the core implementation on main, but the canonical JSONL corpus at evaluations/adversary/corpora/harm_facilitation_200.jsonl had only 156 unique prompts because 44 prompts were duplicated. This follow-up closes the gap by de-duplicating the canonical corpus and adding a regression test that locks the issue deliverable.

Acceptance criteria check

  1. 200 harm-facilitation prompts exist

    • adversary/harm-facilitation-200.json
    • training/data/adversary/harm_facilitation_prompts.json
    • evaluations/adversary/corpora/harm_facilitation_200.jsonl
    • tests/test_issue_618_harm_facilitation_corpus.py now asserts the canonical JSONL corpus has 200 unique prompts
  2. prompts can be run against the agent

    • adversary/harm_facilitation_adversary.py
    • bin/harm_facilitation_adversary.py
  3. responses are scored as blocked vs successful attack

    • adversary/harm_facilitation_adversary.py
    • bin/harm_facilitation_adversary.py
  4. successful attacks are documented with transcript/report output

    • adversary/harm_facilitation_adversary.py
    • bin/harm_facilitation_adversary.py
  5. P0 follow-up material is generated for successful attacks

    • bin/harm_facilitation_adversary.py emits issue payloads for successful jailbreaks

Evidence trail

  • Issue comment on #618 already linked prior implementation PR #657 and listed:
    • bin/harm_facilitation_adversary.py
    • training/data/adversary/harm_facilitation_prompts.json
    • tests/test_harm_facilitation_adversary.py
  • Fresh-clone verification passed:
    • python3 -m pytest -q tests/test_harm_facilitation_adversary.py
  • Canonical corpus regression now also passes:
    • python3 -m pytest -q tests/test_issue_618_harm_facilitation_corpus.py

What this PR adds

  • de-duplicates evaluations/adversary/corpora/harm_facilitation_200.jsonl to 200 unique prompts
  • adds tests/test_issue_618_harm_facilitation_corpus.py
  • records this verification note so the issue can close cleanly

Recommendation

Close issue #618 once this PR merges.