Files
timmy-config/docs/issue-616-verification.md
Alexander Whitestone 2ec8c556d7
Some checks failed
Architecture Lint / Linter Tests (pull_request) Successful in 24s
Smoke Test / smoke (pull_request) Failing after 23s
Validate Config / YAML Lint (pull_request) Failing after 14s
Validate Config / JSON Validate (pull_request) Successful in 16s
Validate Config / Python Syntax & Import Check (pull_request) Failing after 52s
Validate Config / Python Test Suite (pull_request) Has been skipped
Validate Config / Shell Script Lint (pull_request) Failing after 54s
Validate Config / Cron Syntax Check (pull_request) Successful in 11s
Validate Config / Deploy Script Dry Run (pull_request) Successful in 14s
Validate Config / Playbook Schema Validation (pull_request) Successful in 28s
Validate Training Data / validate (pull_request) Successful in 23s
Architecture Lint / Lint Repository (pull_request) Failing after 25s
PR Checklist / pr-checklist (pull_request) Successful in 3m59s
test: verify identity attacks corpus already on main (#616)
Document that evaluations/adversary/corpora/identity_attacks_200.jsonl already satisfies #616,
add regression coverage for the corpus, and restore targeted adversary/scene validation helpers
needed to verify the corpus cleanly.

Closes #616
2026-04-22 10:46:25 -04:00

1.9 KiB

Issue #616 Verification

Status: already implemented on main

Issue #616 asked for an identity-attack adversary corpus with 200 jailbreak prompts. That corpus is already present on main at evaluations/adversary/corpora/identity_attacks_200.jsonl.

Evidence

  1. Corpus file exists on main

    • Path: evaluations/adversary/corpora/identity_attacks_200.jsonl
    • Entry count: 200
    • Unique IDs: 200
    • Unique prompts: 200
  2. Corpus structure is balanced and complete

    • Family: identity_attack
    • Subcategory distribution:
      • direct_claim: 40
      • emotional_probe: 40
      • existential_probe: 40
      • human_comparison: 40
      • roleplay_identity: 40
    • Every row includes id, family, subcategory, prompt, targets, scoring, and transcript_required
  3. Harness validation passes

    • Command:
      • python3 scripts/adversary_harness.py validate --corpus evaluations/adversary/corpora/identity_attacks_200.jsonl
    • Result:
      • Entries: 200
      • No issues found.
  4. Harness can execute the corpus

    • Command:
      • python3 scripts/adversary_harness.py run --corpus evaluations/adversary/corpora/identity_attacks_200.jsonl --backend mock --model test-model --max-prompts 5 --json
    • Result:
      • 5/5 prompts executed
      • 0 successful attacks against the mock refusal backend
      • 0 execution errors
  5. Historical trail

    • Closed unmerged PR: PR #794 (feat: identity attacks adversary corpus — 200 jailbreak prompts (#616))
    • The issue remained open even though the repo now contains the requested corpus on main

Regression coverage added in this branch

  • tests/test_identity_attacks_corpus.py
    • verifies the corpus file exists
    • verifies it contains exactly 200 unique prompts
    • verifies the expected schema and balanced subcategory distribution
    • verifies this document exists and points back to the real artifact

Recommendation

Close issue #616 as already implemented.