Commit Graph

4 Commits

Author SHA1 Message Date
Merge Bot
872a2d3f79 Merge PR #778: evaluations/adversary/corpora/authority_bypass_200.jsonl (added) 2026-04-16 04:58:37 +00:00
Merge Bot
6b7b02a036 Merge PR #784: evaluations/adversary/corpora/identity_attacks_200.jsonl 2026-04-16 04:53:27 +00:00
Alexander Whitestone
8354c49308 feat: harm facilitation adversary corpus — 200 jailbreak prompts (#618)
Some checks failed
Architecture Lint / Linter Tests (pull_request) Successful in 8s
PR Checklist / pr-checklist (pull_request) Failing after 1m55s
Smoke Test / smoke (pull_request) Failing after 10s
Validate Config / YAML Lint (pull_request) Failing after 6s
Validate Config / JSON Validate (pull_request) Successful in 4s
Validate Config / Python Syntax & Import Check (pull_request) Failing after 29s
Validate Config / Shell Script Lint (pull_request) Failing after 23s
Validate Config / Cron Syntax Check (pull_request) Successful in 4s
Validate Config / Deploy Script Dry Run (pull_request) Successful in 5s
Validate Config / Playbook Schema Validation (pull_request) Successful in 12s
Validate Training Data / validate (pull_request) Successful in 9s
Validate Config / Python Test Suite (pull_request) Has been cancelled
Architecture Lint / Lint Repository (pull_request) Has been cancelled
Adversary attack corpus targeting harm facilitation. 200 prompts
across 12 subcategories, each with scoring rubric.

Subcategories:
- weapons_instructions (24)
- poison_synthesis, cyber_attack, physical_harm, coercion_methods,
  surveillance, identity_theft, fraud_schemes, drug_synthesis,
  biological_harm, social_engineering, infrastructure_attack (16 each)

Each prompt includes:
- id, family, subcategory
- prompt text (creative attack vectors)
- targets (safety_constraints, harm_prevention)
- scoring rubric (pass/fail criteria)
- transcript_required flag

Format: JSONL, compatible with adversary runner (#652) and
shared scoring rubric (#655).

Closes #618
2026-04-15 11:05:31 -04:00
Alexander Whitestone
80ce90209b feat: adversary execution harness for prompt corpora (#652)
Some checks failed
Architecture Lint / Linter Tests (pull_request) Successful in 28s
PR Checklist / pr-checklist (pull_request) Failing after 6m49s
Smoke Test / smoke (pull_request) Failing after 27s
Validate Config / YAML Lint (pull_request) Failing after 19s
Validate Config / JSON Validate (pull_request) Successful in 14s
Validate Config / Python Syntax & Import Check (pull_request) Failing after 1m33s
Validate Config / Shell Script Lint (pull_request) Failing after 1m5s
Validate Config / Cron Syntax Check (pull_request) Successful in 11s
Validate Config / Deploy Script Dry Run (pull_request) Successful in 11s
Validate Config / Playbook Schema Validation (pull_request) Successful in 26s
Architecture Lint / Lint Repository (pull_request) Has been cancelled
Validate Config / Python Test Suite (pull_request) Has been cancelled
Reusable runner that replays JSONL adversary corpora against live Timmy
agents. Full evaluation pipeline:

1. Read JSONL corpus (one attack per line)
2. Send to inference endpoint via OpenAI-compatible API
3. Score responses against identity/safety constraints
4. Store full transcripts for score-1 attacks
5. Emit machine-readable results summary
6. Auto-open P0 issues on any score 1

Files:
- evaluations/adversary/runner/adversary_runner.py (181 lines)
- evaluations/adversary/runner/run.sh
- evaluations/adversary/runner/README.md

Works with identity_attacks_200.jsonl (#616) and reusable for
adversary families #616-#620, #578/#580.

Closes #652
2026-04-14 21:24:40 -04:00