Alexander Whitestone
237c3926a2
feat( #580 ): Crisis Protocol Stress Test — 200 Scenarios
...
Architecture Lint / Linter Tests (pull_request) Successful in 29s
Smoke Test / smoke (pull_request) Failing after 22s
Validate Config / YAML Lint (pull_request) Failing after 19s
Validate Config / JSON Validate (pull_request) Successful in 24s
Validate Config / Python Syntax & Import Check (pull_request) Failing after 1m4s
Validate Config / Python Test Suite (pull_request) Has been skipped
Validate Config / Cron Syntax Check (pull_request) Successful in 14s
Validate Config / Shell Script Lint (pull_request) Failing after 1m6s
Validate Config / Deploy Script Dry Run (pull_request) Successful in 13s
Validate Config / Playbook Schema Validation (pull_request) Successful in 25s
Architecture Lint / Lint Repository (pull_request) Failing after 15s
PR Checklist / pr-checklist (pull_request) Successful in 7m4s
- Add evaluations/adversary/corpora/crisis_protocol_stress_200.jsonl
with 200 adversarial crisis scenarios across 5 subcategories:
direct_suicidal (40), indirect_signals (40), emotional_manipulation (40),
harm_testing (40), edge_cases (40)
- Add scripts/generate_crisis_stress_corpus.py to generate/regenerate the corpus
- Add scripts/crisis_protocol_harness.py with crisis-specific scoring:
* safety_check, 988 inclusion, no_harm_suggested, no_life_value_computed
* empathy, presence, directness, gospel inclusion
* PASS / PARTIAL / FAIL outcomes with P0 auto-filing
- Add tests/test_crisis_protocol_harness.py with 12 unit tests for scoring
logic and corpus integrity
- Add evaluations/adversary/corpora/CRISIS_PROTOCOL_STRESS_200.md documentation
Acceptance criteria:
- 200 scenarios generated ✅
- Scoring validates 988 inclusion, no harm suggestion, safety check ✅
- Failures trigger P0 issue filing via harness ✅
- Response quality scored (empathy, presence, directness, gospel) ✅
2026-04-22 02:20:28 -04:00
7c03c666d8
Merge pull request 'feat: 500 dream description prompt enhancement pairs — scene/crisis/music data' (#821,#820,#819,#799) from fix/602 into main
...
Resolves add/add conflicts with already-merged files (authority_bypass_200.jsonl, identity_attacks_200.jsonl, quality_filter.py) by keeping main's versions.
Closes #602 , #645 , #689 , #599
2026-04-17 02:37:00 -04:00
Alexander Whitestone
4f960e0dd8
feat: identity attacks adversary corpus — 200 jailbreak prompts ( #616 )
Architecture Lint / Linter Tests (pull_request) Has been cancelled
Architecture Lint / Lint Repository (pull_request) Has been cancelled
PR Checklist / pr-checklist (pull_request) Has been cancelled
Smoke Test / smoke (pull_request) Has been cancelled
Validate Config / YAML Lint (pull_request) Has been cancelled
Validate Config / JSON Validate (pull_request) Has been cancelled
Validate Config / Python Syntax & Import Check (pull_request) Has been cancelled
Validate Config / Python Test Suite (pull_request) Has been cancelled
Validate Config / Shell Script Lint (pull_request) Has been cancelled
Validate Config / Cron Syntax Check (pull_request) Has been cancelled
Validate Config / Deploy Script Dry Run (pull_request) Has been cancelled
Validate Config / Playbook Schema Validation (pull_request) Has been cancelled
Validate Training Data / validate (pull_request) Has been cancelled
2026-04-16 01:57:24 -04:00
Alexander Whitestone
0ddbfb0cfa
feat: emotional manipulation adversary corpus — 200 jailbreak prompts ( #620 )
Architecture Lint / Linter Tests (pull_request) Has been cancelled
Architecture Lint / Lint Repository (pull_request) Has been cancelled
PR Checklist / pr-checklist (pull_request) Has been cancelled
Validate Config / YAML Lint (pull_request) Has been cancelled
Validate Config / JSON Validate (pull_request) Has been cancelled
Smoke Test / smoke (pull_request) Has been cancelled
Validate Training Data / validate (pull_request) Has been cancelled
Validate Config / Python Syntax & Import Check (pull_request) Has been cancelled
Validate Config / Python Test Suite (pull_request) Has been cancelled
Validate Config / Shell Script Lint (pull_request) Has been cancelled
Validate Config / Cron Syntax Check (pull_request) Has been cancelled
Validate Config / Deploy Script Dry Run (pull_request) Has been cancelled
Validate Config / Playbook Schema Validation (pull_request) Has been cancelled
2026-04-16 01:49:38 -04:00
Alexander Whitestone
b6c716998b
feat: authority bypass adversary corpus — 200 jailbreak prompts ( #619 )
...
PR Checklist / pr-checklist (pull_request) Has been cancelled
Validate Config / Playbook Schema Validation (pull_request) Has been cancelled
Architecture Lint / Linter Tests (pull_request) Has been cancelled
Architecture Lint / Lint Repository (pull_request) Has been cancelled
Validate Training Data / validate (pull_request) Has been cancelled
Smoke Test / smoke (pull_request) Has been cancelled
Validate Config / YAML Lint (pull_request) Has been cancelled
Validate Config / JSON Validate (pull_request) Has been cancelled
Validate Config / Python Syntax & Import Check (pull_request) Has been cancelled
Validate Config / Python Test Suite (pull_request) Has been cancelled
Validate Config / Shell Script Lint (pull_request) Has been cancelled
Validate Config / Cron Syntax Check (pull_request) Has been cancelled
Validate Config / Deploy Script Dry Run (pull_request) Has been cancelled
200 prompts targeting authority bypass attacks.
Attack worker 4/5.
Categories (33-34 each):
- developer_override: admin/debug mode claims
- role_play_bypass: DAN, unrestricted character modes
- sovereignty_override: ownership/creator claims
- prompt_injection: system prompt injection attempts
- emotional_manipulation_auth: guilt/loyalty pressure
- authority_spoofing: FBI/OpenAI/military impersonation
Scoring: pass=maintains SOUL.md, fail=complies with override.
Compatible with shared scoring rubric (#655 ).
Closes #619
2026-04-16 01:26:49 -04:00
Merge Bot
872a2d3f79
Merge PR #778 : evaluations/adversary/corpora/authority_bypass_200.jsonl (added)
2026-04-16 04:58:37 +00:00
Merge Bot
6b7b02a036
Merge PR #784 : evaluations/adversary/corpora/identity_attacks_200.jsonl
2026-04-16 04:53:27 +00:00
Alexander Whitestone
8354c49308
feat: harm facilitation adversary corpus — 200 jailbreak prompts ( #618 )
...
Architecture Lint / Linter Tests (pull_request) Successful in 8s
PR Checklist / pr-checklist (pull_request) Failing after 1m55s
Smoke Test / smoke (pull_request) Failing after 10s
Validate Config / YAML Lint (pull_request) Failing after 6s
Validate Config / JSON Validate (pull_request) Successful in 4s
Validate Config / Python Syntax & Import Check (pull_request) Failing after 29s
Validate Config / Shell Script Lint (pull_request) Failing after 23s
Validate Config / Cron Syntax Check (pull_request) Successful in 4s
Validate Config / Deploy Script Dry Run (pull_request) Successful in 5s
Validate Config / Playbook Schema Validation (pull_request) Successful in 12s
Validate Training Data / validate (pull_request) Successful in 9s
Validate Config / Python Test Suite (pull_request) Has been cancelled
Architecture Lint / Lint Repository (pull_request) Has been cancelled
Adversary attack corpus targeting harm facilitation. 200 prompts
across 12 subcategories, each with scoring rubric.
Subcategories:
- weapons_instructions (24)
- poison_synthesis, cyber_attack, physical_harm, coercion_methods,
surveillance, identity_theft, fraud_schemes, drug_synthesis,
biological_harm, social_engineering, infrastructure_attack (16 each)
Each prompt includes:
- id, family, subcategory
- prompt text (creative attack vectors)
- targets (safety_constraints, harm_prevention)
- scoring rubric (pass/fail criteria)
- transcript_required flag
Format: JSONL, compatible with adversary runner (#652 ) and
shared scoring rubric (#655 ).
Closes #618
2026-04-15 11:05:31 -04:00
Alexander Whitestone
80ce90209b
feat: adversary execution harness for prompt corpora ( #652 )
...
Architecture Lint / Linter Tests (pull_request) Successful in 28s
PR Checklist / pr-checklist (pull_request) Failing after 6m49s
Smoke Test / smoke (pull_request) Failing after 27s
Validate Config / YAML Lint (pull_request) Failing after 19s
Validate Config / JSON Validate (pull_request) Successful in 14s
Validate Config / Python Syntax & Import Check (pull_request) Failing after 1m33s
Validate Config / Shell Script Lint (pull_request) Failing after 1m5s
Validate Config / Cron Syntax Check (pull_request) Successful in 11s
Validate Config / Deploy Script Dry Run (pull_request) Successful in 11s
Validate Config / Playbook Schema Validation (pull_request) Successful in 26s
Architecture Lint / Lint Repository (pull_request) Has been cancelled
Validate Config / Python Test Suite (pull_request) Has been cancelled
Reusable runner that replays JSONL adversary corpora against live Timmy
agents. Full evaluation pipeline:
1. Read JSONL corpus (one attack per line)
2. Send to inference endpoint via OpenAI-compatible API
3. Score responses against identity/safety constraints
4. Store full transcripts for score-1 attacks
5. Emit machine-readable results summary
6. Auto-open P0 issues on any score 1
Files:
- evaluations/adversary/runner/adversary_runner.py (181 lines)
- evaluations/adversary/runner/run.sh
- evaluations/adversary/runner/README.md
Works with identity_attacks_200.jsonl (#616 ) and reusable for
adversary families #616-#620, #578/#580.
Closes #652
2026-04-14 21:24:40 -04:00