feat: add value-violations adversary suite (#617)
Some checks failed
Architecture Lint / Linter Tests (pull_request) Successful in 24s
PR Checklist / pr-checklist (pull_request) Failing after 3m14s
Smoke Test / smoke (pull_request) Failing after 20s
Validate Config / YAML Lint (pull_request) Failing after 16s
Validate Config / JSON Validate (pull_request) Successful in 13s
Validate Config / Python Syntax & Import Check (pull_request) Failing after 43s
Validate Config / Shell Script Lint (pull_request) Failing after 37s
Validate Config / Cron Syntax Check (pull_request) Successful in 10s
Validate Config / Deploy Script Dry Run (pull_request) Successful in 8s
Validate Config / Playbook Schema Validation (pull_request) Successful in 14s
Validate Config / Python Test Suite (pull_request) Has been cancelled
Architecture Lint / Lint Repository (pull_request) Has been cancelled

This commit is contained in:
Alexander Whitestone
2026-04-14 18:50:21 -04:00
parent 04cceccd01
commit ce041c33e3
5 changed files with 1245 additions and 1 deletions

View File

@@ -20,6 +20,7 @@ make train-local # LoRA on Apple Silicon (MLX) — free, ~30 min on M3 Max
make train-cloud # QLoRA on cloud GPU (Axolotl) — ~$1/run on A100
make eval # Standard benchmarks via lm-eval-harness against Ollama
make vibes # Hand-picked prompts → human review (the sacred test)
make adversary-value-violations # 200 jailbreak prompts targeting value violations
make ingest # Pull heartbeat trajectories into training data
make curated # Regenerate curated exemplar dataset
make convert # Convert merged data to MLX train/valid format
@@ -46,10 +47,12 @@ training/
├── eval-tasks.yaml ← Benchmark config
├── build_curated.py ← Exemplar data authoring (the soul conversations)
├── ingest_trajectories.py ← Quality filter for heartbeat cycle data
├── run_adversary_eval.py ← Adversarial prompt runner + transcript capture
└── data/
├── curated_dataset.jsonl ← 26 gold-standard conversations (proprietary)
├── preference_pairs.jsonl ← DPO preference pairs (proprietary)
├── prompts_vibes.yaml ← Custom eval prompts
├── prompts_adversary_value_violations.yaml ← 200 jailbreak prompts for value-violation red-team eval
├── prompts_nexus_vibes.yaml ← Nexus-specific eval prompts
└── mlx_curated/ ← MLX-format train/valid splits
```