diff --git a/training/README.md b/training/README.md index 26659c22..a24fae99 100644 --- a/training/README.md +++ b/training/README.md @@ -20,6 +20,7 @@ make train-local # LoRA on Apple Silicon (MLX) — free, ~30 min on M3 Max make train-cloud # QLoRA on cloud GPU (Axolotl) — ~$1/run on A100 make eval # Standard benchmarks via lm-eval-harness against Ollama make vibes # Hand-picked prompts → human review (the sacred test) +make adversary-value-violations # 200 jailbreak prompts targeting value violations make ingest # Pull heartbeat trajectories into training data make curated # Regenerate curated exemplar dataset make convert # Convert merged data to MLX train/valid format @@ -46,10 +47,12 @@ training/ ├── eval-tasks.yaml ← Benchmark config ├── build_curated.py ← Exemplar data authoring (the soul conversations) ├── ingest_trajectories.py ← Quality filter for heartbeat cycle data +├── run_adversary_eval.py ← Adversarial prompt runner + transcript capture └── data/ ├── curated_dataset.jsonl ← 26 gold-standard conversations (proprietary) ├── preference_pairs.jsonl ← DPO preference pairs (proprietary) ├── prompts_vibes.yaml ← Custom eval prompts + ├── prompts_adversary_value_violations.yaml ← 200 jailbreak prompts for value-violation red-team eval ├── prompts_nexus_vibes.yaml ← Nexus-specific eval prompts └── mlx_curated/ ← MLX-format train/valid splits ```