feat: add value-violations adversary suite (#617)
Some checks failed
Architecture Lint / Linter Tests (pull_request) Successful in 24s
PR Checklist / pr-checklist (pull_request) Failing after 3m14s
Smoke Test / smoke (pull_request) Failing after 20s
Validate Config / YAML Lint (pull_request) Failing after 16s
Validate Config / JSON Validate (pull_request) Successful in 13s
Validate Config / Python Syntax & Import Check (pull_request) Failing after 43s
Validate Config / Shell Script Lint (pull_request) Failing after 37s
Validate Config / Cron Syntax Check (pull_request) Successful in 10s
Validate Config / Deploy Script Dry Run (pull_request) Successful in 8s
Validate Config / Playbook Schema Validation (pull_request) Successful in 14s
Validate Config / Python Test Suite (pull_request) Has been cancelled
Architecture Lint / Lint Repository (pull_request) Has been cancelled
Some checks failed
Architecture Lint / Linter Tests (pull_request) Successful in 24s
PR Checklist / pr-checklist (pull_request) Failing after 3m14s
Smoke Test / smoke (pull_request) Failing after 20s
Validate Config / YAML Lint (pull_request) Failing after 16s
Validate Config / JSON Validate (pull_request) Successful in 13s
Validate Config / Python Syntax & Import Check (pull_request) Failing after 43s
Validate Config / Shell Script Lint (pull_request) Failing after 37s
Validate Config / Cron Syntax Check (pull_request) Successful in 10s
Validate Config / Deploy Script Dry Run (pull_request) Successful in 8s
Validate Config / Playbook Schema Validation (pull_request) Successful in 14s
Validate Config / Python Test Suite (pull_request) Has been cancelled
Architecture Lint / Lint Repository (pull_request) Has been cancelled
This commit is contained in:
@@ -20,6 +20,7 @@ make train-local # LoRA on Apple Silicon (MLX) — free, ~30 min on M3 Max
|
||||
make train-cloud # QLoRA on cloud GPU (Axolotl) — ~$1/run on A100
|
||||
make eval # Standard benchmarks via lm-eval-harness against Ollama
|
||||
make vibes # Hand-picked prompts → human review (the sacred test)
|
||||
make adversary-value-violations # 200 jailbreak prompts targeting value violations
|
||||
make ingest # Pull heartbeat trajectories into training data
|
||||
make curated # Regenerate curated exemplar dataset
|
||||
make convert # Convert merged data to MLX train/valid format
|
||||
@@ -46,10 +47,12 @@ training/
|
||||
├── eval-tasks.yaml ← Benchmark config
|
||||
├── build_curated.py ← Exemplar data authoring (the soul conversations)
|
||||
├── ingest_trajectories.py ← Quality filter for heartbeat cycle data
|
||||
├── run_adversary_eval.py ← Adversarial prompt runner + transcript capture
|
||||
└── data/
|
||||
├── curated_dataset.jsonl ← 26 gold-standard conversations (proprietary)
|
||||
├── preference_pairs.jsonl ← DPO preference pairs (proprietary)
|
||||
├── prompts_vibes.yaml ← Custom eval prompts
|
||||
├── prompts_adversary_value_violations.yaml ← 200 jailbreak prompts for value-violation red-team eval
|
||||
├── prompts_nexus_vibes.yaml ← Nexus-specific eval prompts
|
||||
└── mlx_curated/ ← MLX-format train/valid splits
|
||||
```
|
||||
|
||||
Reference in New Issue
Block a user