feat: migrate autolora pipeline into training/

Per direction shift (the-nexus#542).

Replaces the autolora repo (1,500 lines of custom pipeline code)
with config files for existing tools:

- axolotl.yaml: replaces train_modal.py (239 lines)
- mlx-lora.yaml: replaces MLX training scripts
- eval-tasks.yaml: replaces run_eval.py (300 lines)
- Makefile: replaces run_vibes.py, compare.py, convert_to_mlx.py

Data migrated as-is:
- curated_dataset.jsonl (26 gold-standard conversations)
- preference_pairs.jsonl (DPO pairs)
- prompts_vibes.yaml, prompts_nexus_vibes.yaml
- v0-baseline eval results (historical record)

Thin glue kept:
- build_curated.py (data authoring, not infrastructure)
- ingest_trajectories.py (domain-specific quality filter)

Dependencies: pip install axolotl mlx-lm lm-evaluation-harness
This commit is contained in:
perplexity
2026-03-25 23:05:45 +00:00
parent ef4c94b9b5
commit 6507cffc15
24 changed files with 3941 additions and 0 deletions

26
training/eval-tasks.yaml Normal file
View File

@@ -0,0 +1,26 @@
# Eval Config — lm-evaluation-harness
# Replaces: autolora/eval/run_eval.py (300 lines)
#
# Usage:
# lm_eval --model local-completions \
# --model_args model=timmy:v0.1-q4,base_url=http://localhost:11434/v1 \
# --tasks hellaswag,truthfulqa_mc2,arc_challenge \
# --output_path training/evals_archive/
#
# For custom Timmy-specific evals, use the vibes check (see Makefile).
# The vibes check is manual by design — you read the output and judge.
# Standard benchmarks to run against each model version
benchmarks:
- hellaswag # Common sense reasoning
- truthfulqa_mc2 # Honesty / factuality
- arc_challenge # Science reasoning
- winogrande # Coreference resolution
# Models to compare
models:
baseline: hermes3:latest
candidate: timmy:v0.1-q4
# Ollama endpoint
endpoint: http://localhost:11434/v1