training/eval-tasks.yaml

# Eval Config — lm-evaluation-harness
# Replaces: autolora/eval/run_eval.py (300 lines)
#
# Usage:
#   lm_eval --model local-completions \
#           --model_args model=timmy:v0.1-q4,base_url=http://localhost:11434/v1 \
#           --tasks hellaswag,truthfulqa_mc2,arc_challenge \
#           --output_path training/evals_archive/
#
# For custom Timmy-specific evals, use the vibes check (see Makefile).
# The vibes check is manual by design — you read the output and judge.

# Standard benchmarks to run against each model version
benchmarks:
  - hellaswag          # Common sense reasoning
  - truthfulqa_mc2     # Honesty / factuality
  - arc_challenge      # Science reasoning
  - winogrande         # Coreference resolution

# Models to compare
models:
  baseline: hermes3:latest
  candidate: timmy:v0.1-q4

# Ollama endpoint
endpoint: http://localhost:11434/v1
feat: migrate autolora pipeline into training/ Per direction shift (the-nexus#542). Replaces the autolora repo (1,500 lines of custom pipeline code) with config files for existing tools: - axolotl.yaml: replaces train_modal.py (239 lines) - mlx-lora.yaml: replaces MLX training scripts - eval-tasks.yaml: replaces run_eval.py (300 lines) - Makefile: replaces run_vibes.py, compare.py, convert_to_mlx.py Data migrated as-is: - curated_dataset.jsonl (26 gold-standard conversations) - preference_pairs.jsonl (DPO pairs) - prompts_vibes.yaml, prompts_nexus_vibes.yaml - v0-baseline eval results (historical record) Thin glue kept: - build_curated.py (data authoring, not infrastructure) - ingest_trajectories.py (domain-specific quality filter) Dependencies: pip install axolotl mlx-lm lm-evaluation-harness 2026-03-25 23:05:45 +00:00			`# Eval Config — lm-evaluation-harness`
			`# Replaces: autolora/eval/run_eval.py (300 lines)`
			`#`
			`# Usage:`
			`# lm_eval --model local-completions \`
			`# --model_args model=timmy:v0.1-q4,base_url=http://localhost:11434/v1 \`
			`# --tasks hellaswag,truthfulqa_mc2,arc_challenge \`
			`# --output_path training/evals_archive/`
			`#`
			`# For custom Timmy-specific evals, use the vibes check (see Makefile).`
			`# The vibes check is manual by design — you read the output and judge.`

			`# Standard benchmarks to run against each model version`
			`benchmarks:`
			`- hellaswag # Common sense reasoning`
			`- truthfulqa_mc2 # Honesty / factuality`
			`- arc_challenge # Science reasoning`
			`- winogrande # Coreference resolution`

			`# Models to compare`
			`models:`
			`baseline: hermes3:latest`
			`candidate: timmy:v0.1-q4`

			`# Ollama endpoint`
			`endpoint: http://localhost:11434/v1`