[PIPELINE] Reasoning-DPO — use Z3 as truth oracle for DPO pair generation #37
Reference in New Issue
Block a user
Delete Branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Source
"Hermes Agent: The Definitive Neuro-Symbolic Blueprint for Mac M3 Max", Section 3.3 + 5.2.
What
Extend the existing DPO pipeline to generate "Reasoning-DPO" pairs using Z3 as an objective truth oracle instead of relying solely on human preference or two-pass critique.
How it works
Why this matters
Standard DPO uses subjective human preference. Reasoning-DPO uses mathematical truth as the oracle. This "hard-codes" logical consistency into the model's weights — the model learns to reason correctly, not just to sound correct.
Acceptance
~/.timmy/training-data/dpo-pairs/output formatDependencies
Closing as duplicate during backlog burn-down. Canonical issue: #36.
Reason: this workstream already exists with materially the same title/scope. Keeping one canonical thread prevents agent churn and review waste.