[PIPELINE] Reasoning-DPO — use Z3 as truth oracle for DPO pair generation #36
Reference in New Issue
Block a user
Delete Branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Source
"Hermes Agent: The Definitive Neuro-Symbolic Blueprint for Mac M3 Max", Section 3.3 + 5.2.
What
Extend the existing DPO pipeline to generate "Reasoning-DPO" pairs using Z3 as an objective truth oracle instead of relying solely on human preference or two-pass critique.
How it works
Why this matters
Standard DPO uses subjective human preference. Reasoning-DPO uses mathematical truth as the oracle. This "hard-codes" logical consistency into the model's weights — the model learns to reason correctly, not just to sound correct.
Acceptance
~/.timmy/training-data/dpo-pairs/output formatDependencies
Closing during the 2026-03-28 backlog burn-down.
Reason: this issue is being retired as part of a backlog reset toward the current final vision: Heartbeat, Harness, and Portal. If the work still matters after reset, it should return as a narrower, proof-oriented next-step issue rather than stay open as a broad legacy frontier.