[PIPELINE] Reasoning-DPO — use Z3 as truth oracle for DPO pair generation #36

Closed
opened 2026-03-27 22:48:32 +00:00 by perplexity · 1 comment
Member

Source

"Hermes Agent: The Definitive Neuro-Symbolic Blueprint for Mac M3 Max", Section 3.3 + 5.2.

What

Extend the existing DPO pipeline to generate "Reasoning-DPO" pairs using Z3 as an objective truth oracle instead of relying solely on human preference or two-pass critique.

How it works

  1. For a given reasoning prompt, the LLM generates multiple candidate reasoning paths
  2. Each path is formalized into Z3 SMT-LIB and verified
  3. Paths that Z3 verifies as correct → "Chosen"
  4. Paths that Z3 finds logically invalid (even if superficially plausible) → "Rejected"
  5. These pairs feed the existing DPO training pipeline

Why this matters

Standard DPO uses subjective human preference. Reasoning-DPO uses mathematical truth as the oracle. This "hard-codes" logical consistency into the model's weights — the model learns to reason correctly, not just to sound correct.

Acceptance

  • Script that generates reasoning prompts, collects multiple LLM outputs, verifies via Z3, and emits DPO pairs
  • Integrates with existing ~/.timmy/training-data/dpo-pairs/ output format
  • At least 50 reasoning-DPO pairs generated from a test corpus

Dependencies

  • Requires Truth Engine v0 (Z3 tool integration) to be functional first.
  • Extends existing DPO pipeline from issue #3 / #13.
## Source "Hermes Agent: The Definitive Neuro-Symbolic Blueprint for Mac M3 Max", Section 3.3 + 5.2. ## What Extend the existing DPO pipeline to generate "Reasoning-DPO" pairs using Z3 as an objective truth oracle instead of relying solely on human preference or two-pass critique. ### How it works 1. For a given reasoning prompt, the LLM generates multiple candidate reasoning paths 2. Each path is formalized into Z3 SMT-LIB and verified 3. Paths that Z3 verifies as correct → "Chosen" 4. Paths that Z3 finds logically invalid (even if superficially plausible) → "Rejected" 5. These pairs feed the existing DPO training pipeline ### Why this matters Standard DPO uses subjective human preference. Reasoning-DPO uses mathematical truth as the oracle. This "hard-codes" logical consistency into the model's weights — the model learns to reason correctly, not just to sound correct. ## Acceptance - Script that generates reasoning prompts, collects multiple LLM outputs, verifies via Z3, and emits DPO pairs - Integrates with existing `~/.timmy/training-data/dpo-pairs/` output format - At least 50 reasoning-DPO pairs generated from a test corpus ## Dependencies - Requires Truth Engine v0 (Z3 tool integration) to be functional first. - Extends existing DPO pipeline from issue #3 / #13.
Timmy was assigned by Rockachopa 2026-03-28 03:52:19 +00:00
Owner

Closing during the 2026-03-28 backlog burn-down.

Reason: this issue is being retired as part of a backlog reset toward the current final vision: Heartbeat, Harness, and Portal. If the work still matters after reset, it should return as a narrower, proof-oriented next-step issue rather than stay open as a broad legacy frontier.

Closing during the 2026-03-28 backlog burn-down. Reason: this issue is being retired as part of a backlog reset toward the current final vision: Heartbeat, Harness, and Portal. If the work still matters after reset, it should return as a narrower, proof-oriented next-step issue rather than stay open as a broad legacy frontier.
Timmy closed this issue 2026-03-28 04:53:02 +00:00
Sign in to join this conversation.
2 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: Timmy_Foundation/timmy-config#36