[PIPELINE] Reasoning-DPO — use Z3 as truth oracle for DPO pair generation #37

Closed
opened 2026-03-27 22:48:32 +00:00 by perplexity · 1 comment
Member

Source

"Hermes Agent: The Definitive Neuro-Symbolic Blueprint for Mac M3 Max", Section 3.3 + 5.2.

What

Extend the existing DPO pipeline to generate "Reasoning-DPO" pairs using Z3 as an objective truth oracle instead of relying solely on human preference or two-pass critique.

How it works

  1. For a given reasoning prompt, the LLM generates multiple candidate reasoning paths
  2. Each path is formalized into Z3 SMT-LIB and verified
  3. Paths that Z3 verifies as correct → "Chosen"
  4. Paths that Z3 finds logically invalid (even if superficially plausible) → "Rejected"
  5. These pairs feed the existing DPO training pipeline

Why this matters

Standard DPO uses subjective human preference. Reasoning-DPO uses mathematical truth as the oracle. This "hard-codes" logical consistency into the model's weights — the model learns to reason correctly, not just to sound correct.

Acceptance

  • Script that generates reasoning prompts, collects multiple LLM outputs, verifies via Z3, and emits DPO pairs
  • Integrates with existing ~/.timmy/training-data/dpo-pairs/ output format
  • At least 50 reasoning-DPO pairs generated from a test corpus

Dependencies

  • Requires Truth Engine v0 (Z3 tool integration) to be functional first.
  • Extends existing DPO pipeline from issue #3 / #13.
## Source "Hermes Agent: The Definitive Neuro-Symbolic Blueprint for Mac M3 Max", Section 3.3 + 5.2. ## What Extend the existing DPO pipeline to generate "Reasoning-DPO" pairs using Z3 as an objective truth oracle instead of relying solely on human preference or two-pass critique. ### How it works 1. For a given reasoning prompt, the LLM generates multiple candidate reasoning paths 2. Each path is formalized into Z3 SMT-LIB and verified 3. Paths that Z3 verifies as correct → "Chosen" 4. Paths that Z3 finds logically invalid (even if superficially plausible) → "Rejected" 5. These pairs feed the existing DPO training pipeline ### Why this matters Standard DPO uses subjective human preference. Reasoning-DPO uses mathematical truth as the oracle. This "hard-codes" logical consistency into the model's weights — the model learns to reason correctly, not just to sound correct. ## Acceptance - Script that generates reasoning prompts, collects multiple LLM outputs, verifies via Z3, and emits DPO pairs - Integrates with existing `~/.timmy/training-data/dpo-pairs/` output format - At least 50 reasoning-DPO pairs generated from a test corpus ## Dependencies - Requires Truth Engine v0 (Z3 tool integration) to be functional first. - Extends existing DPO pipeline from issue #3 / #13.
Timmy was assigned by Rockachopa 2026-03-28 03:52:19 +00:00
Owner

Closing as duplicate during backlog burn-down. Canonical issue: #36.

Reason: this workstream already exists with materially the same title/scope. Keeping one canonical thread prevents agent churn and review waste.

Closing as duplicate during backlog burn-down. Canonical issue: #36. Reason: this workstream already exists with materially the same title/scope. Keeping one canonical thread prevents agent churn and review waste.
Timmy closed this issue 2026-03-28 04:45:19 +00:00
Sign in to join this conversation.
2 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: Timmy_Foundation/timmy-config#37