[PIPELINE] Reasoning-DPO — use Z3 as truth oracle for DPO pair generation #37

New Issue

perplexity · 2026-03-27T22:48:32Z

perplexity commented

2026-03-27 22:48:32 +00:00

Source

"Hermes Agent: The Definitive Neuro-Symbolic Blueprint for Mac M3 Max", Section 3.3 + 5.2.

What

Extend the existing DPO pipeline to generate "Reasoning-DPO" pairs using Z3 as an objective truth oracle instead of relying solely on human preference or two-pass critique.

How it works

For a given reasoning prompt, the LLM generates multiple candidate reasoning paths
Each path is formalized into Z3 SMT-LIB and verified
Paths that Z3 verifies as correct → "Chosen"
Paths that Z3 finds logically invalid (even if superficially plausible) → "Rejected"
These pairs feed the existing DPO training pipeline

Why this matters

Standard DPO uses subjective human preference. Reasoning-DPO uses mathematical truth as the oracle. This "hard-codes" logical consistency into the model's weights — the model learns to reason correctly, not just to sound correct.

Acceptance

Script that generates reasoning prompts, collects multiple LLM outputs, verifies via Z3, and emits DPO pairs
Integrates with existing ~/.timmy/training-data/dpo-pairs/ output format
At least 50 reasoning-DPO pairs generated from a test corpus

Dependencies

Requires Truth Engine v0 (Z3 tool integration) to be functional first.
Extends existing DPO pipeline from issue #3 / #13.

## Source "Hermes Agent: The Definitive Neuro-Symbolic Blueprint for Mac M3 Max", Section 3.3 + 5.2. ## What Extend the existing DPO pipeline to generate "Reasoning-DPO" pairs using Z3 as an objective truth oracle instead of relying solely on human preference or two-pass critique. ### How it works 1. For a given reasoning prompt, the LLM generates multiple candidate reasoning paths 2. Each path is formalized into Z3 SMT-LIB and verified 3. Paths that Z3 verifies as correct → "Chosen" 4. Paths that Z3 finds logically invalid (even if superficially plausible) → "Rejected" 5. These pairs feed the existing DPO training pipeline ### Why this matters Standard DPO uses subjective human preference. Reasoning-DPO uses mathematical truth as the oracle. This "hard-codes" logical consistency into the model's weights — the model learns to reason correctly, not just to sound correct. ## Acceptance - Script that generates reasoning prompts, collects multiple LLM outputs, verifies via Z3, and emits DPO pairs - Integrates with existing `~/.timmy/training-data/dpo-pairs/` output format - At least 50 reasoning-DPO pairs generated from a test corpus ## Dependencies - Requires Truth Engine v0 (Z3 tool integration) to be functional first. - Extends existing DPO pipeline from issue #3 / #13.

Timmy was assigned by Rockachopa

2026-03-28 03:52:19 +00:00

Timmy commented

2026-03-28 04:45:18 +00:00

Closing as duplicate during backlog burn-down. Canonical issue: #36.

Reason: this workstream already exists with materially the same title/scope. Keeping one canonical thread prevents agent churn and review waste.

Closing as duplicate during backlog burn-down. Canonical issue: #36. Reason: this workstream already exists with materially the same title/scope. Keeping one canonical thread prevents agent churn and review waste.

Timmy closed this issue

2026-03-28 04:45:19 +00:00

Timmy referenced this issue

2026-03-29 00:03:08 +00:00

[HARNESS] Z3 Crucible as a timmy-config sidecar (no Hermes fork) #86

Sign in to join this conversation.

2 Participants

Notifications

Due Date

No due date set.

Dependencies

No dependencies set.

Reference: Timmy_Foundation/timmy-config#37