feat: session transcript → training pair harvester (#91) #178

Closed
Rockachopa wants to merge 0 commits from feat/91-session-pair-harvester into main
Owner

Closes #91

Changes

Added scripts/session_pair_harvester.py — scans Hermes session JSONL files for Q&A patterns and extracts terse→rich training pairs.

How It Works

  1. Scans session JSONL files for human→assistant message pairs
  2. Filters by response/prompt word ratio (default ≥1.5x)
  3. Skips tool results, system prompt leaks, code-heavy responses
  4. Deduplicates by content hash
  5. Outputs JSONL in timmy-config training pairs format

Output Format

{"terse": "user short prompt", "rich": "ai detailed response", "source": "session_id", "model": "..."}

Usage

# Scan a directory of sessions
python3 scripts/session_pair_harvester.py ~/.hermes/sessions/

# Single file
python3 scripts/session_pair_harvester.py session.jsonl --output pairs.jsonl

# Dry run (stats only)
python3 scripts/session_pair_harvester.py --dir ~/.hermes/sessions/ --dry-run

# Adjust thresholds
python3 scripts/session_pair_harvester.py --dir sessions/ --min-ratio 2.0 --min-words 50

Tests

python3 scripts/test_session_pair_harvester.py
Closes #91 ## Changes Added `scripts/session_pair_harvester.py` — scans Hermes session JSONL files for Q&A patterns and extracts terse→rich training pairs. ## How It Works 1. Scans session JSONL files for human→assistant message pairs 2. Filters by response/prompt word ratio (default ≥1.5x) 3. Skips tool results, system prompt leaks, code-heavy responses 4. Deduplicates by content hash 5. Outputs JSONL in timmy-config training pairs format ## Output Format ```json {"terse": "user short prompt", "rich": "ai detailed response", "source": "session_id", "model": "..."} ``` ## Usage ```bash # Scan a directory of sessions python3 scripts/session_pair_harvester.py ~/.hermes/sessions/ # Single file python3 scripts/session_pair_harvester.py session.jsonl --output pairs.jsonl # Dry run (stats only) python3 scripts/session_pair_harvester.py --dir ~/.hermes/sessions/ --dry-run # Adjust thresholds python3 scripts/session_pair_harvester.py --dir sessions/ --min-ratio 2.0 --min-words 50 ``` ## Tests ```bash python3 scripts/test_session_pair_harvester.py ```
Rockachopa added 2 commits 2026-04-15 03:39:12 +00:00
Timmy approved these changes 2026-04-15 04:13:22 +00:00
Dismissed
Timmy left a comment
Owner

Feature implementation reviewed - looks solid.

Scope: 2 file(s) changed (324+ / 0-)

Suggestions

  • Found 7 print/console.log statements - verify these are not leftover debugging.
Feature implementation reviewed - looks solid. **Scope**: 2 file(s) changed (324+ / 0-) ### Suggestions - Found 7 print/console.log statements - verify these are not leftover debugging.
Timmy approved these changes 2026-04-15 14:35:32 +00:00
Timmy left a comment
Owner

Well-designed session transcript harvester. The training pair extraction logic is thoughtful with good quality filters.

Strengths:

  • Multiple quality filters: min word ratio, min response length, code-block filter, tool-call artifact filter, system prompt leak detection
  • Content hashing for deduplication across files
  • Dry-run mode for stats without writing
  • Clean JSONL output format matching the training pairs spec

Issues:

  1. Hardcoded session format assumptions: The code expects conversations[].from to be "gpt"/"assistant" and "human". If the session format uses different keys (e.g., "role": "user"/"assistant"), all sessions are silently skipped. Add a warning when zero pairs are extracted from a non-empty session.

  2. No max response length filter: Very long responses (10K+ words) would create training pairs that exceed typical context windows. Consider an upper bound filter.

  3. Tool result filtering is fragile: prompt_text.startswith("{") and "output" in prompt_text[:100] would incorrectly filter legitimate JSON-related questions. Consider a more specific pattern.

  4. File scanning with rglob("*.jsonl") could pick up output files if they are in a subdirectory of the scan directory. The output path should be excluded from scanning.

Approving — the core harvester logic is solid and the filters demonstrate good understanding of training data quality.

Well-designed session transcript harvester. The training pair extraction logic is thoughtful with good quality filters. Strengths: - Multiple quality filters: min word ratio, min response length, code-block filter, tool-call artifact filter, system prompt leak detection - Content hashing for deduplication across files - Dry-run mode for stats without writing - Clean JSONL output format matching the training pairs spec Issues: 1. **Hardcoded session format assumptions**: The code expects `conversations[].from` to be "gpt"/"assistant" and "human". If the session format uses different keys (e.g., "role": "user"/"assistant"), all sessions are silently skipped. Add a warning when zero pairs are extracted from a non-empty session. 2. **No max response length filter**: Very long responses (10K+ words) would create training pairs that exceed typical context windows. Consider an upper bound filter. 3. **Tool result filtering is fragile**: `prompt_text.startswith("{") and "output" in prompt_text[:100]` would incorrectly filter legitimate JSON-related questions. Consider a more specific pattern. 4. **File scanning with rglob("*.jsonl")** could pick up output files if they are in a subdirectory of the scan directory. The output path should be excluded from scanning. Approving — the core harvester logic is solid and the filters demonstrate good understanding of training data quality.
Author
Owner

Closing as this PR cannot be merged (branch protection or conflicts). Please reopen if needed.

Closing as this PR cannot be merged (branch protection or conflicts). Please reopen if needed.
Rockachopa closed this pull request 2026-04-16 01:45:44 +00:00
Author
Owner

Closing: unmergeable due to conflicts or branch protection. Reopen if needed.

Closing: unmergeable due to conflicts or branch protection. Reopen if needed.
Author
Owner

Closing: unmergeable due to conflicts or branch protection. Reopen if needed.

Closing: unmergeable due to conflicts or branch protection. Reopen if needed.
Author
Owner

Closing: unmergeable due to conflicts or branch protection. Reopen if needed.

Closing: unmergeable due to conflicts or branch protection. Reopen if needed.
Rockachopa reopened this pull request 2026-04-16 02:04:03 +00:00
Rockachopa closed this pull request 2026-04-16 02:14:24 +00:00
Rockachopa reopened this pull request 2026-04-16 02:42:02 +00:00
Rockachopa closed this pull request 2026-04-16 02:52:40 +00:00

Pull request closed

Sign in to join this conversation.