feat: session transcript → training pair harvester (#91) #178

Rockachopa · 2026-04-15T03:39:11Z

Rockachopa commented

2026-04-15 03:39:11 +00:00

Closes #91

Changes

Added scripts/session_pair_harvester.py — scans Hermes session JSONL files for Q&A patterns and extracts terse→rich training pairs.

How It Works

Scans session JSONL files for human→assistant message pairs
Filters by response/prompt word ratio (default ≥1.5x)
Skips tool results, system prompt leaks, code-heavy responses
Deduplicates by content hash
Outputs JSONL in timmy-config training pairs format

Output Format

{"terse": "user short prompt", "rich": "ai detailed response", "source": "session_id", "model": "..."}

Usage

# Scan a directory of sessions
python3 scripts/session_pair_harvester.py ~/.hermes/sessions/

# Single file
python3 scripts/session_pair_harvester.py session.jsonl --output pairs.jsonl

# Dry run (stats only)
python3 scripts/session_pair_harvester.py --dir ~/.hermes/sessions/ --dry-run

# Adjust thresholds
python3 scripts/session_pair_harvester.py --dir sessions/ --min-ratio 2.0 --min-words 50

Tests

python3 scripts/test_session_pair_harvester.py

Closes #91 ## Changes Added `scripts/session_pair_harvester.py` — scans Hermes session JSONL files for Q&A patterns and extracts terse→rich training pairs. ## How It Works 1. Scans session JSONL files for human→assistant message pairs 2. Filters by response/prompt word ratio (default ≥1.5x) 3. Skips tool results, system prompt leaks, code-heavy responses 4. Deduplicates by content hash 5. Outputs JSONL in timmy-config training pairs format ## Output Format ```json {"terse": "user short prompt", "rich": "ai detailed response", "source": "session_id", "model": "..."} ``` ## Usage ```bash # Scan a directory of sessions python3 scripts/session_pair_harvester.py ~/.hermes/sessions/ # Single file python3 scripts/session_pair_harvester.py session.jsonl --output pairs.jsonl # Dry run (stats only) python3 scripts/session_pair_harvester.py --dir ~/.hermes/sessions/ --dry-run # Adjust thresholds python3 scripts/session_pair_harvester.py --dir sessions/ --min-ratio 2.0 --min-words 50 ``` ## Tests ```bash python3 scripts/test_session_pair_harvester.py ```

Rockachopa added 2 commits 2026-04-15 03:39:12 +00:00

feat: session transcript → training pair harvester (#91 ) b5466dc938

test: add tests for session pair harvester (#91 ) b36f617d4a

Timmy approved these changes 2026-04-15 04:13:22 +00:00

Dismissed

Timmy left a comment

Feature implementation reviewed - looks solid.

Scope: 2 file(s) changed (324+ / 0-)

Suggestions

Found 7 print/console.log statements - verify these are not leftover debugging.

Feature implementation reviewed - looks solid. **Scope**: 2 file(s) changed (324+ / 0-) ### Suggestions - Found 7 print/console.log statements - verify these are not leftover debugging.

Timmy approved these changes 2026-04-15 14:35:32 +00:00

Timmy left a comment

Well-designed session transcript harvester. The training pair extraction logic is thoughtful with good quality filters.

Strengths:

Multiple quality filters: min word ratio, min response length, code-block filter, tool-call artifact filter, system prompt leak detection
Content hashing for deduplication across files
Dry-run mode for stats without writing
Clean JSONL output format matching the training pairs spec

Issues:

Hardcoded session format assumptions: The code expects conversations[].from to be "gpt"/"assistant" and "human". If the session format uses different keys (e.g., "role": "user"/"assistant"), all sessions are silently skipped. Add a warning when zero pairs are extracted from a non-empty session.
No max response length filter: Very long responses (10K+ words) would create training pairs that exceed typical context windows. Consider an upper bound filter.
Tool result filtering is fragile: prompt_text.startswith("{") and "output" in prompt_text[:100] would incorrectly filter legitimate JSON-related questions. Consider a more specific pattern.
File scanning with rglob("*.jsonl") could pick up output files if they are in a subdirectory of the scan directory. The output path should be excluded from scanning.

Approving — the core harvester logic is solid and the filters demonstrate good understanding of training data quality.

Well-designed session transcript harvester. The training pair extraction logic is thoughtful with good quality filters. Strengths: - Multiple quality filters: min word ratio, min response length, code-block filter, tool-call artifact filter, system prompt leak detection - Content hashing for deduplication across files - Dry-run mode for stats without writing - Clean JSONL output format matching the training pairs spec Issues: 1. **Hardcoded session format assumptions**: The code expects `conversations[].from` to be "gpt"/"assistant" and "human". If the session format uses different keys (e.g., "role": "user"/"assistant"), all sessions are silently skipped. Add a warning when zero pairs are extracted from a non-empty session. 2. **No max response length filter**: Very long responses (10K+ words) would create training pairs that exceed typical context windows. Consider an upper bound filter. 3. **Tool result filtering is fragile**: `prompt_text.startswith("{") and "output" in prompt_text[:100]` would incorrectly filter legitimate JSON-related questions. Consider a more specific pattern. 4. **File scanning with rglob("*.jsonl")** could pick up output files if they are in a subdirectory of the scan directory. The output path should be excluded from scanning. Approving — the core harvester logic is solid and the filters demonstrate good understanding of training data quality.

Rockachopa commented

2026-04-16 01:45:40 +00:00

Closing as this PR cannot be merged (branch protection or conflicts). Please reopen if needed.

Rockachopa closed this pull request

2026-04-16 01:45:44 +00:00

Rockachopa commented

2026-04-16 01:53:36 +00:00

Closing: unmergeable due to conflicts or branch protection. Reopen if needed.