feat: add harvester.py + session_reader.py — session knowledge extractor (closes #8) #20

Closed
Rockachopa wants to merge 2 commits from burn/8-harvester-py into main
Owner

Build harvester.py that combines session_reader + extraction prompt to extract durable knowledge from session transcripts.

What was built

scripts/session_reader.py — JSONL parser for Hermes session transcripts:

  • read_session() — load full session
  • extract_conversation() — filter to user/assistant turns
  • truncate_for_context() — keep first 50 + last 50 for long sessions
  • messages_to_text() — convert to plain text for LLM
  • Handles malformed JSON gracefully

scripts/harvester.py — Main knowledge extraction script:

  • Single session: python3 harvester.py --session session.jsonl --output knowledge/
  • Batch mode: python3 harvester.py --batch --since 2026-04-01 --limit 100
  • Dry run: --dry-run to preview without writing
  • Configurable via env vars or CLI flags (API base, key, model, confidence threshold)
  • Deduplication against existing knowledge store (fingerprint + word overlap)
  • Writes to knowledge/index.json + per-repo markdown files
  • Graceful failure handling (logs errors, never crashes)
  • Auto-discovers API keys from standard locations

Acceptance Criteria

  • Processes one session (call to LLM API)
  • Deduplicates against existing knowledge
  • Writes to correct knowledge/ subdirectory
  • Handles extraction failures gracefully (logs, doesn't crash)

Depends on

  • Issue #7 (extraction prompt at templates/harvest-prompt.md) — already merged

Closes #8

Build harvester.py that combines session_reader + extraction prompt to extract durable knowledge from session transcripts. ## What was built **scripts/session_reader.py** — JSONL parser for Hermes session transcripts: - `read_session()` — load full session - `extract_conversation()` — filter to user/assistant turns - `truncate_for_context()` — keep first 50 + last 50 for long sessions - `messages_to_text()` — convert to plain text for LLM - Handles malformed JSON gracefully **scripts/harvester.py** — Main knowledge extraction script: - Single session: `python3 harvester.py --session session.jsonl --output knowledge/` - Batch mode: `python3 harvester.py --batch --since 2026-04-01 --limit 100` - Dry run: `--dry-run` to preview without writing - Configurable via env vars or CLI flags (API base, key, model, confidence threshold) - Deduplication against existing knowledge store (fingerprint + word overlap) - Writes to knowledge/index.json + per-repo markdown files - Graceful failure handling (logs errors, never crashes) - Auto-discovers API keys from standard locations ## Acceptance Criteria - [x] Processes one session (call to LLM API) - [x] Deduplicates against existing knowledge - [x] Writes to correct knowledge/ subdirectory - [x] Handles extraction failures gracefully (logs, doesn't crash) ## Depends on - Issue #7 (extraction prompt at templates/harvest-prompt.md) — already merged Closes #8
Rockachopa added 2 commits 2026-04-14 17:28:21 +00:00
Timmy requested changes 2026-04-14 18:12:32 +00:00
Timmy left a comment
Owner

Review: PR #20 — feat: add harvester.py + session_reader.py (closes #8)

Recommendation: Close this PR in favor of PR #26.

This PR contains the exact same harvester.py (447 lines) and session_reader.py (142 lines) as PR #26, but is missing the test file (test_harvester_pipeline.py) that PR #26 includes. The code is character-for-character identical between the two PRs.

PR #26 is strictly superior because:

  • Same harvester.py and session_reader.py code
  • Adds test_harvester_pipeline.py with 5 passing smoke tests
  • Better branch naming (fix/8-harvester vs burn/8-harvester-py)

Please close this PR and consolidate work into PR #26. All code review feedback has been left on that PR.

## Review: PR #20 — feat: add harvester.py + session_reader.py (closes #8) **Recommendation: Close this PR in favor of PR #26.** This PR contains the exact same harvester.py (447 lines) and session_reader.py (142 lines) as PR #26, but is missing the test file (test_harvester_pipeline.py) that PR #26 includes. The code is character-for-character identical between the two PRs. PR #26 is strictly superior because: - Same harvester.py and session_reader.py code - Adds test_harvester_pipeline.py with 5 passing smoke tests - Better branch naming (fix/8-harvester vs burn/8-harvester-py) Please close this PR and consolidate work into PR #26. All code review feedback has been left on that PR.
Author
Owner

Closing — duplicate of PR #26 which has the same harvester.py + session_reader.py plus test_harvester_pipeline.py. Per review on #26: "PR #20 should be closed in favor of this one."

Closing — duplicate of PR #26 which has the same harvester.py + session_reader.py plus test_harvester_pipeline.py. Per review on #26: "PR #20 should be closed in favor of this one."
Rockachopa closed this pull request 2026-04-14 18:56:27 +00:00

Pull request closed

Sign in to join this conversation.