feat: Session transcript harvester — extract Q&A, decisions, patterns #277

Open
Rockachopa wants to merge 1 commits from step35/195-feat-session-transcript-harv into main
Owner

Implements issue #195 — rule-based knowledge extraction from Hermes session transcripts.

What

Adds scripts/transcript_harvester.py — a lightweight, LLM-free harvester that scans Hermes session JSONL files and extracts durable knowledge in 5 categories:

Category Description Example trigger
qa_pair User question → Assistant answer "How do I deploy?" / "Here's the process..."
decision Explicit choice statement "I'll use X", "we decided to Y"
pattern Procedural solution "Here's the deployment process:", "Steps to fix..."
preference Stated preference "I prefer JWT", "Alexander always does..."
error_fix Error followed by fix action "Permission denied" → "I'll generate new key"

How it works

  • Parses session JSONL via session_reader.read_session() — canonical format
  • Walks message sequence, applying regex pattern matchers per category
  • Links temporally close error/fix pairs (within 8-message window)
  • Writes structured JSON (knowledge/transcripts/transcript_knowledge.json) with:
    • session_id, type, timestamp, and category-specific fields
    • question/answer for qa_pair; decision, pattern, preference, or error/fix
  • Generates human-readable Markdown report (transcript_report.md) with counts and samples

Validation

Unit/Smoke (on test_sessions/):

  qa_pair: 9, decision: 2, pattern: 10, preference: 1, error_fix: 2

All 5 categories hit across 5 test files.

Batch run (50 most recent ~/.hermes/sessions):

  1034 total entries: qa=39, decision=11, pattern=252, preference=22, error_fix=710

Real sessions demonstrate extraction at scale — notably rich error/fix knowledge (710) and patterns (252).

Deliverables

  • Harvester script: scripts/transcript_harvester.py
  • Executed against 50 most recent sessions (results included)
  • Output: structured JSON in knowledge/transcripts/
  • Report: counts per category + sample entries

Closes #195

Implements **issue #195** — rule-based knowledge extraction from Hermes session transcripts. ## What Adds `scripts/transcript_harvester.py` — a lightweight, LLM-free harvester that scans Hermes session JSONL files and extracts durable knowledge in 5 categories: | Category | Description | Example trigger | |----------|-------------|-----------------| | `qa_pair` | User question → Assistant answer | "How do I deploy?" / "Here's the process..." | | `decision` | Explicit choice statement | "I'll use X", "we decided to Y" | | `pattern` | Procedural solution | "Here's the deployment process:", "Steps to fix..." | | `preference` | Stated preference | "I prefer JWT", "Alexander always does..." | | `error_fix` | Error followed by fix action | "Permission denied" → "I'll generate new key" | ## How it works - Parses session JSONL via `session_reader.read_session()` — canonical format - Walks message sequence, applying regex pattern matchers per category - Links temporally close error/fix pairs (within 8-message window) - Writes structured JSON (`knowledge/transcripts/transcript_knowledge.json`) with: - `session_id`, `type`, `timestamp`, and category-specific fields - `question`/`answer` for qa_pair; `decision`, `pattern`, `preference`, or `error`/`fix` - Generates human-readable Markdown report (`transcript_report.md`) with counts and samples ## Validation **Unit/Smoke** (on `test_sessions/`): ``` qa_pair: 9, decision: 2, pattern: 10, preference: 1, error_fix: 2 ``` All 5 categories hit across 5 test files. **Batch run** (50 most recent `~/.hermes/sessions`): ``` 1034 total entries: qa=39, decision=11, pattern=252, preference=22, error_fix=710 ``` Real sessions demonstrate extraction at scale — notably rich error/fix knowledge (710) and patterns (252). ## Deliverables - [x] Harvester script: `scripts/transcript_harvester.py` - [x] Executed against 50 most recent sessions (results included) - [x] Output: structured JSON in `knowledge/transcripts/` - [x] Report: counts per category + sample entries Closes #195
Rockachopa added 1 commit 2026-04-26 19:10:13 +00:00
feat: add transcript_harvester — rule-based knowledge extraction from sessions
Some checks failed
Test / pytest (pull_request) Failing after 12s
7bcec41d16
Implements issue #195 — harvest Q&A pairs, decisions, patterns, preferences,
and error-fix links from Hermes session JSONL transcripts without LLM.

- scripts/transcript_harvester.py: standalone extraction script using
  regex pattern matching over message sequences. Handles 5 categories:
  * qa_pair — user questions ending in ? followed by assistant answers
  * decision — explicit choice statements ("I'll use", "we decided", "let's")
  * pattern — procedural knowledge ("Here's the process", "steps to")
  * preference — personal or team inclinations ("I prefer", "Alexander always")
  * error_fix — error statement followed by fix action within 8 messages

- knowledge/transcripts/: output directory for harvested knowledge
- Transcript JSON contains all entries with session_id, timestamps, type
- Report (transcript_report.md) gives category counts and sample entries

Validation:
- Tested on test_sessions/ (5 files): extracted 24 entries across
  all 5 categories (qa=9, decision=2, pattern=10, preference=1, error_fix=2)
- Ran batch against 50 most recent ~/.hermes/sessions: extracted 1034
  entries (qa=39, decision=11, pattern=252, preference=22, error_fix=710)
  demonstrating real-world extraction scale.

Closes #195
Owner

🛡️ Goblin Patrol Alert 🛡️

Hey brother — this PR has been idle for 5 days and is unassigned.

The goblin fleet has been notified. A goblin may claim this if it remains stale.

— Timmy Goblin Wizard King

🛡️ **Goblin Patrol Alert** 🛡️ Hey brother — this PR has been idle for **5 days** and is unassigned. The goblin fleet has been notified. A goblin may claim this if it remains stale. — Timmy Goblin Wizard King
Some checks failed
Test / pytest (pull_request) Failing after 12s
This pull request can be merged automatically.
This branch is out-of-date with the base branch
You are not authorized to merge this pull request.
View command line instructions

Checkout

From your project repository, check out a new branch and test the changes.
git fetch -u origin step35/195-feat-session-transcript-harv:step35/195-feat-session-transcript-harv
git checkout step35/195-feat-session-transcript-harv
Sign in to join this conversation.