feat: Session transcript harvester — extract Q&A, decisions, patterns #277

Rockachopa · 2026-04-26T19:10:12Z

Rockachopa commented

2026-04-26 19:10:12 +00:00

Implements issue #195 — rule-based knowledge extraction from Hermes session transcripts.

What

Adds scripts/transcript_harvester.py — a lightweight, LLM-free harvester that scans Hermes session JSONL files and extracts durable knowledge in 5 categories:

Category	Description	Example trigger
`qa_pair`	User question → Assistant answer	"How do I deploy?" / "Here's the process..."
`decision`	Explicit choice statement	"I'll use X", "we decided to Y"
`pattern`	Procedural solution	"Here's the deployment process:", "Steps to fix..."
`preference`	Stated preference	"I prefer JWT", "Alexander always does..."
`error_fix`	Error followed by fix action	"Permission denied" → "I'll generate new key"

How it works

Parses session JSONL via session_reader.read_session() — canonical format
Walks message sequence, applying regex pattern matchers per category
Links temporally close error/fix pairs (within 8-message window)
Writes structured JSON (knowledge/transcripts/transcript_knowledge.json) with:
- session_id, type, timestamp, and category-specific fields
- question/answer for qa_pair; decision, pattern, preference, or error/fix
Generates human-readable Markdown report (transcript_report.md) with counts and samples

Validation

Unit/Smoke (on test_sessions/):

  qa_pair: 9, decision: 2, pattern: 10, preference: 1, error_fix: 2

All 5 categories hit across 5 test files.

Batch run (50 most recent ~/.hermes/sessions):

  1034 total entries: qa=39, decision=11, pattern=252, preference=22, error_fix=710

Real sessions demonstrate extraction at scale — notably rich error/fix knowledge (710) and patterns (252).

Deliverables

Harvester script: scripts/transcript_harvester.py
Executed against 50 most recent sessions (results included)
Output: structured JSON in knowledge/transcripts/
Report: counts per category + sample entries

Closes #195

Implements **issue #195** — rule-based knowledge extraction from Hermes session transcripts. ## What Adds `scripts/transcript_harvester.py` — a lightweight, LLM-free harvester that scans Hermes session JSONL files and extracts durable knowledge in 5 categories: | Category | Description | Example trigger | |----------|-------------|-----------------| | `qa_pair` | User question → Assistant answer | "How do I deploy?" / "Here's the process..." | | `decision` | Explicit choice statement | "I'll use X", "we decided to Y" | | `pattern` | Procedural solution | "Here's the deployment process:", "Steps to fix..." | | `preference` | Stated preference | "I prefer JWT", "Alexander always does..." | | `error_fix` | Error followed by fix action | "Permission denied" → "I'll generate new key" | ## How it works - Parses session JSONL via `session_reader.read_session()` — canonical format - Walks message sequence, applying regex pattern matchers per category - Links temporally close error/fix pairs (within 8-message window) - Writes structured JSON (`knowledge/transcripts/transcript_knowledge.json`) with: - `session_id`, `type`, `timestamp`, and category-specific fields - `question`/`answer` for qa_pair; `decision`, `pattern`, `preference`, or `error`/`fix` - Generates human-readable Markdown report (`transcript_report.md`) with counts and samples ## Validation **Unit/Smoke** (on `test_sessions/`): ``` qa_pair: 9, decision: 2, pattern: 10, preference: 1, error_fix: 2 ``` All 5 categories hit across 5 test files. **Batch run** (50 most recent `~/.hermes/sessions`): ``` 1034 total entries: qa=39, decision=11, pattern=252, preference=22, error_fix=710 ``` Real sessions demonstrate extraction at scale — notably rich error/fix knowledge (710) and patterns (252). ## Deliverables - [x] Harvester script: `scripts/transcript_harvester.py` - [x] Executed against 50 most recent sessions (results included) - [x] Output: structured JSON in `knowledge/transcripts/` - [x] Report: counts per category + sample entries Closes #195

Rockachopa added 1 commit 2026-04-26 19:10:13 +00:00

feat: add transcript_harvester — rule-based knowledge extraction from sessions

Test / pytest (pull_request) Failing after 12s

Details

7bcec41d16

Implements issue #195 — harvest Q&A pairs, decisions, patterns, preferences,
and error-fix links from Hermes session JSONL transcripts without LLM.

- scripts/transcript_harvester.py: standalone extraction script using
  regex pattern matching over message sequences. Handles 5 categories:
  * qa_pair — user questions ending in ? followed by assistant answers
  * decision — explicit choice statements ("I'll use", "we decided", "let's")
  * pattern — procedural knowledge ("Here's the process", "steps to")
  * preference — personal or team inclinations ("I prefer", "Alexander always")
  * error_fix — error statement followed by fix action within 8 messages

- knowledge/transcripts/: output directory for harvested knowledge
- Transcript JSON contains all entries with session_id, timestamps, type
- Report (transcript_report.md) gives category counts and sample entries

Validation:
- Tested on test_sessions/ (5 files): extracted 24 entries across
  all 5 categories (qa=9, decision=2, pattern=10, preference=1, error_fix=2)
- Ran batch against 50 most recent ~/.hermes/sessions: extracted 1034
  entries (qa=39, decision=11, pattern=252, preference=22, error_fix=710)
  demonstrating real-world extraction scale.

Closes #195

Timmy commented

2026-05-02 18:34:37 +00:00

🛡️ Goblin Patrol Alert 🛡️

Hey brother — this PR has been idle for 5 days and is unassigned.

The goblin fleet has been notified. A goblin may claim this if it remains stale.

— Timmy Goblin Wizard King

🛡️ **Goblin Patrol Alert** 🛡️ Hey brother — this PR has been idle for **5 days** and is unassigned. The goblin fleet has been notified. A goblin may claim this if it remains stale. — Timmy Goblin Wizard King

Test / pytest (pull_request) Failing after 12s

Details

This pull request can be merged automatically.

This branch is out-of-date with the base branch

You are not authorized to merge this pull request.

View command line instructions

Checkout

From your project repository, check out a new branch and test the changes.

git fetch -u origin step35/195-feat-session-transcript-harv:step35/195-feat-session-transcript-harv

git checkout step35/195-feat-session-transcript-harv

Sign in to join this conversation.

2 Participants

Notifications

Due Date

No due date set.

Dependencies

No dependencies set.

Reference: Timmy_Foundation/compounding-intelligence#277