Alexander Whitestone
|
da073ad7cf
|
feat: add harvester.py — session knowledge extractor (#8)
Main harvester module that chains:
session_reader → extraction prompt → LLM → validate → deduplicate → store
Includes:
- scripts/harvester.py — main module (reader + prompt + storage pipeline)
- scripts/session_reader.py — JSONL transcript parser
- scripts/test_harvester_pipeline.py — smoke tests (all passing)
Pipeline:
1. Read session JSONL via session_reader
2. Truncate long sessions (first 50 + last 50 messages)
3. Send transcript + extraction prompt to LLM (mimo-v2-pro)
4. Parse structured JSON response (facts/pitfalls/patterns/quirks/questions)
5. Validate fields + confidence threshold
6. Deduplicate against knowledge/index.json (fingerprint + word overlap)
7. Write to knowledge store (index.json + per-repo markdown)
CLI:
Single: python3 harvester.py --session <path> --output knowledge/
Batch: python3 harvester.py --batch --since 2026-04-01 --limit 100
Dry-run: python3 harvester.py --session <path> --dry-run
|
2026-04-14 14:03:30 -04:00 |
|