feat: add harvester.py — session knowledge extractor (#8) #26
Closed
Rockachopa
wants to merge 0 commits from
fix/8-harvester into main
pull from: fix/8-harvester
merge into: Timmy_Foundation:main
Timmy_Foundation:main
Timmy_Foundation:step35/150-8-7-graph-query-engine
Timmy_Foundation:step35/230-atlas-memory-eval-run-a-live
Timmy_Foundation:step35/89-3-10-test-generation-orchest
Timmy_Foundation:step35/87-3-8-regression-test-generato
Timmy_Foundation:step35/231-atlas-wiki-build-the-llm-wik
Timmy_Foundation:step35/108-5-2-vulnerability-scanner
Timmy_Foundation:step35/233-atlas-connectors-sovereign-p
Timmy_Foundation:step35/195-feat-session-transcript-harv
Timmy_Foundation:step35/199-feat-training-data-pipeline
Timmy_Foundation:step35/232-atlas-research-solve-the-swa
Timmy_Foundation:step35/127-6-9-review-quality-scorer
Timmy_Foundation:step35/99-4-4-architecture-doc-generat
Timmy_Foundation:step35/172-10-7-knowledge-gap-identifier
Timmy_Foundation:step35/162-9-8-code-duplication-detecto
Timmy_Foundation:step35/121-6-3-logic-reviewer
Timmy_Foundation:step35/104-4-9-doc-freshness-checker
Timmy_Foundation:step35/157-9-3-type-checker
Timmy_Foundation:step35/171-10-6-performance-bottleneck
Timmy_Foundation:step35/161-9-7-dependency-freshness
Timmy_Foundation:step35/140-7-8-citation-tracker
Timmy_Foundation:step35/132-feat-codebase-genome-diff-de
Timmy_Foundation:step35/135-feat-pr-complexity-scorer-es
Timmy_Foundation:step35/124-6-6-test-coverage-checker
Timmy_Foundation:step35/113-5-7-security-patch-applier
Timmy_Foundation:step35/109-5-3-update-checker
Timmy_Foundation:step35/170-10-5-automation-opportunity
Timmy_Foundation:step35/148-8-5-session-knowledge-extrac
Timmy_Foundation:step35/147-8-4-cross-repo-connector
Timmy_Foundation:step35/126-review-comment-generator
Timmy_Foundation:step35/134-gh-trending
Timmy_Foundation:step35/138-7-6-conference-talk-summariz
Timmy_Foundation:step35/96-4-1-docstring-generator
Timmy_Foundation:step35/98-4-3-api-doc-generator
Timmy_Foundation:step35/205-feat-zero-shot-knowledge-syn
Timmy_Foundation:step35/173-10-8-progress-tracker
Timmy_Foundation:step35/137-7-5-release-note-analyzer
Timmy_Foundation:step35/107-5-1-dependency-inventory
Timmy_Foundation:step35/111-5-5-transitive-dependency-an
Timmy_Foundation:step35/90-feat-gitea-issue-body-parser
Timmy_Foundation:step35/158-9-4-security-linter
Timmy_Foundation:step35/155-9-1-linter-runner
Timmy_Foundation:step35/133-feat-import-graph-visualizat
Timmy_Foundation:step35/93-feat-cross-repo-dependency-g
Timmy_Foundation:step35/112-5-6-dependency-bloat-detecto
Timmy_Foundation:step35/97-4-2-readme-generator
Timmy_Foundation:step35/91-feat-session-transcript-trai
Timmy_Foundation:step35/144-8-1-entity-extractor
Timmy_Foundation:step35/151-8-8-graph-visualizer
Timmy_Foundation:step35/88-3-9-test-documentation-gener
Timmy_Foundation:step35/197-feat-provenance-chain-source
Timmy_Foundation:step35/103-4-8-doc-link-validator
Timmy_Foundation:burn/196-1776306000
Timmy_Foundation:feat/200-knowledge-freshness-cron
Timmy_Foundation:fix/syntax-bottleneck-211
Timmy_Foundation:fix/212-dependency-graph-dot-quoting
Timmy_Foundation:fix/211-syntax-errors
Timmy_Foundation:fix/210-refactoring-opportunity-api
Timmy_Foundation:fix/210-refactoring-opportunity-finder
Timmy_Foundation:burn/210-1776305000
Timmy_Foundation:burn/211-1776305100
Timmy_Foundation:fix/211-syntax-error
Timmy_Foundation:fix/212-dot-quoting
Timmy_Foundation:fix/perf-bottleneck-syntax-211
Timmy_Foundation:fix/211-perf-bottleneck-syntax
Timmy_Foundation:burn/212-fix-dot-quoting
Timmy_Foundation:fix/211
Timmy_Foundation:fix/212-dependency-graph-quoting
Timmy_Foundation:fix/676
Timmy_Foundation:fix/198-quality-gate
Timmy_Foundation:fix/201-pytest-warnings
Timmy_Foundation:burn/210-1776852000
Timmy_Foundation:fix/676-genome-ci
Timmy_Foundation:fix/190
Timmy_Foundation:burn/170-1776263897
Timmy_Foundation:burn/169-1776263898
Timmy_Foundation:burn/174-1776263883
Timmy_Foundation:burn/171-1776263896
Timmy_Foundation:burn/168-1776263899
Timmy_Foundation:burn/172-1776263893
Timmy_Foundation:burn/175-1776263877
Timmy_Foundation:feat/179-staleness-check
Timmy_Foundation:feat/176-diff-analyzer
Timmy_Foundation:feat/177-issue-parser
Timmy_Foundation:feat/94-dead-code-detector
Timmy_Foundation:burn/172-1776218600
Timmy_Foundation:feat/93-dependency-graph
Timmy_Foundation:feat/92-knowledge-staleness-detector
Timmy_Foundation:feat/91-session-pair-harvester
Timmy_Foundation:feat/90-issue-body-parser
Timmy_Foundation:burn/110-license-checker
Timmy_Foundation:burn/118-1776218500
Timmy_Foundation:burn/17-session-sampler
Timmy_Foundation:fix/7-extraction-prompt
Timmy_Foundation:docs/genome-676
Timmy_Foundation:feat/session-metadata
Timmy_Foundation:fix/10-knowledge-format
Timmy_Foundation:fix/14-measurer
Timmy_Foundation:fix/9-auto-harvest-cron
Timmy_Foundation:fix/19-migrate-memory
Timmy_Foundation:fix/11-bootstrapper
Timmy_Foundation:feat/session-reader
Timmy_Foundation:burn/8-harvester-py
Labels
Clear labels
acceptance-criteria
batch-pipeline
bootstrapper
epic
harvester
measurer
milestone:1
milestone:2
milestone:3
milestone:4
pipeline
pipeline
priority:high
priority:medium
retroactive
throughput-10x
token-masterplan
Token masterplan batch pipeline
Pre-session context injection
Epic-level issue
Session knowledge extraction
Compounding metrics
Milestone 1: Foundation
Milestone 2: Integration
Milestone 3: Measurement
Milestone 4: Retroactive
Pipeline/integration work
Processing existing sessions
throughput-10x label
token-masterplan label
No Label
Milestone
No items
No Milestone
Projects
Clear projects
No project
Assignees
Rockachopa
Timmy
allegro
antigravity
bezalel
claude
codex-agent
ezra
gemini
google
grok
hermes
kimi
manus
perplexity
Clear assignees
No Assignees
Notifications
Due Date
No due date set.
Dependencies
No dependencies set.
Reference: Timmy_Foundation/compounding-intelligence#26
Reference in New Issue
Block a user
Blocking a user prevents them from interacting with repositories, such as opening or commenting on pull requests or issues. Learn more about blocking a user.
Delete Branch "fix/8-harvester"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Closes #8.
What
Main harvester module that chains reader → prompt → LLM → validate → deduplicate → store.
Pipeline
session_reader.pyknowledge/index.json(fingerprint + word overlap)Files
scripts/harvester.py— main module (447 lines)scripts/session_reader.py— JSONL transcript parser (142 lines)scripts/test_harvester_pipeline.py— smoke tests (5/5 passing)CLI
Review: PR #26 — feat: add harvester.py — session knowledge extractor (#8)
Recommendation: REQUEST_CHANGES (but this is the best of the three overlapping PRs)
Duplicate analysis
PR #26, PR #20, and PR #21 all overlap significantly. PR #26 is the most complete — it contains the same harvester.py and session_reader.py as PR #20, plus adds test_harvester_pipeline.py (5 smoke tests). PR #20 should be closed in favor of this one. PR #21 has an incompatible session_reader.py that cannot work with the harvester (different API surface).
Code quality — generally solid
The pipeline architecture (read -> truncate -> extract -> validate -> dedup -> store) is clean and well-structured. The CLI is well-designed with good defaults and env var overrides.
Issues to address
Security: MD5 for fingerprinting —
fact_fingerprint()useshashlib.md5(). While this is not for cryptographic purposes, it is a code-smell that will flag in security scanners. Usehashlib.sha256()instead — there is zero performance difference at this scale.Bug:
parse_extraction_responseregex is broken — The third fallback regexr'({[^{}]*"knowledge"[^{}]*[[sS]*?][^{}]*})'uses[sS]which matches literal characters 's' and 'S', not "any character including newline" (that would be[\s\S]). This fallback will almost never match real JSON.Race condition in write_knowledge — The function reads index.json, modifies it in memory, then writes it back. If two harvester processes run concurrently (e.g., two batch jobs), they will clobber each other's writes. Consider using file locking (
fcntl.flock) or atomic writes (write to temp file, thenos.rename).Deduplication is O(n*m) —
deduplicate()iterates all new facts against all existing facts with word-set overlap. This is fine for hundreds of facts but will degrade at scale. Not blocking, but worth a TODO comment.Missing:
find_api_keyreads files without error handling — If a key file exists but is not readable (permissions), this will throw an unhandled exception. Wrap theopen()in a try/except.batch_harvest date parsing is fragile —
datetime.fromisoformat()on the--sinceargument does not handle bare dates like2026-04-01consistently across Python versions (3.10 vs 3.11+ behave differently). Consider explicitly parsing withstrptimefor the expectedYYYY-MM-DDformat.Test coverage is smoke-test only — The tests mock the LLM call, which is reasonable, but there are no tests for
parse_extraction_response(the most fragile function), no tests forbatch_harvest, and no edge cases (empty facts, unicode content, very large sessions). Consider adding at least a test for the JSON parsing fallbacks.No
__init__.py— The scripts directory usessys.pathmanipulation for imports. If this grows, consider adding an__init__.pyor restructuring as a proper package.Summary
This is the right PR to merge (over #20 and #21). The architecture is sound, the code is readable, and the test suite covers the happy path. Fix the regex bug (#2) and the MD5 usage (#1), and this is good to go. The other items are non-blocking improvements.
Review: harvester.py — Session Knowledge Extractor
Overall: Good pipeline architecture (reader -> prompt -> LLM -> validate -> deduplicate -> store). The code is well-structured with clear separation between stages.
Strengths:
--sincedate filteringIssues:
urllib.requestinstead ofrequests— while this avoids an external dependency, the error handling is more complex. Theurlopentimeout of 60s may be too short for large transcript extractions. Consider making it configurable.~/.hermes/keymaxxing/active/minimax.keysuggests this might pick up the wrong API key for a Nous Research endpoint. Verify the key paths match the API base.json...wrappers.--batchprocesses many sessions, it could hammer the API. Add a configurable delay between API calls.Recommendation: Address JSON fence stripping and rate limiting before merge. The core pipeline is solid.
This is the core harvester module — the most important script in the compounding intelligence pipeline. The code is well-structured with a clear pipeline: read -> truncate -> extract (LLM) -> validate -> deduplicate -> store.
Strengths:
Issues:
fact_fingerprint()uses MD5 which is fine for dedup butimport hashlibshould note this is not for security purposes. Not a blocker.find_api_key()tries 3 different key file paths including a minimax key. The fallback tominimax.keyfor a Nous Research API seems like a misconfiguration risk — if the wrong key is loaded, API calls would fail silently or charge the wrong account.batch_harvestprocesses sessions sequentially with no delay between LLM API calls. For large batches this could hit rate limits.session_reader.pyis solid: Clean JSONL parsing with proper error handling, multimodal content support, and sensible truncation (head 50 + tail 50 with a marker).Overall this is well-built and ready for integration. The API key path issue is worth a follow-up but not blocking. Approved.
Closing: unmergeable due to conflicts or branch protection. Reopen if needed.
Closing: unmergeable due to conflicts or branch protection. Reopen if needed.
Closing: unmergeable due to conflicts or branch protection. Reopen if needed.
Pull request closed