feat: Add session harvester with auto-harvest cron (#9) #29
Closed
Rockachopa
wants to merge 4 commits from
fix/9-auto-harvest-cron into main
pull from: fix/9-auto-harvest-cron
merge into: Timmy_Foundation:main
Timmy_Foundation:main
Timmy_Foundation:step35/150-8-7-graph-query-engine
Timmy_Foundation:step35/230-atlas-memory-eval-run-a-live
Timmy_Foundation:step35/89-3-10-test-generation-orchest
Timmy_Foundation:step35/87-3-8-regression-test-generato
Timmy_Foundation:step35/231-atlas-wiki-build-the-llm-wik
Timmy_Foundation:step35/108-5-2-vulnerability-scanner
Timmy_Foundation:step35/233-atlas-connectors-sovereign-p
Timmy_Foundation:step35/195-feat-session-transcript-harv
Timmy_Foundation:step35/199-feat-training-data-pipeline
Timmy_Foundation:step35/232-atlas-research-solve-the-swa
Timmy_Foundation:step35/127-6-9-review-quality-scorer
Timmy_Foundation:step35/99-4-4-architecture-doc-generat
Timmy_Foundation:step35/172-10-7-knowledge-gap-identifier
Timmy_Foundation:step35/162-9-8-code-duplication-detecto
Timmy_Foundation:step35/121-6-3-logic-reviewer
Timmy_Foundation:step35/104-4-9-doc-freshness-checker
Timmy_Foundation:step35/157-9-3-type-checker
Timmy_Foundation:step35/171-10-6-performance-bottleneck
Timmy_Foundation:step35/161-9-7-dependency-freshness
Timmy_Foundation:step35/140-7-8-citation-tracker
Timmy_Foundation:step35/132-feat-codebase-genome-diff-de
Timmy_Foundation:step35/135-feat-pr-complexity-scorer-es
Timmy_Foundation:step35/124-6-6-test-coverage-checker
Timmy_Foundation:step35/113-5-7-security-patch-applier
Timmy_Foundation:step35/109-5-3-update-checker
Timmy_Foundation:step35/170-10-5-automation-opportunity
Timmy_Foundation:step35/148-8-5-session-knowledge-extrac
Timmy_Foundation:step35/147-8-4-cross-repo-connector
Timmy_Foundation:step35/126-review-comment-generator
Timmy_Foundation:step35/134-gh-trending
Timmy_Foundation:step35/138-7-6-conference-talk-summariz
Timmy_Foundation:step35/96-4-1-docstring-generator
Timmy_Foundation:step35/98-4-3-api-doc-generator
Timmy_Foundation:step35/205-feat-zero-shot-knowledge-syn
Timmy_Foundation:step35/173-10-8-progress-tracker
Timmy_Foundation:step35/137-7-5-release-note-analyzer
Timmy_Foundation:step35/107-5-1-dependency-inventory
Timmy_Foundation:step35/111-5-5-transitive-dependency-an
Timmy_Foundation:step35/90-feat-gitea-issue-body-parser
Timmy_Foundation:step35/158-9-4-security-linter
Timmy_Foundation:step35/155-9-1-linter-runner
Timmy_Foundation:step35/133-feat-import-graph-visualizat
Timmy_Foundation:step35/93-feat-cross-repo-dependency-g
Timmy_Foundation:step35/112-5-6-dependency-bloat-detecto
Timmy_Foundation:step35/97-4-2-readme-generator
Timmy_Foundation:step35/91-feat-session-transcript-trai
Timmy_Foundation:step35/144-8-1-entity-extractor
Timmy_Foundation:step35/151-8-8-graph-visualizer
Timmy_Foundation:step35/88-3-9-test-documentation-gener
Timmy_Foundation:step35/197-feat-provenance-chain-source
Timmy_Foundation:step35/103-4-8-doc-link-validator
Timmy_Foundation:burn/196-1776306000
Timmy_Foundation:feat/200-knowledge-freshness-cron
Timmy_Foundation:fix/syntax-bottleneck-211
Timmy_Foundation:fix/212-dependency-graph-dot-quoting
Timmy_Foundation:fix/211-syntax-errors
Timmy_Foundation:fix/210-refactoring-opportunity-api
Timmy_Foundation:fix/210-refactoring-opportunity-finder
Timmy_Foundation:burn/210-1776305000
Timmy_Foundation:burn/211-1776305100
Timmy_Foundation:fix/211-syntax-error
Timmy_Foundation:fix/212-dot-quoting
Timmy_Foundation:fix/perf-bottleneck-syntax-211
Timmy_Foundation:fix/211-perf-bottleneck-syntax
Timmy_Foundation:burn/212-fix-dot-quoting
Timmy_Foundation:fix/211
Timmy_Foundation:fix/212-dependency-graph-quoting
Timmy_Foundation:fix/676
Timmy_Foundation:fix/198-quality-gate
Timmy_Foundation:fix/201-pytest-warnings
Timmy_Foundation:burn/210-1776852000
Timmy_Foundation:fix/676-genome-ci
Timmy_Foundation:fix/190
Timmy_Foundation:burn/170-1776263897
Timmy_Foundation:burn/169-1776263898
Timmy_Foundation:burn/174-1776263883
Timmy_Foundation:burn/171-1776263896
Timmy_Foundation:burn/168-1776263899
Timmy_Foundation:burn/172-1776263893
Timmy_Foundation:burn/175-1776263877
Timmy_Foundation:feat/179-staleness-check
Timmy_Foundation:feat/176-diff-analyzer
Timmy_Foundation:feat/177-issue-parser
Timmy_Foundation:feat/94-dead-code-detector
Timmy_Foundation:burn/172-1776218600
Timmy_Foundation:feat/93-dependency-graph
Timmy_Foundation:feat/92-knowledge-staleness-detector
Timmy_Foundation:feat/91-session-pair-harvester
Timmy_Foundation:feat/90-issue-body-parser
Timmy_Foundation:burn/110-license-checker
Timmy_Foundation:burn/118-1776218500
Timmy_Foundation:burn/17-session-sampler
Timmy_Foundation:fix/7-extraction-prompt
Timmy_Foundation:docs/genome-676
Timmy_Foundation:feat/session-metadata
Timmy_Foundation:fix/10-knowledge-format
Timmy_Foundation:fix/14-measurer
Timmy_Foundation:fix/19-migrate-memory
Timmy_Foundation:fix/11-bootstrapper
Timmy_Foundation:fix/8-harvester
Timmy_Foundation:feat/session-reader
Timmy_Foundation:burn/8-harvester-py
Labels
Clear labels
acceptance-criteria
batch-pipeline
bootstrapper
epic
harvester
measurer
milestone:1
milestone:2
milestone:3
milestone:4
pipeline
pipeline
priority:high
priority:medium
retroactive
throughput-10x
token-masterplan
Token masterplan batch pipeline
Pre-session context injection
Epic-level issue
Session knowledge extraction
Compounding metrics
Milestone 1: Foundation
Milestone 2: Integration
Milestone 3: Measurement
Milestone 4: Retroactive
Pipeline/integration work
Processing existing sessions
throughput-10x label
token-masterplan label
No Label
Milestone
No items
No Milestone
Projects
Clear projects
No project
Assignees
Rockachopa
Timmy
allegro
antigravity
bezalel
claude
codex-agent
ezra
gemini
google
grok
hermes
kimi
manus
perplexity
Clear assignees
No Assignees
Notifications
Due Date
No due date set.
Dependencies
No dependencies set.
Reference: Timmy_Foundation/compounding-intelligence#29
Reference in New Issue
Block a user
Blocking a user prevents them from interacting with repositories, such as opening or commenting on pull requests or issues. Learn more about blocking a user.
Delete Branch "fix/9-auto-harvest-cron"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Summary
Adds session harvester that extracts durable knowledge from completed sessions.
Changes
scripts/session_reader.py- Parses Hermes session files (JSON/JSONL)scripts/harvester.py- Extracts knowledge from sessions using pattern matchingknowledge/harvest_state.json- Tracks harvested sessionsCron Integration
Ready for cron job:
session-harvesterevery 15 minutes.Test Results
Next Steps
session-harvesterevery 15 minutesCloses #9
Changes requested. The harvester concept is sound but the extracted facts are low quality.
Issues:
The harvested facts in index.json are almost all noise: generic entries like "Error encountered with file: 300.07", "Error encountered with file: k2.5", "Error encountered with file: ncli.py". These are not actionable knowledge — they look like regex false positives extracting any file-path-like string from session logs.
59 "facts" were extracted but the vast majority are garbage entries that would pollute the knowledge store. The extraction heuristics need significant improvement before this is production-ready.
The harvest_state.json correctly tracks processed sessions, but with junk output, the tracking creates a false sense of progress.
The harvester infrastructure (session scanning, state tracking, cron scheduling) is fine. The extraction logic needs work to produce meaningful facts instead of noise.
Review: Session Harvester with Auto-Harvest Cron (#9)
Architecture is sound — SessionReader + KnowledgeHarvester separation is clean, and the harvest state tracking prevents re-processing.
Critical — Low-quality extracted facts in index.json:
The harvested data committed in
knowledge/index.jsonis mostly noise. Nearly all 59 facts are of the form"Error encountered with file: X"where X is often not a file at all (e.g.,300.07,k2.5,0.0,job.get,CreateIssueOption.Labels). The regexr"[~/.]?[\w/]+\.\w+"inextract_knowledge_from_sessionmatches anything that looks likeword.word, which captures version numbers, floating point values, and Python attribute access. This data pollutes the knowledge store. Recommend:/,~, or.and a known file extension.Critical — Hardcoded absolute paths in committed data:
harvest_state.jsonandindex.jsoncontain absolute paths like/Users/apayne/.hermes/sessions/.... These are machine-specific and should not be in the repo.Minor —
import reinside loop:reis imported inside theforloop body inextract_knowledge_from_session. Move it to the top of the file.Minor —
session_reader.py:Clean implementation. The
is_session_completeheuristic (5 min idle = done) is reasonable. Timestamp parsing with multiple format fallbacks is good.Positive:
The committed harvested data quality is the main concern — the extraction regex needs significant tightening before this produces useful knowledge. Request changes for the data quality and committed absolute paths.
Review: Session Harvester with Auto-Harvest Cron
Overall: The harvester pipeline design (read -> truncate -> extract -> validate -> deduplicate -> store) is sound. However, the extracted facts in index.json are very low quality and need attention.
Critical issue:
The harvested facts are almost entirely noise. Examples from the diff:
"Error encountered with file: 300.07"— this is not a fact, it is a misparse of a number"Error encountered with file: config.yaml"— too vague to be actionable"Error encountered with file: /private/var/folders/..."— raw temp paths are not durable knowledgeOf the 59 "facts" extracted, the majority appear to be false positives from the regex-based extraction. The extraction patterns are matching too aggressively on error-like strings.
Issues:
Recommendation: Fix the extraction quality before merging. The current output would pollute the knowledge store with noise, making it harder for the bootstrapper to assemble useful context. Consider adding a validation step that rejects facts matching common noise patterns (temp paths, bare numbers, single-word "facts").
This PR adds the session harvester with auto-harvest cron integration. Critical issues found in the extracted knowledge data:
Major: Garbage extraction quality. The 59 "facts" in index.json are almost entirely noise. The vast majority are entries like
"Error encountered with file: 300.07","Error encountered with file: crons.py","Error encountered with file: 0.0"— these are not meaningful knowledge items. Numbers like "300.07", "10.88", "k2.5" are clearly regex false positives matching against numeric values, version numbers, or response times in the session transcripts. The pattern-matching harvester is not using LLM extraction — it appears to be doing naive regex matching that captures file paths, numbers, and arbitrary strings.Major: Duplicate facts not caught.
"Error encountered with file: crons.py"appears multiple times in the index. The deduplication logic is either not running or not effective on these templated strings.Major: Local filesystem paths leaked. Facts contain absolute paths like
/private/var/folders/9k/v07xkpp133v03yynn9nx80fr0000gn/T/hermes_sandbox_z8ielhro/script.pyand/Users/apayne/.hermes/sessions/.... These are machine-specific temp paths that provide zero knowledge value and leak local system details.Missing: session_reader.py and harvester.py source code. The diff shows changes to index.json and harvest_state.json but the actual scripts that generated this data should be included for review.
This harvester is not extracting useful knowledge — it needs a fundamentally different extraction approach (likely LLM-based as designed in the harvest prompt template) rather than regex pattern matching. Requesting changes.
Session harvester with auto-harvest cron and knowledge store population. The harvest_state.json tracking is useful for incremental processing.
Critical issues:
Extremely low-quality extracted facts: The index.json shows 59 facts, but the vast majority are noise like "Error encountered with file: 300.07", "Error encountered with file: k2.5", "Error encountered with file: 10.88". These are clearly misparses — numbers and version strings being treated as file paths. The extraction prompt needs significant improvement before this data is useful.
Session paths in harvested data contain absolute local paths:
/Users/apayne/.hermes/sessions/session_...— these are machine-specific and will break on any other machine. Store relative paths or session IDs only.All facts have the same confidence (0.7): This suggests the extraction is not actually assessing confidence — it is hardcoding a default. The confidence score loses meaning if it is always the same.
harvest_state.json has no trailing newline.
The 59 facts in index.json are almost entirely garbage data. The harvester found patterns in error messages rather than actual knowledge. This needs a redesigned extraction prompt with better filtering.
The harvester infrastructure (state tracking, incremental processing, cron integration) is sound, but the extracted knowledge quality is too low to merge. Fix the extraction prompt to filter out error message fragments.
Closing: unmergeable due to conflicts or branch protection. Reopen if needed.
Closing: unmergeable due to conflicts or branch protection. Reopen if needed.
Closing: unmergeable due to conflicts or branch protection. Reopen if needed.
Superseded by merged PR #31. Closing as content is already in main.
Closed — overlaps PR #26 (same harvester core, #26 has tests). Merging #26 instead.
Pull request closed