fix(#676 ): Add GENOME.md for compounding-intelligence

Complete codebase genome: - Project overview and three-pipeline architecture - Mermaid architecture diagram - Entry points and data flow - Knowledge schema and confidence scoring - Key abstractions - Test coverage analysis with gaps - Security considerations - Dependencies and status
2026-04-15 03:25:20 +00:00
1 changed files with 184 additions and 0 deletions
--- a/GENOME.md
+++ b/GENOME.md
@@ -0,0 +1,184 @@
+# GENOME.md — compounding-intelligence
+
+> Auto-generated codebase genome. Repo 9/16 in the Codebase Genome series.
+
+## Project Overview
+
+**compounding-intelligence** turns 1B+ daily tokens into durable, compounding fleet intelligence. It solves the core problem of AI agent amnesia: every session starts at zero, rediscovering the same facts, pitfalls, and patterns that previous sessions already learned.
+
+The project implements three pipelines forming a compounding loop:
+
+```
+SESSION ENDS --> HARVESTER --> KNOWLEDGE STORE --> BOOTSTRAPPER --> NEW SESSION STARTS SMARTER
+                                      |
+                                 MEASURER --> Prove it's working
+```
+
+**Key insight**: Intelligence from a million tokens of work evaporates when the session ends. This project captures it, stores it, and injects it into future sessions so they start smarter.
+
+## Architecture
+
+```mermaid
+graph LR
+    A[Session Transcripts] -->|Harvester| B[Knowledge Store]
+    B -->|Bootstrapper| C[New Session Context]
+    C --> D[Agent Work]
+    D --> A
+    B -->|Measurer| E[Dashboard]
+    E -->|Metrics| F[Proof of Compounding]
+
+    subgraph Knowledge Store
+        B1[index.json]
+        B2[global/]
+        B3[repos/{repo}.md]
+        B4[agents/{agent}.md]
+    end
+```
+
+### Pipeline 1: Harvester
+- **Input**: Finished session transcripts (JSONL format)
+- **Process**: LLM extracts durable knowledge using structured prompt
+- **Output**: Facts stored in `knowledge/` directory
+- **Categories**: fact, pitfall, pattern, tool-quirk, question
+- **Deduplication**: Content-hash based, existing knowledge has priority
+
+### Pipeline 2: Bootstrapper
+- **Input**: `knowledge/` store
+- **Process**: Queries for relevant facts, assembles compact 2k-token context
+- **Output**: Injected context at session start
+- **Goal**: New sessions start with full situational awareness
+
+### Pipeline 3: Measurer
+- **Input**: Knowledge store + session metrics
+- **Process**: Tracks knowledge velocity, error reduction, hit rate
+- **Output**: Dashboard.md + daily reports
+- **Goal**: Prove the compounding loop works
+
+## Directory Structure
+
+```
+compounding-intelligence/
+|-- README.md                          # Project overview and roadmap
+|-- knowledge/
+|   |-- index.json                     # Machine-readable fact index (versioned)
+|   |-- global/                        # Cross-repo knowledge
+|   |-- repos/{repo}.md                # Per-repo knowledge files
+|   |-- agents/{agent}.md              # Agent-type notes
+|-- scripts/
+|   |-- test_harvest_prompt.py         # Validation for harvest prompt output
+|   |-- test_harvest_prompt_comprehensive.py  # Extended test suite
+|-- templates/
+|   |-- harvest-prompt.md             # LLM prompt for knowledge extraction
+|-- metrics/
+|   |-- .gitkeep                      # Placeholder for dashboard
+|-- test_sessions/
+|   |-- session_failure.jsonl         # Test data: failed session
+|   |-- session_partial.jsonl         # Test data: partial session
+|   |-- session_patterns.jsonl        # Test data: pattern extraction
+|   |-- session_questions.jsonl       # Test data: question identification
+|   |-- session_success.jsonl         # Test data: successful session
+```
+
+## Entry Points
+
+| File | Purpose | Entry |
+|------|---------|-------|
+| `templates/harvest-prompt.md` | Extraction prompt | LLM input template |
+| `scripts/test_harvest_prompt.py` | Validation | `python3 test_harvest_prompt.py` |
+| `knowledge/index.json` | Data store | Read/write by all pipelines |
+
+## Data Flow
+
+```
+1. Agent completes session -> session transcript (JSONL)
+2. Harvester reads transcript
+3. LLM processes via harvest-prompt.md template
+4. Extracted knowledge validated against schema
+5. Deduplicated against existing index.json
+6. New facts appended with source attribution
+7. Bootstrapper queries index.json for relevant facts
+8. Context injected into next session
+9. Measurer tracks velocity and quality metrics
+```
+
+## Knowledge Schema
+
+Each knowledge item in `index.json`:
+
+```json
+{
+  "fact": "One sentence description",
+  "category": "fact|pitfall|pattern|tool-quirk|question",
+  "repo": "Repository name or 'global'",
+  "confidence": 0.0-1.0,
+  "source": "mempalace|fact_store|skill|harvester",
+  "source_file": "Origin file if applicable",
+  "migrated_at": "ISO 8601 timestamp"
+}
+```
+
+### Confidence Scoring
+- **0.9-1.0**: Explicitly stated with verification
+- **0.7-0.8**: Clearly implied by multiple data points
+- **0.5-0.6**: Suggested but not fully verified
+- **0.3-0.4**: Inferred from limited data
+- **0.1-0.2**: Speculative or uncertain
+
+## Key Abstractions
+
+1. **Knowledge Item**: Atomic unit of extracted intelligence. One fact, one category, one confidence score.
+2. **Knowledge Store**: Directory-based persistent storage with JSON index.
+3. **Harvest Prompt**: Structured LLM prompt that converts session transcripts to knowledge items.
+4. **Bootstrap Context**: Compact 2k-token summary injected at session start.
+5. **Compounding Loop**: The cycle of extract -> store -> inject -> work -> extract.
+
+## API Surface
+
+### Knowledge Store (file-based)
+- **Read**: `knowledge/index.json` — all facts
+- **Write**: Append to `index.json` after deduplication
+- **Query**: Filter by category, repo, confidence threshold
+
+### Templates
+- **harvest-prompt.md**: Input template for LLM extraction
+- **bootstrap-context.md**: Output template for session injection
+
+## Test Coverage
+
+| Test File | Covers | Status |
+|-----------|--------|--------|
+| `test_harvest_prompt.py` | Schema validation, required fields | Present |
+| `test_harvest_prompt_comprehensive.py` | Extended validation, edge cases | Present |
+| `test_sessions/session_failure.jsonl` | Failure extraction | Test data |
+| `test_sessions/session_partial.jsonl` | Partial session handling | Test data |
+| `test_sessions/session_patterns.jsonl` | Pattern extraction | Test data |
+| `test_sessions/session_questions.jsonl` | Question identification | Test data |
+| `test_sessions/session_success.jsonl` | Full extraction | Test data |
+
+### Gaps
+- No integration tests for full harvester pipeline
+- No tests for bootstrapper context assembly
+- No tests for measurer metrics computation
+- No tests for deduplication logic
+- No CI pipeline configured
+
+## Security Considerations
+
+1. **Knowledge injection**: Bootstrapper injects context from knowledge store. Malicious facts in the store could influence agent behavior. Trust scoring partially mitigates this.
+2. **Session transcripts**: May contain sensitive data (tokens, API keys). Harvester must filter sensitive patterns before storage.
+3. **LLM extraction**: Harvest prompt instructs "no hallucination" but LLMs can still confabulate. Confidence scoring and source attribution provide auditability.
+4. **File-based storage**: No access control on knowledge files. Anyone with filesystem access can read/modify.
+
+## Dependencies
+
+- Python 3.10+
+- No external packages (stdlib only)
+- LLM access for harvester pipeline (Ollama or cloud provider)
+- Hermes agent framework for session management
+
+## Status
+
+- **Phase**: Early development
+- **Epics**: 4 (Harvester, Knowledge Store, Bootstrap, Measurement)
+- **Milestone**: 4 (Retroactive Harvest)
+- **Open Issues**: Active development across harvester and knowledge store pipelines