# Knowledge File Format Specification **Version:** 1 **Issue:** #10 **Status:** Draft --- ## Overview The knowledge system has two layers: 1. **index.json** — Machine-readable fact index. Fast lookups by ID, category, repo, tags. 2. **Knowledge files** (YAML) — Human-readable, editable facts organized by domain. The harvester writes to both. The bootstrapper reads from index.json. Humans edit the YAML files directly. --- ## index.json Schema ```json { "version": 1, "last_updated": "ISO-8601 timestamp", "total_facts": 0, "facts": [] } ``` ### Fact Object | Field | Type | Required | Description | |-------|------|----------|-------------| | `id` | string | yes | Unique identifier: `{domain}:{category}:{sequence}` | | `fact` | string | yes | One-sentence description of the knowledge | | `category` | enum | yes | One of: `fact`, `pitfall`, `pattern`, `tool-quirk`, `question` | | `domain` | string | yes | Where this applies: repo name, `global`, or agent name | | `confidence` | float | yes | 0.0–1.0. How certain is this knowledge? | | `tags` | string[] | no | Searchable labels: `["git", "auth", "gitea"]` | | `source_count` | int | no | How many sessions confirmed this fact | | `first_seen` | date | no | ISO-8601 date first extracted | | `last_confirmed` | date | no | ISO-8601 date last seen in a session | | `expires` | date | no | Optional. After this date, fact is stale | | `related` | string[] | no | IDs of related facts | ### ID Format ``` {domain}:{category}:{sequence} ``` - `domain` — repo name, `global`, or agent type - `category` — one of the 5 categories - `sequence` — zero-padded 3-digit number: `001`, `002`, ... Examples: - `the-nexus:pitfall:001` - `global:tool-quirk:012` - `hermes-agent:pattern:003` ### Categories | Category | Definition | Example | |----------|------------|---------| | `fact` | Concrete, verifiable information | "Gitea API requires token auth at /api/v1" | | `pitfall` | Errors, wrong assumptions, time-wasters | "Assumed env var GITEA_TOKEN; actual path is ~/.config/gitea/token" | | `pattern` | Successful sequences of actions | "To deploy: test → build → push → webhook" | | `tool-quirk` | Environment-specific behaviors | "URL format requires trailing slash on macOS" | | `question` | Identified but unanswered | "Need optimal batch size for harvesting" | ### Confidence Scoring | Range | Meaning | |-------|---------| | 0.9–1.0 | Explicitly stated and verified | | 0.7–0.8 | Clearly implied by multiple data points | | 0.5–0.6 | Suggested but not fully verified | | 0.3–0.4 | Inferred from limited data | | 0.1–0.2 | Speculative or uncertain | --- ## Knowledge Files (YAML) Human-readable files stored in `knowledge/` subdirectories. ### Directory Structure ``` knowledge/ ├── index.json # Machine-readable fact index ├── SCHEMA.md # This file ├── global/ # Cross-repo knowledge │ ├── pitfalls.yaml # Pitfalls that span multiple repos │ ├── patterns.yaml # Proven workflows │ └── tool-quirks.yaml # Environment behaviors ├── repos/ # Per-repo knowledge │ ├── the-nexus.yaml │ ├── hermes-agent.yaml │ └── ... └── agents/ # Agent-type knowledge ├── mimo-sprint.yaml └── ... ``` ### YAML File Format ```yaml --- domain: global # or repo name or agent name category: tool-quirk # fact, pitfall, pattern, tool-quirk, question version: 1 last_updated: "2026-04-13" --- # Tool Quirks (Global) Cross-environment behaviors that bite you if you don't know them. ## Authentication - id: global:tool-quirk:001 fact: "Gitea token stored at ~/.config/gitea/token, not env var" confidence: 0.95 tags: [git, auth, gitea] source_count: 23 first_seen: "2026-03-27" last_confirmed: "2026-04-13" related: [global:pitfall:003] - id: global:tool-quirk:002 fact: "Gitea API uses 'token' header format: Authorization: token TOKEN" confidence: 0.9 tags: [git, api, gitea] source_count: 8 first_seen: "2026-03-28" ``` ### Rules 1. **One file per domain per category.** `repos/the-nexus.yaml` holds all the-nexus facts. Don't mix categories across files. 2. **Markdown sections for humans.** The YAML items live under markdown headers. This makes the files readable in Gitea's UI. 3. **ID is the link.** The `id` field connects YAML facts to index.json entries. Same ID = same fact. 4. **Harvester writes, humans edit.** The harvester appends new facts. Humans can correct confidence, add tags, or mark expired. --- ## Sync Rules 1. **Harvester → YAML:** Appends new facts to the appropriate YAML file. 2. **Harvester → index.json:** Adds/updates fact entries. 3. **Human edits YAML:** Changes propagate to index.json on next harvester run. 4. **Confidence decay:** Facts not confirmed in 30+ sessions get confidence *= 0.9. 5. **Expiration:** Facts with `expires` date past current date are marked `stale` in index.json. --- ## Validation Facts must pass these checks before entering the index: 1. `id` matches format `{domain}:{category}:{sequence}` 2. `category` is one of the 5 allowed values 3. `confidence` is between 0.0 and 1.0 4. `fact` is non-empty string, max 280 characters 5. `domain` is non-empty string 6. `tags` are lowercase alphanumeric + hyphens 7. No duplicate IDs in index.json Validation script: `scripts/validate_knowledge.py`