compounding-intelligence/knowledge/SCHEMA.md

# Knowledge File Format Specification

**Version:** 1
**Issue:** #10
**Status:** Draft

---

## Overview

The knowledge system has two layers:

1. **index.json** — Machine-readable fact index. Fast lookups by ID, category, repo, tags.
2. **Knowledge files** (YAML) — Human-readable, editable facts organized by domain.

The harvester writes to both. The bootstrapper reads from index.json. Humans edit the YAML files directly.

---

## index.json Schema

```json
{
  "version": 1,
  "last_updated": "ISO-8601 timestamp",
  "total_facts": 0,
  "facts": []
}
```

### Fact Object

| Field | Type | Required | Description |
|-------|------|----------|-------------|
| `id` | string | yes | Unique identifier: `{domain}:{category}:{sequence}` |
| `fact` | string | yes | One-sentence description of the knowledge |
| `category` | enum | yes | One of: `fact`, `pitfall`, `pattern`, `tool-quirk`, `question` |
| `domain` | string | yes | Where this applies: repo name, `global`, or agent name |
| `confidence` | float | yes | 0.0-1.0. How certain is this knowledge? |
| `tags` | string[] | no | Searchable labels: `["git", "auth", "gitea"]` |
| `source_count` | int | no | How many sessions confirmed this fact |
| `first_seen` | date | no | ISO-8601 date first extracted |
| `last_confirmed` | date | no | ISO-8601 date last seen in a session |
| `expires` | date | no | Optional. After this date, fact is stale |
| `related` | string[] | no | IDs of related facts |

### ID Format

```
{domain}:{category}:{sequence}
```

- `domain` — repo name, `global`, or agent type
- `category` — one of the 5 categories
- `sequence` — zero-padded 3-digit number: `001`, `002`, ...

Examples:
- `the-nexus:pitfall:001`
- `global:tool-quirk:012`
- `hermes-agent:pattern:003`

### Categories

| Category | Definition | Example |
|----------|------------|---------|
| `fact` | Concrete, verifiable information | "Gitea API requires token auth at /api/v1" |
| `pitfall` | Errors, wrong assumptions, time-wasters | "Assumed env var GITEA_TOKEN; actual path is ~/.config/gitea/token" |
| `pattern` | Successful sequences of actions | "To deploy: test -> build -> push -> webhook" |
| `tool-quirk` | Environment-specific behaviors | "URL format requires trailing slash on macOS" |
| `question` | Identified but unanswered | "Need optimal batch size for harvesting" |

### Confidence Scoring

| Range | Meaning |
|-------|---------|
| 0.9-1.0 | Explicitly stated and verified |
| 0.7-0.8 | Clearly implied by multiple data points |
| 0.5-0.6 | Suggested but not fully verified |
| 0.3-0.4 | Inferred from limited data |
| 0.1-0.2 | Speculative or uncertain |

---

## Knowledge Files (YAML)

Human-readable files stored in `knowledge/` subdirectories.

### Directory Structure

```
knowledge/
├── index.json                  # Machine-readable fact index
├── SCHEMA.md                   # This file
├── global/                     # Cross-repo knowledge
│   ├── pitfalls.yaml           # Pitfalls that span multiple repos
│   ├── patterns.yaml           # Proven workflows
│   └── tool-quirks.yaml        # Environment behaviors
├── repos/                      # Per-repo knowledge
│   ├── the-nexus.yaml
│   ├── hermes-agent.yaml
│   └── ...
└── agents/                     # Agent-type knowledge
    ├── mimo-sprint.yaml
    └── ...
```

### YAML File Format

```yaml
---
domain: global                    # or repo name or agent name
category: tool-quirk              # fact, pitfall, pattern, tool-quirk, question
version: 1
last_updated: "2026-04-13"
---

# Tool Quirks (Global)

Cross-environment behaviors that bite you if you don't know them.

## Authentication

- id: global:tool-quirk:001
  fact: "Gitea token stored at ~/.config/gitea/token, not env var"
  confidence: 0.95
  tags: [git, auth, gitea]
  source_count: 23
  first_seen: "2026-03-27"
  last_confirmed: "2026-04-13"
  related: [global:pitfall:003]
```

### Rules

1. **One file per domain per category.** `repos/the-nexus.yaml` holds all the-nexus facts.
2. **Markdown sections for humans.** YAML items live under markdown headers for Gitea UI readability.
3. **ID is the link.** The `id` field connects YAML facts to index.json entries.
4. **Harvester writes, humans edit.** Harvester appends. Humans correct confidence, add tags, mark expired.

---

## Sync Rules

1. **Harvester -> YAML:** Appends new facts to the appropriate YAML file.
2. **Harvester -> index.json:** Adds/updates fact entries.
3. **Human edits YAML:** Changes propagate to index.json on next harvester run.
4. **Confidence decay:** Facts not confirmed in 30+ sessions get confidence *= 0.9.
5. **Expiration:** Facts with `expires` date past current date are marked `stale`.

---

## Validation

Facts must pass these checks:

1. `id` matches format `{domain}:{category}:{sequence}`
2. `category` is one of the 5 allowed values
3. `confidence` is between 0.0 and 1.0
4. `fact` is non-empty string, max 280 characters
5. `domain` is non-empty string
6. `tags` are lowercase alphanumeric + hyphens
7. No duplicate IDs in index.json

Validation script: `scripts/validate_knowledge.py`