172 lines
5.4 KiB
Markdown
172 lines
5.4 KiB
Markdown
# Knowledge File Format Specification
|
||
|
||
**Version:** 1
|
||
**Issue:** #10
|
||
**Status:** Draft
|
||
|
||
---
|
||
|
||
## Overview
|
||
|
||
The knowledge system has two layers:
|
||
|
||
1. **index.json** — Machine-readable fact index. Fast lookups by ID, category, repo, tags.
|
||
2. **Knowledge files** (YAML) — Human-readable, editable facts organized by domain.
|
||
|
||
The harvester writes to both. The bootstrapper reads from index.json. Humans edit the YAML files directly.
|
||
|
||
---
|
||
|
||
## index.json Schema
|
||
|
||
```json
|
||
{
|
||
"version": 1,
|
||
"last_updated": "ISO-8601 timestamp",
|
||
"total_facts": 0,
|
||
"facts": []
|
||
}
|
||
```
|
||
|
||
### Fact Object
|
||
|
||
| Field | Type | Required | Description |
|
||
|-------|------|----------|-------------|
|
||
| `id` | string | yes | Unique identifier: `{domain}:{category}:{sequence}` |
|
||
| `fact` | string | yes | One-sentence description of the knowledge |
|
||
| `category` | enum | yes | One of: `fact`, `pitfall`, `pattern`, `tool-quirk`, `question` |
|
||
| `domain` | string | yes | Where this applies: repo name, `global`, or agent name |
|
||
| `confidence` | float | yes | 0.0–1.0. How certain is this knowledge? |
|
||
| `tags` | string[] | no | Searchable labels: `["git", "auth", "gitea"]` |
|
||
| `source_count` | int | no | How many sessions confirmed this fact |
|
||
| `first_seen` | date | no | ISO-8601 date first extracted |
|
||
| `last_confirmed` | date | no | ISO-8601 date last seen in a session |
|
||
| `expires` | date | no | Optional. After this date, fact is stale |
|
||
| `related` | string[] | no | IDs of related facts |
|
||
|
||
### ID Format
|
||
|
||
```
|
||
{domain}:{category}:{sequence}
|
||
```
|
||
|
||
- `domain` — repo name, `global`, or agent type
|
||
- `category` — one of the 5 categories
|
||
- `sequence` — zero-padded 3-digit number: `001`, `002`, ...
|
||
|
||
Examples:
|
||
- `the-nexus:pitfall:001`
|
||
- `global:tool-quirk:012`
|
||
- `hermes-agent:pattern:003`
|
||
|
||
### Categories
|
||
|
||
| Category | Definition | Example |
|
||
|----------|------------|---------|
|
||
| `fact` | Concrete, verifiable information | "Gitea API requires token auth at /api/v1" |
|
||
| `pitfall` | Errors, wrong assumptions, time-wasters | "Assumed env var GITEA_TOKEN; actual path is ~/.config/gitea/token" |
|
||
| `pattern` | Successful sequences of actions | "To deploy: test → build → push → webhook" |
|
||
| `tool-quirk` | Environment-specific behaviors | "URL format requires trailing slash on macOS" |
|
||
| `question` | Identified but unanswered | "Need optimal batch size for harvesting" |
|
||
|
||
### Confidence Scoring
|
||
|
||
| Range | Meaning |
|
||
|-------|---------|
|
||
| 0.9–1.0 | Explicitly stated and verified |
|
||
| 0.7–0.8 | Clearly implied by multiple data points |
|
||
| 0.5–0.6 | Suggested but not fully verified |
|
||
| 0.3–0.4 | Inferred from limited data |
|
||
| 0.1–0.2 | Speculative or uncertain |
|
||
|
||
---
|
||
|
||
## Knowledge Files (YAML)
|
||
|
||
Human-readable files stored in `knowledge/` subdirectories.
|
||
|
||
### Directory Structure
|
||
|
||
```
|
||
knowledge/
|
||
├── index.json # Machine-readable fact index
|
||
├── SCHEMA.md # This file
|
||
├── global/ # Cross-repo knowledge
|
||
│ ├── pitfalls.yaml # Pitfalls that span multiple repos
|
||
│ ├── patterns.yaml # Proven workflows
|
||
│ └── tool-quirks.yaml # Environment behaviors
|
||
├── repos/ # Per-repo knowledge
|
||
│ ├── the-nexus.yaml
|
||
│ ├── hermes-agent.yaml
|
||
│ └── ...
|
||
└── agents/ # Agent-type knowledge
|
||
├── mimo-sprint.yaml
|
||
└── ...
|
||
```
|
||
|
||
### YAML File Format
|
||
|
||
```yaml
|
||
---
|
||
domain: global # or repo name or agent name
|
||
category: tool-quirk # fact, pitfall, pattern, tool-quirk, question
|
||
version: 1
|
||
last_updated: "2026-04-13"
|
||
---
|
||
|
||
# Tool Quirks (Global)
|
||
|
||
Cross-environment behaviors that bite you if you don't know them.
|
||
|
||
## Authentication
|
||
|
||
- id: global:tool-quirk:001
|
||
fact: "Gitea token stored at ~/.config/gitea/token, not env var"
|
||
confidence: 0.95
|
||
tags: [git, auth, gitea]
|
||
source_count: 23
|
||
first_seen: "2026-03-27"
|
||
last_confirmed: "2026-04-13"
|
||
related: [global:pitfall:003]
|
||
|
||
- id: global:tool-quirk:002
|
||
fact: "Gitea API uses 'token' header format: Authorization: token TOKEN"
|
||
confidence: 0.9
|
||
tags: [git, api, gitea]
|
||
source_count: 8
|
||
first_seen: "2026-03-28"
|
||
```
|
||
|
||
### Rules
|
||
|
||
1. **One file per domain per category.** `repos/the-nexus.yaml` holds all the-nexus facts. Don't mix categories across files.
|
||
2. **Markdown sections for humans.** The YAML items live under markdown headers. This makes the files readable in Gitea's UI.
|
||
3. **ID is the link.** The `id` field connects YAML facts to index.json entries. Same ID = same fact.
|
||
4. **Harvester writes, humans edit.** The harvester appends new facts. Humans can correct confidence, add tags, or mark expired.
|
||
|
||
---
|
||
|
||
## Sync Rules
|
||
|
||
1. **Harvester → YAML:** Appends new facts to the appropriate YAML file.
|
||
2. **Harvester → index.json:** Adds/updates fact entries.
|
||
3. **Human edits YAML:** Changes propagate to index.json on next harvester run.
|
||
4. **Confidence decay:** Facts not confirmed in 30+ sessions get confidence *= 0.9.
|
||
5. **Expiration:** Facts with `expires` date past current date are marked `stale` in index.json.
|
||
|
||
---
|
||
|
||
## Validation
|
||
|
||
Facts must pass these checks before entering the index:
|
||
|
||
1. `id` matches format `{domain}:{category}:{sequence}`
|
||
2. `category` is one of the 5 allowed values
|
||
3. `confidence` is between 0.0 and 1.0
|
||
4. `fact` is non-empty string, max 280 characters
|
||
5. `domain` is non-empty string
|
||
6. `tags` are lowercase alphanumeric + hyphens
|
||
7. No duplicate IDs in index.json
|
||
|
||
Validation script: `scripts/validate_knowledge.py`
|