[EVAL] Live memory-provider snapshot matrix — training data pipeline #276

Rockachopa · 2026-04-26T17:03:42Z

Rockachopa commented

2026-04-26 17:03:42 +00:00

Implement training data pipeline converting quality-gated knowledge entries into JSONL format.

Changes

Add scripts/knowledge_to_training_pairs.py: main pipeline script
- Reads knowledge/index.json (quality-gated entries)
- Applies configurable filters: --min-confidence, --model-filter, --after, --before
- Converts each entry to terse→rich pair via category-aware fact_to_terse()
- Outputs JSONL with fields: terse, rich, domain, source_confidence, source_model
Add tests/test_knowledge_to_training_pairs.py: smoke tests (9 tests)
- Unit: fact_to_terse() category handling, filter_entries() logic, entry_to_pair() structure
- Integration: end-to-end dry-run validates JSONL stats and output validity
- All tests pass locally

Deliverables (per #199)

✅ Pipeline script: scripts/knowledge_to_training_pairs.py
✅ End-to-end: knowledge/index.json → training_pairs.jsonl
✅ Configurable filters: confidence threshold, model whitelist, date range
✅ Tests: 9 smoke tests covering all integration points

Verification

# Dry-run (no output file)
python3 scripts/knowledge_to_training_pairs.py --dry-run

# With filters
python3 scripts/knowledge_to_training_pairs.py   --min-confidence 0.7   --after 2026-04-01

# Full pipeline
python3 scripts/knowledge_to_training_pairs.py -o training_pairs.jsonl

Output example (default on current knowledge/):

29 entries → 29 training pairs
Domains: hermes-agent (8), the-nexus (6), global (15)
Avg confidence: 0.9034

Notes

source_model defaults to "unknown" as model provenance is not stored in current index.json entries; can be populated if/when provenance tracking is added upstream
fact_to_terse uses simple suffix-based heuristics — sufficient for first-pass data generation; can be refined (e.g., LLM-backed rephrasing) in future iterations

Closes #199

Implement training data pipeline converting quality-gated knowledge entries into JSONL format. ## Changes - **Add** `scripts/knowledge_to_training_pairs.py`: main pipeline script - Reads `knowledge/index.json` (quality-gated entries) - Applies configurable filters: `--min-confidence`, `--model-filter`, `--after`, `--before` - Converts each entry to `terse`→`rich` pair via category-aware `fact_to_terse()` - Outputs JSONL with fields: `terse`, `rich`, `domain`, `source_confidence`, `source_model` - **Add** `tests/test_knowledge_to_training_pairs.py`: smoke tests (9 tests) - Unit: `fact_to_terse()` category handling, `filter_entries()` logic, `entry_to_pair()` structure - Integration: end-to-end dry-run validates JSONL stats and output validity - All tests pass locally ## Deliverables (per #199) 1. ✅ Pipeline script: `scripts/knowledge_to_training_pairs.py` 2. ✅ End-to-end: `knowledge/index.json` → `training_pairs.jsonl` 3. ✅ Configurable filters: confidence threshold, model whitelist, date range 4. ✅ Tests: 9 smoke tests covering all integration points ## Verification ```bash # Dry-run (no output file) python3 scripts/knowledge_to_training_pairs.py --dry-run # With filters python3 scripts/knowledge_to_training_pairs.py --min-confidence 0.7 --after 2026-04-01 # Full pipeline python3 scripts/knowledge_to_training_pairs.py -o training_pairs.jsonl ``` Output example (default on current knowledge/): - 29 entries → 29 training pairs - Domains: hermes-agent (8), the-nexus (6), global (15) - Avg confidence: 0.9034 ## Notes - `source_model` defaults to `"unknown"` as model provenance is not stored in current `index.json` entries; can be populated if/when provenance tracking is added upstream - `fact_to_terse` uses simple suffix-based heuristics — sufficient for first-pass data generation; can be refined (e.g., LLM-backed rephrasing) in future iterations Closes #199

Rockachopa added 1 commit 2026-04-26 17:03:43 +00:00

feat: training data pipeline — knowledge entries → JSONL training pairs

Test / pytest (pull_request) Failing after 7s

Details

86eb1c9a50

Add scripts/knowledge_to_training_pairs.py which reads quality-gated
knowledge entries from knowledge/index.json and emits terse→rich
training pairs in JSONL format.

Features:
- Derives terse queries from facts via category-aware heuristics
- Configurable quality filters: min-confidence, model-filter, date range
- Output includes domain, source_confidence, source_model
- Smoke tests added in tests/test_knowledge_to_training_pairs.py

Deliverables for #199:
1. Pipeline script: scripts/knowledge_to_training_pairs.py
2. End-to-end: knowledge/index.json → training_pairs.jsonl (or custom JSONL)
3. Config: min-confidence, model-filter, after/before date filters
4. Test: 9 smoke tests covering conversion, filtering, and end-to-end run

Closes #199

Timmy commented

2026-05-02 18:34:37 +00:00

🛡️ Goblin Patrol Alert 🛡️

Hey brother — this PR has been idle for 6 days and is unassigned.

The goblin fleet has been notified. A goblin may claim this if it remains stale.

— Timmy Goblin Wizard King

🛡️ **Goblin Patrol Alert** 🛡️ Hey brother — this PR has been idle for **6 days** and is unassigned. The goblin fleet has been notified. A goblin may claim this if it remains stale. — Timmy Goblin Wizard King

Test / pytest (pull_request) Failing after 7s

Details

This pull request can be merged automatically.

This branch is out-of-date with the base branch

You are not authorized to merge this pull request.

View command line instructions

Checkout

From your project repository, check out a new branch and test the changes.

git fetch -u origin step35/199-feat-training-data-pipeline:step35/199-feat-training-data-pipeline

git checkout step35/199-feat-training-data-pipeline

Sign in to join this conversation.

2 Participants

Notifications

Due Date

No due date set.

Dependencies

No dependencies set.

Reference: Timmy_Foundation/compounding-intelligence#276