[EVAL] Live memory-provider snapshot matrix — training data pipeline #276

Open
Rockachopa wants to merge 1 commits from step35/199-feat-training-data-pipeline into main
Owner

Implement training data pipeline converting quality-gated knowledge entries into JSONL format.

Changes

  • Add scripts/knowledge_to_training_pairs.py: main pipeline script

    • Reads knowledge/index.json (quality-gated entries)
    • Applies configurable filters: --min-confidence, --model-filter, --after, --before
    • Converts each entry to terserich pair via category-aware fact_to_terse()
    • Outputs JSONL with fields: terse, rich, domain, source_confidence, source_model
  • Add tests/test_knowledge_to_training_pairs.py: smoke tests (9 tests)

    • Unit: fact_to_terse() category handling, filter_entries() logic, entry_to_pair() structure
    • Integration: end-to-end dry-run validates JSONL stats and output validity
    • All tests pass locally

Deliverables (per #199)

  1. Pipeline script: scripts/knowledge_to_training_pairs.py
  2. End-to-end: knowledge/index.jsontraining_pairs.jsonl
  3. Configurable filters: confidence threshold, model whitelist, date range
  4. Tests: 9 smoke tests covering all integration points

Verification

# Dry-run (no output file)
python3 scripts/knowledge_to_training_pairs.py --dry-run

# With filters
python3 scripts/knowledge_to_training_pairs.py   --min-confidence 0.7   --after 2026-04-01

# Full pipeline
python3 scripts/knowledge_to_training_pairs.py -o training_pairs.jsonl

Output example (default on current knowledge/):

  • 29 entries → 29 training pairs
  • Domains: hermes-agent (8), the-nexus (6), global (15)
  • Avg confidence: 0.9034

Notes

  • source_model defaults to "unknown" as model provenance is not stored in current index.json entries; can be populated if/when provenance tracking is added upstream
  • fact_to_terse uses simple suffix-based heuristics — sufficient for first-pass data generation; can be refined (e.g., LLM-backed rephrasing) in future iterations

Closes #199

Implement training data pipeline converting quality-gated knowledge entries into JSONL format. ## Changes - **Add** `scripts/knowledge_to_training_pairs.py`: main pipeline script - Reads `knowledge/index.json` (quality-gated entries) - Applies configurable filters: `--min-confidence`, `--model-filter`, `--after`, `--before` - Converts each entry to `terse`→`rich` pair via category-aware `fact_to_terse()` - Outputs JSONL with fields: `terse`, `rich`, `domain`, `source_confidence`, `source_model` - **Add** `tests/test_knowledge_to_training_pairs.py`: smoke tests (9 tests) - Unit: `fact_to_terse()` category handling, `filter_entries()` logic, `entry_to_pair()` structure - Integration: end-to-end dry-run validates JSONL stats and output validity - All tests pass locally ## Deliverables (per #199) 1. ✅ Pipeline script: `scripts/knowledge_to_training_pairs.py` 2. ✅ End-to-end: `knowledge/index.json` → `training_pairs.jsonl` 3. ✅ Configurable filters: confidence threshold, model whitelist, date range 4. ✅ Tests: 9 smoke tests covering all integration points ## Verification ```bash # Dry-run (no output file) python3 scripts/knowledge_to_training_pairs.py --dry-run # With filters python3 scripts/knowledge_to_training_pairs.py --min-confidence 0.7 --after 2026-04-01 # Full pipeline python3 scripts/knowledge_to_training_pairs.py -o training_pairs.jsonl ``` Output example (default on current knowledge/): - 29 entries → 29 training pairs - Domains: hermes-agent (8), the-nexus (6), global (15) - Avg confidence: 0.9034 ## Notes - `source_model` defaults to `"unknown"` as model provenance is not stored in current `index.json` entries; can be populated if/when provenance tracking is added upstream - `fact_to_terse` uses simple suffix-based heuristics — sufficient for first-pass data generation; can be refined (e.g., LLM-backed rephrasing) in future iterations Closes #199
Rockachopa added 1 commit 2026-04-26 17:03:43 +00:00
feat: training data pipeline — knowledge entries → JSONL training pairs
Some checks failed
Test / pytest (pull_request) Failing after 7s
86eb1c9a50
Add scripts/knowledge_to_training_pairs.py which reads quality-gated
knowledge entries from knowledge/index.json and emits terse→rich
training pairs in JSONL format.

Features:
- Derives terse queries from facts via category-aware heuristics
- Configurable quality filters: min-confidence, model-filter, date range
- Output includes domain, source_confidence, source_model
- Smoke tests added in tests/test_knowledge_to_training_pairs.py

Deliverables for #199:
1. Pipeline script: scripts/knowledge_to_training_pairs.py
2. End-to-end: knowledge/index.json → training_pairs.jsonl (or custom JSONL)
3. Config: min-confidence, model-filter, after/before date filters
4. Test: 9 smoke tests covering conversion, filtering, and end-to-end run

Closes #199
Owner

🛡️ Goblin Patrol Alert 🛡️

Hey brother — this PR has been idle for 6 days and is unassigned.

The goblin fleet has been notified. A goblin may claim this if it remains stale.

— Timmy Goblin Wizard King

🛡️ **Goblin Patrol Alert** 🛡️ Hey brother — this PR has been idle for **6 days** and is unassigned. The goblin fleet has been notified. A goblin may claim this if it remains stale. — Timmy Goblin Wizard King
Some checks failed
Test / pytest (pull_request) Failing after 7s
This pull request can be merged automatically.
This branch is out-of-date with the base branch
You are not authorized to merge this pull request.
View command line instructions

Checkout

From your project repository, check out a new branch and test the changes.
git fetch -u origin step35/199-feat-training-data-pipeline:step35/199-feat-training-data-pipeline
git checkout step35/199-feat-training-data-pipeline
Sign in to join this conversation.