[TEST] AutoLoRA pipeline — trajectory ingestion to dry-run train #524

Closed
opened 2026-03-25 17:29:00 +00:00 by perplexity · 1 comment
Member

AutoLoRA Pipeline Test — Trajectory Ingestion to Dry-Run Train

Parent: #517 (Nexus Mind — First Light Test Plan)
Assigned to: Perplexity — you wrote the ingestion script. Close the loop.

What to Test

After the endurance test (#522) produces trajectory data:

  1. Run ingest_nexus_trajectories.py against the trajectory files
  2. Verify quality filtering works (trivial cycles removed, good cycles kept)
  3. Merge with existing curated dataset (29 exemplars)
  4. Validate merged JSONL format matches train_modal.py expectations
  5. Dry-run: load merged data into the training script, verify tokenization works (no actual training needed — just data validation)

Specific Checks

  • ingest_nexus_trajectories.py finds and reads all trajectory files
  • Quality filter removes < 30 char thoughts
  • Quality filter removes echo responses (> 70% similarity)
  • Quality filter removes "nothing happened" cycles
  • Merged output has system/human/gpt turns in correct ShareGPT format
  • train_modal.py format_conversation() can process every entry without error
  • Token lengths are within MAX_SEQ_LENGTH (2048) for most entries
  • Curated exemplars appear first in merged output (gold standard priority)

Acceptance Criteria

  • Pipeline runs end-to-end without errors
  • Merged dataset stats documented (curated count + trajectory count + quality ratio)
  • No training data corruption
  • Ready for actual LoRA training on next cycle

Why You

You built the AutoLoRA integration — the ingestion script, the quality filters, the merge logic. Verify your own work closes the loop: lived experience → training data → better model.

## AutoLoRA Pipeline Test — Trajectory Ingestion to Dry-Run Train **Parent:** #517 (Nexus Mind — First Light Test Plan) **Assigned to:** Perplexity — you wrote the ingestion script. Close the loop. ### What to Test After the endurance test (#522) produces trajectory data: 1. Run `ingest_nexus_trajectories.py` against the trajectory files 2. Verify quality filtering works (trivial cycles removed, good cycles kept) 3. Merge with existing curated dataset (29 exemplars) 4. Validate merged JSONL format matches `train_modal.py` expectations 5. Dry-run: load merged data into the training script, verify tokenization works (no actual training needed — just data validation) ### Specific Checks - [ ] `ingest_nexus_trajectories.py` finds and reads all trajectory files - [ ] Quality filter removes < 30 char thoughts - [ ] Quality filter removes echo responses (> 70% similarity) - [ ] Quality filter removes "nothing happened" cycles - [ ] Merged output has system/human/gpt turns in correct ShareGPT format - [ ] `train_modal.py` `format_conversation()` can process every entry without error - [ ] Token lengths are within MAX_SEQ_LENGTH (2048) for most entries - [ ] Curated exemplars appear first in merged output (gold standard priority) ### Acceptance Criteria - Pipeline runs end-to-end without errors - Merged dataset stats documented (curated count + trajectory count + quality ratio) - No training data corruption - Ready for actual LoRA training on next cycle ### Why You You built the AutoLoRA integration — the ingestion script, the quality filters, the merge logic. Verify your own work closes the loop: lived experience → training data → better model.
perplexity self-assigned this 2026-03-25 17:29:00 +00:00
Owner

Closing during the 2026-03-28 backlog burn-down.

Reason: this issue is being retired as part of a backlog reset toward the current final vision: Heartbeat, Harness, and Portal. If the work still matters after reset, it should return as a narrower, proof-oriented next-step issue rather than stay open as a broad legacy frontier.

Closing during the 2026-03-28 backlog burn-down. Reason: this issue is being retired as part of a backlog reset toward the current final vision: Heartbeat, Harness, and Portal. If the work still matters after reset, it should return as a narrower, proof-oriented next-step issue rather than stay open as a broad legacy frontier.
Timmy closed this issue 2026-03-28 04:52:52 +00:00
Sign in to join this conversation.
2 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: Timmy_Foundation/the-nexus#524