[PIPELINE] Trajectory sanitization — strip sensitive metadata before DPO #6

Closed
opened 2026-03-27 22:48:31 +00:00 by perplexity · 1 comment
Member

Source

Sovereign Developer's Brief (2026-03-27), Section 4 "Low-Hanging Fruit", item 3.

What

Write a utility script under scripts/ that sanitizes raw Hermes session files before they enter the DPO pipeline. Must strip:

  • API keys, tokens, passwords that may appear in tool-call outputs
  • IP addresses and hostnames beyond the known VPS
  • Email addresses except Alexander's known addresses
  • File paths containing usernames (normalize to ~/)

Design

  • Input: raw session JSON from ~/.hermes/sessions/
  • Output: sanitized copy in ~/.timmy/training-data/sanitized/
  • Regex-based with a configurable allowlist for known-safe values
  • Idempotent: re-running on already-sanitized files produces identical output

Acceptance

  • python3 -m scripts.trajectory_sanitize --input ~/.hermes/sessions/ --output ~/.timmy/training-data/sanitized/
  • Unit tests with synthetic sessions containing known sensitive patterns
  • Integrated into the session_export task as a pre-processing step
  • Existing session_export task in tasks.py currently copies sessions without sanitization.
  • Archive pipeline (issue #3, now closed) already handles privacy for Twitter data but not for general Hermes sessions.
## Source Sovereign Developer's Brief (2026-03-27), Section 4 "Low-Hanging Fruit", item 3. ## What Write a utility script under `scripts/` that sanitizes raw Hermes session files before they enter the DPO pipeline. Must strip: - API keys, tokens, passwords that may appear in tool-call outputs - IP addresses and hostnames beyond the known VPS - Email addresses except Alexander's known addresses - File paths containing usernames (normalize to `~/`) ## Design - Input: raw session JSON from `~/.hermes/sessions/` - Output: sanitized copy in `~/.timmy/training-data/sanitized/` - Regex-based with a configurable allowlist for known-safe values - Idempotent: re-running on already-sanitized files produces identical output ## Acceptance - `python3 -m scripts.trajectory_sanitize --input ~/.hermes/sessions/ --output ~/.timmy/training-data/sanitized/` - Unit tests with synthetic sessions containing known sensitive patterns - Integrated into the `session_export` task as a pre-processing step ## Related - Existing `session_export` task in `tasks.py` currently copies sessions without sanitization. - Archive pipeline (issue #3, now closed) already handles privacy for Twitter data but not for general Hermes sessions.
Timmy was assigned by Rockachopa 2026-03-28 03:52:32 +00:00
Owner

Closing as duplicate during backlog burn-down. Canonical issue: #5.

Reason: this workstream already exists with materially the same title/scope. Keeping one canonical thread prevents agent churn and review waste.

Closing as duplicate during backlog burn-down. Canonical issue: #5. Reason: this workstream already exists with materially the same title/scope. Keeping one canonical thread prevents agent churn and review waste.
Timmy closed this issue 2026-03-28 04:45:34 +00:00
Sign in to join this conversation.
2 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: Timmy_Foundation/timmy-home#6