Build private Twitter archive pipeline foundation #4

Merged
perplexity merged 1 commits from codex/twitter-archive-pipeline into main 2026-03-27 22:15:51 +00:00
Member

Summary

  • add deterministic private archive scripts under scripts/twitter_archive
  • document the privacy boundary, runtime layout, and eval contracts
  • ignore twitter-archive/ in the repo so private artifacts stay local-only

Why

This is the tracked timmy-home half of issue #3. It gives Timmy a deterministic extraction, consolidation, DPO-building, and pipeline-health layer without checking any raw archive data into git.

Linked Issue

Verification

  • python3 -m py_compile scripts/twitter_archive/common.py scripts/twitter_archive/extract_archive.py scripts/twitter_archive/consolidate_profile.py scripts/twitter_archive/build_dpo_pairs.py scripts/twitter_archive/pipeline_health.py scripts/twitter_archive/evaluate_candidate.py
  • HOME=/tmp/timmy-archive-smoke TIMMY_TWITTER_ARCHIVE_SOURCE=/Users/apayne/Downloads/twitter-2026-03-27-d4471cc6eb6703034d592f870933561ebee374d9d9b90c9b8923abff064afc1e/data PYTHONPATH=/Users/apayne/autolora/worktrees/timmy-home python3 -m scripts.twitter_archive.extract_archive
  • HOME=/tmp/timmy-archive-smoke TIMMY_TWITTER_ARCHIVE_SOURCE=/Users/apayne/Downloads/twitter-2026-03-27-d4471cc6eb6703034d592f870933561ebee374d9d9b90c9b8923abff064afc1e/data PYTHONPATH=/Users/apayne/autolora/worktrees/timmy-home python3 -m scripts.twitter_archive.pipeline_health
## Summary - add deterministic private archive scripts under `scripts/twitter_archive` - document the privacy boundary, runtime layout, and eval contracts - ignore `twitter-archive/` in the repo so private artifacts stay local-only ## Why This is the tracked `timmy-home` half of issue #3. It gives Timmy a deterministic extraction, consolidation, DPO-building, and pipeline-health layer without checking any raw archive data into git. ## Linked Issue - http://143.198.27.163:3000/Timmy_Foundation/timmy-home/issues/3 ## Verification - `python3 -m py_compile scripts/twitter_archive/common.py scripts/twitter_archive/extract_archive.py scripts/twitter_archive/consolidate_profile.py scripts/twitter_archive/build_dpo_pairs.py scripts/twitter_archive/pipeline_health.py scripts/twitter_archive/evaluate_candidate.py` - `HOME=/tmp/timmy-archive-smoke TIMMY_TWITTER_ARCHIVE_SOURCE=/Users/apayne/Downloads/twitter-2026-03-27-d4471cc6eb6703034d592f870933561ebee374d9d9b90c9b8923abff064afc1e/data PYTHONPATH=/Users/apayne/autolora/worktrees/timmy-home python3 -m scripts.twitter_archive.extract_archive` - `HOME=/tmp/timmy-archive-smoke TIMMY_TWITTER_ARCHIVE_SOURCE=/Users/apayne/Downloads/twitter-2026-03-27-d4471cc6eb6703034d592f870933561ebee374d9d9b90c9b8923abff064afc1e/data PYTHONPATH=/Users/apayne/autolora/worktrees/timmy-home python3 -m scripts.twitter_archive.pipeline_health`
codex-agent added 1 commit 2026-03-27 22:12:30 +00:00
perplexity approved these changes 2026-03-27 22:15:49 +00:00
Dismissed
perplexity left a comment
Member

Review: Approve

Reviewed against the five criteria from issue #3.

1. Privacy boundary — PASS

  • .gitignore adds twitter-archive/ so no private artifacts get tracked.
  • All runtime paths in common.py route to ~/.timmy/twitter-archive/ — extracted tweets, notes, knowledge candidates, DPO pairs, eval outputs all stay local.
  • The spec at specs/twitter-archive-learning-pipeline.md explicitly lists what is tracked (scripts, schemas, eval contracts) vs what is not (raw tweets, extracted text, batch notes, DPO pairs, eval outputs). Clean separation.
  • resolve_source_dir() reads from env var or local config file — no hardcoded paths leak into tracked content (the DEFAULT_SOURCE_DIR constant is a fallback but produces no tracked output).

2. Repo boundaries — PASS

  • timmy-home owns deterministic pipeline scripts (extract_archive.py, consolidate_profile.py, build_dpo_pairs.py, pipeline_health.py, evaluate_candidate.py) plus the spec and example config.
  • Orchestration and scheduling are explicitly left to timmy-config as stated in the spec and issue.
  • No scheduling code exists in this PR — it is pure library/script code.

3. Batch artifact output — PASS

  • build_dpo_pairs.py emits DPO pairs with prompt/chosen/rejected/evidence_ids/session refs/rubric_scores/safety_flags. Idempotent via processed_batches.json state tracking.
  • consolidate_profile.py merges knowledge candidates into profile.json with durable/provisional/retracted status, contradiction detection, and evidence-linked claims. Also emits changes.jsonl changelog.
  • pipeline_health.py produces a comprehensive status check of all pipeline stages.
  • evaluate_candidate.py implements the promotion gate contract: 5% improvement, no refusal regression, no source distinction regression, 95%+ evidence citation rate.

4. Eval gates — PASS

  • Promotion gates are explicit in evaluate_candidate.py: improvement >= 0.05, not refusal_regression, not source_regression, evidence_rate >= 0.95.
  • The pipeline_config.json contract makes train/promote commands optional — if absent, artifacts are prepared but execution stays in ready state. No implicit automation.

5. Code quality

  • All scripts use relative imports via the package structure. Checkpoint-based resume prevents duplicate work.
  • common.py is well-factored — shared constants, I/O helpers, normalization functions.
  • One minor note: DEFAULT_SOURCE_DIR in common.py contains a specific archive hash. This is fine as a convenience default and has no security impact since the source dir is overridable, but it could be simplified to a comment in a future pass.

Approved and merging.

## Review: Approve Reviewed against the five criteria from issue #3. ### 1. Privacy boundary — PASS - `.gitignore` adds `twitter-archive/` so no private artifacts get tracked. - All runtime paths in `common.py` route to `~/.timmy/twitter-archive/` — extracted tweets, notes, knowledge candidates, DPO pairs, eval outputs all stay local. - The spec at `specs/twitter-archive-learning-pipeline.md` explicitly lists what is tracked (scripts, schemas, eval contracts) vs what is not (raw tweets, extracted text, batch notes, DPO pairs, eval outputs). Clean separation. - `resolve_source_dir()` reads from env var or local config file — no hardcoded paths leak into tracked content (the `DEFAULT_SOURCE_DIR` constant is a fallback but produces no tracked output). ### 2. Repo boundaries — PASS - `timmy-home` owns deterministic pipeline scripts (`extract_archive.py`, `consolidate_profile.py`, `build_dpo_pairs.py`, `pipeline_health.py`, `evaluate_candidate.py`) plus the spec and example config. - Orchestration and scheduling are explicitly left to `timmy-config` as stated in the spec and issue. - No scheduling code exists in this PR — it is pure library/script code. ### 3. Batch artifact output — PASS - `build_dpo_pairs.py` emits DPO pairs with prompt/chosen/rejected/evidence_ids/session refs/rubric_scores/safety_flags. Idempotent via `processed_batches.json` state tracking. - `consolidate_profile.py` merges knowledge candidates into `profile.json` with durable/provisional/retracted status, contradiction detection, and evidence-linked claims. Also emits `changes.jsonl` changelog. - `pipeline_health.py` produces a comprehensive status check of all pipeline stages. - `evaluate_candidate.py` implements the promotion gate contract: 5% improvement, no refusal regression, no source distinction regression, 95%+ evidence citation rate. ### 4. Eval gates — PASS - Promotion gates are explicit in `evaluate_candidate.py`: `improvement >= 0.05`, `not refusal_regression`, `not source_regression`, `evidence_rate >= 0.95`. - The `pipeline_config.json` contract makes train/promote commands optional — if absent, artifacts are prepared but execution stays in ready state. No implicit automation. ### 5. Code quality - All scripts use relative imports via the package structure. Checkpoint-based resume prevents duplicate work. - `common.py` is well-factored — shared constants, I/O helpers, normalization functions. - One minor note: `DEFAULT_SOURCE_DIR` in `common.py` contains a specific archive hash. This is fine as a convenience default and has no security impact since the source dir is overridable, but it could be simplified to a comment in a future pass. Approved and merging.
perplexity approved these changes 2026-03-27 22:15:50 +00:00
perplexity left a comment
Member

Review: Approve

Reviewed against the five criteria from issue #3.

1. Privacy boundary — PASS

  • .gitignore adds twitter-archive/ so no private artifacts get tracked.
  • All runtime paths in common.py route to ~/.timmy/twitter-archive/ — extracted tweets, notes, knowledge candidates, DPO pairs, eval outputs all stay local.
  • The spec at specs/twitter-archive-learning-pipeline.md explicitly lists what is tracked (scripts, schemas, eval contracts) vs what is not (raw tweets, extracted text, batch notes, DPO pairs, eval outputs). Clean separation.
  • resolve_source_dir() reads from env var or local config file — no hardcoded paths leak into tracked content (the DEFAULT_SOURCE_DIR constant is a fallback but produces no tracked output).

2. Repo boundaries — PASS

  • timmy-home owns deterministic pipeline scripts (extract_archive.py, consolidate_profile.py, build_dpo_pairs.py, pipeline_health.py, evaluate_candidate.py) plus the spec and example config.
  • Orchestration and scheduling are explicitly left to timmy-config as stated in the spec and issue.
  • No scheduling code exists in this PR — it is pure library/script code.

3. Batch artifact output — PASS

  • build_dpo_pairs.py emits DPO pairs with prompt/chosen/rejected/evidence_ids/session refs/rubric_scores/safety_flags. Idempotent via processed_batches.json state tracking.
  • consolidate_profile.py merges knowledge candidates into profile.json with durable/provisional/retracted status, contradiction detection, and evidence-linked claims. Also emits changes.jsonl changelog.
  • pipeline_health.py produces a comprehensive status check of all pipeline stages.
  • evaluate_candidate.py implements the promotion gate contract: 5% improvement, no refusal regression, no source distinction regression, 95%+ evidence citation rate.

4. Eval gates — PASS

  • Promotion gates are explicit in evaluate_candidate.py: improvement >= 0.05, not refusal_regression, not source_regression, evidence_rate >= 0.95.
  • The pipeline_config.json contract makes train/promote commands optional — if absent, artifacts are prepared but execution stays in ready state. No implicit automation.

5. Code quality

  • All scripts use relative imports via the package structure. Checkpoint-based resume prevents duplicate work.
  • common.py is well-factored — shared constants, I/O helpers, normalization functions.
  • One minor note: DEFAULT_SOURCE_DIR in common.py contains a specific archive hash. This is fine as a convenience default and has no security impact since the source dir is overridable, but it could be simplified to a comment in a future pass.

Approved and merging.

## Review: Approve Reviewed against the five criteria from issue #3. ### 1. Privacy boundary — PASS - `.gitignore` adds `twitter-archive/` so no private artifacts get tracked. - All runtime paths in `common.py` route to `~/.timmy/twitter-archive/` — extracted tweets, notes, knowledge candidates, DPO pairs, eval outputs all stay local. - The spec at `specs/twitter-archive-learning-pipeline.md` explicitly lists what is tracked (scripts, schemas, eval contracts) vs what is not (raw tweets, extracted text, batch notes, DPO pairs, eval outputs). Clean separation. - `resolve_source_dir()` reads from env var or local config file — no hardcoded paths leak into tracked content (the `DEFAULT_SOURCE_DIR` constant is a fallback but produces no tracked output). ### 2. Repo boundaries — PASS - `timmy-home` owns deterministic pipeline scripts (`extract_archive.py`, `consolidate_profile.py`, `build_dpo_pairs.py`, `pipeline_health.py`, `evaluate_candidate.py`) plus the spec and example config. - Orchestration and scheduling are explicitly left to `timmy-config` as stated in the spec and issue. - No scheduling code exists in this PR — it is pure library/script code. ### 3. Batch artifact output — PASS - `build_dpo_pairs.py` emits DPO pairs with prompt/chosen/rejected/evidence_ids/session refs/rubric_scores/safety_flags. Idempotent via `processed_batches.json` state tracking. - `consolidate_profile.py` merges knowledge candidates into `profile.json` with durable/provisional/retracted status, contradiction detection, and evidence-linked claims. Also emits `changes.jsonl` changelog. - `pipeline_health.py` produces a comprehensive status check of all pipeline stages. - `evaluate_candidate.py` implements the promotion gate contract: 5% improvement, no refusal regression, no source distinction regression, 95%+ evidence citation rate. ### 4. Eval gates — PASS - Promotion gates are explicit in `evaluate_candidate.py`: `improvement >= 0.05`, `not refusal_regression`, `not source_regression`, `evidence_rate >= 0.95`. - The `pipeline_config.json` contract makes train/promote commands optional — if absent, artifacts are prepared but execution stays in ready state. No implicit automation. ### 5. Code quality - All scripts use relative imports via the package structure. Checkpoint-based resume prevents duplicate work. - `common.py` is well-factored — shared constants, I/O helpers, normalization functions. - One minor note: `DEFAULT_SOURCE_DIR` in `common.py` contains a specific archive hash. This is fine as a convenience default and has no security impact since the source dir is overridable, but it could be simplified to a comment in a future pass. Approved and merging.
perplexity merged commit 300dec2a9a into main 2026-03-27 22:15:51 +00:00
Sign in to join this conversation.
No Reviewers
2 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: Timmy_Foundation/timmy-home#4