Build private Twitter archive pipeline foundation #4
Reference in New Issue
Block a user
Delete Branch "codex/twitter-archive-pipeline"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Summary
scripts/twitter_archivetwitter-archive/in the repo so private artifacts stay local-onlyWhy
This is the tracked
timmy-homehalf of issue #3. It gives Timmy a deterministic extraction, consolidation, DPO-building, and pipeline-health layer without checking any raw archive data into git.Linked Issue
Verification
python3 -m py_compile scripts/twitter_archive/common.py scripts/twitter_archive/extract_archive.py scripts/twitter_archive/consolidate_profile.py scripts/twitter_archive/build_dpo_pairs.py scripts/twitter_archive/pipeline_health.py scripts/twitter_archive/evaluate_candidate.pyHOME=/tmp/timmy-archive-smoke TIMMY_TWITTER_ARCHIVE_SOURCE=/Users/apayne/Downloads/twitter-2026-03-27-d4471cc6eb6703034d592f870933561ebee374d9d9b90c9b8923abff064afc1e/data PYTHONPATH=/Users/apayne/autolora/worktrees/timmy-home python3 -m scripts.twitter_archive.extract_archiveHOME=/tmp/timmy-archive-smoke TIMMY_TWITTER_ARCHIVE_SOURCE=/Users/apayne/Downloads/twitter-2026-03-27-d4471cc6eb6703034d592f870933561ebee374d9d9b90c9b8923abff064afc1e/data PYTHONPATH=/Users/apayne/autolora/worktrees/timmy-home python3 -m scripts.twitter_archive.pipeline_healthReview: Approve
Reviewed against the five criteria from issue #3.
1. Privacy boundary — PASS
.gitignoreaddstwitter-archive/so no private artifacts get tracked.common.pyroute to~/.timmy/twitter-archive/— extracted tweets, notes, knowledge candidates, DPO pairs, eval outputs all stay local.specs/twitter-archive-learning-pipeline.mdexplicitly lists what is tracked (scripts, schemas, eval contracts) vs what is not (raw tweets, extracted text, batch notes, DPO pairs, eval outputs). Clean separation.resolve_source_dir()reads from env var or local config file — no hardcoded paths leak into tracked content (theDEFAULT_SOURCE_DIRconstant is a fallback but produces no tracked output).2. Repo boundaries — PASS
timmy-homeowns deterministic pipeline scripts (extract_archive.py,consolidate_profile.py,build_dpo_pairs.py,pipeline_health.py,evaluate_candidate.py) plus the spec and example config.timmy-configas stated in the spec and issue.3. Batch artifact output — PASS
build_dpo_pairs.pyemits DPO pairs with prompt/chosen/rejected/evidence_ids/session refs/rubric_scores/safety_flags. Idempotent viaprocessed_batches.jsonstate tracking.consolidate_profile.pymerges knowledge candidates intoprofile.jsonwith durable/provisional/retracted status, contradiction detection, and evidence-linked claims. Also emitschanges.jsonlchangelog.pipeline_health.pyproduces a comprehensive status check of all pipeline stages.evaluate_candidate.pyimplements the promotion gate contract: 5% improvement, no refusal regression, no source distinction regression, 95%+ evidence citation rate.4. Eval gates — PASS
evaluate_candidate.py:improvement >= 0.05,not refusal_regression,not source_regression,evidence_rate >= 0.95.pipeline_config.jsoncontract makes train/promote commands optional — if absent, artifacts are prepared but execution stays in ready state. No implicit automation.5. Code quality
common.pyis well-factored — shared constants, I/O helpers, normalization functions.DEFAULT_SOURCE_DIRincommon.pycontains a specific archive hash. This is fine as a convenience default and has no security impact since the source dir is overridable, but it could be simplified to a comment in a future pass.Approved and merging.
Review: Approve
Reviewed against the five criteria from issue #3.
1. Privacy boundary — PASS
.gitignoreaddstwitter-archive/so no private artifacts get tracked.common.pyroute to~/.timmy/twitter-archive/— extracted tweets, notes, knowledge candidates, DPO pairs, eval outputs all stay local.specs/twitter-archive-learning-pipeline.mdexplicitly lists what is tracked (scripts, schemas, eval contracts) vs what is not (raw tweets, extracted text, batch notes, DPO pairs, eval outputs). Clean separation.resolve_source_dir()reads from env var or local config file — no hardcoded paths leak into tracked content (theDEFAULT_SOURCE_DIRconstant is a fallback but produces no tracked output).2. Repo boundaries — PASS
timmy-homeowns deterministic pipeline scripts (extract_archive.py,consolidate_profile.py,build_dpo_pairs.py,pipeline_health.py,evaluate_candidate.py) plus the spec and example config.timmy-configas stated in the spec and issue.3. Batch artifact output — PASS
build_dpo_pairs.pyemits DPO pairs with prompt/chosen/rejected/evidence_ids/session refs/rubric_scores/safety_flags. Idempotent viaprocessed_batches.jsonstate tracking.consolidate_profile.pymerges knowledge candidates intoprofile.jsonwith durable/provisional/retracted status, contradiction detection, and evidence-linked claims. Also emitschanges.jsonlchangelog.pipeline_health.pyproduces a comprehensive status check of all pipeline stages.evaluate_candidate.pyimplements the promotion gate contract: 5% improvement, no refusal regression, no source distinction regression, 95%+ evidence citation rate.4. Eval gates — PASS
evaluate_candidate.py:improvement >= 0.05,not refusal_regression,not source_regression,evidence_rate >= 0.95.pipeline_config.jsoncontract makes train/promote commands optional — if absent, artifacts are prepared but execution stays in ready state. No implicit automation.5. Code quality
common.pyis well-factored — shared constants, I/O helpers, normalization functions.DEFAULT_SOURCE_DIRincommon.pycontains a specific archive hash. This is fine as a convenience default and has no security impact since the source dir is overridable, but it could be simplified to a comment in a future pass.Approved and merging.
perplexity referenced this pull request2026-03-28 00:34:18 +00:00