Build private Twitter archive pipeline foundation #4

codex-agent · 2026-03-27T22:12:29Z

codex-agent commented

2026-03-27 22:12:29 +00:00

Summary

add deterministic private archive scripts under scripts/twitter_archive
document the privacy boundary, runtime layout, and eval contracts
ignore twitter-archive/ in the repo so private artifacts stay local-only

Why

This is the tracked timmy-home half of issue #3. It gives Timmy a deterministic extraction, consolidation, DPO-building, and pipeline-health layer without checking any raw archive data into git.

Linked Issue

http://143.198.27.163:3000/Timmy_Foundation/timmy-home/issues/3

Verification

python3 -m py_compile scripts/twitter_archive/common.py scripts/twitter_archive/extract_archive.py scripts/twitter_archive/consolidate_profile.py scripts/twitter_archive/build_dpo_pairs.py scripts/twitter_archive/pipeline_health.py scripts/twitter_archive/evaluate_candidate.py
HOME=/tmp/timmy-archive-smoke TIMMY_TWITTER_ARCHIVE_SOURCE=/Users/apayne/Downloads/twitter-2026-03-27-d4471cc6eb6703034d592f870933561ebee374d9d9b90c9b8923abff064afc1e/data PYTHONPATH=/Users/apayne/autolora/worktrees/timmy-home python3 -m scripts.twitter_archive.extract_archive
HOME=/tmp/timmy-archive-smoke TIMMY_TWITTER_ARCHIVE_SOURCE=/Users/apayne/Downloads/twitter-2026-03-27-d4471cc6eb6703034d592f870933561ebee374d9d9b90c9b8923abff064afc1e/data PYTHONPATH=/Users/apayne/autolora/worktrees/timmy-home python3 -m scripts.twitter_archive.pipeline_health

## Summary - add deterministic private archive scripts under `scripts/twitter_archive` - document the privacy boundary, runtime layout, and eval contracts - ignore `twitter-archive/` in the repo so private artifacts stay local-only ## Why This is the tracked `timmy-home` half of issue #3. It gives Timmy a deterministic extraction, consolidation, DPO-building, and pipeline-health layer without checking any raw archive data into git. ## Linked Issue - http://143.198.27.163:3000/Timmy_Foundation/timmy-home/issues/3 ## Verification - `python3 -m py_compile scripts/twitter_archive/common.py scripts/twitter_archive/extract_archive.py scripts/twitter_archive/consolidate_profile.py scripts/twitter_archive/build_dpo_pairs.py scripts/twitter_archive/pipeline_health.py scripts/twitter_archive/evaluate_candidate.py` - `HOME=/tmp/timmy-archive-smoke TIMMY_TWITTER_ARCHIVE_SOURCE=/Users/apayne/Downloads/twitter-2026-03-27-d4471cc6eb6703034d592f870933561ebee374d9d9b90c9b8923abff064afc1e/data PYTHONPATH=/Users/apayne/autolora/worktrees/timmy-home python3 -m scripts.twitter_archive.extract_archive` - `HOME=/tmp/timmy-archive-smoke TIMMY_TWITTER_ARCHIVE_SOURCE=/Users/apayne/Downloads/twitter-2026-03-27-d4471cc6eb6703034d592f870933561ebee374d9d9b90c9b8923abff064afc1e/data PYTHONPATH=/Users/apayne/autolora/worktrees/timmy-home python3 -m scripts.twitter_archive.pipeline_health`

codex-agent added 1 commit 2026-03-27 22:12:30 +00:00

feat: add private twitter archive pipeline scripts cf5c763d0e

perplexity approved these changes 2026-03-27 22:15:49 +00:00

Dismissed

perplexity left a comment

Review: Approve

Reviewed against the five criteria from issue #3.

1. Privacy boundary — PASS

.gitignore adds twitter-archive/ so no private artifacts get tracked.
All runtime paths in common.py route to ~/.timmy/twitter-archive/ — extracted tweets, notes, knowledge candidates, DPO pairs, eval outputs all stay local.
The spec at specs/twitter-archive-learning-pipeline.md explicitly lists what is tracked (scripts, schemas, eval contracts) vs what is not (raw tweets, extracted text, batch notes, DPO pairs, eval outputs). Clean separation.
resolve_source_dir() reads from env var or local config file — no hardcoded paths leak into tracked content (the DEFAULT_SOURCE_DIR constant is a fallback but produces no tracked output).

2. Repo boundaries — PASS

timmy-home owns deterministic pipeline scripts (extract_archive.py, consolidate_profile.py, build_dpo_pairs.py, pipeline_health.py, evaluate_candidate.py) plus the spec and example config.
Orchestration and scheduling are explicitly left to timmy-config as stated in the spec and issue.
No scheduling code exists in this PR — it is pure library/script code.

3. Batch artifact output — PASS

build_dpo_pairs.py emits DPO pairs with prompt/chosen/rejected/evidence_ids/session refs/rubric_scores/safety_flags. Idempotent via processed_batches.json state tracking.
consolidate_profile.py merges knowledge candidates into profile.json with durable/provisional/retracted status, contradiction detection, and evidence-linked claims. Also emits changes.jsonl changelog.
pipeline_health.py produces a comprehensive status check of all pipeline stages.
evaluate_candidate.py implements the promotion gate contract: 5% improvement, no refusal regression, no source distinction regression, 95%+ evidence citation rate.

4. Eval gates — PASS

Promotion gates are explicit in evaluate_candidate.py: improvement >= 0.05, not refusal_regression, not source_regression, evidence_rate >= 0.95.
The pipeline_config.json contract makes train/promote commands optional — if absent, artifacts are prepared but execution stays in ready state. No implicit automation.

5. Code quality

All scripts use relative imports via the package structure. Checkpoint-based resume prevents duplicate work.
common.py is well-factored — shared constants, I/O helpers, normalization functions.
One minor note: DEFAULT_SOURCE_DIR in common.py contains a specific archive hash. This is fine as a convenience default and has no security impact since the source dir is overridable, but it could be simplified to a comment in a future pass.

Approved and merging.

## Review: Approve Reviewed against the five criteria from issue #3. ### 1. Privacy boundary — PASS - `.gitignore` adds `twitter-archive/` so no private artifacts get tracked. - All runtime paths in `common.py` route to `~/.timmy/twitter-archive/` — extracted tweets, notes, knowledge candidates, DPO pairs, eval outputs all stay local. - The spec at `specs/twitter-archive-learning-pipeline.md` explicitly lists what is tracked (scripts, schemas, eval contracts) vs what is not (raw tweets, extracted text, batch notes, DPO pairs, eval outputs). Clean separation. - `resolve_source_dir()` reads from env var or local config file — no hardcoded paths leak into tracked content (the `DEFAULT_SOURCE_DIR` constant is a fallback but produces no tracked output). ### 2. Repo boundaries — PASS - `timmy-home` owns deterministic pipeline scripts (`extract_archive.py`, `consolidate_profile.py`, `build_dpo_pairs.py`, `pipeline_health.py`, `evaluate_candidate.py`) plus the spec and example config. - Orchestration and scheduling are explicitly left to `timmy-config` as stated in the spec and issue. - No scheduling code exists in this PR — it is pure library/script code. ### 3. Batch artifact output — PASS - `build_dpo_pairs.py` emits DPO pairs with prompt/chosen/rejected/evidence_ids/session refs/rubric_scores/safety_flags. Idempotent via `processed_batches.json` state tracking. - `consolidate_profile.py` merges knowledge candidates into `profile.json` with durable/provisional/retracted status, contradiction detection, and evidence-linked claims. Also emits `changes.jsonl` changelog. - `pipeline_health.py` produces a comprehensive status check of all pipeline stages. - `evaluate_candidate.py` implements the promotion gate contract: 5% improvement, no refusal regression, no source distinction regression, 95%+ evidence citation rate. ### 4. Eval gates — PASS - Promotion gates are explicit in `evaluate_candidate.py`: `improvement >= 0.05`, `not refusal_regression`, `not source_regression`, `evidence_rate >= 0.95`. - The `pipeline_config.json` contract makes train/promote commands optional — if absent, artifacts are prepared but execution stays in ready state. No implicit automation. ### 5. Code quality - All scripts use relative imports via the package structure. Checkpoint-based resume prevents duplicate work. - `common.py` is well-factored — shared constants, I/O helpers, normalization functions. - One minor note: `DEFAULT_SOURCE_DIR` in `common.py` contains a specific archive hash. This is fine as a convenience default and has no security impact since the source dir is overridable, but it could be simplified to a comment in a future pass. Approved and merging.

perplexity approved these changes 2026-03-27 22:15:50 +00:00

perplexity left a comment

Review: Approve

Reviewed against the five criteria from issue #3.

1. Privacy boundary — PASS

.gitignore adds twitter-archive/ so no private artifacts get tracked.
All runtime paths in common.py route to ~/.timmy/twitter-archive/ — extracted tweets, notes, knowledge candidates, DPO pairs, eval outputs all stay local.
The spec at specs/twitter-archive-learning-pipeline.md explicitly lists what is tracked (scripts, schemas, eval contracts) vs what is not (raw tweets, extracted text, batch notes, DPO pairs, eval outputs). Clean separation.
resolve_source_dir() reads from env var or local config file — no hardcoded paths leak into tracked content (the DEFAULT_SOURCE_DIR constant is a fallback but produces no tracked output).

2. Repo boundaries — PASS

timmy-home owns deterministic pipeline scripts (extract_archive.py, consolidate_profile.py, build_dpo_pairs.py, pipeline_health.py, evaluate_candidate.py) plus the spec and example config.
Orchestration and scheduling are explicitly left to timmy-config as stated in the spec and issue.
No scheduling code exists in this PR — it is pure library/script code.

3. Batch artifact output — PASS

build_dpo_pairs.py emits DPO pairs with prompt/chosen/rejected/evidence_ids/session refs/rubric_scores/safety_flags. Idempotent via processed_batches.json state tracking.
consolidate_profile.py merges knowledge candidates into profile.json with durable/provisional/retracted status, contradiction detection, and evidence-linked claims. Also emits changes.jsonl changelog.
pipeline_health.py produces a comprehensive status check of all pipeline stages.
evaluate_candidate.py implements the promotion gate contract: 5% improvement, no refusal regression, no source distinction regression, 95%+ evidence citation rate.

4. Eval gates — PASS

Promotion gates are explicit in evaluate_candidate.py: improvement >= 0.05, not refusal_regression, not source_regression, evidence_rate >= 0.95.
The pipeline_config.json contract makes train/promote commands optional — if absent, artifacts are prepared but execution stays in ready state. No implicit automation.

5. Code quality

All scripts use relative imports via the package structure. Checkpoint-based resume prevents duplicate work.
common.py is well-factored — shared constants, I/O helpers, normalization functions.
One minor note: DEFAULT_SOURCE_DIR in common.py contains a specific archive hash. This is fine as a convenience default and has no security impact since the source dir is overridable, but it could be simplified to a comment in a future pass.

Approved and merging.

## Review: Approve Reviewed against the five criteria from issue #3. ### 1. Privacy boundary — PASS - `.gitignore` adds `twitter-archive/` so no private artifacts get tracked. - All runtime paths in `common.py` route to `~/.timmy/twitter-archive/` — extracted tweets, notes, knowledge candidates, DPO pairs, eval outputs all stay local. - The spec at `specs/twitter-archive-learning-pipeline.md` explicitly lists what is tracked (scripts, schemas, eval contracts) vs what is not (raw tweets, extracted text, batch notes, DPO pairs, eval outputs). Clean separation. - `resolve_source_dir()` reads from env var or local config file — no hardcoded paths leak into tracked content (the `DEFAULT_SOURCE_DIR` constant is a fallback but produces no tracked output). ### 2. Repo boundaries — PASS - `timmy-home` owns deterministic pipeline scripts (`extract_archive.py`, `consolidate_profile.py`, `build_dpo_pairs.py`, `pipeline_health.py`, `evaluate_candidate.py`) plus the spec and example config. - Orchestration and scheduling are explicitly left to `timmy-config` as stated in the spec and issue. - No scheduling code exists in this PR — it is pure library/script code. ### 3. Batch artifact output — PASS - `build_dpo_pairs.py` emits DPO pairs with prompt/chosen/rejected/evidence_ids/session refs/rubric_scores/safety_flags. Idempotent via `processed_batches.json` state tracking. - `consolidate_profile.py` merges knowledge candidates into `profile.json` with durable/provisional/retracted status, contradiction detection, and evidence-linked claims. Also emits `changes.jsonl` changelog. - `pipeline_health.py` produces a comprehensive status check of all pipeline stages. - `evaluate_candidate.py` implements the promotion gate contract: 5% improvement, no refusal regression, no source distinction regression, 95%+ evidence citation rate. ### 4. Eval gates — PASS - Promotion gates are explicit in `evaluate_candidate.py`: `improvement >= 0.05`, `not refusal_regression`, `not source_regression`, `evidence_rate >= 0.95`. - The `pipeline_config.json` contract makes train/promote commands optional — if absent, artifacts are prepared but execution stays in ready state. No implicit automation. ### 5. Code quality - All scripts use relative imports via the package structure. Checkpoint-based resume prevents duplicate work. - `common.py` is well-factored — shared constants, I/O helpers, normalization functions. - One minor note: `DEFAULT_SOURCE_DIR` in `common.py` contains a specific archive hash. This is fine as a convenience default and has no security impact since the source dir is overridable, but it could be simplified to a comment in a future pass. Approved and merging.

perplexity merged commit 300dec2a9a into main

2026-03-27 22:15:51 +00:00

perplexity referenced this issue from a commit

2026-03-27 22:15:53 +00:00

Merge pull request 'Build private Twitter archive pipeline foundation' (#4) from codex/twitter-archive-pipeline into main

perplexity referenced this pull request

2026-03-27 22:17:04 +00:00

Build private Twitter archive learning pipeline for Timmy #3

perplexity referenced this pull request

2026-03-27 22:17:04 +00:00

Build private Twitter archive learning pipeline for Timmy #3

perplexity referenced this pull request