specs/twitter-archive-learning-pipeline.md

# Twitter Archive Learning Pipeline

This repo owns the tracked code, schemas, prompts, and eval contracts for
Timmy's private Twitter archive learning loop.

## Privacy Boundary

- Raw archive files stay outside git.
- Derived runtime artifacts live under `~/.timmy/twitter-archive/`.
- `twitter-archive/` is ignored by `timmy-home` so private notes and training
  artifacts do not get pushed by accident.

Tracked here:
- deterministic extraction and consolidation scripts
- output schemas
- eval gate contract
- prompt/orchestration code in `timmy-config`

Not tracked here:
- raw tweets
- extracted tweet text
- batch notes
- private profile artifacts
- local-only DPO pairs
- local eval outputs

## Runtime Layout

The runtime workspace is:

```text
~/.timmy/twitter-archive/
  extracted/
  notes/
  knowledge/
  insights/
  training/
  checkpoint.json
  metrics/progress.json
  source_config.json
  pipeline_config.json
```

## Source Config

Optional local file:

```json
{
  "source_path": "~/Downloads/twitter-.../data"
}
```

Environment override:

```bash
TIMMY_TWITTER_ARCHIVE_SOURCE=~/Downloads/twitter-.../data
```

## Knowledge Candidate Schema

Each batch candidate file contains:

- `id`
- `category`
- `claim`
- `evidence_tweet_ids`
- `evidence_quotes`
- `confidence`
- `status`
- `first_seen_at`
- `last_confirmed_at`
- `contradicts`

The consolidator computes durable vs provisional vs retracted from these fields.

## Candidate Eval Contract

Local eval JSON files under `~/.timmy/twitter-archive/training/evals/` must use:

```json
{
  "candidate_id": "timmy-archive-v0.1",
  "baseline_composite": 0.71,
  "candidate_composite": 0.76,
  "refusal_over_fabrication_regression": false,
  "source_distinction_regression": false,
  "evidence_citation_rate": 0.98,
  "rollback_model": "timmy-archive-v0.0"
}
```

Promotion gate:

- candidate composite improves by at least 5%
- no refusal regression
- no source distinction regression
- evidence citation rate stays at or above 95%

## Training Command Contract

Optional local file `pipeline_config.json` can define:

```json
{
  "train_command": "bash -lc 'echo train me'",
  "promote_command": "bash -lc 'echo promote me'"
}
```

If these commands are absent, the pipeline still prepares artifacts and run
manifests, but training/promotion stays in a ready state instead of executing.
feat: add private twitter archive pipeline scripts 2026-03-27 18:09:28 -04:00			`# Twitter Archive Learning Pipeline`

			`This repo owns the tracked code, schemas, prompts, and eval contracts for`
			`Timmy's private Twitter archive learning loop.`

			`## Privacy Boundary`

			`- Raw archive files stay outside git.`
			- Derived runtime artifacts live under `~/.timmy/twitter-archive/`.
			- `twitter-archive/` is ignored by `timmy-home` so private notes and training
			`artifacts do not get pushed by accident.`

			`Tracked here:`
			`- deterministic extraction and consolidation scripts`
			`- output schemas`
			`- eval gate contract`
			- prompt/orchestration code in `timmy-config`

			`Not tracked here:`
			`- raw tweets`
			`- extracted tweet text`
			`- batch notes`
			`- private profile artifacts`
			`- local-only DPO pairs`
			`- local eval outputs`

			`## Runtime Layout`

			`The runtime workspace is:`

			```text
			`~/.timmy/twitter-archive/`
			`extracted/`
			`notes/`
			`knowledge/`
			`insights/`
			`training/`
			`checkpoint.json`
			`metrics/progress.json`
			`source_config.json`
			`pipeline_config.json`
			```

			`## Source Config`

			`Optional local file:`

			```json
			`{`
			`"source_path": "~/Downloads/twitter-.../data"`
			`}`
			```

			`Environment override:`

			```bash
			`TIMMY_TWITTER_ARCHIVE_SOURCE=~/Downloads/twitter-.../data`
			```

			`## Knowledge Candidate Schema`

			`Each batch candidate file contains:`

			- `id`
			- `category`
			- `claim`
			- `evidence_tweet_ids`
			- `evidence_quotes`
			- `confidence`
			- `status`
			- `first_seen_at`
			- `last_confirmed_at`
			- `contradicts`

			`The consolidator computes durable vs provisional vs retracted from these fields.`

			`## Candidate Eval Contract`

			Local eval JSON files under `~/.timmy/twitter-archive/training/evals/` must use:

			```json
			`{`
			`"candidate_id": "timmy-archive-v0.1",`
			`"baseline_composite": 0.71,`
			`"candidate_composite": 0.76,`
			`"refusal_over_fabrication_regression": false,`
			`"source_distinction_regression": false,`
			`"evidence_citation_rate": 0.98,`
			`"rollback_model": "timmy-archive-v0.0"`
			`}`
			```

			`Promotion gate:`

			`- candidate composite improves by at least 5%`
			`- no refusal regression`
			`- no source distinction regression`
			`- evidence citation rate stays at or above 95%`

			`## Training Command Contract`

			Optional local file `pipeline_config.json` can define:

			```json
			`{`
			`"train_command": "bash -lc 'echo train me'",`
			`"promote_command": "bash -lc 'echo promote me'"`
			`}`
			```

			`If these commands are absent, the pipeline still prepares artifacts and run`
			`manifests, but training/promotion stays in a ready state instead of executing.`