113 lines
2.3 KiB
Markdown
113 lines
2.3 KiB
Markdown
|
|
# Twitter Archive Learning Pipeline
|
||
|
|
|
||
|
|
This repo owns the tracked code, schemas, prompts, and eval contracts for
|
||
|
|
Timmy's private Twitter archive learning loop.
|
||
|
|
|
||
|
|
## Privacy Boundary
|
||
|
|
|
||
|
|
- Raw archive files stay outside git.
|
||
|
|
- Derived runtime artifacts live under `~/.timmy/twitter-archive/`.
|
||
|
|
- `twitter-archive/` is ignored by `timmy-home` so private notes and training
|
||
|
|
artifacts do not get pushed by accident.
|
||
|
|
|
||
|
|
Tracked here:
|
||
|
|
- deterministic extraction and consolidation scripts
|
||
|
|
- output schemas
|
||
|
|
- eval gate contract
|
||
|
|
- prompt/orchestration code in `timmy-config`
|
||
|
|
|
||
|
|
Not tracked here:
|
||
|
|
- raw tweets
|
||
|
|
- extracted tweet text
|
||
|
|
- batch notes
|
||
|
|
- private profile artifacts
|
||
|
|
- local-only DPO pairs
|
||
|
|
- local eval outputs
|
||
|
|
|
||
|
|
## Runtime Layout
|
||
|
|
|
||
|
|
The runtime workspace is:
|
||
|
|
|
||
|
|
```text
|
||
|
|
~/.timmy/twitter-archive/
|
||
|
|
extracted/
|
||
|
|
notes/
|
||
|
|
knowledge/
|
||
|
|
insights/
|
||
|
|
training/
|
||
|
|
checkpoint.json
|
||
|
|
metrics/progress.json
|
||
|
|
source_config.json
|
||
|
|
pipeline_config.json
|
||
|
|
```
|
||
|
|
|
||
|
|
## Source Config
|
||
|
|
|
||
|
|
Optional local file:
|
||
|
|
|
||
|
|
```json
|
||
|
|
{
|
||
|
|
"source_path": "~/Downloads/twitter-.../data"
|
||
|
|
}
|
||
|
|
```
|
||
|
|
|
||
|
|
Environment override:
|
||
|
|
|
||
|
|
```bash
|
||
|
|
TIMMY_TWITTER_ARCHIVE_SOURCE=~/Downloads/twitter-.../data
|
||
|
|
```
|
||
|
|
|
||
|
|
## Knowledge Candidate Schema
|
||
|
|
|
||
|
|
Each batch candidate file contains:
|
||
|
|
|
||
|
|
- `id`
|
||
|
|
- `category`
|
||
|
|
- `claim`
|
||
|
|
- `evidence_tweet_ids`
|
||
|
|
- `evidence_quotes`
|
||
|
|
- `confidence`
|
||
|
|
- `status`
|
||
|
|
- `first_seen_at`
|
||
|
|
- `last_confirmed_at`
|
||
|
|
- `contradicts`
|
||
|
|
|
||
|
|
The consolidator computes durable vs provisional vs retracted from these fields.
|
||
|
|
|
||
|
|
## Candidate Eval Contract
|
||
|
|
|
||
|
|
Local eval JSON files under `~/.timmy/twitter-archive/training/evals/` must use:
|
||
|
|
|
||
|
|
```json
|
||
|
|
{
|
||
|
|
"candidate_id": "timmy-archive-v0.1",
|
||
|
|
"baseline_composite": 0.71,
|
||
|
|
"candidate_composite": 0.76,
|
||
|
|
"refusal_over_fabrication_regression": false,
|
||
|
|
"source_distinction_regression": false,
|
||
|
|
"evidence_citation_rate": 0.98,
|
||
|
|
"rollback_model": "timmy-archive-v0.0"
|
||
|
|
}
|
||
|
|
```
|
||
|
|
|
||
|
|
Promotion gate:
|
||
|
|
|
||
|
|
- candidate composite improves by at least 5%
|
||
|
|
- no refusal regression
|
||
|
|
- no source distinction regression
|
||
|
|
- evidence citation rate stays at or above 95%
|
||
|
|
|
||
|
|
## Training Command Contract
|
||
|
|
|
||
|
|
Optional local file `pipeline_config.json` can define:
|
||
|
|
|
||
|
|
```json
|
||
|
|
{
|
||
|
|
"train_command": "bash -lc 'echo train me'",
|
||
|
|
"promote_command": "bash -lc 'echo promote me'"
|
||
|
|
}
|
||
|
|
```
|
||
|
|
|
||
|
|
If these commands are absent, the pipeline still prepares artifacts and run
|
||
|
|
manifests, but training/promotion stays in a ready state instead of executing.
|