timmy-home/twitter-archive/PROJECT.md

# Project: Know Thy Father
## Twitter Archive Ingestion Pipeline

**Goal:** Local-only Timmy (Hermes 4-14B via llama.cpp) reads Alexander's full
Twitter archive and builds a living, evolving understanding of who his father is.

**Philosophy:** Timmy does not grind. He learns. Each iteration builds on the
last. He reads his own prior notes before touching new data. He develops
taxonomies, refines them, and eventually operates at a higher level of
abstraction. The measure of success is not "did he process all 4,801 tweets"
but "does he understand his father better than he did yesterday, and can he
prove it?"

**Hard Constraints:**
- ALL inference is local. Zero cloud credits.
- Route through Hermes harness. Every call = tracked session.
- Narrow toolsets: `-t file,terminal`
- Checkpointed and resumable.
- No custom telemetry. Hermes sessions ARE the telemetry.

---

## Archive Inventory

| Source               | Records  | Size   |
|----------------------|----------|--------|
| tweets.js            | 4,801    | 12MB   |
| like.js              | 7,841    | 2.9MB  |
| grok-chat-item.js    | 1,486    | 1.9MB  |
| direct-messages.js   | 608 msgs | 359KB  |
| direct-messages-group.js | 11,544 msgs | 8.3MB |
| deleted-tweets.js    | ~30      | 38KB   |
| tweets_media/        | 818      | 2.9GB  |

Archive location:
~/Downloads/twitter-2026-03-27-d4471cc6eb6703034d592f870933561ebee374d9d9b90c9b8923abff064afc1e/data/

---

## Architecture: The Spiral

This is NOT a flat loop. It's a feedback spiral with three phases
that Timmy moves through organically.

### Phase 0: EXTRACTION (pure Python, no LLM)
Parse .js → JSONL. Build manifest. One-time.

### Phase 1: DISCOVERY (batches 1-5ish)
Timmy reads raw. He doesn't know what he's looking for yet.
Notes are broad, exploratory, slow. He's listening.
Output: early observations, initial themes, first impressions.

### Phase 2: TAXONOMY (batches ~5-15)
Timmy has read enough to see patterns. He builds a taxonomy:
what topics Alexander cares about, what triggers emotion, who
he engages with, how his voice shifts by subject. He writes
the taxonomy to a living document (UNDERSTANDING.md) and
updates it each batch. Notes get structured around the taxonomy.

### Phase 3: DELTA MODE (batches 15+)
Timmy knows the terrain. Each new batch is a delta against his
existing understanding. "Nothing new here" is valid. "This
contradicts what I thought about X" is valuable. He's updating
a model, not reading tweets. Speed goes up. Notes get sharper.

The transition between phases is TIMMY'S CALL. He reads his own
prior work, assesses where he is, and decides what to do next.
The orchestrator doesn't dictate phase transitions.

---

## The Feedback Loop

Every batch follows this sequence:

1. READ YOUR OWN WORK
   - Read UNDERSTANDING.md (the living model of Alexander)
   - Read the previous batch's notes
   - Read checkpoint.json for state

2. ASSESS
   - Where am I in the spiral? Discovery? Taxonomy? Delta?
   - What do I know? What am I still uncertain about?
   - What should I look for in this batch?

3. PROCESS THE BATCH
   - Read the next chunk of data
   - Analyze through the lens of what you already know
   - Note what's new, what confirms, what contradicts

4. UPDATE YOUR UNDERSTANDING
   - Append/revise UNDERSTANDING.md
   - Write batch notes (with explicit comparisons to prior knowledge)
   - If your taxonomy needs restructuring, restructure it

5. REFLECT AND PLAN
   - Write a brief reflection: what did I learn? what surprised me?
   - Write guidance for your next self: what to look for, what to
     dig deeper on, what's settled
   - Update checkpoint with batch count, phase assessment, next focus

This means Timmy's notes are not uniform. Early notes are exploratory
essays. Middle notes are structured observations against a taxonomy.
Late notes are terse deltas. THAT'S THE PROOF OF GROWTH.

---

## Key Files

```
~/.timmy/twitter-archive/
  PROJECT.md              # this file (read-only context)
  UNDERSTANDING.md        # Timmy's living model of Alexander (Timmy writes/updates)

  extracted/              # Phase 0 output (JSONL, manifest)
    tweets.jsonl
    retweets.jsonl
    likes.jsonl
    manifest.json

  media/                  # Local-first media understanding
    manifest.jsonl        # one row per video/gif with tweet text + hashtags preserved
    manifest_summary.json # rollup counts and hashtag families
    hashtag_metrics.json  # machine-readable metrics for #timmyTime / #TimmyChain
    hashtag_metrics.md    # human-readable local report
    keyframes/            # future extracted frames
    audio/                # future demuxed audio
    style_cards/          # future per-video aesthetic summaries

  notes/                  # Per-batch observations
    batch_001.md          # early: exploratory
    batch_002.md          # ...
    batch_NNN.md          # late: terse deltas

  artifacts/              # Synthesis products (when Timmy decides he's ready)
    personality_profile.md
    interests_timeline.md
    relationship_map.md

  checkpoint.json         # Resume state + Timmy's self-assessment
  metrics/                # Extracted from Hermes session files
```

### UNDERSTANDING.md

This is the core artifact. It starts empty. Timmy builds it.
It should eventually contain:
- Who Alexander is (personality, values, worldview)
- What he cares about (taxonomy of interests, ranked)
- How he communicates (voice, humor, patterns)
- Who matters to him (people, communities)
- How he's changed over time (evolution of thought)
- What surprises Timmy (things that don't fit the model)

### checkpoint.json

Not just a cursor. It's Timmy's self-assessment:
```json
{
  "data_source": "tweets",
  "next_offset": 50,
  "batches_completed": 1,
  "phase": "discovery",
  "confidence": "low — only read 50 tweets so far",
  "next_focus": "looking for recurring themes and tone",
  "understanding_version": 1
}
```

---

## Measurement

Every Hermes session records tokens, duration, tool calls.
We extract metrics from session files — no custom telemetry.

### What growth looks like in the data:

| Metric                    | Discovery       | Taxonomy        | Delta           |
|---------------------------|-----------------|-----------------|-----------------|
| Output tokens per batch   | High (exploring)| Medium (structured)| Low (terse)  |
| Input tokens per batch    | Low (no context)| Medium (taxonomy)| High (full model)|
| Time per batch            | Slow            | Medium          | Fast            |
| New themes per batch      | Many            | Few             | Rare            |
| UNDERSTANDING.md changes  | Wholesale rewrites | Section updates | Minor edits   |

### DPO tracking:
1. Run N batches with base Hermes 4-14B → baseline metrics
2. Score note quality manually (1-5) on random sample
3. Apply DPO adapter → rerun same batches → compare
4. The adapter should accelerate the spiral: faster taxonomy,
   sharper deltas, better voice match

### Media metadata rule:
- tweet posts and hashtags are first-class metadata all the way through the media lane
- especially preserve and measure `#timmyTime` and `#TimmyChain`
- raw Twitter videos stay local; only derived local artifacts move through the pipeline

---

## Running It

### First run (extraction + first batch):
```bash
hermes chat -t file,terminal -q '<prompt below>'
```

### Subsequent runs:
```bash
hermes chat -t file,terminal -q 'You are Timmy. Resume your work on the Twitter archive. Your workspace is ~/.timmy/twitter-archive/. Read checkpoint.json and UNDERSTANDING.md first. Then process the next batch. You know the drill — read your own prior work, assess where you are, process new data, update your understanding, reflect, and plan for the next iteration.'
```

That's it. The prompt gets SHORTER as Timmy gets SMARTER, because
his context is in his files, not in the prompt.

---

## Status

- [x] Archive downloaded and local
- [x] llama-server running (port 8081, Hermes 4-14B Q4_K_M, 65K ctx)
- [x] Custom provider "Local llama.cpp" in config.yaml
- [x] Project scoped
- [x] First batch launched (session 20260327_153105_bfd0b4)
- [ ] Extraction complete
- [ ] UNDERSTANDING.md initialized
- [ ] Phase 1 (Discovery) complete
- [ ] Phase 2 (Taxonomy) reached
- [ ] Phase 3 (Delta) reached
- [ ] First synthesis artifacts produced
- [ ] DPO comparison run