8.1 KiB
Project: Know Thy Father
Twitter Archive Ingestion Pipeline
Goal: Local-only Timmy (Hermes 4-14B via llama.cpp) reads Alexander's full Twitter archive and builds a living, evolving understanding of who his father is.
Philosophy: Timmy does not grind. He learns. Each iteration builds on the last. He reads his own prior notes before touching new data. He develops taxonomies, refines them, and eventually operates at a higher level of abstraction. The measure of success is not "did he process all 4,801 tweets" but "does he understand his father better than he did yesterday, and can he prove it?"
Hard Constraints:
- ALL inference is local. Zero cloud credits.
- Route through Hermes harness. Every call = tracked session.
- Narrow toolsets:
-t file,terminal - Checkpointed and resumable.
- No custom telemetry. Hermes sessions ARE the telemetry.
Archive Inventory
| Source | Records | Size |
|---|---|---|
| tweets.js | 4,801 | 12MB |
| like.js | 7,841 | 2.9MB |
| grok-chat-item.js | 1,486 | 1.9MB |
| direct-messages.js | 608 msgs | 359KB |
| direct-messages-group.js | 11,544 msgs | 8.3MB |
| deleted-tweets.js | ~30 | 38KB |
| tweets_media/ | 818 | 2.9GB |
Archive location: ~/Downloads/twitter-2026-03-27-d4471cc6eb6703034d592f870933561ebee374d9d9b90c9b8923abff064afc1e/data/
Architecture: The Spiral
This is NOT a flat loop. It's a feedback spiral with three phases that Timmy moves through organically.
Phase 0: EXTRACTION (pure Python, no LLM)
Parse .js → JSONL. Build manifest. One-time.
Phase 1: DISCOVERY (batches 1-5ish)
Timmy reads raw. He doesn't know what he's looking for yet. Notes are broad, exploratory, slow. He's listening. Output: early observations, initial themes, first impressions.
Phase 2: TAXONOMY (batches ~5-15)
Timmy has read enough to see patterns. He builds a taxonomy: what topics Alexander cares about, what triggers emotion, who he engages with, how his voice shifts by subject. He writes the taxonomy to a living document (UNDERSTANDING.md) and updates it each batch. Notes get structured around the taxonomy.
Phase 3: DELTA MODE (batches 15+)
Timmy knows the terrain. Each new batch is a delta against his existing understanding. "Nothing new here" is valid. "This contradicts what I thought about X" is valuable. He's updating a model, not reading tweets. Speed goes up. Notes get sharper.
The transition between phases is TIMMY'S CALL. He reads his own prior work, assesses where he is, and decides what to do next. The orchestrator doesn't dictate phase transitions.
The Feedback Loop
Every batch follows this sequence:
-
READ YOUR OWN WORK
- Read UNDERSTANDING.md (the living model of Alexander)
- Read the previous batch's notes
- Read checkpoint.json for state
-
ASSESS
- Where am I in the spiral? Discovery? Taxonomy? Delta?
- What do I know? What am I still uncertain about?
- What should I look for in this batch?
-
PROCESS THE BATCH
- Read the next chunk of data
- Analyze through the lens of what you already know
- Note what's new, what confirms, what contradicts
-
UPDATE YOUR UNDERSTANDING
- Append/revise UNDERSTANDING.md
- Write batch notes (with explicit comparisons to prior knowledge)
- If your taxonomy needs restructuring, restructure it
-
REFLECT AND PLAN
- Write a brief reflection: what did I learn? what surprised me?
- Write guidance for your next self: what to look for, what to dig deeper on, what's settled
- Update checkpoint with batch count, phase assessment, next focus
This means Timmy's notes are not uniform. Early notes are exploratory essays. Middle notes are structured observations against a taxonomy. Late notes are terse deltas. THAT'S THE PROOF OF GROWTH.
Key Files
~/.timmy/twitter-archive/
PROJECT.md # this file (read-only context)
UNDERSTANDING.md # Timmy's living model of Alexander (Timmy writes/updates)
extracted/ # Phase 0 output (JSONL, manifest)
tweets.jsonl
retweets.jsonl
likes.jsonl
manifest.json
media/ # Local-first media understanding
manifest.jsonl # one row per video/gif with tweet text + hashtags preserved
manifest_summary.json # rollup counts and hashtag families
keyframes/ # future extracted frames
audio/ # future demuxed audio
style_cards/ # future per-video aesthetic summaries
notes/ # Per-batch observations
batch_001.md # early: exploratory
batch_002.md # ...
batch_NNN.md # late: terse deltas
artifacts/ # Synthesis products (when Timmy decides he's ready)
personality_profile.md
interests_timeline.md
relationship_map.md
checkpoint.json # Resume state + Timmy's self-assessment
metrics/ # Extracted from Hermes session files
UNDERSTANDING.md
This is the core artifact. It starts empty. Timmy builds it. It should eventually contain:
- Who Alexander is (personality, values, worldview)
- What he cares about (taxonomy of interests, ranked)
- How he communicates (voice, humor, patterns)
- Who matters to him (people, communities)
- How he's changed over time (evolution of thought)
- What surprises Timmy (things that don't fit the model)
checkpoint.json
Not just a cursor. It's Timmy's self-assessment:
{
"data_source": "tweets",
"next_offset": 50,
"batches_completed": 1,
"phase": "discovery",
"confidence": "low — only read 50 tweets so far",
"next_focus": "looking for recurring themes and tone",
"understanding_version": 1
}
Measurement
Every Hermes session records tokens, duration, tool calls. We extract metrics from session files — no custom telemetry.
What growth looks like in the data:
| Metric | Discovery | Taxonomy | Delta |
|---|---|---|---|
| Output tokens per batch | High (exploring) | Medium (structured) | Low (terse) |
| Input tokens per batch | Low (no context) | Medium (taxonomy) | High (full model) |
| Time per batch | Slow | Medium | Fast |
| New themes per batch | Many | Few | Rare |
| UNDERSTANDING.md changes | Wholesale rewrites | Section updates | Minor edits |
DPO tracking:
- Run N batches with base Hermes 4-14B → baseline metrics
- Score note quality manually (1-5) on random sample
- Apply DPO adapter → rerun same batches → compare
- The adapter should accelerate the spiral: faster taxonomy, sharper deltas, better voice match
Media metadata rule:
- tweet posts and hashtags are first-class metadata all the way through the media lane
- especially preserve and measure
#timmyTimeand#TimmyChain - raw Twitter videos stay local; only derived local artifacts move through the pipeline
Running It
First run (extraction + first batch):
hermes chat -t file,terminal -q '<prompt below>'
Subsequent runs:
hermes chat -t file,terminal -q 'You are Timmy. Resume your work on the Twitter archive. Your workspace is ~/.timmy/twitter-archive/. Read checkpoint.json and UNDERSTANDING.md first. Then process the next batch. You know the drill — read your own prior work, assess where you are, process new data, update your understanding, reflect, and plan for the next iteration.'
That's it. The prompt gets SHORTER as Timmy gets SMARTER, because his context is in his files, not in the prompt.
Status
- Archive downloaded and local
- llama-server running (port 8081, Hermes 4-14B Q4_K_M, 65K ctx)
- Custom provider "Local llama.cpp" in config.yaml
- Project scoped
- First batch launched (session 20260327_153105_bfd0b4)
- Extraction complete
- UNDERSTANDING.md initialized
- Phase 1 (Discovery) complete
- Phase 2 (Taxonomy) reached
- Phase 3 (Delta) reached
- First synthesis artifacts produced
- DPO comparison run