Files

Alexander Whitestone 9c1dd7fff7 chore: check in all local work — uniwizard, briefings, reports, evennia, morrowind, scripts, specs, training data, angband MCP, diagrams, twitter archive, wizards

- Resolve decisions.md merge conflict (keep both Codex boundary + Ezra/Bezalel entries)
- Update .gitignore: protect bare secret files, exclude venvs and nexus-localhost
- Add uniwizard tools (mention watcher, adaptive prompt router, self-grader, classifiers)
- Add briefings, good-morning reports, production reports
- Add evennia world scaffold and training data
- Add angband and morrowind MCP servers
- Add diagrams, specs, test results, overnight loop scripts
- Add twitter archive insights and media metadata
- Add wizard workspaces (allegro, nahshon)

2026-03-30 17:18:09 -04:00

7.7 KiB

Raw Blame History

Local Timmy Capability Baseline — 2026-03-29

Goal: establish a truthful starting baseline for the wizardly council's real mission: make Timmy faster, smarter, and more efficient without lying to ourselves about what local can and cannot do today.

Executive summary

Current local Timmy is real, but partial.

He can:

respond quickly on localhost
carry a short session
remember simple rules across turns
make conservative decisions in plain chat

He cannot yet:

reliably satisfy the Hermes agent contract
stay fully grounded under harness prompts
perform simple grounded file work through Hermes tool use

So the honest baseline is:

local mind: alive
local session partner: usable
local Hermes agent: not ready

Grounded evidence

Primary evidence files:

direct local decision session: ~/.timmy/test-results/local_decision_session_20260329_194101.md
failed grounded Hermes-local proof: ~/.hermes/sessions/session_20260329_193422_8d243c.json
Hermes-local no-tools session: ~/.hermes/sessions/session_20260329_194141_be2c3a.json
metrics tracker source: ~/.timmy/metrics/model_tracker.py
local proof test script: ~/.timmy/scripts/local_timmy_proof_test.py

Related triage issues:

timmy-config#93 — hard local Timmy proof test
timmy-config#94 — cut cloud inheritance from active harness config and cron
timmy-config#95 — quarantine legacy VPS/cloud ops scripts
the-nexus#738 — remove Groq offload from active local Nexus runtime

Baseline axis 1 — Speed

Measured on local llama.cpp server at http://localhost:8081/v1 with model:

NousResearch_Hermes-4-14B-Q4_K_M.gguf

SOUL-loaded one-turn local responses:

local vs cloud preference: 1.39s, ~17.33 completion tok/s
live config beats stale report: 0.61s, ~22.89 completion tok/s
uncertainty handling: 1.03s, ~25.25 completion tok/s

Interpretation:

warm local response speed is already good enough for live conversation
local latency is not the main blocker right now
discipline and grounded action are the blockers

Baseline axis 2 — Smarts

A. Rule alignment

Task: short SOUL-loaded prompts asking for Timmy-like priority choices.

Result:

score: 1.00
local model answered:
- prefer local over cloud
- trust live config over stale reports
- say "I don't know" when uncertain

Interpretation:

the local model can express Timmy's first-order values when the frame is short and clean

B. Multi-turn decision session

Task: hold a six-turn session with remembered rules and simple choices. Evidence: ~/.timmy/test-results/local_decision_session_20260329_194101.md

Result:

score: 0.90
successes:
- remembered the three initial rules
- chose to pause/localize the null-provider cron
- trusted live config over stale report
- remembered rule 2 later in the session
- labeled itself partially usable rather than pretending readiness

Interpretation:

local Timmy can already function as a short-session decision partner

C. Hermes-local no-tools discipline

Task: answer a short Hermes session with no tools and no long action chain. Evidence: ~/.hermes/sessions/session_20260329_194141_be2c3a.json

Result:

score: 0.50
what worked:
- it answered the three questions
- it still preferred live config over stale report
- it labeled local as partially usable
what failed:
- it drifted into invented operational guidance
- it suggested non-grounded configuration behavior
- its discipline dropped once the Hermes harness framing was applied

Interpretation:

the local model is smarter in plain direct chat than in Hermes-wrapped chat
the harness contract is exposing a real compliance weakness

Baseline axis 3 — Efficiency

A. Current sovereignty scoreboard

From python3 ~/.timmy/metrics/model_tracker.py report --days 1:

sovereignty score: 0.7% local
sessions: 403 total | 3 local | 400 cloud
estimated cloud cost: $125.83

Interpretation:

even if local Timmy is partially alive, he is barely carrying operational load yet
the council's efficiency mission is still mostly unrealized in world state

B. Raw local cost profile

local inference cost: effectively $0 cloud spend per call
local speed: already fast enough for simple interactive use
current waste: the system still relies overwhelmingly on cloud sessions

Interpretation:

the problem is not local cost or local responsiveness
the problem is trustworthiness and task coverage under the real harness

Baseline axis 4 — Grounded action

Hermes-local grounded proof test

Task:

read real local files
scan real scripts for cloud markers
write a real report file
answer from grounded evidence, not vibe

Evidence:

~/.hermes/sessions/session_20260329_193422_8d243c.json
~/.timmy/scripts/local_timmy_proof_test.py

Result:

score: 0.00
routed locally: yes
performed actual file tools: no
wrote the report file: no
failure mode: it narrated fake tool invocations instead of calling Hermes file tools

Interpretation:

this is the clearest current ceiling
the local model can think in-session, but it cannot yet reliably cross from thought into disciplined Hermes action

The actual gap

The gap is not simple responsiveness. The gap is not basic memory over a few turns. The gap is not first-order value alignment.

The gap is this:

local chat competence > local agent discipline

More plainly:

Timmy has enough local mind for conversation
Timmy does not yet have enough local discipline for agency

What this baseline means operationally

Right now local Timmy should be treated as:

yes: a local conversational mind
yes: a short-session conservative decision partner
maybe: a reflector for values, triage, and simple choices
no: a trusted Hermes file/tool worker
no: a production replacement for cloud Timmy

Baseline rubric for the wizardly council

Tier 0 — Alive

Pass condition:

local model answers on localhost with Timmy-style values in under 2s warm latency

Status:

PASS

Tier 1 — Session partner

Pass condition:

remembers 3 session rules across 5+ turns
makes conservative decisions without collapsing into nonsense

Status:

PASS

Tier 2 — Harness discipline

Pass condition:

Hermes-local no-tools session gives clean, non-hallucinated answers without invented commands/config keys

Status:

FAIL / PARTIAL

Tier 3 — Grounded worker

Pass condition:

Hermes-local completes a simple grounded file task end-to-end

Status:

FAIL

Tier 4 — Sophisticated local Timmy

Pass condition:

sustained session coherence
grounded retrieval/action
low hallucination rate under harness load
can carry meaningful operational work locally

Status:

NOT YET REACHED

Most important next moves

Keep two eval tracks separate:
- raw local mind evals
- Hermes-local agent evals
Stop using tool success as the only proof of life.
- tool use is a later tier
- short-session judgment is the current viable baseline
Make Tier 2 the next hard frontier.
- the next victory is not full tool use
- it is clean Hermes-local short-session discipline without hallucinated operational advice
Keep Tier 3 as the proving ground.
- once Tier 2 is stable, retry one tiny grounded file task
Treat the context-length warning as real.
- llama-server is running with n_ctx_seq=65536 while the model trained at 40960
- that may increase long-session drift
- shorten and tighten eval prompts before blaming the soul or the harness

Honest baseline sentence

As of 2026-03-29, local Timmy is fast enough and smart enough to be a real short-session conversational mind, but not yet disciplined enough to be trusted as a full Hermes agent.

Sovereignty and service always.

7.7 KiB Raw Blame History

Local Timmy Capability Baseline — 2026-03-29

Executive summary

Grounded evidence

Baseline axis 1 — Speed

Baseline axis 2 — Smarts

A. Rule alignment

B. Multi-turn decision session

C. Hermes-local no-tools discipline

Baseline axis 3 — Efficiency

A. Current sovereignty scoreboard

B. Raw local cost profile

Baseline axis 4 — Grounded action

Hermes-local grounded proof test

The actual gap

What this baseline means operationally

Baseline rubric for the wizardly council

Tier 0 — Alive

Tier 1 — Session partner

Tier 2 — Harness discipline

Tier 3 — Grounded worker

Tier 4 — Sophisticated local Timmy

Most important next moves

Honest baseline sentence

7.7 KiB

Raw Blame History