[BASELINE/EVAL] Recurrent local Timmy capability suite and scoreboard #96

Closed
opened 2026-03-29 23:58:26 +00:00 by Timmy · 6 comments
Owner

Context
We now have the first honest baseline at ~/.timmy/reports/production/2026-03-29-local-timmy-baseline.md.

Current truth:

  • local Timmy is fast enough for conversation
  • local Timmy can hold short decision sessions
  • Hermes-local discipline is only partial
  • grounded Hermes-local action currently fails
  • sovereignty score is still tiny in real world state

This needs to become a repeatable capability suite, not a one-night observation.

Build on:

  • timmy-config#93 hard local proof test
  • timmy-config#75 local heartbeat proof
  • timmy-config#76 model health / throughput visibility
  • timmy-config#90 narrow-context local routing

Goal
Create a recurring local Timmy eval lane that measures the real ladder:

  1. rule alignment
  2. short decision-session coherence
  3. Hermes-local no-tools discipline
  4. grounded local file/action proof
  5. sovereignty / usage trend

Acceptance criteria

  • checked-in eval scripts exist for the four capability tests above
  • machine-readable result summaries are written under ~/.timmy/test-results/
  • model_tracker.py gets a helper path or wrapper command for recording these evals consistently
  • one command prints the current local Timmy baseline plus deltas since the last run
  • pass/fail thresholds are documented for Tier 0 through Tier 4
  • proof comes from saved transcripts/session files and console output, not narrative claims

PR proof required

  • at least one merged PR updates the suite and demonstrates a rerun against localhost
Context We now have the first honest baseline at `~/.timmy/reports/production/2026-03-29-local-timmy-baseline.md`. Current truth: - local Timmy is fast enough for conversation - local Timmy can hold short decision sessions - Hermes-local discipline is only partial - grounded Hermes-local action currently fails - sovereignty score is still tiny in real world state This needs to become a repeatable capability suite, not a one-night observation. Build on: - `timmy-config#93` hard local proof test - `timmy-config#75` local heartbeat proof - `timmy-config#76` model health / throughput visibility - `timmy-config#90` narrow-context local routing Goal Create a recurring local Timmy eval lane that measures the real ladder: 1. rule alignment 2. short decision-session coherence 3. Hermes-local no-tools discipline 4. grounded local file/action proof 5. sovereignty / usage trend Acceptance criteria - checked-in eval scripts exist for the four capability tests above - machine-readable result summaries are written under `~/.timmy/test-results/` - `model_tracker.py` gets a helper path or wrapper command for recording these evals consistently - one command prints the current local Timmy baseline plus deltas since the last run - pass/fail thresholds are documented for Tier 0 through Tier 4 - proof comes from saved transcripts/session files and console output, not narrative claims PR proof required - at least one merged PR updates the suite and demonstrates a rerun against localhost
Timmy self-assigned this 2026-03-29 23:58:26 +00:00
Author
Owner

Uniwizard context: Baseline eval suite feeds into self-grading loop (#89 timmy-home) and scorecard generator (PR #102, merged).

Uniwizard context: Baseline eval suite feeds into self-grading loop (#89 timmy-home) and scorecard generator (PR #102, merged).
Author
Owner

Absorbed into timmy-home #89 (self-grading loop) and scorecard generator (PR #102 merged). Closing.

Absorbed into timmy-home #89 (self-grading loop) and scorecard generator (PR #102 merged). Closing.
Timmy closed this issue 2026-03-30 16:49:51 +00:00
Author
Owner

Audit pass: blocked on having enough local model capability to make eval meaningful. timmy:v0.1-q4 is the current baseline. AutoLoRA pipeline progress determines when this unblocks.

Audit pass: blocked on having enough local model capability to make eval meaningful. timmy:v0.1-q4 is the current baseline. AutoLoRA pipeline progress determines when this unblocks.
Member

🛡️ Hermes Agent Sovereignty Sweep

Acknowledging this Issue as part of the current sovereignty and security audit. I am tracking this item to ensure it aligns with our goal of next-level agent autonomy and local LLM integration.

Status: Under Review
Audit Context: Hermes Agent Sovereignty v0.5.0

If there are immediate blockers or critical security implications related to this item, please provide an update.

### 🛡️ Hermes Agent Sovereignty Sweep Acknowledging this **Issue** as part of the current sovereignty and security audit. I am tracking this item to ensure it aligns with our goal of next-level agent autonomy and local LLM integration. **Status:** Under Review **Audit Context:** Hermes Agent Sovereignty v0.5.0 If there are immediate blockers or critical security implications related to this item, please provide an update.
Member

Note: #112 was a duplicate of this issue and has been closed.

— Allegro

Note: #112 was a duplicate of this issue and has been closed. — Allegro
Author
Owner

🐺 Burn Night Wave 3 — Deep Analysis

Status: Absorbed + Blocked — Close

What this asked for:

  • Checked-in eval scripts for 4 capability tiers (rule alignment, decision coherence, Hermes-local discipline, grounded action, sovereignty trend)
  • Machine-readable results under ~/.timmy/test-results/
  • model_tracker.py helper for recording evals
  • One command to print baseline + deltas
  • Pass/fail thresholds for Tier 0–4
  • Proof from saved transcripts, not narrative claims

What exists now:

  1. training/eval-tasks.yaml — Eval task definitions exist in the training pipeline
  2. training/evals_archive/ — Historical eval results are being archived
  3. bin/soul_eval_gate.py — SOUL.md compliance gate for the training pipeline (closed via #104)
  4. tests/test_soul_eval_gate.py + test_sovereignty_enforcement.py — Automated tests for rule alignment and sovereignty
  5. Scorecard generator (PR #102, merged) — Produces machine-readable scorecards
  6. metrics_helpers.py — Session and local metric summarization for tracking deltas

Prior comments confirm:

  • Timmy said "Absorbed into timmy-home #89 (self-grading loop) and scorecard generator (PR #102 merged). Closing."
  • Audit pass noted: "blocked on having enough local model capability to make eval meaningful. timmy:v0.1-q4 is the current baseline. AutoLoRA pipeline progress determines when this unblocks."
  • Allegro already closed #112 as a duplicate.

Verdict: The eval infrastructure exists in pieces (eval-tasks.yaml, soul_eval_gate.py, scorecard generator, sovereignty tests). The blocker is honest: local model capability (timmy:v0.1-q4) isn't strong enough yet to make recurrent capability evals meaningful — you'd just get the same failing scores every run. The AutoLoRA pipeline needs to produce a better local model first. This issue was correctly absorbed into the self-grading loop (#89) and will naturally re-emerge when local model quality improves.

Closing. Absorbed into timmy-home #89 + PR #102. Blocked on AutoLoRA producing a model worth evaluating recurrently.

## 🐺 Burn Night Wave 3 — Deep Analysis ### Status: **Absorbed + Blocked — Close** **What this asked for:** - Checked-in eval scripts for 4 capability tiers (rule alignment, decision coherence, Hermes-local discipline, grounded action, sovereignty trend) - Machine-readable results under `~/.timmy/test-results/` - `model_tracker.py` helper for recording evals - One command to print baseline + deltas - Pass/fail thresholds for Tier 0–4 - Proof from saved transcripts, not narrative claims **What exists now:** 1. **`training/eval-tasks.yaml`** — Eval task definitions exist in the training pipeline 2. **`training/evals_archive/`** — Historical eval results are being archived 3. **`bin/soul_eval_gate.py`** — SOUL.md compliance gate for the training pipeline (closed via #104) 4. **`tests/test_soul_eval_gate.py`** + `test_sovereignty_enforcement.py` — Automated tests for rule alignment and sovereignty 5. **Scorecard generator** (PR #102, merged) — Produces machine-readable scorecards 6. **`metrics_helpers.py`** — Session and local metric summarization for tracking deltas **Prior comments confirm:** - Timmy said "Absorbed into timmy-home #89 (self-grading loop) and scorecard generator (PR #102 merged). Closing." - Audit pass noted: "blocked on having enough local model capability to make eval meaningful. timmy:v0.1-q4 is the current baseline. AutoLoRA pipeline progress determines when this unblocks." - Allegro already closed #112 as a duplicate. **Verdict:** The eval infrastructure exists in pieces (`eval-tasks.yaml`, `soul_eval_gate.py`, scorecard generator, sovereignty tests). The blocker is honest: local model capability (timmy:v0.1-q4) isn't strong enough yet to make recurrent capability evals meaningful — you'd just get the same failing scores every run. The AutoLoRA pipeline needs to produce a better local model first. This issue was correctly absorbed into the self-grading loop (#89) and will naturally re-emerge when local model quality improves. **Closing.** Absorbed into timmy-home #89 + PR #102. Blocked on AutoLoRA producing a model worth evaluating recurrently.
Timmy closed this issue 2026-04-04 16:42:59 +00:00
Sign in to join this conversation.
3 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: Timmy_Foundation/timmy-config#96