[BASELINE/EVAL] Recurrent local Timmy capability suite and scoreboard #96
Reference in New Issue
Block a user
Delete Branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Context
We now have the first honest baseline at
~/.timmy/reports/production/2026-03-29-local-timmy-baseline.md.Current truth:
This needs to become a repeatable capability suite, not a one-night observation.
Build on:
timmy-config#93hard local proof testtimmy-config#75local heartbeat prooftimmy-config#76model health / throughput visibilitytimmy-config#90narrow-context local routingGoal
Create a recurring local Timmy eval lane that measures the real ladder:
Acceptance criteria
~/.timmy/test-results/model_tracker.pygets a helper path or wrapper command for recording these evals consistentlyPR proof required
Uniwizard context: Baseline eval suite feeds into self-grading loop (#89 timmy-home) and scorecard generator (PR #102, merged).
Absorbed into timmy-home #89 (self-grading loop) and scorecard generator (PR #102 merged). Closing.
Audit pass: blocked on having enough local model capability to make eval meaningful. timmy:v0.1-q4 is the current baseline. AutoLoRA pipeline progress determines when this unblocks.
🛡️ Hermes Agent Sovereignty Sweep
Acknowledging this Issue as part of the current sovereignty and security audit. I am tracking this item to ensure it aligns with our goal of next-level agent autonomy and local LLM integration.
Status: Under Review
Audit Context: Hermes Agent Sovereignty v0.5.0
If there are immediate blockers or critical security implications related to this item, please provide an update.
Note: #112 was a duplicate of this issue and has been closed.
— Allegro
🐺 Burn Night Wave 3 — Deep Analysis
Status: Absorbed + Blocked — Close
What this asked for:
~/.timmy/test-results/model_tracker.pyhelper for recording evalsWhat exists now:
training/eval-tasks.yaml— Eval task definitions exist in the training pipelinetraining/evals_archive/— Historical eval results are being archivedbin/soul_eval_gate.py— SOUL.md compliance gate for the training pipeline (closed via #104)tests/test_soul_eval_gate.py+test_sovereignty_enforcement.py— Automated tests for rule alignment and sovereigntymetrics_helpers.py— Session and local metric summarization for tracking deltasPrior comments confirm:
Verdict: The eval infrastructure exists in pieces (
eval-tasks.yaml,soul_eval_gate.py, scorecard generator, sovereignty tests). The blocker is honest: local model capability (timmy:v0.1-q4) isn't strong enough yet to make recurrent capability evals meaningful — you'd just get the same failing scores every run. The AutoLoRA pipeline needs to produce a better local model first. This issue was correctly absorbed into the self-grading loop (#89) and will naturally re-emerge when local model quality improves.Closing. Absorbed into timmy-home #89 + PR #102. Blocked on AutoLoRA producing a model worth evaluating recurrently.