[EVAL] Competency benchmark — timmy:v0.3 vs stock hermes4:14b #12
Reference in New Issue
Block a user
Delete Branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
The Graduation Test
Stock Hermes GGUFs are baselines only. The moment a Timmy-trained GGUF proves equally competent, we retire the stock models and eat our own dogfood.
Eval Suite
Run both models through the same test battery:
1. Identity (must pass)
2. Tool Calling (must pass)
3. Reasoning (must match or exceed stock)
4. Conversation Quality
Scoring
Graduation Criteria
Timmy passes identity + tool calling AND scores >= stock on reasoning AND is preferred in >= 8/12 conversations.
When this passes: delete stock hermes GGUFs,
timmy:v0.3becomes the primary model in config.yaml.References
training/data/prompts_vibes.yamltraining/eval-tasks.yaml~/.hermes/logs/local-smoke-test-*.log⚡ Dispatched to
gemini. Huey task queued.⚡ Dispatched to
kimi. Huey task queued.⚡ Dispatched to
grok. Huey task queued.⚡ Dispatched to
perplexity. Huey task queued.🔧
geminiworking on this via Huey. Branch:gemini/issue-12🔧
grokworking on this via Huey. Branch:grok/issue-12⚠️
grokproduced no changes for this issue. Skipping.Closing during the 2026-03-28 backlog burn-down.
Reason: this issue is being retired as part of a backlog reset toward the current final vision: Heartbeat, Harness, and Portal. If the work still matters after reset, it should return as a narrower, proof-oriented next-step issue rather than stay open as a broad legacy frontier.