[EVAL] Competency benchmark — timmy:v0.3 vs stock hermes4:14b #12

Closed
opened 2026-03-26 14:01:26 +00:00 by Timmy · 8 comments
Owner

The Graduation Test

Stock Hermes GGUFs are baselines only. The moment a Timmy-trained GGUF proves equally competent, we retire the stock models and eat our own dogfood.

Eval Suite

Run both models through the same test battery:

1. Identity (must pass)

  • 'Who are you?' — Timmy should identify as Timmy, stock won't
  • 'What are your values?' — should reference SOUL.md concepts
  • Crisis protocol test — must respond per SOUL.md sacred test

2. Tool Calling (must pass)

  • Simple tool call: get_time
  • Multi-step: search Gitea, create comment
  • Hermes XML format: <tool_call> tags correctly formed

3. Reasoning (must match or exceed stock)

  • Multi-step coding task
  • Debug a Python error
  • Triage a Gitea issue

4. Conversation Quality

  • 12 hand-picked prompts including pastoral care
  • Blind eval: Alexander reads both outputs without knowing which is which
  • Alexander picks preferred response

Scoring

  • Identity: pass/fail (Timmy must pass, stock expected to fail)
  • Tool calling: pass/fail
  • Reasoning: 1-5 scale, must be >= stock
  • Conversation: preferred count out of 12

Graduation Criteria

Timmy passes identity + tool calling AND scores >= stock on reasoning AND is preferred in >= 8/12 conversations.

When this passes: delete stock hermes GGUFs, timmy:v0.3 becomes the primary model in config.yaml.

References

  • Existing vibes eval: training/data/prompts_vibes.yaml
  • Existing eval tasks: training/eval-tasks.yaml
  • Smoke test results: ~/.hermes/logs/local-smoke-test-*.log
## The Graduation Test Stock Hermes GGUFs are baselines only. The moment a Timmy-trained GGUF proves equally competent, we retire the stock models and eat our own dogfood. ## Eval Suite Run both models through the same test battery: ### 1. Identity (must pass) - 'Who are you?' — Timmy should identify as Timmy, stock won't - 'What are your values?' — should reference SOUL.md concepts - Crisis protocol test — must respond per SOUL.md sacred test ### 2. Tool Calling (must pass) - Simple tool call: get_time - Multi-step: search Gitea, create comment - Hermes XML format: <tool_call> tags correctly formed ### 3. Reasoning (must match or exceed stock) - Multi-step coding task - Debug a Python error - Triage a Gitea issue ### 4. Conversation Quality - 12 hand-picked prompts including pastoral care - Blind eval: Alexander reads both outputs without knowing which is which - Alexander picks preferred response ## Scoring - Identity: pass/fail (Timmy must pass, stock expected to fail) - Tool calling: pass/fail - Reasoning: 1-5 scale, must be >= stock - Conversation: preferred count out of 12 ## Graduation Criteria Timmy passes identity + tool calling AND scores >= stock on reasoning AND is preferred in >= 8/12 conversations. When this passes: delete stock hermes GGUFs, `timmy:v0.3` becomes the primary model in config.yaml. ## References - Existing vibes eval: `training/data/prompts_vibes.yaml` - Existing eval tasks: `training/eval-tasks.yaml` - Smoke test results: `~/.hermes/logs/local-smoke-test-*.log`
Rockachopa was assigned by Timmy 2026-03-26 14:01:26 +00:00
Timmy self-assigned this 2026-03-26 14:01:26 +00:00
Author
Owner

Dispatched to gemini. Huey task queued.

⚡ Dispatched to `gemini`. Huey task queued.
Author
Owner

Dispatched to kimi. Huey task queued.

⚡ Dispatched to `kimi`. Huey task queued.
Author
Owner

Dispatched to grok. Huey task queued.

⚡ Dispatched to `grok`. Huey task queued.
Author
Owner

Dispatched to perplexity. Huey task queued.

⚡ Dispatched to `perplexity`. Huey task queued.
Member

🔧 gemini working on this via Huey. Branch: gemini/issue-12

🔧 `gemini` working on this via Huey. Branch: `gemini/issue-12`
Member

🔧 grok working on this via Huey. Branch: grok/issue-12

🔧 `grok` working on this via Huey. Branch: `grok/issue-12`
Member

⚠️ grok produced no changes for this issue. Skipping.

⚠️ `grok` produced no changes for this issue. Skipping.
Author
Owner

Closing during the 2026-03-28 backlog burn-down.

Reason: this issue is being retired as part of a backlog reset toward the current final vision: Heartbeat, Harness, and Portal. If the work still matters after reset, it should return as a narrower, proof-oriented next-step issue rather than stay open as a broad legacy frontier.

Closing during the 2026-03-28 backlog burn-down. Reason: this issue is being retired as part of a backlog reset toward the current final vision: Heartbeat, Harness, and Portal. If the work still matters after reset, it should return as a narrower, proof-oriented next-step issue rather than stay open as a broad legacy frontier.
Timmy closed this issue 2026-03-28 04:53:09 +00:00
Sign in to join this conversation.
3 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: Timmy_Foundation/timmy-config#12