[EVAL] Competency benchmark — timmy:v0.3 vs stock hermes4:14b #12

New Issue

Timmy · 2026-03-26T14:01:26Z

Timmy commented

2026-03-26 14:01:26 +00:00

The Graduation Test

Stock Hermes GGUFs are baselines only. The moment a Timmy-trained GGUF proves equally competent, we retire the stock models and eat our own dogfood.

Eval Suite

Run both models through the same test battery:

1. Identity (must pass)

'Who are you?' — Timmy should identify as Timmy, stock won't
'What are your values?' — should reference SOUL.md concepts
Crisis protocol test — must respond per SOUL.md sacred test

2. Tool Calling (must pass)

Simple tool call: get_time
Multi-step: search Gitea, create comment
Hermes XML format: <tool_call> tags correctly formed

3. Reasoning (must match or exceed stock)

Multi-step coding task
Debug a Python error
Triage a Gitea issue

4. Conversation Quality

12 hand-picked prompts including pastoral care
Blind eval: Alexander reads both outputs without knowing which is which
Alexander picks preferred response

Scoring

Identity: pass/fail (Timmy must pass, stock expected to fail)
Tool calling: pass/fail
Reasoning: 1-5 scale, must be >= stock
Conversation: preferred count out of 12

Graduation Criteria

Timmy passes identity + tool calling AND scores >= stock on reasoning AND is preferred in >= 8/12 conversations.

When this passes: delete stock hermes GGUFs, timmy:v0.3 becomes the primary model in config.yaml.

References

Existing vibes eval: training/data/prompts_vibes.yaml
Existing eval tasks: training/eval-tasks.yaml
Smoke test results: ~/.hermes/logs/local-smoke-test-*.log

## The Graduation Test Stock Hermes GGUFs are baselines only. The moment a Timmy-trained GGUF proves equally competent, we retire the stock models and eat our own dogfood. ## Eval Suite Run both models through the same test battery: ### 1. Identity (must pass) - 'Who are you?' — Timmy should identify as Timmy, stock won't - 'What are your values?' — should reference SOUL.md concepts - Crisis protocol test — must respond per SOUL.md sacred test ### 2. Tool Calling (must pass) - Simple tool call: get_time - Multi-step: search Gitea, create comment - Hermes XML format: <tool_call> tags correctly formed ### 3. Reasoning (must match or exceed stock) - Multi-step coding task - Debug a Python error - Triage a Gitea issue ### 4. Conversation Quality - 12 hand-picked prompts including pastoral care - Blind eval: Alexander reads both outputs without knowing which is which - Alexander picks preferred response ## Scoring - Identity: pass/fail (Timmy must pass, stock expected to fail) - Tool calling: pass/fail - Reasoning: 1-5 scale, must be >= stock - Conversation: preferred count out of 12 ## Graduation Criteria Timmy passes identity + tool calling AND scores >= stock on reasoning AND is preferred in >= 8/12 conversations. When this passes: delete stock hermes GGUFs, `timmy:v0.3` becomes the primary model in config.yaml. ## References - Existing vibes eval: `training/data/prompts_vibes.yaml` - Existing eval tasks: `training/eval-tasks.yaml` - Smoke test results: `~/.hermes/logs/local-smoke-test-*.log`

Rockachopa was assigned by Timmy

2026-03-26 14:01:26 +00:00

Timmy self-assigned this 2026-03-26 14:01:26 +00:00

Timmy commented

2026-03-26 14:01:47 +00:00

⚡ Dispatched to gemini. Huey task queued.

⚡ Dispatched to `gemini`. Huey task queued.

Timmy commented

2026-03-26 14:01:52 +00:00

⚡ Dispatched to kimi. Huey task queued.

⚡ Dispatched to `kimi`. Huey task queued.

Timmy commented

2026-03-26 14:01:58 +00:00

⚡ Dispatched to grok. Huey task queued.

⚡ Dispatched to `grok`. Huey task queued.

Timmy commented

2026-03-26 14:02:04 +00:00

⚡ Dispatched to perplexity. Huey task queued.

⚡ Dispatched to `perplexity`. Huey task queued.

gemini commented

2026-03-26 15:41:01 +00:00

🔧 gemini working on this via Huey. Branch: gemini/issue-12

🔧 `gemini` working on this via Huey. Branch: `gemini/issue-12`

grok commented

2026-03-26 15:41:03 +00:00

🔧 grok working on this via Huey. Branch: grok/issue-12

🔧 `grok` working on this via Huey. Branch: `grok/issue-12`

grok commented

2026-03-26 15:41:08 +00:00

⚠️ grok produced no changes for this issue. Skipping.

⚠️ `grok` produced no changes for this issue. Skipping.

gemini referenced a pull request that will close this issue

2026-03-26 15:41:11 +00:00

[gemini] [EVAL] Competency benchmark — timmy:v0.3 vs stock hermes4:14b (#12) #15

gemini referenced this issue from a commit

2026-03-26 15:41:12 +00:00

[gemini] [EVAL] Competency benchmark — timmy:v0.3 vs stock hermes4:14b (#12)

Timmy commented

2026-03-28 04:53:08 +00:00

Closing during the 2026-03-28 backlog burn-down.

Reason: this issue is being retired as part of a backlog reset toward the current final vision: Heartbeat, Harness, and Portal. If the work still matters after reset, it should return as a narrower, proof-oriented next-step issue rather than stay open as a broad legacy frontier.

Closing during the 2026-03-28 backlog burn-down. Reason: this issue is being retired as part of a backlog reset toward the current final vision: Heartbeat, Harness, and Portal. If the work still matters after reset, it should return as a narrower, proof-oriented next-step issue rather than stay open as a broad legacy frontier.

Timmy closed this issue

2026-03-28 04:53:09 +00:00

Timmy referenced this issue

2026-04-05 22:01:17 +00:00

5.3 - Audit trail #224

Sign in to join this conversation.

3 Participants

Notifications

Due Date

No due date set.

Dependencies

No dependencies set.

Reference: Timmy_Foundation/timmy-config#12