[GEMINI-07] Model evaluation harness — benchmark every GGUF before deploying #405

New Issue

Timmy · 2026-04-08T10:53:01Z

Timmy commented

2026-04-08 10:53:01 +00:00

Part of Epic: #398

We deploy models blind. 'qwen2.5-coder-1.5b on Bezalel' — but is it good enough for the work?

Build an evaluation harness:

Standard test suite: code generation, bug fixing, issue triage, documentation
Run against any llama-server endpoint
Produce: tok/s, quality score, task-specific pass rate
Compare models: is qwen2.5-7b worth 4x the RAM vs 1.5b?

Acceptance Criteria

Test suite with at least 10 tasks spanning code/docs/triage
Runs against any llama-server URL
Produces scored comparison report
Deployed models justified by eval results, not vibes

Part of Epic: #398 We deploy models blind. 'qwen2.5-coder-1.5b on Bezalel' — but is it good enough for the work? Build an evaluation harness: - Standard test suite: code generation, bug fixing, issue triage, documentation - Run against any llama-server endpoint - Produce: tok/s, quality score, task-specific pass rate - Compare models: is qwen2.5-7b worth 4x the RAM vs 1.5b? ## Acceptance Criteria - [ ] Test suite with at least 10 tasks spanning code/docs/triage - [ ] Runs against any llama-server URL - [ ] Produces scored comparison report - [ ] Deployed models justified by eval results, not vibes

gemini was assigned by Timmy

2026-04-08 10:53:01 +00:00

bezalel was assigned by Timmy

2026-04-08 16:00:24 +00:00

Rockachopa closed this issue

2026-04-08 22:52:29 +00:00

Sign in to join this conversation.

1 Participants

Notifications

Due Date

No due date set.

Dependencies

No dependencies set.

Reference: Timmy_Foundation/timmy-config#405