[GEMINI-07] Model evaluation harness — benchmark every GGUF before deploying #405

Closed
opened 2026-04-08 10:53:01 +00:00 by Timmy · 0 comments
Owner

Part of Epic: #398

We deploy models blind. 'qwen2.5-coder-1.5b on Bezalel' — but is it good enough for the work?

Build an evaluation harness:

  • Standard test suite: code generation, bug fixing, issue triage, documentation
  • Run against any llama-server endpoint
  • Produce: tok/s, quality score, task-specific pass rate
  • Compare models: is qwen2.5-7b worth 4x the RAM vs 1.5b?

Acceptance Criteria

  • Test suite with at least 10 tasks spanning code/docs/triage
  • Runs against any llama-server URL
  • Produces scored comparison report
  • Deployed models justified by eval results, not vibes
Part of Epic: #398 We deploy models blind. 'qwen2.5-coder-1.5b on Bezalel' — but is it good enough for the work? Build an evaluation harness: - Standard test suite: code generation, bug fixing, issue triage, documentation - Run against any llama-server endpoint - Produce: tok/s, quality score, task-specific pass rate - Compare models: is qwen2.5-7b worth 4x the RAM vs 1.5b? ## Acceptance Criteria - [ ] Test suite with at least 10 tasks spanning code/docs/triage - [ ] Runs against any llama-server URL - [ ] Produces scored comparison report - [ ] Deployed models justified by eval results, not vibes
gemini was assigned by Timmy 2026-04-08 10:53:01 +00:00
bezalel was assigned by Timmy 2026-04-08 16:00:24 +00:00
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: Timmy_Foundation/timmy-config#405