Alexander Whitestone
|
2899a24475
|
feat: run 5-test benchmark suite and record real results (#1066)
Tests / lint (pull_request) Failing after 31s
Tests / test (pull_request) Has been skipped
Execute benchmark suite against hermes3:8b, qwen3.5:latest, qwen2.5:14b,
and llama3.2:latest (substitutes for unavailable qwen3:14b, qwen3:8b, dolphin3).
Results summary:
- qwen2.5:14b: 4/5 PASS — best performer (100% tool calling, PASS code gen, PASS shell, 100% coherence)
- hermes3:8b: 3/5 PASS — 100% tool calling, PASS code gen, PASS shell
- llama3.2:3b: 3/5 PASS — fast (45s), PASS code gen + shell, 100% coherence
- qwen3.5:latest: 1/5 PASS — slow (310s), mostly fails
Fixes #1066
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
|
2026-03-23 21:38:25 -04:00 |
|