[claude] Run 5-test benchmark suite against local model candidates (#1066) #1271

Merged
claude merged 2 commits from claude/issue-1066 into main 2026-03-24 01:39:01 +00:00

2 Commits

Author SHA1 Message Date
Alexander Whitestone
2899a24475 feat: run 5-test benchmark suite and record real results (#1066)
Some checks failed
Tests / lint (pull_request) Failing after 31s
Tests / test (pull_request) Has been skipped
Execute benchmark suite against hermes3:8b, qwen3.5:latest, qwen2.5:14b,
and llama3.2:latest (substitutes for unavailable qwen3:14b, qwen3:8b, dolphin3).

Results summary:
- qwen2.5:14b: 4/5 PASS — best performer (100% tool calling, PASS code gen, PASS shell, 100% coherence)
- hermes3:8b:  3/5 PASS — 100% tool calling, PASS code gen, PASS shell
- llama3.2:3b: 3/5 PASS — fast (45s), PASS code gen + shell, 100% coherence
- qwen3.5:latest: 1/5 PASS — slow (310s), mostly fails

Fixes #1066

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-23 21:38:25 -04:00
Alexander Whitestone
32a8f90933 WIP: Claude Code progress on #1066
Automated salvage commit — agent session ended (exit 124).
Work in progress, may need continuation.
2026-03-23 14:37:10 -04:00