[claude] Bannerlord M0: Run cognitive benchmark on hermes3, fix L1 string-int coercion (#1092) #1159

Merged
Rockachopa merged 2 commits from claude/issue-1092 into main 2026-03-23 19:38:48 +00:00

2 Commits

Author SHA1 Message Date
Alexander Whitestone
40561b480f feat: run cognitive benchmark on hermes3 models, fix L1 string-int coercion
Some checks failed
Tests / lint (pull_request) Failing after 18s
Tests / test (pull_request) Has been skipped
Refs #1092

- Run full 6-level benchmark against hermes3:latest and hermes3:8b
- Fix Level 1 harness: coerce string-digit moves to int (hermes3 emits
  "move": "4" instead of "move": 4; type fidelity is already tested at L0)
- All three models (qwen2.5:14b, hermes3:latest, hermes3:8b) pass L0
  and L2–L5 but fail L1 at 50% score
- Root cause: models pick heuristic moves without checking if the target
  square is already occupied
- M1 gate: FAILED for all models; qwen3:14b not available on instance
- Add SCORECARD.md with full cross-model comparison and recommendations

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-23 15:31:07 -04:00
Alexander Whitestone
5f70fe5a2b WIP: Claude Code progress on #1092
Automated salvage commit — agent session ended (exit 124).
Work in progress, may need continuation.
2026-03-23 14:30:50 -04:00