[claude] Bannerlord M0: Run cognitive benchmark on hermes3, fix L1 string-int coercion (#1092) #1159
Reference in New Issue
Block a user
Delete Branch "claude/issue-1092"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Fixes #1092
What was done
qwen2.5:14b(results from WIP commit),hermes3:latest, andhermes3:8bhermes3models emit"move": "4"(string) instead of"move": 4(int), causing false FAIL on scenario 0 (empty board). Added string-digit-to-int coercion since type fidelity is already tested at Level 0.SCORECARD.mdwith cross-model comparison and recommendationsScorecard
Level 1 failure analysis
All models fail L1 with identical pattern: they pick heuristic moves (center=4, corner=0) without checking if the square is occupied. Scenarios 1 & 2 specifically require blocking/winning at position 1, but models choose already-occupied squares.
qwen3:14b
Not available on this Ollama instance.
qwen3:30bis available but was not benchmarked (much slower; requires explicit decision).Recommendation
See
SCORECARD.mdfor options: lower L1 pass threshold, improve prompting, or re-evaluate the L1 gate requirement vs. the Bannerlord use case.