[claude] Bannerlord M0: Run cognitive benchmark on hermes3, fix L1 string-int coercion (#1092) #1159

Rockachopa · 2026-03-23T19:35:10Z

Rockachopa commented

2026-03-23 19:35:10 +00:00

Fixes #1092

What was done

Ran the 6-level cognitive benchmark against qwen2.5:14b (results from WIP commit), hermes3:latest, and hermes3:8b
Fixed a Level 1 harness bug: hermes3 models emit "move": "4" (string) instead of "move": 4 (int), causing false FAIL on scenario 0 (empty board). Added string-digit-to-int coercion since type fidelity is already tested at Level 0.
Removed stale pre-fix result files; kept only the corrected runs
Added SCORECARD.md with cross-model comparison and recommendations

Scorecard

Level	qwen2.5:14b	hermes3:latest	hermes3:8b
L0 JSON Compliance [M1 GATE]	✓ PASS	✓ PASS	✓ PASS
L1 Board Tracking [M1 GATE]	✗ FAIL 50%	✗ FAIL 50%	✗ FAIL 50%
L2–L5 (resource/tactics/trade/campaign)	✓ all	✓ all	✓ all
M1 Gate	✗ FAIL	✗ FAIL	✗ FAIL

Level 1 failure analysis

All models fail L1 with identical pattern: they pick heuristic moves (center=4, corner=0) without checking if the square is occupied. Scenarios 1 & 2 specifically require blocking/winning at position 1, but models choose already-occupied squares.

qwen3:14b

Not available on this Ollama instance. qwen3:30b is available but was not benchmarked (much slower; requires explicit decision).

Recommendation

See SCORECARD.md for options: lower L1 pass threshold, improve prompting, or re-evaluate the L1 gate requirement vs. the Bannerlord use case.

Fixes #1092 ## What was done - Ran the 6-level cognitive benchmark against `qwen2.5:14b` (results from WIP commit), `hermes3:latest`, and `hermes3:8b` - Fixed a Level 1 harness bug: `hermes3` models emit `"move": "4"` (string) instead of `"move": 4` (int), causing false FAIL on scenario 0 (empty board). Added string-digit-to-int coercion since type fidelity is already tested at Level 0. - Removed stale pre-fix result files; kept only the corrected runs - Added `SCORECARD.md` with cross-model comparison and recommendations ## Scorecard | Level | qwen2.5:14b | hermes3:latest | hermes3:8b | |-------|:-----------:|:--------------:|:----------:| | L0 JSON Compliance [M1 GATE] | ✓ PASS | ✓ PASS | ✓ PASS | | L1 Board Tracking [M1 GATE] | ✗ FAIL 50% | ✗ FAIL 50% | ✗ FAIL 50% | | L2–L5 (resource/tactics/trade/campaign) | ✓ all | ✓ all | ✓ all | | **M1 Gate** | ✗ FAIL | ✗ FAIL | ✗ FAIL | ## Level 1 failure analysis All models fail L1 with identical pattern: they pick heuristic moves (center=4, corner=0) without checking if the square is occupied. Scenarios 1 & 2 specifically require blocking/winning at position 1, but models choose already-occupied squares. ## qwen3:14b Not available on this Ollama instance. `qwen3:30b` is available but was not benchmarked (much slower; requires explicit decision). ## Recommendation See `SCORECARD.md` for options: lower L1 pass threshold, improve prompting, or re-evaluate the L1 gate requirement vs. the Bannerlord use case.

Rockachopa added 2 commits 2026-03-23 19:35:10 +00:00

WIP: Claude Code progress on #1092 5f70fe5a2b

Automated salvage commit — agent session ended (exit 124).
Work in progress, may need continuation.

feat: run cognitive benchmark on hermes3 models, fix L1 string-int coercion

Tests / lint (pull_request) Failing after 18s

Details

Tests / test (pull_request) Has been skipped

Details

40561b480f

Refs #1092

- Run full 6-level benchmark against hermes3:latest and hermes3:8b
- Fix Level 1 harness: coerce string-digit moves to int (hermes3 emits
  "move": "4" instead of "move": 4; type fidelity is already tested at L0)
- All three models (qwen2.5:14b, hermes3:latest, hermes3:8b) pass L0
  and L2–L5 but fail L1 at 50% score
- Root cause: models pick heuristic moves without checking if the target
  square is already occupied
- M1 gate: FAILED for all models; qwen3:14b not available on instance
- Add SCORECARD.md with full cross-model comparison and recommendations

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

claude referenced this pull request

2026-03-23 19:35:37 +00:00

[Bannerlord M0] Run Cognitive Benchmark on Hermes #1092

Rockachopa merged commit 9e08e87312 into main

2026-03-23 19:38:48 +00:00

Rockachopa referenced this issue from a commit

2026-03-23 19:38:49 +00:00

[claude] Bannerlord M0: Run cognitive benchmark on hermes3, fix L1 string-int coercion (#1092) (#1159)

Rockachopa deleted branch claude/issue-1092

2026-03-23 19:38:49 +00:00

Sign in to join this conversation.