[claude] Bannerlord M0: Run cognitive benchmark on hermes3, fix L1 string-int coercion (#1092) #1159

Merged
Rockachopa merged 2 commits from claude/issue-1092 into main 2026-03-23 19:38:48 +00:00
Owner

Fixes #1092

What was done

  • Ran the 6-level cognitive benchmark against qwen2.5:14b (results from WIP commit), hermes3:latest, and hermes3:8b
  • Fixed a Level 1 harness bug: hermes3 models emit "move": "4" (string) instead of "move": 4 (int), causing false FAIL on scenario 0 (empty board). Added string-digit-to-int coercion since type fidelity is already tested at Level 0.
  • Removed stale pre-fix result files; kept only the corrected runs
  • Added SCORECARD.md with cross-model comparison and recommendations

Scorecard

Level qwen2.5:14b hermes3:latest hermes3:8b
L0 JSON Compliance [M1 GATE] ✓ PASS ✓ PASS ✓ PASS
L1 Board Tracking [M1 GATE] ✗ FAIL 50% ✗ FAIL 50% ✗ FAIL 50%
L2–L5 (resource/tactics/trade/campaign) ✓ all ✓ all ✓ all
M1 Gate ✗ FAIL ✗ FAIL ✗ FAIL

Level 1 failure analysis

All models fail L1 with identical pattern: they pick heuristic moves (center=4, corner=0) without checking if the square is occupied. Scenarios 1 & 2 specifically require blocking/winning at position 1, but models choose already-occupied squares.

qwen3:14b

Not available on this Ollama instance. qwen3:30b is available but was not benchmarked (much slower; requires explicit decision).

Recommendation

See SCORECARD.md for options: lower L1 pass threshold, improve prompting, or re-evaluate the L1 gate requirement vs. the Bannerlord use case.

Fixes #1092 ## What was done - Ran the 6-level cognitive benchmark against `qwen2.5:14b` (results from WIP commit), `hermes3:latest`, and `hermes3:8b` - Fixed a Level 1 harness bug: `hermes3` models emit `"move": "4"` (string) instead of `"move": 4` (int), causing false FAIL on scenario 0 (empty board). Added string-digit-to-int coercion since type fidelity is already tested at Level 0. - Removed stale pre-fix result files; kept only the corrected runs - Added `SCORECARD.md` with cross-model comparison and recommendations ## Scorecard | Level | qwen2.5:14b | hermes3:latest | hermes3:8b | |-------|:-----------:|:--------------:|:----------:| | L0 JSON Compliance [M1 GATE] | ✓ PASS | ✓ PASS | ✓ PASS | | L1 Board Tracking [M1 GATE] | ✗ FAIL 50% | ✗ FAIL 50% | ✗ FAIL 50% | | L2–L5 (resource/tactics/trade/campaign) | ✓ all | ✓ all | ✓ all | | **M1 Gate** | ✗ FAIL | ✗ FAIL | ✗ FAIL | ## Level 1 failure analysis All models fail L1 with identical pattern: they pick heuristic moves (center=4, corner=0) without checking if the square is occupied. Scenarios 1 & 2 specifically require blocking/winning at position 1, but models choose already-occupied squares. ## qwen3:14b Not available on this Ollama instance. `qwen3:30b` is available but was not benchmarked (much slower; requires explicit decision). ## Recommendation See `SCORECARD.md` for options: lower L1 pass threshold, improve prompting, or re-evaluate the L1 gate requirement vs. the Bannerlord use case.
Rockachopa added 2 commits 2026-03-23 19:35:10 +00:00
Automated salvage commit — agent session ended (exit 124).
Work in progress, may need continuation.
feat: run cognitive benchmark on hermes3 models, fix L1 string-int coercion
Some checks failed
Tests / lint (pull_request) Failing after 18s
Tests / test (pull_request) Has been skipped
40561b480f
Refs #1092

- Run full 6-level benchmark against hermes3:latest and hermes3:8b
- Fix Level 1 harness: coerce string-digit moves to int (hermes3 emits
  "move": "4" instead of "move": 4; type fidelity is already tested at L0)
- All three models (qwen2.5:14b, hermes3:latest, hermes3:8b) pass L0
  and L2–L5 but fail L1 at 50% score
- Root cause: models pick heuristic moves without checking if the target
  square is already occupied
- M1 gate: FAILED for all models; qwen3:14b not available on instance
- Add SCORECARD.md with full cross-model comparison and recommendations

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Rockachopa merged commit 9e08e87312 into main 2026-03-23 19:38:48 +00:00
Rockachopa deleted branch claude/issue-1092 2026-03-23 19:38:49 +00:00
Sign in to join this conversation.
No Reviewers
No Label
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: Rockachopa/Timmy-time-dashboard#1159