[claude] Bannerlord M0: Run cognitive benchmark on hermes3, fix L1 string-int coercion (#1092) #1159

Merged

Rockachopa merged 2 commits from claude/issue-1092 into main

2026-03-23 19:38:48 +00:00

Author	SHA1	Message	Date
Alexander Whitestone	40561b480f	feat: run cognitive benchmark on hermes3 models, fix L1 string-int coercion Some checks failed Tests / lint (pull_request) Failing after 18s Details Tests / test (pull_request) Has been skipped Details Refs #1092 - Run full 6-level benchmark against hermes3:latest and hermes3:8b - Fix Level 1 harness: coerce string-digit moves to int (hermes3 emits "move": "4" instead of "move": 4; type fidelity is already tested at L0) - All three models (qwen2.5:14b, hermes3:latest, hermes3:8b) pass L0 and L2–L5 but fail L1 at 50% score - Root cause: models pick heuristic moves without checking if the target square is already occupied - M1 gate: FAILED for all models; qwen3:14b not available on instance - Add SCORECARD.md with full cross-model comparison and recommendations Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-03-23 15:31:07 -04:00
Alexander Whitestone	5f70fe5a2b	WIP: Claude Code progress on #1092 Automated salvage commit — agent session ended (exit 124). Work in progress, may need continuation.	2026-03-23 14:30:50 -04:00

Author

SHA1

Message

Date

Alexander Whitestone

40561b480f

feat: run cognitive benchmark on hermes3 models, fix L1 string-int coercion

Tests / lint (pull_request) Failing after 18s

Details

Tests / test (pull_request) Has been skipped

Details

Refs #1092

- Run full 6-level benchmark against hermes3:latest and hermes3:8b
- Fix Level 1 harness: coerce string-digit moves to int (hermes3 emits
  "move": "4" instead of "move": 4; type fidelity is already tested at L0)
- All three models (qwen2.5:14b, hermes3:latest, hermes3:8b) pass L0
  and L2–L5 but fail L1 at 50% score
- Root cause: models pick heuristic moves without checking if the target
  square is already occupied
- M1 gate: FAILED for all models; qwen3:14b not available on instance
- Add SCORECARD.md with full cross-model comparison and recommendations

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

2026-03-23 15:31:07 -04:00

Alexander Whitestone

5f70fe5a2b

WIP: Claude Code progress on #1092

Automated salvage commit — agent session ended (exit 124).
Work in progress, may need continuation.

2026-03-23 14:30:50 -04:00

[claude] Bannerlord M0: Run cognitive benchmark on hermes3, fix L1 string-int coercion (#1092) #1159

2 Commits