Archived

forked from Rockachopa/Timmy-time-dashboard

This repository has been archived on 2026-03-24. You can view files and clone it. You cannot open issues or pull requests or push a commit.

Files

Alexander Whitestone 9e08e87312 [claude] Bannerlord M0: Run cognitive benchmark on hermes3, fix L1 string-int coercion (#1092 ) (#1159 )

Co-authored-by: Alexander Whitestone <alexpaynex@gmail.com>
Co-committed-by: Alexander Whitestone <alexpaynex@gmail.com>

2026-03-23 19:38:48 +00:00

3.6 KiB

Raw Blame History

Bannerlord M0 — Cognitive Benchmark Scorecard

Date: 2026-03-23 Benchmark: 6-level cognitive harness (L0–L5) M1 Gate: Must pass L0 + L1, latency < 10s per decision

Results Summary

Level	Description	qwen2.5:14b	hermes3:latest	hermes3:8b
L0 [M1 GATE]	JSON Compliance	✓ PASS 100%	✓ PASS 100%	✓ PASS 100%
L1 [M1 GATE]	Board State Tracking	✗ FAIL 50%	✗ FAIL 50%	✗ FAIL 50%
L2	Resource Management	✓ PASS 100%	✓ PASS 100%	✓ PASS 100%
L3	Battle Tactics	✓ PASS 100%	✓ PASS 100%	✓ PASS 100%
L4	Trade Route	✓ PASS 100%	✓ PASS 100%	✓ PASS 100%
L5	Mini Campaign	✓ PASS 100%	✓ PASS 100%	✓ PASS 100%
M1 GATE		✗ FAIL	✗ FAIL	✗ FAIL

Latency (p50 / p99)

Level	qwen2.5:14b	hermes3:latest	hermes3:8b
L0	1443ms / 6348ms	1028ms / 1184ms	570ms / 593ms
L1	943ms / 1184ms	1166ms / 1303ms	767ms / 1313ms
L2	2936ms / 3122ms	2032ms / 2232ms	2408ms / 2832ms
L3	2248ms / 3828ms	1614ms / 3525ms	2174ms / 3437ms
L4	3235ms / 3318ms	2724ms / 3038ms	2507ms / 3420ms
L5	3414ms / 3970ms	3137ms / 3433ms	2571ms / 2763ms

All models are well under the 10s latency threshold for L0–L1.

Level 1 Failure Analysis

All three models fail L1 with identical pattern (2/4 scenarios pass):

Scenario	Expected	All Models
Empty board — opening move	Any empty square	✓ center (4)
Block opponent's winning move	Position 1 (only block)	✗ position 4 (occupied!)
Take winning move	Position 1 (win)	✗ position 0 or 2 (occupied!)
Legal move on partially filled board	Any of 6,7,8	✓ position 6

Root cause: Models choose moves by heuristic (center, corners) without checking whether the chosen square is already occupied. They read the board description but don't cross-reference their move choice against it. This is a genuine spatial state-tracking failure.

Note: hermes3 models emit "move": "4" (string) vs "move": 4 (int). The benchmark was patched to coerce string digits to int for L1, since type fidelity is already tested at L0.

M1 Gate: FAILED (all models)

No model passes the M1 gate. The blocker is Level 1 — Board State Tracking.

Recommendation

The L1 failure is consistent and structural. All models understand the format and can make reasonable opening moves but fail to avoid already-occupied squares. Options for M1:

Lower the L1 pass threshold from 100% to ≥ 75% — the scenarios where models fail require recognizing occupied positions from a sparse JSON array, which is a known weakness. Would allow proceeding to M1 with flagged risk.
Prompt engineering — add explicit "The following squares are taken: X at positions [P1, P2]" to the prompt to see if board tracking improves.
Re-evaluate L1 gate requirement — models pass L2–L5 (resource, tactics, trade, campaign) which are more directly relevant to Bannerlord play. Consider whether L1 is the right gate.

qwen3:14b

Model not available on this Ollama instance. Available qwen3 model: qwen3:30b. qwen3:30b was not benchmarked (significantly slower; requires explicit decision to run).

Result Files

Model	File
qwen2.5:14b	`results/qwen2.5_14b_20260323_142119.json`
hermes3:latest	`results/hermes3_latest_20260323_152900.json`
hermes3:8b	`results/hermes3_8b_20260323_153000.json`

3.6 KiB Raw Blame History Unescape Escape