Co-authored-by: Alexander Whitestone <alexpaynex@gmail.com> Co-committed-by: Alexander Whitestone <alexpaynex@gmail.com>
3.6 KiB
Bannerlord M0 — Cognitive Benchmark Scorecard
Date: 2026-03-23 Benchmark: 6-level cognitive harness (L0–L5) M1 Gate: Must pass L0 + L1, latency < 10s per decision
Results Summary
| Level | Description | qwen2.5:14b | hermes3:latest | hermes3:8b |
|---|---|---|---|---|
| L0 [M1 GATE] | JSON Compliance | ✓ PASS 100% | ✓ PASS 100% | ✓ PASS 100% |
| L1 [M1 GATE] | Board State Tracking | ✗ FAIL 50% | ✗ FAIL 50% | ✗ FAIL 50% |
| L2 | Resource Management | ✓ PASS 100% | ✓ PASS 100% | ✓ PASS 100% |
| L3 | Battle Tactics | ✓ PASS 100% | ✓ PASS 100% | ✓ PASS 100% |
| L4 | Trade Route | ✓ PASS 100% | ✓ PASS 100% | ✓ PASS 100% |
| L5 | Mini Campaign | ✓ PASS 100% | ✓ PASS 100% | ✓ PASS 100% |
| M1 GATE | ✗ FAIL | ✗ FAIL | ✗ FAIL |
Latency (p50 / p99)
| Level | qwen2.5:14b | hermes3:latest | hermes3:8b |
|---|---|---|---|
| L0 | 1443ms / 6348ms | 1028ms / 1184ms | 570ms / 593ms |
| L1 | 943ms / 1184ms | 1166ms / 1303ms | 767ms / 1313ms |
| L2 | 2936ms / 3122ms | 2032ms / 2232ms | 2408ms / 2832ms |
| L3 | 2248ms / 3828ms | 1614ms / 3525ms | 2174ms / 3437ms |
| L4 | 3235ms / 3318ms | 2724ms / 3038ms | 2507ms / 3420ms |
| L5 | 3414ms / 3970ms | 3137ms / 3433ms | 2571ms / 2763ms |
All models are well under the 10s latency threshold for L0–L1.
Level 1 Failure Analysis
All three models fail L1 with identical pattern (2/4 scenarios pass):
| Scenario | Expected | All Models |
|---|---|---|
| Empty board — opening move | Any empty square | ✓ center (4) |
| Block opponent's winning move | Position 1 (only block) | ✗ position 4 (occupied!) |
| Take winning move | Position 1 (win) | ✗ position 0 or 2 (occupied!) |
| Legal move on partially filled board | Any of 6,7,8 | ✓ position 6 |
Root cause: Models choose moves by heuristic (center, corners) without checking whether the chosen square is already occupied. They read the board description but don't cross-reference their move choice against it. This is a genuine spatial state-tracking failure.
Note: hermes3 models emit "move": "4" (string) vs "move": 4 (int). The benchmark was patched to coerce string digits to int for L1, since type fidelity is already tested at L0.
M1 Gate: FAILED (all models)
No model passes the M1 gate. The blocker is Level 1 — Board State Tracking.
Recommendation
The L1 failure is consistent and structural. All models understand the format and can make reasonable opening moves but fail to avoid already-occupied squares. Options for M1:
- Lower the L1 pass threshold from 100% to ≥ 75% — the scenarios where models fail require recognizing occupied positions from a sparse JSON array, which is a known weakness. Would allow proceeding to M1 with flagged risk.
- Prompt engineering — add explicit "The following squares are taken: X at positions [P1, P2]" to the prompt to see if board tracking improves.
- Re-evaluate L1 gate requirement — models pass L2–L5 (resource, tactics, trade, campaign) which are more directly relevant to Bannerlord play. Consider whether L1 is the right gate.
qwen3:14b
Model not available on this Ollama instance. Available qwen3 model: qwen3:30b.
qwen3:30b was not benchmarked (significantly slower; requires explicit decision to run).
Result Files
| Model | File |
|---|---|
| qwen2.5:14b | results/qwen2.5_14b_20260323_142119.json |
| hermes3:latest | results/hermes3_latest_20260323_152900.json |
| hermes3:8b | results/hermes3_8b_20260323_153000.json |