# Bannerlord M0 — Cognitive Benchmark Scorecard **Date:** 2026-03-23 **Benchmark:** 6-level cognitive harness (L0–L5) **M1 Gate:** Must pass L0 + L1, latency < 10s per decision --- ## Results Summary | Level | Description | qwen2.5:14b | hermes3:latest | hermes3:8b | |-------|-------------|:-----------:|:--------------:|:----------:| | **L0 [M1 GATE]** | JSON Compliance | ✓ PASS 100% | ✓ PASS 100% | ✓ PASS 100% | | **L1 [M1 GATE]** | Board State Tracking | ✗ FAIL 50% | ✗ FAIL 50% | ✗ FAIL 50% | | L2 | Resource Management | ✓ PASS 100% | ✓ PASS 100% | ✓ PASS 100% | | L3 | Battle Tactics | ✓ PASS 100% | ✓ PASS 100% | ✓ PASS 100% | | L4 | Trade Route | ✓ PASS 100% | ✓ PASS 100% | ✓ PASS 100% | | L5 | Mini Campaign | ✓ PASS 100% | ✓ PASS 100% | ✓ PASS 100% | | **M1 GATE** | | ✗ **FAIL** | ✗ **FAIL** | ✗ **FAIL** | --- ## Latency (p50 / p99) | Level | qwen2.5:14b | hermes3:latest | hermes3:8b | |-------|-------------|----------------|------------| | L0 | 1443ms / 6348ms | 1028ms / 1184ms | 570ms / 593ms | | L1 | 943ms / 1184ms | 1166ms / 1303ms | 767ms / 1313ms | | L2 | 2936ms / 3122ms | 2032ms / 2232ms | 2408ms / 2832ms | | L3 | 2248ms / 3828ms | 1614ms / 3525ms | 2174ms / 3437ms | | L4 | 3235ms / 3318ms | 2724ms / 3038ms | 2507ms / 3420ms | | L5 | 3414ms / 3970ms | 3137ms / 3433ms | 2571ms / 2763ms | All models are **well under the 10s latency threshold** for L0–L1. --- ## Level 1 Failure Analysis All three models fail L1 with **identical pattern** (2/4 scenarios pass): | Scenario | Expected | All Models | |----------|----------|-----------| | Empty board — opening move | Any empty square | ✓ center (4) | | Block opponent's winning move | Position 1 (only block) | ✗ position 4 (occupied!) | | Take winning move | Position 1 (win) | ✗ position 0 or 2 (occupied!) | | Legal move on partially filled board | Any of 6,7,8 | ✓ position 6 | **Root cause:** Models choose moves by heuristic (center, corners) without checking whether the chosen square is already occupied. They read the board description but don't cross-reference their move choice against it. This is a genuine spatial state-tracking failure. **Note:** `hermes3` models emit `"move": "4"` (string) vs `"move": 4` (int). The benchmark was patched to coerce string digits to int for L1, since type fidelity is already tested at L0. --- ## M1 Gate: FAILED (all models) No model passes the M1 gate. The blocker is **Level 1 — Board State Tracking**. ### Recommendation The L1 failure is consistent and structural. All models understand the format and can make reasonable *opening* moves but fail to avoid already-occupied squares. Options for M1: 1. **Lower the L1 pass threshold** from 100% to ≥ 75% — the scenarios where models fail require recognizing occupied positions from a sparse JSON array, which is a known weakness. Would allow proceeding to M1 with flagged risk. 2. **Prompt engineering** — add explicit "The following squares are taken: X at positions [P1, P2]" to the prompt to see if board tracking improves. 3. **Re-evaluate L1 gate requirement** — models pass L2–L5 (resource, tactics, trade, campaign) which are more directly relevant to Bannerlord play. Consider whether L1 is the right gate. --- ## qwen3:14b Model **not available** on this Ollama instance. Available qwen3 model: `qwen3:30b`. `qwen3:30b` was not benchmarked (significantly slower; requires explicit decision to run). --- ## Result Files | Model | File | |-------|------| | qwen2.5:14b | `results/qwen2.5_14b_20260323_142119.json` | | hermes3:latest | `results/hermes3_latest_20260323_152900.json` | | hermes3:8b | `results/hermes3_8b_20260323_153000.json` |