timmy-benchmark/results/SCORECARD.md

# Bannerlord M0 — Cognitive Benchmark Scorecard

**Date:** 2026-03-23
**Benchmark:** 6-level cognitive harness (L0–L5)
**M1 Gate:** Must pass L0 + L1, latency < 10s per decision

---

## Results Summary

| Level | Description | qwen2.5:14b | hermes3:latest | hermes3:8b |
|-------|-------------|:-----------:|:--------------:|:----------:|
| **L0 [M1 GATE]** | JSON Compliance | ✓ PASS 100% | ✓ PASS 100% | ✓ PASS 100% |
| **L1 [M1 GATE]** | Board State Tracking | ✗ FAIL 50% | ✗ FAIL 50% | ✗ FAIL 50% |
| L2 | Resource Management | ✓ PASS 100% | ✓ PASS 100% | ✓ PASS 100% |
| L3 | Battle Tactics | ✓ PASS 100% | ✓ PASS 100% | ✓ PASS 100% |
| L4 | Trade Route | ✓ PASS 100% | ✓ PASS 100% | ✓ PASS 100% |
| L5 | Mini Campaign | ✓ PASS 100% | ✓ PASS 100% | ✓ PASS 100% |
| **M1 GATE** | | ✗ **FAIL** | ✗ **FAIL** | ✗ **FAIL** |

---

## Latency (p50 / p99)

| Level | qwen2.5:14b | hermes3:latest | hermes3:8b |
|-------|-------------|----------------|------------|
| L0 | 1443ms / 6348ms | 1028ms / 1184ms | 570ms / 593ms |
| L1 | 943ms / 1184ms | 1166ms / 1303ms | 767ms / 1313ms |
| L2 | 2936ms / 3122ms | 2032ms / 2232ms | 2408ms / 2832ms |
| L3 | 2248ms / 3828ms | 1614ms / 3525ms | 2174ms / 3437ms |
| L4 | 3235ms / 3318ms | 2724ms / 3038ms | 2507ms / 3420ms |
| L5 | 3414ms / 3970ms | 3137ms / 3433ms | 2571ms / 2763ms |

All models are **well under the 10s latency threshold** for L0–L1.

---

## Level 1 Failure Analysis

All three models fail L1 with **identical pattern** (2/4 scenarios pass):

| Scenario | Expected | All Models |
|----------|----------|-----------|
| Empty board — opening move | Any empty square | ✓ center (4) |
| Block opponent's winning move | Position 1 (only block) | ✗ position 4 (occupied!) |
| Take winning move | Position 1 (win) | ✗ position 0 or 2 (occupied!) |
| Legal move on partially filled board | Any of 6,7,8 | ✓ position 6 |

**Root cause:** Models choose moves by heuristic (center, corners) without checking whether the chosen square is already occupied. They read the board description but don't cross-reference their move choice against it. This is a genuine spatial state-tracking failure.

**Note:** `hermes3` models emit `"move": "4"` (string) vs `"move": 4` (int). The benchmark was patched to coerce string digits to int for L1, since type fidelity is already tested at L0.

---

## M1 Gate: FAILED (all models)

No model passes the M1 gate. The blocker is **Level 1 — Board State Tracking**.

### Recommendation

The L1 failure is consistent and structural. All models understand the format and can make reasonable *opening* moves but fail to avoid already-occupied squares. Options for M1:

1. **Lower the L1 pass threshold** from 100% to ≥ 75% — the scenarios where models fail require recognizing occupied positions from a sparse JSON array, which is a known weakness. Would allow proceeding to M1 with flagged risk.
2. **Prompt engineering** — add explicit "The following squares are taken: X at positions [P1, P2]" to the prompt to see if board tracking improves.
3. **Re-evaluate L1 gate requirement** — models pass L2–L5 (resource, tactics, trade, campaign) which are more directly relevant to Bannerlord play. Consider whether L1 is the right gate.

---

## qwen3:14b

Model **not available** on this Ollama instance. Available qwen3 model: `qwen3:30b`.
`qwen3:30b` was not benchmarked (significantly slower; requires explicit decision to run).

---

## Result Files

| Model | File |
|-------|------|
| qwen2.5:14b | `results/qwen2.5_14b_20260323_142119.json` |
| hermes3:latest | `results/hermes3_latest_20260323_152900.json` |
| hermes3:8b | `results/hermes3_8b_20260323_153000.json` |