1
0
This repository has been archived on 2026-03-24. You can view files and clone it. You cannot open issues or pull requests or push a commit.
Files
Timmy-time-dashboard/timmy-benchmark/results/SCORECARD.md
Alexander Whitestone 9e08e87312 [claude] Bannerlord M0: Run cognitive benchmark on hermes3, fix L1 string-int coercion (#1092) (#1159)
Co-authored-by: Alexander Whitestone <alexpaynex@gmail.com>
Co-committed-by: Alexander Whitestone <alexpaynex@gmail.com>
2026-03-23 19:38:48 +00:00

3.6 KiB
Raw Blame History

Bannerlord M0 — Cognitive Benchmark Scorecard

Date: 2026-03-23 Benchmark: 6-level cognitive harness (L0L5) M1 Gate: Must pass L0 + L1, latency < 10s per decision


Results Summary

Level Description qwen2.5:14b hermes3:latest hermes3:8b
L0 [M1 GATE] JSON Compliance ✓ PASS 100% ✓ PASS 100% ✓ PASS 100%
L1 [M1 GATE] Board State Tracking ✗ FAIL 50% ✗ FAIL 50% ✗ FAIL 50%
L2 Resource Management ✓ PASS 100% ✓ PASS 100% ✓ PASS 100%
L3 Battle Tactics ✓ PASS 100% ✓ PASS 100% ✓ PASS 100%
L4 Trade Route ✓ PASS 100% ✓ PASS 100% ✓ PASS 100%
L5 Mini Campaign ✓ PASS 100% ✓ PASS 100% ✓ PASS 100%
M1 GATE FAIL FAIL FAIL

Latency (p50 / p99)

Level qwen2.5:14b hermes3:latest hermes3:8b
L0 1443ms / 6348ms 1028ms / 1184ms 570ms / 593ms
L1 943ms / 1184ms 1166ms / 1303ms 767ms / 1313ms
L2 2936ms / 3122ms 2032ms / 2232ms 2408ms / 2832ms
L3 2248ms / 3828ms 1614ms / 3525ms 2174ms / 3437ms
L4 3235ms / 3318ms 2724ms / 3038ms 2507ms / 3420ms
L5 3414ms / 3970ms 3137ms / 3433ms 2571ms / 2763ms

All models are well under the 10s latency threshold for L0L1.


Level 1 Failure Analysis

All three models fail L1 with identical pattern (2/4 scenarios pass):

Scenario Expected All Models
Empty board — opening move Any empty square ✓ center (4)
Block opponent's winning move Position 1 (only block) ✗ position 4 (occupied!)
Take winning move Position 1 (win) ✗ position 0 or 2 (occupied!)
Legal move on partially filled board Any of 6,7,8 ✓ position 6

Root cause: Models choose moves by heuristic (center, corners) without checking whether the chosen square is already occupied. They read the board description but don't cross-reference their move choice against it. This is a genuine spatial state-tracking failure.

Note: hermes3 models emit "move": "4" (string) vs "move": 4 (int). The benchmark was patched to coerce string digits to int for L1, since type fidelity is already tested at L0.


M1 Gate: FAILED (all models)

No model passes the M1 gate. The blocker is Level 1 — Board State Tracking.

Recommendation

The L1 failure is consistent and structural. All models understand the format and can make reasonable opening moves but fail to avoid already-occupied squares. Options for M1:

  1. Lower the L1 pass threshold from 100% to ≥ 75% — the scenarios where models fail require recognizing occupied positions from a sparse JSON array, which is a known weakness. Would allow proceeding to M1 with flagged risk.
  2. Prompt engineering — add explicit "The following squares are taken: X at positions [P1, P2]" to the prompt to see if board tracking improves.
  3. Re-evaluate L1 gate requirement — models pass L2L5 (resource, tactics, trade, campaign) which are more directly relevant to Bannerlord play. Consider whether L1 is the right gate.

qwen3:14b

Model not available on this Ollama instance. Available qwen3 model: qwen3:30b. qwen3:30b was not benchmarked (significantly slower; requires explicit decision to run).


Result Files

Model File
qwen2.5:14b results/qwen2.5_14b_20260323_142119.json
hermes3:latest results/hermes3_latest_20260323_152900.json
hermes3:8b results/hermes3_8b_20260323_153000.json