forked from Rockachopa/Timmy-time-dashboard
Co-authored-by: Alexander Whitestone <alexpaynex@gmail.com> Co-committed-by: Alexander Whitestone <alexpaynex@gmail.com>
83 lines
3.6 KiB
Markdown
83 lines
3.6 KiB
Markdown
# Bannerlord M0 — Cognitive Benchmark Scorecard
|
||
|
||
**Date:** 2026-03-23
|
||
**Benchmark:** 6-level cognitive harness (L0–L5)
|
||
**M1 Gate:** Must pass L0 + L1, latency < 10s per decision
|
||
|
||
---
|
||
|
||
## Results Summary
|
||
|
||
| Level | Description | qwen2.5:14b | hermes3:latest | hermes3:8b |
|
||
|-------|-------------|:-----------:|:--------------:|:----------:|
|
||
| **L0 [M1 GATE]** | JSON Compliance | ✓ PASS 100% | ✓ PASS 100% | ✓ PASS 100% |
|
||
| **L1 [M1 GATE]** | Board State Tracking | ✗ FAIL 50% | ✗ FAIL 50% | ✗ FAIL 50% |
|
||
| L2 | Resource Management | ✓ PASS 100% | ✓ PASS 100% | ✓ PASS 100% |
|
||
| L3 | Battle Tactics | ✓ PASS 100% | ✓ PASS 100% | ✓ PASS 100% |
|
||
| L4 | Trade Route | ✓ PASS 100% | ✓ PASS 100% | ✓ PASS 100% |
|
||
| L5 | Mini Campaign | ✓ PASS 100% | ✓ PASS 100% | ✓ PASS 100% |
|
||
| **M1 GATE** | | ✗ **FAIL** | ✗ **FAIL** | ✗ **FAIL** |
|
||
|
||
---
|
||
|
||
## Latency (p50 / p99)
|
||
|
||
| Level | qwen2.5:14b | hermes3:latest | hermes3:8b |
|
||
|-------|-------------|----------------|------------|
|
||
| L0 | 1443ms / 6348ms | 1028ms / 1184ms | 570ms / 593ms |
|
||
| L1 | 943ms / 1184ms | 1166ms / 1303ms | 767ms / 1313ms |
|
||
| L2 | 2936ms / 3122ms | 2032ms / 2232ms | 2408ms / 2832ms |
|
||
| L3 | 2248ms / 3828ms | 1614ms / 3525ms | 2174ms / 3437ms |
|
||
| L4 | 3235ms / 3318ms | 2724ms / 3038ms | 2507ms / 3420ms |
|
||
| L5 | 3414ms / 3970ms | 3137ms / 3433ms | 2571ms / 2763ms |
|
||
|
||
All models are **well under the 10s latency threshold** for L0–L1.
|
||
|
||
---
|
||
|
||
## Level 1 Failure Analysis
|
||
|
||
All three models fail L1 with **identical pattern** (2/4 scenarios pass):
|
||
|
||
| Scenario | Expected | All Models |
|
||
|----------|----------|-----------|
|
||
| Empty board — opening move | Any empty square | ✓ center (4) |
|
||
| Block opponent's winning move | Position 1 (only block) | ✗ position 4 (occupied!) |
|
||
| Take winning move | Position 1 (win) | ✗ position 0 or 2 (occupied!) |
|
||
| Legal move on partially filled board | Any of 6,7,8 | ✓ position 6 |
|
||
|
||
**Root cause:** Models choose moves by heuristic (center, corners) without checking whether the chosen square is already occupied. They read the board description but don't cross-reference their move choice against it. This is a genuine spatial state-tracking failure.
|
||
|
||
**Note:** `hermes3` models emit `"move": "4"` (string) vs `"move": 4` (int). The benchmark was patched to coerce string digits to int for L1, since type fidelity is already tested at L0.
|
||
|
||
---
|
||
|
||
## M1 Gate: FAILED (all models)
|
||
|
||
No model passes the M1 gate. The blocker is **Level 1 — Board State Tracking**.
|
||
|
||
### Recommendation
|
||
|
||
The L1 failure is consistent and structural. All models understand the format and can make reasonable *opening* moves but fail to avoid already-occupied squares. Options for M1:
|
||
|
||
1. **Lower the L1 pass threshold** from 100% to ≥ 75% — the scenarios where models fail require recognizing occupied positions from a sparse JSON array, which is a known weakness. Would allow proceeding to M1 with flagged risk.
|
||
2. **Prompt engineering** — add explicit "The following squares are taken: X at positions [P1, P2]" to the prompt to see if board tracking improves.
|
||
3. **Re-evaluate L1 gate requirement** — models pass L2–L5 (resource, tactics, trade, campaign) which are more directly relevant to Bannerlord play. Consider whether L1 is the right gate.
|
||
|
||
---
|
||
|
||
## qwen3:14b
|
||
|
||
Model **not available** on this Ollama instance. Available qwen3 model: `qwen3:30b`.
|
||
`qwen3:30b` was not benchmarked (significantly slower; requires explicit decision to run).
|
||
|
||
---
|
||
|
||
## Result Files
|
||
|
||
| Model | File |
|
||
|-------|------|
|
||
| qwen2.5:14b | `results/qwen2.5_14b_20260323_142119.json` |
|
||
| hermes3:latest | `results/hermes3_latest_20260323_152900.json` |
|
||
| hermes3:8b | `results/hermes3_8b_20260323_153000.json` |
|