forked from Rockachopa/Timmy-time-dashboard
83 lines
3.6 KiB
Markdown
83 lines
3.6 KiB
Markdown
|
|
# Bannerlord M0 — Cognitive Benchmark Scorecard
|
|||
|
|
|
|||
|
|
**Date:** 2026-03-23
|
|||
|
|
**Benchmark:** 6-level cognitive harness (L0–L5)
|
|||
|
|
**M1 Gate:** Must pass L0 + L1, latency < 10s per decision
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## Results Summary
|
|||
|
|
|
|||
|
|
| Level | Description | qwen2.5:14b | hermes3:latest | hermes3:8b |
|
|||
|
|
|-------|-------------|:-----------:|:--------------:|:----------:|
|
|||
|
|
| **L0 [M1 GATE]** | JSON Compliance | ✓ PASS 100% | ✓ PASS 100% | ✓ PASS 100% |
|
|||
|
|
| **L1 [M1 GATE]** | Board State Tracking | ✗ FAIL 50% | ✗ FAIL 50% | ✗ FAIL 50% |
|
|||
|
|
| L2 | Resource Management | ✓ PASS 100% | ✓ PASS 100% | ✓ PASS 100% |
|
|||
|
|
| L3 | Battle Tactics | ✓ PASS 100% | ✓ PASS 100% | ✓ PASS 100% |
|
|||
|
|
| L4 | Trade Route | ✓ PASS 100% | ✓ PASS 100% | ✓ PASS 100% |
|
|||
|
|
| L5 | Mini Campaign | ✓ PASS 100% | ✓ PASS 100% | ✓ PASS 100% |
|
|||
|
|
| **M1 GATE** | | ✗ **FAIL** | ✗ **FAIL** | ✗ **FAIL** |
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## Latency (p50 / p99)
|
|||
|
|
|
|||
|
|
| Level | qwen2.5:14b | hermes3:latest | hermes3:8b |
|
|||
|
|
|-------|-------------|----------------|------------|
|
|||
|
|
| L0 | 1443ms / 6348ms | 1028ms / 1184ms | 570ms / 593ms |
|
|||
|
|
| L1 | 943ms / 1184ms | 1166ms / 1303ms | 767ms / 1313ms |
|
|||
|
|
| L2 | 2936ms / 3122ms | 2032ms / 2232ms | 2408ms / 2832ms |
|
|||
|
|
| L3 | 2248ms / 3828ms | 1614ms / 3525ms | 2174ms / 3437ms |
|
|||
|
|
| L4 | 3235ms / 3318ms | 2724ms / 3038ms | 2507ms / 3420ms |
|
|||
|
|
| L5 | 3414ms / 3970ms | 3137ms / 3433ms | 2571ms / 2763ms |
|
|||
|
|
|
|||
|
|
All models are **well under the 10s latency threshold** for L0–L1.
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## Level 1 Failure Analysis
|
|||
|
|
|
|||
|
|
All three models fail L1 with **identical pattern** (2/4 scenarios pass):
|
|||
|
|
|
|||
|
|
| Scenario | Expected | All Models |
|
|||
|
|
|----------|----------|-----------|
|
|||
|
|
| Empty board — opening move | Any empty square | ✓ center (4) |
|
|||
|
|
| Block opponent's winning move | Position 1 (only block) | ✗ position 4 (occupied!) |
|
|||
|
|
| Take winning move | Position 1 (win) | ✗ position 0 or 2 (occupied!) |
|
|||
|
|
| Legal move on partially filled board | Any of 6,7,8 | ✓ position 6 |
|
|||
|
|
|
|||
|
|
**Root cause:** Models choose moves by heuristic (center, corners) without checking whether the chosen square is already occupied. They read the board description but don't cross-reference their move choice against it. This is a genuine spatial state-tracking failure.
|
|||
|
|
|
|||
|
|
**Note:** `hermes3` models emit `"move": "4"` (string) vs `"move": 4` (int). The benchmark was patched to coerce string digits to int for L1, since type fidelity is already tested at L0.
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## M1 Gate: FAILED (all models)
|
|||
|
|
|
|||
|
|
No model passes the M1 gate. The blocker is **Level 1 — Board State Tracking**.
|
|||
|
|
|
|||
|
|
### Recommendation
|
|||
|
|
|
|||
|
|
The L1 failure is consistent and structural. All models understand the format and can make reasonable *opening* moves but fail to avoid already-occupied squares. Options for M1:
|
|||
|
|
|
|||
|
|
1. **Lower the L1 pass threshold** from 100% to ≥ 75% — the scenarios where models fail require recognizing occupied positions from a sparse JSON array, which is a known weakness. Would allow proceeding to M1 with flagged risk.
|
|||
|
|
2. **Prompt engineering** — add explicit "The following squares are taken: X at positions [P1, P2]" to the prompt to see if board tracking improves.
|
|||
|
|
3. **Re-evaluate L1 gate requirement** — models pass L2–L5 (resource, tactics, trade, campaign) which are more directly relevant to Bannerlord play. Consider whether L1 is the right gate.
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## qwen3:14b
|
|||
|
|
|
|||
|
|
Model **not available** on this Ollama instance. Available qwen3 model: `qwen3:30b`.
|
|||
|
|
`qwen3:30b` was not benchmarked (significantly slower; requires explicit decision to run).
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## Result Files
|
|||
|
|
|
|||
|
|
| Model | File |
|
|||
|
|
|-------|------|
|
|||
|
|
| qwen2.5:14b | `results/qwen2.5_14b_20260323_142119.json` |
|
|||
|
|
| hermes3:latest | `results/hermes3_latest_20260323_152900.json` |
|
|||
|
|
| hermes3:8b | `results/hermes3_8b_20260323_153000.json` |
|