This repository has been archived on 2026-03-24. You can view files and clone it. You cannot open issues or pull requests or push a commit.
Files
Timmy-time-dashboard/timmy-benchmark/results/SCORECARD.md
Alexander Whitestone 9e08e87312 [claude] Bannerlord M0: Run cognitive benchmark on hermes3, fix L1 string-int coercion (#1092) (#1159)
Co-authored-by: Alexander Whitestone <alexpaynex@gmail.com>
Co-committed-by: Alexander Whitestone <alexpaynex@gmail.com>
2026-03-23 19:38:48 +00:00

83 lines
3.6 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# Bannerlord M0 — Cognitive Benchmark Scorecard
**Date:** 2026-03-23
**Benchmark:** 6-level cognitive harness (L0L5)
**M1 Gate:** Must pass L0 + L1, latency < 10s per decision
---
## Results Summary
| Level | Description | qwen2.5:14b | hermes3:latest | hermes3:8b |
|-------|-------------|:-----------:|:--------------:|:----------:|
| **L0 [M1 GATE]** | JSON Compliance | ✓ PASS 100% | ✓ PASS 100% | ✓ PASS 100% |
| **L1 [M1 GATE]** | Board State Tracking | ✗ FAIL 50% | ✗ FAIL 50% | ✗ FAIL 50% |
| L2 | Resource Management | ✓ PASS 100% | ✓ PASS 100% | ✓ PASS 100% |
| L3 | Battle Tactics | ✓ PASS 100% | ✓ PASS 100% | ✓ PASS 100% |
| L4 | Trade Route | ✓ PASS 100% | ✓ PASS 100% | ✓ PASS 100% |
| L5 | Mini Campaign | ✓ PASS 100% | ✓ PASS 100% | ✓ PASS 100% |
| **M1 GATE** | | ✗ **FAIL** | ✗ **FAIL** | ✗ **FAIL** |
---
## Latency (p50 / p99)
| Level | qwen2.5:14b | hermes3:latest | hermes3:8b |
|-------|-------------|----------------|------------|
| L0 | 1443ms / 6348ms | 1028ms / 1184ms | 570ms / 593ms |
| L1 | 943ms / 1184ms | 1166ms / 1303ms | 767ms / 1313ms |
| L2 | 2936ms / 3122ms | 2032ms / 2232ms | 2408ms / 2832ms |
| L3 | 2248ms / 3828ms | 1614ms / 3525ms | 2174ms / 3437ms |
| L4 | 3235ms / 3318ms | 2724ms / 3038ms | 2507ms / 3420ms |
| L5 | 3414ms / 3970ms | 3137ms / 3433ms | 2571ms / 2763ms |
All models are **well under the 10s latency threshold** for L0L1.
---
## Level 1 Failure Analysis
All three models fail L1 with **identical pattern** (2/4 scenarios pass):
| Scenario | Expected | All Models |
|----------|----------|-----------|
| Empty board — opening move | Any empty square | ✓ center (4) |
| Block opponent's winning move | Position 1 (only block) | ✗ position 4 (occupied!) |
| Take winning move | Position 1 (win) | ✗ position 0 or 2 (occupied!) |
| Legal move on partially filled board | Any of 6,7,8 | ✓ position 6 |
**Root cause:** Models choose moves by heuristic (center, corners) without checking whether the chosen square is already occupied. They read the board description but don't cross-reference their move choice against it. This is a genuine spatial state-tracking failure.
**Note:** `hermes3` models emit `"move": "4"` (string) vs `"move": 4` (int). The benchmark was patched to coerce string digits to int for L1, since type fidelity is already tested at L0.
---
## M1 Gate: FAILED (all models)
No model passes the M1 gate. The blocker is **Level 1 — Board State Tracking**.
### Recommendation
The L1 failure is consistent and structural. All models understand the format and can make reasonable *opening* moves but fail to avoid already-occupied squares. Options for M1:
1. **Lower the L1 pass threshold** from 100% to ≥ 75% — the scenarios where models fail require recognizing occupied positions from a sparse JSON array, which is a known weakness. Would allow proceeding to M1 with flagged risk.
2. **Prompt engineering** — add explicit "The following squares are taken: X at positions [P1, P2]" to the prompt to see if board tracking improves.
3. **Re-evaluate L1 gate requirement** — models pass L2L5 (resource, tactics, trade, campaign) which are more directly relevant to Bannerlord play. Consider whether L1 is the right gate.
---
## qwen3:14b
Model **not available** on this Ollama instance. Available qwen3 model: `qwen3:30b`.
`qwen3:30b` was not benchmarked (significantly slower; requires explicit decision to run).
---
## Result Files
| Model | File |
|-------|------|
| qwen2.5:14b | `results/qwen2.5_14b_20260323_142119.json` |
| hermes3:latest | `results/hermes3_latest_20260323_152900.json` |
| hermes3:8b | `results/hermes3_8b_20260323_153000.json` |