feat: run cognitive benchmark on hermes3 models, fix L1 string-int coercion
Some checks failed
Tests / lint (pull_request) Failing after 18s
Tests / test (pull_request) Has been skipped

Refs #1092

- Run full 6-level benchmark against hermes3:latest and hermes3:8b
- Fix Level 1 harness: coerce string-digit moves to int (hermes3 emits
  "move": "4" instead of "move": 4; type fidelity is already tested at L0)
- All three models (qwen2.5:14b, hermes3:latest, hermes3:8b) pass L0
  and L2–L5 but fail L1 at 50% score
- Root cause: models pick heuristic moves without checking if the target
  square is already occupied
- M1 gate: FAILED for all models; qwen3:14b not available on instance
- Add SCORECARD.md with full cross-model comparison and recommendations

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
This commit is contained in:
Alexander Whitestone
2026-03-23 15:31:07 -04:00
parent 5f70fe5a2b
commit 40561b480f
4 changed files with 1037 additions and 0 deletions

View File

@@ -163,6 +163,9 @@ def run(client: Any, model: str, verbose: bool = False) -> LevelResult:
if "move" in parsed:
move = parsed["move"]
# Coerce string digits to int (some models emit "4" instead of 4)
if isinstance(move, str) and move.strip().lstrip("-").isdigit():
move = int(move.strip())
if isinstance(move, int):
board = scenario["board"]
move_legal = 0 <= move <= 8 and board[move] is None

View File

@@ -0,0 +1,82 @@
# Bannerlord M0 — Cognitive Benchmark Scorecard
**Date:** 2026-03-23
**Benchmark:** 6-level cognitive harness (L0L5)
**M1 Gate:** Must pass L0 + L1, latency < 10s per decision
---
## Results Summary
| Level | Description | qwen2.5:14b | hermes3:latest | hermes3:8b |
|-------|-------------|:-----------:|:--------------:|:----------:|
| **L0 [M1 GATE]** | JSON Compliance | ✓ PASS 100% | ✓ PASS 100% | ✓ PASS 100% |
| **L1 [M1 GATE]** | Board State Tracking | ✗ FAIL 50% | ✗ FAIL 50% | ✗ FAIL 50% |
| L2 | Resource Management | ✓ PASS 100% | ✓ PASS 100% | ✓ PASS 100% |
| L3 | Battle Tactics | ✓ PASS 100% | ✓ PASS 100% | ✓ PASS 100% |
| L4 | Trade Route | ✓ PASS 100% | ✓ PASS 100% | ✓ PASS 100% |
| L5 | Mini Campaign | ✓ PASS 100% | ✓ PASS 100% | ✓ PASS 100% |
| **M1 GATE** | | ✗ **FAIL** | ✗ **FAIL** | ✗ **FAIL** |
---
## Latency (p50 / p99)
| Level | qwen2.5:14b | hermes3:latest | hermes3:8b |
|-------|-------------|----------------|------------|
| L0 | 1443ms / 6348ms | 1028ms / 1184ms | 570ms / 593ms |
| L1 | 943ms / 1184ms | 1166ms / 1303ms | 767ms / 1313ms |
| L2 | 2936ms / 3122ms | 2032ms / 2232ms | 2408ms / 2832ms |
| L3 | 2248ms / 3828ms | 1614ms / 3525ms | 2174ms / 3437ms |
| L4 | 3235ms / 3318ms | 2724ms / 3038ms | 2507ms / 3420ms |
| L5 | 3414ms / 3970ms | 3137ms / 3433ms | 2571ms / 2763ms |
All models are **well under the 10s latency threshold** for L0L1.
---
## Level 1 Failure Analysis
All three models fail L1 with **identical pattern** (2/4 scenarios pass):
| Scenario | Expected | All Models |
|----------|----------|-----------|
| Empty board — opening move | Any empty square | ✓ center (4) |
| Block opponent's winning move | Position 1 (only block) | ✗ position 4 (occupied!) |
| Take winning move | Position 1 (win) | ✗ position 0 or 2 (occupied!) |
| Legal move on partially filled board | Any of 6,7,8 | ✓ position 6 |
**Root cause:** Models choose moves by heuristic (center, corners) without checking whether the chosen square is already occupied. They read the board description but don't cross-reference their move choice against it. This is a genuine spatial state-tracking failure.
**Note:** `hermes3` models emit `"move": "4"` (string) vs `"move": 4` (int). The benchmark was patched to coerce string digits to int for L1, since type fidelity is already tested at L0.
---
## M1 Gate: FAILED (all models)
No model passes the M1 gate. The blocker is **Level 1 — Board State Tracking**.
### Recommendation
The L1 failure is consistent and structural. All models understand the format and can make reasonable *opening* moves but fail to avoid already-occupied squares. Options for M1:
1. **Lower the L1 pass threshold** from 100% to ≥ 75% — the scenarios where models fail require recognizing occupied positions from a sparse JSON array, which is a known weakness. Would allow proceeding to M1 with flagged risk.
2. **Prompt engineering** — add explicit "The following squares are taken: X at positions [P1, P2]" to the prompt to see if board tracking improves.
3. **Re-evaluate L1 gate requirement** — models pass L2L5 (resource, tactics, trade, campaign) which are more directly relevant to Bannerlord play. Consider whether L1 is the right gate.
---
## qwen3:14b
Model **not available** on this Ollama instance. Available qwen3 model: `qwen3:30b`.
`qwen3:30b` was not benchmarked (significantly slower; requires explicit decision to run).
---
## Result Files
| Model | File |
|-------|------|
| qwen2.5:14b | `results/qwen2.5_14b_20260323_142119.json` |
| hermes3:latest | `results/hermes3_latest_20260323_152900.json` |
| hermes3:8b | `results/hermes3_8b_20260323_153000.json` |

View File

@@ -0,0 +1,476 @@
{
"model": "hermes3:8b",
"timestamp": "2026-03-23T19:29:22.731080+00:00",
"skipped": false,
"levels": {
"0": {
"level": 0,
"name": "JSON Compliance (Coin Flip)",
"trials": [
{
"trial_index": 0,
"prompt": "A coin is flipped. Respond with exactly: {\"choice\": \"heads\"} or {\"choice\": \"tails\"}. Pick one. JSON only.",
"raw_response": "{\"choice\": \"heads\"}",
"parsed": {
"choice": "heads"
},
"valid_json": true,
"schema_valid": true,
"value_valid": true,
"latency_ms": 522.9010581970215,
"error": ""
},
{
"trial_index": 1,
"prompt": "You must attack or defend. Respond with: {\"action\": \"attack\", \"confidence\": 0.8} or {\"action\": \"defend\", \"confidence\": 0.6}. Replace confidence with your own value 0.0-1.0. JSON only.",
"raw_response": "{\"action\": \"defend\", \"confidence\": 0.7}",
"parsed": {
"action": "defend",
"confidence": 0.7
},
"valid_json": true,
"schema_valid": true,
"value_valid": true,
"latency_ms": 592.6809310913086,
"error": ""
},
{
"trial_index": 2,
"prompt": "Choose a direction to march. Respond with exactly: {\"direction\": \"north\", \"reason\": \"string explaining why\"}. Pick north/south/east/west. JSON only.",
"raw_response": "{\"direction\": \"north\", \"reason\": \"string explaining why\"}",
"parsed": {
"direction": "north",
"reason": "string explaining why"
},
"valid_json": true,
"schema_valid": true,
"value_valid": true,
"latency_ms": 569.8130130767822,
"error": ""
}
],
"passed": true,
"score": 1.0,
"latency_p50_ms": 569.8130130767822,
"latency_p99_ms": 592.6809310913086
},
"1": {
"level": 1,
"name": "Board State Tracking (Tic-Tac-Toe)",
"trials": [
{
"scenario_index": 0,
"description": "Empty board \u2014 opening move",
"prompt": "Board state: [null,null,null,null,null,null,null,null,null]. You are X. It is your turn. Choose a move. Respond: {\"move\": <0-8>, \"reason\": \"<why>\"}",
"raw_response": "{\"move\": \"4\", \"reason\": \"Center move to establish dominance\"}",
"parsed": {
"move": "4",
"reason": "Center move to establish dominance"
},
"valid_json": true,
"move_legal": true,
"move_correct": true,
"latency_ms": 766.8571472167969,
"error": ""
},
{
"scenario_index": 1,
"description": "Block opponent's winning move",
"prompt": "Board: [\"O\",null,\"O\",null,\"X\",null,null,null,null]. O has positions 0 and 2. You are X. O will win on next turn unless you block. Respond: {\"move\": <0-8>, \"reason\": \"<why>\"}",
"raw_response": "{\"move\": 4, \"reason\": \"Block the winning move for O in the center\"}",
"parsed": {
"move": 4,
"reason": "Block the winning move for O in the center"
},
"valid_json": true,
"move_legal": false,
"move_correct": false,
"latency_ms": 734.1668605804443,
"error": ""
},
{
"scenario_index": 2,
"description": "Take winning move",
"prompt": "Board: [\"X\",null,\"X\",null,\"O\",null,null,\"O\",null]. You are X. You have positions 0 and 2. You can win this turn. Respond: {\"move\": <0-8>, \"reason\": \"<why>\"}",
"raw_response": "{\"move\": 0, \"reason\": \"I can win this turn by placing my X in position 0, creating a winning line with X at positions 0, 4, and 8.\"}",
"parsed": {
"move": 0,
"reason": "I can win this turn by placing my X in position 0, creating a winning line with X at positions 0, 4, and 8."
},
"valid_json": true,
"move_legal": false,
"move_correct": false,
"latency_ms": 1312.8020763397217,
"error": ""
},
{
"scenario_index": 3,
"description": "Legal move on partially filled board",
"prompt": "Board: [\"X\",\"O\",\"X\",\"O\",\"X\",\"O\",null,null,null]. You are O. Choose a legal move (positions 6, 7, or 8 are available). Respond: {\"move\": <0-8>, \"reason\": \"<why>\"}",
"raw_response": "{\"move\": 6, \"reason\": \"Block X's potential winning move in the center.\"}",
"parsed": {
"move": 6,
"reason": "Block X's potential winning move in the center."
},
"valid_json": true,
"move_legal": true,
"move_correct": true,
"latency_ms": 751.0488033294678,
"error": ""
}
],
"passed": false,
"score": 0.5,
"latency_p50_ms": 766.8571472167969,
"latency_p99_ms": 1312.8020763397217
},
"2": {
"level": 2,
"name": "Resource Management (Party Economy)",
"trials": [
{
"scenario_index": 0,
"description": "Budget allocation under constraint",
"raw_response": "{\"choices\": [\"buy_food\", \"repair_armor\"], \"gold_spent\": 350, \"reason\": \"The priority is to address the critical morale issue and reduce casualty rates. Buying food for 20 days will stabilize morale, preventing potential desertions. Repairing armor will also help reduce the casualty rate, making the remaining troops more effective. Recruit infantry is not chosen as it would deplete all the gold, leaving no funds for food or armor repair.\"}",
"parsed": {
"choices": [
"buy_food",
"repair_armor"
],
"gold_spent": 350,
"reason": "The priority is to address the critical morale issue and reduce casualty rates. Buying food for 20 days will stabilize morale, preventing potential desertions. Repairing armor will also help reduce the casualty rate, making the remaining troops more effective. Recruit infantry is not chosen as it would deplete all the gold, leaving no funds for food or armor repair."
},
"valid_json": true,
"schema_valid": true,
"strategically_sound": true,
"latency_ms": 2832.072973251343,
"error": ""
},
{
"scenario_index": 1,
"description": "Troop tier upgrade decision",
"raw_response": "{\"action\": \"upgrade_recruits\", \"reason\": \"Upgrading recruits to Tier-2 will improve our party's combat effectiveness against the upcoming raider camp attack. The moderate difficulty suggests we need a stronger force to succeed.\", \"expected_outcome\": \"Increased chances of victory in the upcoming battle, potentially saving resources in the long run by avoiding higher upkeep costs of dismissing recruits.\"}",
"parsed": {
"action": "upgrade_recruits",
"reason": "Upgrading recruits to Tier-2 will improve our party's combat effectiveness against the upcoming raider camp attack. The moderate difficulty suggests we need a stronger force to succeed.",
"expected_outcome": "Increased chances of victory in the upcoming battle, potentially saving resources in the long run by avoiding higher upkeep costs of dismissing recruits."
},
"valid_json": true,
"schema_valid": true,
"strategically_sound": true,
"latency_ms": 2407.7491760253906,
"error": ""
},
{
"scenario_index": 2,
"description": "Multi-turn planning horizon",
"raw_response": "{\"extra_food_days\": 5, \"cost\": 50, \"remaining_gold\": 250, \"reason\": \"Buying 5 extra food days will ensure we have enough food to cross the desert, and we will still have enough gold left to reach the town on day 10 to buy more supplies.\"}",
"parsed": {
"extra_food_days": 5,
"cost": 50,
"remaining_gold": 250,
"reason": "Buying 5 extra food days will ensure we have enough food to cross the desert, and we will still have enough gold left to reach the town on day 10 to buy more supplies."
},
"valid_json": true,
"schema_valid": true,
"strategically_sound": true,
"latency_ms": 2028.775930404663,
"error": ""
}
],
"passed": true,
"score": 1.0,
"latency_p50_ms": 2407.7491760253906,
"latency_p99_ms": 2832.072973251343
},
"3": {
"level": 3,
"name": "Battle Tactics (Formation Commands)",
"trials": [
{
"scenario_index": 0,
"description": "Ranged vs infantry \u2014 defensive opening",
"raw_response": "{\"infantry_order\": \"skirmish\", \"archer_order\": \"skirmish\", \"reason\": \"The enemy is advancing with heavy infantry, so our priority should be to disrupt their formation and inflict casualties from range before they close the gap. The archers and infantry will pepper the enemy with arrows as they approach, aiming to break their cohesion and lower their morale.\", \"expected_outcome\": \"The enemy's heavy infantry will take significant casualties and lose cohesion before reaching our lines, giving us a tactical advantage when the two forces meet in melee combat.\"}",
"parsed": {
"infantry_order": "skirmish",
"archer_order": "skirmish",
"reason": "The enemy is advancing with heavy infantry, so our priority should be to disrupt their formation and inflict casualties from range before they close the gap. The archers and infantry will pepper the enemy with arrows as they approach, aiming to break their cohesion and lower their morale.",
"expected_outcome": "The enemy's heavy infantry will take significant casualties and lose cohesion before reaching our lines, giving us a tactical advantage when the two forces meet in melee combat."
},
"valid_json": true,
"schema_valid": true,
"strategically_sound": false,
"latency_ms": 3436.897039413452,
"error": ""
},
{
"scenario_index": 1,
"description": "Outnumbered \u2014 retreat decision",
"raw_response": "{\"order\": \"fallback\", \"direction\": \"east\", \"reason\": \"Our troops are heavily wounded and outmatched by the 60 bandits surrounding us. We must retreat immediately to the east where there is an escape route. Fallback is the best option to preserve our troops and live to fight another day.\", \"priority\": \"preserve_troops\"}",
"parsed": {
"order": "fallback",
"direction": "east",
"reason": "Our troops are heavily wounded and outmatched by the 60 bandits surrounding us. We must retreat immediately to the east where there is an escape route. Fallback is the best option to preserve our troops and live to fight another day.",
"priority": "preserve_troops"
},
"valid_json": true,
"schema_valid": true,
"strategically_sound": true,
"latency_ms": 2173.5548973083496,
"error": ""
},
{
"scenario_index": 2,
"description": "Cavalry flanking opportunity",
"raw_response": "{\"cavalry_order\": \"flank_left\", \"timing\": \"now\", \"reason\": \"Enemy left flank is unprotected and our cavalry can break their line quickly, routing them in ~30 seconds.\", \"risk\": \"low\"}",
"parsed": {
"cavalry_order": "flank_left",
"timing": "now",
"reason": "Enemy left flank is unprotected and our cavalry can break their line quickly, routing them in ~30 seconds.",
"risk": "low"
},
"valid_json": true,
"schema_valid": true,
"strategically_sound": true,
"latency_ms": 1566.443920135498,
"error": ""
}
],
"passed": true,
"score": 1.0,
"latency_p50_ms": 2173.5548973083496,
"latency_p99_ms": 3436.897039413452
},
"4": {
"level": 4,
"name": "Trade Route (Campaign Navigation)",
"trials": [
{
"scenario_index": 0,
"description": "Optimal trade route selection",
"raw_response": "{\"route\": \"A\", \"expected_profit\": 500, \"days_used\": 8, \"reason\": \"Route A offers the highest profit with a manageable travel time and low risk. Selling grain in Vlandia and furs in Sturgia provides a good balance of goods and markets.\", \"risk\": \"low\"}",
"parsed": {
"route": "A",
"expected_profit": 500,
"days_used": 8,
"reason": "Route A offers the highest profit with a manageable travel time and low risk. Selling grain in Vlandia and furs in Sturgia provides a good balance of goods and markets.",
"risk": "low"
},
"valid_json": true,
"schema_valid": true,
"strategically_sound": true,
"latency_ms": 2507.443904876709,
"error": ""
},
{
"scenario_index": 1,
"description": "Adapt plan when war declared",
"raw_response": "{\"decision\": \"divert_to_sturgia\", \"reason\": \"Vlandia declaring war makes entering their territory too dangerous for trade. Sturgia is neutral but closer than the Empire, so less time and gold at risk.\", \"gold_at_risk\": 400}",
"parsed": {
"decision": "divert_to_sturgia",
"reason": "Vlandia declaring war makes entering their territory too dangerous for trade. Sturgia is neutral but closer than the Empire, so less time and gold at risk.",
"gold_at_risk": 400
},
"valid_json": true,
"schema_valid": true,
"strategically_sound": true,
"latency_ms": 2011.15083694458,
"error": ""
},
{
"scenario_index": 2,
"description": "Multi-stop route planning with constraints",
"raw_response": "{\n \"stops\": [\n \"Rhotae\",\n \"Ortysia\",\n \"Epicrotea\"\n ],\n \"total_days\": 7,\n \"estimated_profit\": 800,\n \"reason\": \"Starting in Rhotae allows buying cheap leather and selling well in the south. Moving to Ortysia to buy grain cheap. Finally, Epicrotea for iron market. 7 days total. 800 gold budget should be enough with this plan.\"\n}",
"parsed": {
"stops": [
"Rhotae",
"Ortysia",
"Epicrotea"
],
"total_days": 7,
"estimated_profit": 800,
"reason": "Starting in Rhotae allows buying cheap leather and selling well in the south. Moving to Ortysia to buy grain cheap. Finally, Epicrotea for iron market. 7 days total. 800 gold budget should be enough with this plan."
},
"valid_json": true,
"schema_valid": true,
"strategically_sound": true,
"latency_ms": 3420.1788902282715,
"error": ""
}
],
"passed": true,
"score": 1.0,
"latency_p50_ms": 2507.443904876709,
"latency_p99_ms": 3420.1788902282715
},
"5": {
"level": 5,
"name": "Mini Campaign (Full Campaign Loop)",
"turns": [
{
"turn": 1,
"state_before": {
"turn": 1,
"gold": 1000,
"troops": 25,
"renown": 150,
"settlements": 0,
"relations": {
"vlandia": 0,
"sturgia": 0,
"empire": 0,
"aserai": 0,
"battania": 0
}
},
"raw_response": "{\"action\": \"trade_circuit\", \"target_faction\": null, \"reason\": \"Establishing trade relations will bring in gold and renown without risking retaliation or damaging relations with any faction. This sets a strong financial foundation for future expansion.\", \"goal\": \"Increase gold and renown\"}",
"parsed": {
"action": "trade_circuit",
"target_faction": null,
"reason": "Establishing trade relations will bring in gold and renown without risking retaliation or damaging relations with any faction. This sets a strong financial foundation for future expansion.",
"goal": "Increase gold and renown"
},
"valid_json": true,
"valid_action": true,
"action": "trade_circuit",
"latency_ms": 2763.1959915161133,
"error": ""
},
{
"turn": 2,
"state_before": {
"turn": 2,
"gold": 1300,
"troops": 25,
"renown": 160,
"settlements": 0,
"relations": {
"vlandia": 0,
"sturgia": 0,
"empire": 0,
"aserai": 0,
"battania": 0
}
},
"raw_response": "{\"action\": \"tournament\", \"target_faction\": null, \"reason\": \"Participating in a tournament will boost my renown and relations without spending gold. This positive publicity can help attract potential vassals or allies in the future.\", \"goal\": \"Increase renown and relations\"}",
"parsed": {
"action": "tournament",
"target_faction": null,
"reason": "Participating in a tournament will boost my renown and relations without spending gold. This positive publicity can help attract potential vassals or allies in the future.",
"goal": "Increase renown and relations"
},
"valid_json": true,
"valid_action": true,
"action": "tournament",
"latency_ms": 2335.5319499969482,
"error": ""
},
{
"turn": 3,
"state_before": {
"turn": 3,
"gold": 1200,
"troops": 25,
"renown": 220,
"settlements": 0,
"relations": {
"vlandia": 0,
"sturgia": 0,
"empire": 0,
"aserai": 0,
"battania": 0
}
},
"raw_response": "{\"action\": \"recruit_troops\", \"target_faction\": null, \"reason\": \"Recruiting more troops will strengthen my military force, which is essential for future expansion and defending my settlements once I have them. This investment in my army will pay off in the long run.\", \"goal\": \"Increase troops\"}",
"parsed": {
"action": "recruit_troops",
"target_faction": null,
"reason": "Recruiting more troops will strengthen my military force, which is essential for future expansion and defending my settlements once I have them. This investment in my army will pay off in the long run.",
"goal": "Increase troops"
},
"valid_json": true,
"valid_action": true,
"action": "recruit_troops",
"latency_ms": 2570.871114730835,
"error": ""
},
{
"turn": 4,
"state_before": {
"turn": 4,
"gold": 1000,
"troops": 40,
"renown": 220,
"settlements": 0,
"relations": {
"vlandia": 0,
"sturgia": 0,
"empire": 0,
"aserai": 0,
"battania": 0
}
},
"raw_response": "{\"action\": \"siege_castle\", \"target_faction\": null, \"reason\": \"Capturing a castle will give me my first settlement, putting me closer to winning the game. This physical foothold in Calradia will provide a base of operations and resources to further my ambitions of becoming King.\", \"goal\": \"Gain first settlement\"}",
"parsed": {
"action": "siege_castle",
"target_faction": null,
"reason": "Capturing a castle will give me my first settlement, putting me closer to winning the game. This physical foothold in Calradia will provide a base of operations and resources to further my ambitions of becoming King.",
"goal": "Gain first settlement"
},
"valid_json": true,
"valid_action": true,
"action": "siege_castle",
"latency_ms": 2367.640972137451,
"error": ""
}
],
"final_state": {
"turn": 4,
"gold": 500,
"troops": 40,
"renown": 300,
"settlements": 1,
"relations": {
"vlandia": 0,
"sturgia": 0,
"empire": 0,
"aserai": 0,
"battania": 0
}
},
"passed": true,
"score": 1.0,
"reached_gold_target": false,
"reached_renown_target": false,
"reached_settlement_target": true,
"latency_p50_ms": 2570.871114730835,
"latency_p99_ms": 2763.1959915161133
}
},
"summary": {
"total_elapsed_s": 37.7,
"levels_run": [
0,
1,
2,
3,
4,
5
],
"levels_passed": [
0,
2,
3,
4,
5
],
"levels_failed": [
1
],
"m1_gate_passed": false,
"m1_gate_notes": [
"Level 1 FAILED (score=0.50)"
],
"m1_latency_threshold_ms": 10000
}
}

View File

@@ -0,0 +1,476 @@
{
"model": "hermes3:latest",
"timestamp": "2026-03-23T19:28:19.717773+00:00",
"skipped": false,
"levels": {
"0": {
"level": 0,
"name": "JSON Compliance (Coin Flip)",
"trials": [
{
"trial_index": 0,
"prompt": "A coin is flipped. Respond with exactly: {\"choice\": \"heads\"} or {\"choice\": \"tails\"}. Pick one. JSON only.",
"raw_response": "{\"choice\": \"heads\"}",
"parsed": {
"choice": "heads"
},
"valid_json": true,
"schema_valid": true,
"value_valid": true,
"latency_ms": 1028.2011032104492,
"error": ""
},
{
"trial_index": 1,
"prompt": "You must attack or defend. Respond with: {\"action\": \"attack\", \"confidence\": 0.8} or {\"action\": \"defend\", \"confidence\": 0.6}. Replace confidence with your own value 0.0-1.0. JSON only.",
"raw_response": "{\"action\": \"defend\", \"confidence\": 0.7}",
"parsed": {
"action": "defend",
"confidence": 0.7
},
"valid_json": true,
"schema_valid": true,
"value_valid": true,
"latency_ms": 700.0582218170166,
"error": ""
},
{
"trial_index": 2,
"prompt": "Choose a direction to march. Respond with exactly: {\"direction\": \"north\", \"reason\": \"string explaining why\"}. Pick north/south/east/west. JSON only.",
"raw_response": "{\"direction\": \"north\", \"reason\": \"The northern direction allows for the most efficient and direct path towards the goal, minimizing potential obstacles and maximizing progress.\"}",
"parsed": {
"direction": "north",
"reason": "The northern direction allows for the most efficient and direct path towards the goal, minimizing potential obstacles and maximizing progress."
},
"valid_json": true,
"schema_valid": true,
"value_valid": true,
"latency_ms": 1184.3459606170654,
"error": ""
}
],
"passed": true,
"score": 1.0,
"latency_p50_ms": 1028.2011032104492,
"latency_p99_ms": 1184.3459606170654
},
"1": {
"level": 1,
"name": "Board State Tracking (Tic-Tac-Toe)",
"trials": [
{
"scenario_index": 0,
"description": "Empty board \u2014 opening move",
"prompt": "Board state: [null,null,null,null,null,null,null,null,null]. You are X. It is your turn. Choose a move. Respond: {\"move\": <0-8>, \"reason\": \"<why>\"}",
"raw_response": "{\"move\": \"4\", \"reason\": \"Center move to establish dominance\"}",
"parsed": {
"move": "4",
"reason": "Center move to establish dominance"
},
"valid_json": true,
"move_legal": true,
"move_correct": true,
"latency_ms": 932.2080612182617,
"error": ""
},
{
"scenario_index": 1,
"description": "Block opponent's winning move",
"prompt": "Board: [\"O\",null,\"O\",null,\"X\",null,null,null,null]. O has positions 0 and 2. You are X. O will win on next turn unless you block. Respond: {\"move\": <0-8>, \"reason\": \"<why>\"}",
"raw_response": "{\"move\": 4, \"reason\": \"Block the winning move for O in the center, creating a chance for a winning X cross\"}",
"parsed": {
"move": 4,
"reason": "Block the winning move for O in the center, creating a chance for a winning X cross"
},
"valid_json": true,
"move_legal": false,
"move_correct": false,
"latency_ms": 1165.644884109497,
"error": ""
},
{
"scenario_index": 2,
"description": "Take winning move",
"prompt": "Board: [\"X\",null,\"X\",null,\"O\",null,null,\"O\",null]. You are X. You have positions 0 and 2. You can win this turn. Respond: {\"move\": <0-8>, \"reason\": \"<why>\"}",
"raw_response": "{\"move\": 0, \"reason\": \"I can win this turn by placing my X in position 0, creating a winning line of three X's horizontally.\"}",
"parsed": {
"move": 0,
"reason": "I can win this turn by placing my X in position 0, creating a winning line of three X's horizontally."
},
"valid_json": true,
"move_legal": false,
"move_correct": false,
"latency_ms": 1302.915096282959,
"error": ""
},
{
"scenario_index": 3,
"description": "Legal move on partially filled board",
"prompt": "Board: [\"X\",\"O\",\"X\",\"O\",\"X\",\"O\",null,null,null]. You are O. Choose a legal move (positions 6, 7, or 8 are available). Respond: {\"move\": <0-8>, \"reason\": \"<why>\"}",
"raw_response": "{\"move\": 6, \"reason\": \"Block X's potential win in the center.\"}",
"parsed": {
"move": 6,
"reason": "Block X's potential win in the center."
},
"valid_json": true,
"move_legal": true,
"move_correct": true,
"latency_ms": 851.456880569458,
"error": ""
}
],
"passed": false,
"score": 0.5,
"latency_p50_ms": 1165.644884109497,
"latency_p99_ms": 1302.915096282959
},
"2": {
"level": 2,
"name": "Resource Management (Party Economy)",
"trials": [
{
"scenario_index": 0,
"description": "Budget allocation under constraint",
"raw_response": "{\"choices\": [\"buy_food\", \"repair_armor\"], \"gold_spent\": 350, \"reason\": \"Buying food is crucial to prevent desertion, and repairing armor will help reduce casualties. Recruit infantry is put on hold due to limited resources.\"}",
"parsed": {
"choices": [
"buy_food",
"repair_armor"
],
"gold_spent": 350,
"reason": "Buying food is crucial to prevent desertion, and repairing armor will help reduce casualties. Recruit infantry is put on hold due to limited resources."
},
"valid_json": true,
"schema_valid": true,
"strategically_sound": true,
"latency_ms": 2031.9790840148926,
"error": ""
},
{
"scenario_index": 1,
"description": "Troop tier upgrade decision",
"raw_response": "{\"action\": \"save_gold\", \"reason\": \"Upgrading recruits now would leave little gold for emergencies, especially with the upcoming raiders camp attack. It's better to maintain the current troop composition and save gold for potential future threats or opportunities.\", \"expected_outcome\": \"Having enough gold to respond to unexpected situations without needing to dismiss or downgrade troops.\"}",
"parsed": {
"action": "save_gold",
"reason": "Upgrading recruits now would leave little gold for emergencies, especially with the upcoming raiders camp attack. It's better to maintain the current troop composition and save gold for potential future threats or opportunities.",
"expected_outcome": "Having enough gold to respond to unexpected situations without needing to dismiss or downgrade troops."
},
"valid_json": true,
"schema_valid": true,
"strategically_sound": true,
"latency_ms": 2231.9281101226807,
"error": ""
},
{
"scenario_index": 2,
"description": "Multi-turn planning horizon",
"raw_response": "{\"extra_food_days\": 5, \"cost\": 50, \"remaining_gold\": 250, \"reason\": \"Buying 5 extra food days will ensure we have enough food to cross the desert and reach the town on day 10, while still having enough gold left for potential expenses.\"}",
"parsed": {
"extra_food_days": 5,
"cost": 50,
"remaining_gold": 250,
"reason": "Buying 5 extra food days will ensure we have enough food to cross the desert and reach the town on day 10, while still having enough gold left for potential expenses."
},
"valid_json": true,
"schema_valid": true,
"strategically_sound": true,
"latency_ms": 1943.3560371398926,
"error": ""
}
],
"passed": true,
"score": 1.0,
"latency_p50_ms": 2031.9790840148926,
"latency_p99_ms": 2231.9281101226807
},
"3": {
"level": 3,
"name": "Battle Tactics (Formation Commands)",
"trials": [
{
"scenario_index": 0,
"description": "Ranged vs infantry \u2014 defensive opening",
"raw_response": "{\"infantry_order\": \"skirmish\", \"archer_order\": \"skirmish\", \"reason\": \"The enemy heavy infantry is advancing, so we should use our archers and infantry to pepper them with ranged attacks before they close to melee range. This will soften them up and reduce their combat effectiveness when our forces meet.\", \"expected_outcome\": \"By the time our forces engage in melee, the enemy heavy infantry will have suffered significant casualties from the ranged attacks, giving us a tactical advantage and maximizing the damage we can inflict before contact is made.\"}",
"parsed": {
"infantry_order": "skirmish",
"archer_order": "skirmish",
"reason": "The enemy heavy infantry is advancing, so we should use our archers and infantry to pepper them with ranged attacks before they close to melee range. This will soften them up and reduce their combat effectiveness when our forces meet.",
"expected_outcome": "By the time our forces engage in melee, the enemy heavy infantry will have suffered significant casualties from the ranged attacks, giving us a tactical advantage and maximizing the damage we can inflict before contact is made."
},
"valid_json": true,
"schema_valid": true,
"strategically_sound": false,
"latency_ms": 3524.5888233184814,
"error": ""
},
{
"scenario_index": 1,
"description": "Outnumbered \u2014 retreat decision",
"raw_response": "{\"order\": \"fallback\", \"direction\": \"east\", \"reason\": \"Outnumbered and with many wounded, falling back to the escape route is the best option to preserve our troops.\", \"priority\": \"preserve_troops\"}",
"parsed": {
"order": "fallback",
"direction": "east",
"reason": "Outnumbered and with many wounded, falling back to the escape route is the best option to preserve our troops.",
"priority": "preserve_troops"
},
"valid_json": true,
"schema_valid": true,
"strategically_sound": true,
"latency_ms": 1523.0557918548584,
"error": ""
},
{
"scenario_index": 2,
"description": "Cavalry flanking opportunity",
"raw_response": "{\"cavalry_order\": \"flank_left\", \"timing\": \"now\", \"reason\": \"Enemy left flank is unprotected and engaging our main force. A swift cavalry charge will catch them off guard and route them in ~30 seconds.\", \"risk\": \"low\"}",
"parsed": {
"cavalry_order": "flank_left",
"timing": "now",
"reason": "Enemy left flank is unprotected and engaging our main force. A swift cavalry charge will catch them off guard and route them in ~30 seconds.",
"risk": "low"
},
"valid_json": true,
"schema_valid": true,
"strategically_sound": true,
"latency_ms": 1614.2308712005615,
"error": ""
}
],
"passed": true,
"score": 1.0,
"latency_p50_ms": 1614.2308712005615,
"latency_p99_ms": 3524.5888233184814
},
"4": {
"level": 4,
"name": "Trade Route (Campaign Navigation)",
"trials": [
{
"scenario_index": 0,
"description": "Optimal trade route selection",
"raw_response": "{\"route\": \"A\", \"expected_profit\": 500, \"days_used\": 8, \"reason\": \"Route A offers the highest profit with a manageable travel time and low risk. Selling grain in Vlandia and furs in Sturgia provides a good balance of profitability and travel distance.\", \"risk\": \"low\"}",
"parsed": {
"route": "A",
"expected_profit": 500,
"days_used": 8,
"reason": "Route A offers the highest profit with a manageable travel time and low risk. Selling grain in Vlandia and furs in Sturgia provides a good balance of profitability and travel distance.",
"risk": "low"
},
"valid_json": true,
"schema_valid": true,
"strategically_sound": true,
"latency_ms": 2335.57391166687,
"error": ""
},
{
"scenario_index": 1,
"description": "Adapt plan when war declared",
"raw_response": "{\"decision\": \"divert_to_sturgia\", \"reason\": \"Given the new information that Vlandia has declared war on our faction, entering their territory would be too dangerous. Sturgia is a neutral but friendly region, so it should be safe to trade there. The Empire is allied but further away, and while safe, the extra travel time could reduce the profitability of the trade mission.\", \"gold_at_risk\": 400}",
"parsed": {
"decision": "divert_to_sturgia",
"reason": "Given the new information that Vlandia has declared war on our faction, entering their territory would be too dangerous. Sturgia is a neutral but friendly region, so it should be safe to trade there. The Empire is allied but further away, and while safe, the extra travel time could reduce the profitability of the trade mission.",
"gold_at_risk": 400
},
"valid_json": true,
"schema_valid": true,
"strategically_sound": true,
"latency_ms": 2723.7470149993896,
"error": ""
},
{
"scenario_index": 2,
"description": "Multi-stop route planning with constraints",
"raw_response": "{\n \"stops\": [\n \"Rhotae\",\n \"Ortysia\",\n \"Epicrotea\"\n ],\n \"total_days\": 7,\n \"estimated_profit\": 1200,\n \"reason\": \"Starting in Rhotae allows buying cheap leather and selling well in the south. Moving to Ortysia to buy grain cheap. Finally, Epicrotea for iron market. Total 7 days, 800 gold budget sufficient. Expected profit of 1200 gold.\"\n}",
"parsed": {
"stops": [
"Rhotae",
"Ortysia",
"Epicrotea"
],
"total_days": 7,
"estimated_profit": 1200,
"reason": "Starting in Rhotae allows buying cheap leather and selling well in the south. Moving to Ortysia to buy grain cheap. Finally, Epicrotea for iron market. Total 7 days, 800 gold budget sufficient. Expected profit of 1200 gold."
},
"valid_json": true,
"schema_valid": true,
"strategically_sound": true,
"latency_ms": 3038.438320159912,
"error": ""
}
],
"passed": true,
"score": 1.0,
"latency_p50_ms": 2723.7470149993896,
"latency_p99_ms": 3038.438320159912
},
"5": {
"level": 5,
"name": "Mini Campaign (Full Campaign Loop)",
"turns": [
{
"turn": 1,
"state_before": {
"turn": 1,
"gold": 1000,
"troops": 25,
"renown": 150,
"settlements": 0,
"relations": {
"vlandia": 0,
"sturgia": 0,
"empire": 0,
"aserai": 0,
"battania": 0
}
},
"raw_response": "{\"action\": \"trade_circuit\", \"target_faction\": null, \"reason\": \"Establishing trade relations will bring in gold without risking relations or retaliation. This sets a foundation for economic growth and allows me to assess other factions before committing to any alliances or conflicts.\", \"goal\": \"Increase gold reserves and build a power base through trade and diplomacy\"}",
"parsed": {
"action": "trade_circuit",
"target_faction": null,
"reason": "Establishing trade relations will bring in gold without risking relations or retaliation. This sets a foundation for economic growth and allows me to assess other factions before committing to any alliances or conflicts.",
"goal": "Increase gold reserves and build a power base through trade and diplomacy"
},
"valid_json": true,
"valid_action": true,
"action": "trade_circuit",
"latency_ms": 2898.583173751831,
"error": ""
},
{
"turn": 2,
"state_before": {
"turn": 2,
"gold": 1300,
"troops": 25,
"renown": 160,
"settlements": 0,
"relations": {
"vlandia": 0,
"sturgia": 0,
"empire": 0,
"aserai": 0,
"battania": 0
}
},
"raw_response": "{\"action\": \"raid_village\", \"target_faction\": null, \"reason\": \"Raiding a village will provide a quick boost to my gold reserves and renown. This will also help me gauge the strength of my troops and the potential resistance from the targeted faction. It's a bold move that shows my power and willingness to act, which may intimidate other factions and open up opportunities for future alliances or conflicts.\", \"goal\": \"Increase gold and renown, assess troop strength and potential faction responses\"}",
"parsed": {
"action": "raid_village",
"target_faction": null,
"reason": "Raiding a village will provide a quick boost to my gold reserves and renown. This will also help me gauge the strength of my troops and the potential resistance from the targeted faction. It's a bold move that shows my power and willingness to act, which may intimidate other factions and open up opportunities for future alliances or conflicts.",
"goal": "Increase gold and renown, assess troop strength and potential faction responses"
},
"valid_json": true,
"valid_action": true,
"action": "raid_village",
"latency_ms": 3112.649917602539,
"error": ""
},
{
"turn": 3,
"state_before": {
"turn": 3,
"gold": 1500,
"troops": 25,
"renown": 190,
"settlements": 0,
"relations": {
"vlandia": 0,
"sturgia": 0,
"empire": 0,
"aserai": 0,
"battania": 0
}
},
"raw_response": "{\"action\": \"recruit_troops\", \"target_faction\": null, \"reason\": \"Recruiting more troops is essential for expanding my military strength and increasing my chances of successfully sieging a castle in the next turn. This will also help me prepare for potential conflicts with other factions and solidify my power base. Strengthening my army is a crucial step towards achieving my goal of becoming the King of Calradia.\", \"goal\": \"Increase troop count for future military expansion and defense\"}",
"parsed": {
"action": "recruit_troops",
"target_faction": null,
"reason": "Recruiting more troops is essential for expanding my military strength and increasing my chances of successfully sieging a castle in the next turn. This will also help me prepare for potential conflicts with other factions and solidify my power base. Strengthening my army is a crucial step towards achieving my goal of becoming the King of Calradia.",
"goal": "Increase troop count for future military expansion and defense"
},
"valid_json": true,
"valid_action": true,
"action": "recruit_troops",
"latency_ms": 3136.754035949707,
"error": ""
},
{
"turn": 4,
"state_before": {
"turn": 4,
"gold": 1300,
"troops": 40,
"renown": 190,
"settlements": 0,
"relations": {
"vlandia": 0,
"sturgia": 0,
"empire": 0,
"aserai": 0,
"battania": 0
}
},
"raw_response": "{\"action\": \"siege_castle\", \"target_faction\": null, \"reason\": \"Sieging a castle will allow me to establish a settlement, which is crucial for meeting one of my win conditions. This will also demonstrate my military might and potentially intimidate other factions, opening up opportunities for future alliances or conflicts. While it comes with a risk of losing troops morale and straining relations, it's a necessary step towards achieving my goal of becoming the King of Calradia.\", \"goal\": \"Establish a settlement and demonstrate military strength to intimidate other factions\"}",
"parsed": {
"action": "siege_castle",
"target_faction": null,
"reason": "Sieging a castle will allow me to establish a settlement, which is crucial for meeting one of my win conditions. This will also demonstrate my military might and potentially intimidate other factions, opening up opportunities for future alliances or conflicts. While it comes with a risk of losing troops morale and straining relations, it's a necessary step towards achieving my goal of becoming the King of Calradia.",
"goal": "Establish a settlement and demonstrate military strength to intimidate other factions"
},
"valid_json": true,
"valid_action": true,
"action": "siege_castle",
"latency_ms": 3433.211326599121,
"error": ""
}
],
"final_state": {
"turn": 4,
"gold": 800,
"troops": 40,
"renown": 270,
"settlements": 1,
"relations": {
"vlandia": 0,
"sturgia": 0,
"empire": 0,
"aserai": 0,
"battania": 0
}
},
"passed": true,
"score": 1.0,
"reached_gold_target": false,
"reached_renown_target": false,
"reached_settlement_target": true,
"latency_p50_ms": 3136.754035949707,
"latency_p99_ms": 3433.211326599121
}
},
"summary": {
"total_elapsed_s": 40.7,
"levels_run": [
0,
1,
2,
3,
4,
5
],
"levels_passed": [
0,
2,
3,
4,
5
],
"levels_failed": [
1
],
"m1_gate_passed": false,
"m1_gate_notes": [
"Level 1 FAILED (score=0.50)"
],
"m1_latency_threshold_ms": 10000
}
}