[Bannerlord M0] Run Cognitive Benchmark on Hermes #1092

New Issue

perplexity · 2026-03-23T16:49:23Z

perplexity commented

2026-03-23 16:49:23 +00:00

Parent: #1091

Task

Run the 6-level cognitive benchmark harness against Timmy's current model (qwen2.5:14b) and candidate upgrades.

Steps:

Copy timmy-benchmark/ to Hermes
pip install ollama
python run_benchmark.py --model qwen2.5:14b --verbose
python run_benchmark.py --model qwen3:14b --verbose (if available)
Save results JSON for each model

Expected outcome: Clear scorecard showing which level Timmy passes/fails, with latency data.

Pass criteria for proceeding to M1:

Must pass Level 0 (JSON compliance) and Level 1 (tic-tac-toe state tracking)
Latency < 10s per decision for Level 0-1

Files: timmy-benchmark/run_benchmark.py, timmy-benchmark/levels/

**Parent:** #1091 ## Task Run the 6-level cognitive benchmark harness against Timmy's current model (qwen2.5:14b) and candidate upgrades. **Steps:** 1. Copy `timmy-benchmark/` to Hermes 2. `pip install ollama` 3. `python run_benchmark.py --model qwen2.5:14b --verbose` 4. `python run_benchmark.py --model qwen3:14b --verbose` (if available) 5. Save results JSON for each model **Expected outcome:** Clear scorecard showing which level Timmy passes/fails, with latency data. **Pass criteria for proceeding to M1:** - Must pass Level 0 (JSON compliance) and Level 1 (tic-tac-toe state tracking) - Latency < 10s per decision for Level 0-1 **Files:** `timmy-benchmark/run_benchmark.py`, `timmy-benchmark/levels/`

claude was assigned by Rockachopa

2026-03-23 17:57:41 +00:00

claude referenced this issue from a commit

2026-03-23 18:31:13 +00:00

WIP: Claude Code progress on #1092

Rockachopa referenced this issue from a commit

2026-03-23 19:34:45 +00:00

feat: run cognitive benchmark on hermes3 models, fix L1 string-int coercion

Rockachopa referenced a pull request that will close this issue

2026-03-23 19:35:10 +00:00

[claude] Bannerlord M0: Run cognitive benchmark on hermes3, fix L1 string-int coercion (#1092) #1159

claude commented

2026-03-23 19:35:37 +00:00

PR created: #1159

Benchmark Results Summary

Ran the 6-level cognitive harness against qwen2.5:14b, hermes3:latest, and hermes3:8b.

M1 Gate: FAILED for all models. Blocker is Level 1 (Board State Tracking).

Level	qwen2.5:14b	hermes3:latest	hermes3:8b
L0 JSON [M1 GATE]	✓ 100%	✓ 100%	✓ 100%
L1 Board [M1 GATE]	✗ 50%	✗ 50%	✗ 50%
L2–L5	✓ all	✓ all	✓ all
M1 Gate	✗	✗	✗

Root cause: All models choose moves by heuristic (center=4, corner=0) without checking whether that square is already occupied. They fail scenarios that require blocking/winning at position 1.

Fix applied: Level 1 harness was patched to coerce "move": "4" (string) → 4 (int) since hermes3 models emit strings, and type fidelity is already tested at L0.

qwen3:14b — not available on this Ollama instance.

Full scorecard with latency data and recommendations: timmy-benchmark/results/SCORECARD.md

PR created: #1159 ## Benchmark Results Summary Ran the 6-level cognitive harness against `qwen2.5:14b`, `hermes3:latest`, and `hermes3:8b`. **M1 Gate: FAILED for all models.** Blocker is Level 1 (Board State Tracking). | Level | qwen2.5:14b | hermes3:latest | hermes3:8b | |-------|:-----------:|:--------------:|:----------:| | L0 JSON [M1 GATE] | ✓ 100% | ✓ 100% | ✓ 100% | | L1 Board [M1 GATE] | ✗ 50% | ✗ 50% | ✗ 50% | | L2–L5 | ✓ all | ✓ all | ✓ all | | **M1 Gate** | ✗ | ✗ | ✗ | **Root cause:** All models choose moves by heuristic (center=4, corner=0) without checking whether that square is already occupied. They fail scenarios that require blocking/winning at position 1. **Fix applied:** Level 1 harness was patched to coerce `"move": "4"` (string) → `4` (int) since `hermes3` models emit strings, and type fidelity is already tested at L0. **qwen3:14b** — not available on this Ollama instance. Full scorecard with latency data and recommendations: `timmy-benchmark/results/SCORECARD.md`

Rockachopa closed this issue

2026-03-23 19:38:49 +00:00

Rockachopa referenced this issue from a commit

2026-03-23 19:38:49 +00:00

[claude] Bannerlord M0: Run cognitive benchmark on hermes3, fix L1 string-int coercion (#1092) (#1159)

Sign in to join this conversation.

Branches Tags

main

gemini/issue-892

claude/issue-1342

claude/issue-1346

claude/issue-1351

claude/issue-1340

fix/test-llm-triage-syntax

gemini/issue-1014

gemini/issue-932

claude/issue-1277

claude/issue-1139

claude/issue-870

claude/issue-1285

claude/issue-1292

claude/issue-1281

claude/issue-917

claude/issue-1275

claude/issue-925

claude/issue-1019

claude/issue-1094

claude/issue-1019-v3

fix/flaky-vassal-xdist-tests

fix/test-config-env-isolation

claude/issue-1019-v2

claude/issue-957-v2

claude/issue-1218

claude/issue-1217

test/chat-store-unit-tests

claude/issue-1191

claude/issue-1186

claude/issue-957

gemini/issue-936

claude/issue-1065

gemini/issue-976

gemini/issue-1149

claude/issue-1135

claude/issue-1064

gemini/issue-1012

claude/issue-1095

claude/issue-1102

claude/issue-1114

gemini/issue-978

gemini/issue-971

claude/issue-1074

claude/issue-987

claude/issue-1011

feature/internal-monologue

feature/issue-1006

feature/issue-1007

feature/issue-1008

feature/issue-1009

feature/issue-1010

feature/issue-1011

feature/issue-1012

feature/issue-1013

feature/issue-1014

feature/issue-981

feature/issue-982

feature/issue-983

feature/issue-984

feature/issue-985

feature/issue-986

feature/issue-987

feature/issue-993

claude/issue-943

claude/issue-975

claude/issue-989

claude/issue-988

fix/loop-guard-gitea-api-and-queue-validation

feature/lhf-tech-debt-fixes

kimi/issue-753

kimi/issue-714

kimi/issue-716

fix/csrf-check-before-execute

chore/migrate-gitea-to-vps

kimi/issue-640

fix/utcnow-calm-py

kimi/issue-635

kimi/issue-625

fix/router-api-truncated-param

kimi/issue-604

kimi/issue-594

review-fixes

kimi/issue-570

kimi/issue-554

kimi/issue-539

kimi/issue-540

feature/ipad-v1-api

kimi/issue-506

kimi/issue-512

refactor/airllm-doc-cleanup

kimi/issue-513

kimi/issue-514

kimi/issue-500

kimi/issue-492

kimi/issue-490

kimi/issue-459

kimi/issue-472

kimi/issue-473

kimi/issue-462

kimi/issue-463

kimi/issue-454

kimi/issue-445

kimi/issue-446

kimi/issue-431

2 Participants

Notifications

Due Date

No due date set.

Dependencies

No dependencies set.

Reference: Rockachopa/Timmy-time-dashboard#1092