Files
Timmy-time-dashboard/timmy-benchmark
Alexander Whitestone 40561b480f
Some checks failed
Tests / lint (pull_request) Failing after 18s
Tests / test (pull_request) Has been skipped
feat: run cognitive benchmark on hermes3 models, fix L1 string-int coercion
Refs #1092

- Run full 6-level benchmark against hermes3:latest and hermes3:8b
- Fix Level 1 harness: coerce string-digit moves to int (hermes3 emits
  "move": "4" instead of "move": 4; type fidelity is already tested at L0)
- All three models (qwen2.5:14b, hermes3:latest, hermes3:8b) pass L0
  and L2–L5 but fail L1 at 50% score
- Root cause: models pick heuristic moves without checking if the target
  square is already occupied
- M1 gate: FAILED for all models; qwen3:14b not available on instance
- Add SCORECARD.md with full cross-model comparison and recommendations

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-23 15:31:07 -04:00
..