Timmy-time-dashboard

Rockachopa/Timmy-time-dashboard

Fork 2

Files

History

Alexander Whitestone 40561b480f

Tests / lint (pull_request) Failing after 18s

Details

Tests / test (pull_request) Has been skipped

Details

feat: run cognitive benchmark on hermes3 models, fix L1 string-int coercion

Refs #1092

- Run full 6-level benchmark against hermes3:latest and hermes3:8b
- Fix Level 1 harness: coerce string-digit moves to int (hermes3 emits
  "move": "4" instead of "move": 4; type fidelity is already tested at L0)
- All three models (qwen2.5:14b, hermes3:latest, hermes3:8b) pass L0
  and L2–L5 but fail L1 at 50% score
- Root cause: models pick heuristic moves without checking if the target
  square is already occupied
- M1 gate: FAILED for all models; qwen3:14b not available on instance
- Add SCORECARD.md with full cross-model comparison and recommendations

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

2026-03-23 15:31:07 -04:00

levels

feat: run cognitive benchmark on hermes3 models, fix L1 string-int coercion

2026-03-23 15:31:07 -04:00

results

feat: run cognitive benchmark on hermes3 models, fix L1 string-int coercion

2026-03-23 15:31:07 -04:00

run_benchmark.py

WIP: Claude Code progress on #1092

2026-03-23 14:30:50 -04:00