Refs #1092
- Run full 6-level benchmark against hermes3:latest and hermes3:8b
- Fix Level 1 harness: coerce string-digit moves to int (hermes3 emits
"move": "4" instead of "move": 4; type fidelity is already tested at L0)
- All three models (qwen2.5:14b, hermes3:latest, hermes3:8b) pass L0
and L2–L5 but fail L1 at 50% score
- Root cause: models pick heuristic moves without checking if the target
square is already occupied
- M1 gate: FAILED for all models; qwen3:14b not available on instance
- Add SCORECARD.md with full cross-model comparison and recommendations
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>