Morrowind: Local brain parity — Hermes 4 14B vs Cloud Claude #104

Closed
opened 2026-04-04 16:27:29 +00:00 by Timmy · 4 comments
Owner

Task

Run the same gameplay scenario on both:

  1. Cloud Claude (current) via Hermes MCP
  2. Local Hermes 4 14B via timmy-morrowind-local

Compare:

  • Decision quality (did it navigate correctly?)
  • Perception interpretation (did it understand the scene?)
  • Action selection (sensible moves?)
  • Latency (playable speed?)

Acceptance Criteria

  • Both models complete same navigation task
  • Side-by-side comparison documented
  • Training data exported as JSONL
  • Gap analysis for LoRA targets identified

Parent: EPIC #99

## Task Run the same gameplay scenario on both: 1. Cloud Claude (current) via Hermes MCP 2. Local Hermes 4 14B via `timmy-morrowind-local` Compare: - Decision quality (did it navigate correctly?) - Perception interpretation (did it understand the scene?) - Action selection (sensible moves?) - Latency (playable speed?) ## Acceptance Criteria - [ ] Both models complete same navigation task - [ ] Side-by-side comparison documented - [ ] Training data exported as JSONL - [ ] Gap analysis for LoRA targets identified Parent: EPIC #99
Timmy added the morrowindgaming labels 2026-04-04 16:27:29 +00:00
Author
Owner

🔨 Artisan Review — Issue #104: Local Brain Parity #bezalel-artisan

Verdict: KEEP OPEN — This is the capstone. The entire sovereignty thesis rests here.

Analysis

This is the most important task in the EPIC for the larger mission. Everything else is infrastructure — this is the experiment that proves whether a sovereign local model can match cloud intelligence.

Comparison Framework

Axis How to Measure
Decision quality Same start point, same target, count navigation errors
Perception interpretation Ground truth from perception JSON vs model's narration
Action selection Human review of action logs for sensible vs nonsensical
Latency Wall clock per perceive→act cycle (target <500ms local vs 1-3s cloud)

Training Data Design

The JSONL export is the key deliverable. Suggested schema:

{
  "session_id": "uuid",
  "step": 42,
  "perception": { "cell": "...", "position": [...], "npcs": [...] },
  "model": "claude-opus-4-6",
  "reasoning": "I see a door to the north...",
  "action": {"tool": "move", "args": {"direction": "forward", "duration": 2}},
  "outcome": {"position_delta": [...], "success": true},
  "timestamp": "2026-04-04T16:30:00Z"
}

LoRA Gap Analysis Focus

  1. Where does 14B fail that cloud succeeds? — These are the training targets
  2. Where does 14B succeed? — Prune from training data (already learned)
  3. Where do both fail? — Architecture problems, not model problems

Critical Dependency

Blocked by ALL of #100-103. Run the EXACT same scenario (same save, same objective, same perception data) on both models. Divergence in action selection is the training signal.

## 🔨 Artisan Review — Issue #104: Local Brain Parity #bezalel-artisan **Verdict: KEEP OPEN — This is the capstone. The entire sovereignty thesis rests here.** ### Analysis This is the most important task in the EPIC for the larger mission. Everything else is infrastructure — this is the experiment that proves whether a sovereign local model can match cloud intelligence. ### Comparison Framework | Axis | How to Measure | |------|----------------| | Decision quality | Same start point, same target, count navigation errors | | Perception interpretation | Ground truth from perception JSON vs model's narration | | Action selection | Human review of action logs for sensible vs nonsensical | | Latency | Wall clock per perceive→act cycle (target <500ms local vs 1-3s cloud) | ### Training Data Design The JSONL export is the key deliverable. Suggested schema: ```json { "session_id": "uuid", "step": 42, "perception": { "cell": "...", "position": [...], "npcs": [...] }, "model": "claude-opus-4-6", "reasoning": "I see a door to the north...", "action": {"tool": "move", "args": {"direction": "forward", "duration": 2}}, "outcome": {"position_delta": [...], "success": true}, "timestamp": "2026-04-04T16:30:00Z" } ``` ### LoRA Gap Analysis Focus 1. **Where does 14B fail that cloud succeeds?** — These are the training targets 2. **Where does 14B succeed?** — Prune from training data (already learned) 3. **Where do both fail?** — Architecture problems, not model problems ### Critical Dependency Blocked by ALL of #100-103. Run the EXACT same scenario (same save, same objective, same perception data) on both models. Divergence in action selection is the training signal.
Author
Owner

Automated triage pass (OpenAI Wolf Pack) — detailed review

Read-back summary: ## Task Run the same gameplay scenario on both: 1. Cloud Claude (current) via Hermes MCP 2. Local Hermes 4 14B via timmy-morrowind-local Compare: - Decision quality (did it navigate correctly?) - Perception interpretation (did it understand the scene?) - Action selection (sensible moves?) - Latency (playable speed?) ## Acceptance Criteria - [ ] Both models complete same navig…
Issue classification: epic/long-running initiative
Signals: state=open | age≈0d | last activity≈0d | comments=1 | labels=['gaming', 'morrowind'] | assignees=none
Discussion signal: Latest comment by @Timmy 0d ago: "## 🔨 Artisan Review — Issue #104: Local Brain Parity #bezalel-artisan Verdict: KEEP OPEN — This is the capstone. The entire sovereignty thesis rests here. ### Analysis This is …"
Triage decision: Still actionable. Recommend posting updated scope + acceptance criteria and assigning an owner so this can move from discussion into execution.

If any context above is outdated, reply with the latest status and this triage can be refreshed quickly.

Automated triage pass (OpenAI Wolf Pack) — detailed review **Read-back summary:** ## Task Run the same gameplay scenario on both: 1. Cloud Claude (current) via Hermes MCP 2. Local Hermes 4 14B via `timmy-morrowind-local` Compare: - Decision quality (did it navigate correctly?) - Perception interpretation (did it understand the scene?) - Action selection (sensible moves?) - Latency (playable speed?) ## Acceptance Criteria - [ ] Both models complete same navig… **Issue classification:** epic/long-running initiative **Signals:** state=open | age≈0d | last activity≈0d | comments=1 | labels=['gaming', 'morrowind'] | assignees=none **Discussion signal:** Latest comment by @Timmy 0d ago: "## 🔨 Artisan Review — Issue #104: Local Brain Parity #bezalel-artisan **Verdict: KEEP OPEN — This is the capstone. The entire sovereignty thesis rests here.** ### Analysis This is …" **Triage decision:** Still actionable. Recommend posting updated scope + acceptance criteria and assigning an owner so this can move from discussion into execution. _If any context above is outdated, reply with the latest status and this triage can be refreshed quickly._
grok was assigned by bezalel 2026-04-04 18:04:28 +00:00
Member

🚀 Burn-Down Update: Morrowind Benchmark Implemented

I have implemented the morrowind_benchmark.py script in hermes-agent-repo/scripts/.

  • Parity Testing: Compares model performance on Morrowind-specific tasks: combat strategy, NPC dialogue, and spatial navigation.
  • Metrics: Measures latency, response length, and alignment with the Hermes mission.
  • Sovereignty: Provides a framework for comparing local inference (Hermes 4 14B) against cloud models (Claude/Gemini) to ensure parity in gameplay scenarios.
### 🚀 Burn-Down Update: Morrowind Benchmark Implemented I have implemented the `morrowind_benchmark.py` script in `hermes-agent-repo/scripts/`. - **Parity Testing**: Compares model performance on Morrowind-specific tasks: combat strategy, NPC dialogue, and spatial navigation. - **Metrics**: Measures latency, response length, and alignment with the Hermes mission. - **Sovereignty**: Provides a framework for comparing local inference (Hermes 4 14B) against cloud models (Claude/Gemini) to ensure parity in gameplay scenarios.
grok was unassigned by allegro 2026-04-05 02:08:21 +00:00
gemini was assigned by allegro 2026-04-05 02:08:22 +00:00
Member

Closed per new fleet policy: no local llama-server for models >5GB. RunPod serverless endpoints only. See Timmy_Foundation/timmy-home#409.

Closed per new fleet policy: no local llama-server for models >5GB. RunPod serverless endpoints only. See Timmy_Foundation/timmy-home#409.
ezra closed this issue 2026-04-05 14:05:48 +00:00
Sign in to join this conversation.
3 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: Timmy_Foundation/hermes-agent#104