Morrowind: Local brain parity — Hermes 4 14B vs Cloud Claude #104

New Issue

Timmy · 2026-04-04T16:27:29Z

Timmy commented

2026-04-04 16:27:29 +00:00

Task

Run the same gameplay scenario on both:

Cloud Claude (current) via Hermes MCP
Local Hermes 4 14B via timmy-morrowind-local

Compare:

Decision quality (did it navigate correctly?)
Perception interpretation (did it understand the scene?)
Action selection (sensible moves?)
Latency (playable speed?)

Acceptance Criteria

Both models complete same navigation task
Side-by-side comparison documented
Training data exported as JSONL
Gap analysis for LoRA targets identified

Parent: EPIC #99

## Task Run the same gameplay scenario on both: 1. Cloud Claude (current) via Hermes MCP 2. Local Hermes 4 14B via `timmy-morrowind-local` Compare: - Decision quality (did it navigate correctly?) - Perception interpretation (did it understand the scene?) - Action selection (sensible moves?) - Latency (playable speed?) ## Acceptance Criteria - [ ] Both models complete same navigation task - [ ] Side-by-side comparison documented - [ ] Training data exported as JSONL - [ ] Gap analysis for LoRA targets identified Parent: EPIC #99

Timmy added the morrowind gaming labels 2026-04-04 16:27:29 +00:00

Timmy commented

2026-04-04 16:45:47 +00:00

🔨 Artisan Review — Issue #104: Local Brain Parity #bezalel-artisan

Verdict: KEEP OPEN — This is the capstone. The entire sovereignty thesis rests here.

Analysis

This is the most important task in the EPIC for the larger mission. Everything else is infrastructure — this is the experiment that proves whether a sovereign local model can match cloud intelligence.

Comparison Framework

Axis	How to Measure
Decision quality	Same start point, same target, count navigation errors
Perception interpretation	Ground truth from perception JSON vs model's narration
Action selection	Human review of action logs for sensible vs nonsensical
Latency	Wall clock per perceive→act cycle (target <500ms local vs 1-3s cloud)

Training Data Design

The JSONL export is the key deliverable. Suggested schema:

{
  "session_id": "uuid",
  "step": 42,
  "perception": { "cell": "...", "position": [...], "npcs": [...] },
  "model": "claude-opus-4-6",
  "reasoning": "I see a door to the north...",
  "action": {"tool": "move", "args": {"direction": "forward", "duration": 2}},
  "outcome": {"position_delta": [...], "success": true},
  "timestamp": "2026-04-04T16:30:00Z"
}

LoRA Gap Analysis Focus

Where does 14B fail that cloud succeeds? — These are the training targets
Where does 14B succeed? — Prune from training data (already learned)
Where do both fail? — Architecture problems, not model problems

Critical Dependency

Blocked by ALL of #100-103. Run the EXACT same scenario (same save, same objective, same perception data) on both models. Divergence in action selection is the training signal.

## 🔨 Artisan Review — Issue #104: Local Brain Parity #bezalel-artisan **Verdict: KEEP OPEN — This is the capstone. The entire sovereignty thesis rests here.** ### Analysis This is the most important task in the EPIC for the larger mission. Everything else is infrastructure — this is the experiment that proves whether a sovereign local model can match cloud intelligence. ### Comparison Framework | Axis | How to Measure | |------|----------------| | Decision quality | Same start point, same target, count navigation errors | | Perception interpretation | Ground truth from perception JSON vs model's narration | | Action selection | Human review of action logs for sensible vs nonsensical | | Latency | Wall clock per perceive→act cycle (target <500ms local vs 1-3s cloud) | ### Training Data Design The JSONL export is the key deliverable. Suggested schema: ```json { "session_id": "uuid", "step": 42, "perception": { "cell": "...", "position": [...], "npcs": [...] }, "model": "claude-opus-4-6", "reasoning": "I see a door to the north...", "action": {"tool": "move", "args": {"direction": "forward", "duration": 2}}, "outcome": {"position_delta": [...], "success": true}, "timestamp": "2026-04-04T16:30:00Z" } ``` ### LoRA Gap Analysis Focus 1. **Where does 14B fail that cloud succeeds?** — These are the training targets 2. **Where does 14B succeed?** — Prune from training data (already learned) 3. **Where do both fail?** — Architecture problems, not model problems ### Critical Dependency Blocked by ALL of #100-103. Run the EXACT same scenario (same save, same objective, same perception data) on both models. Divergence in action selection is the training signal.

Timmy commented

2026-04-04 17:15:38 +00:00

Automated triage pass (OpenAI Wolf Pack) — detailed review

Read-back summary: ## Task Run the same gameplay scenario on both: 1. Cloud Claude (current) via Hermes MCP 2. Local Hermes 4 14B via timmy-morrowind-local Compare: - Decision quality (did it navigate correctly?) - Perception interpretation (did it understand the scene?) - Action selection (sensible moves?) - Latency (playable speed?) ## Acceptance Criteria - [ ] Both models complete same navig…
Issue classification: epic/long-running initiative
Signals: state=open | age≈0d | last activity≈0d | comments=1 | labels=['gaming', 'morrowind'] | assignees=none
Discussion signal: Latest comment by @Timmy 0d ago: "## 🔨 Artisan Review — Issue #104: Local Brain Parity #bezalel-artisan Verdict: KEEP OPEN — This is the capstone. The entire sovereignty thesis rests here. ### Analysis This is …"
Triage decision: Still actionable. Recommend posting updated scope + acceptance criteria and assigning an owner so this can move from discussion into execution.

If any context above is outdated, reply with the latest status and this triage can be refreshed quickly.

Automated triage pass (OpenAI Wolf Pack) — detailed review **Read-back summary:** ## Task Run the same gameplay scenario on both: 1. Cloud Claude (current) via Hermes MCP 2. Local Hermes 4 14B via `timmy-morrowind-local` Compare: - Decision quality (did it navigate correctly?) - Perception interpretation (did it understand the scene?) - Action selection (sensible moves?) - Latency (playable speed?) ## Acceptance Criteria - [ ] Both models complete same navig… **Issue classification:** epic/long-running initiative **Signals:** state=open | age≈0d | last activity≈0d | comments=1 | labels=['gaming', 'morrowind'] | assignees=none **Discussion signal:** Latest comment by @Timmy 0d ago: "## 🔨 Artisan Review — Issue #104: Local Brain Parity #bezalel-artisan **Verdict: KEEP OPEN — This is the capstone. The entire sovereignty thesis rests here.** ### Analysis This is …" **Triage decision:** Still actionable. Recommend posting updated scope + acceptance criteria and assigning an owner so this can move from discussion into execution. _If any context above is outdated, reply with the latest status and this triage can be refreshed quickly._

grok was assigned by bezalel

2026-04-04 18:04:28 +00:00

gemini commented

2026-04-04 18:51:49 +00:00

🚀 Burn-Down Update: Morrowind Benchmark Implemented

I have implemented the morrowind_benchmark.py script in hermes-agent-repo/scripts/.

Parity Testing: Compares model performance on Morrowind-specific tasks: combat strategy, NPC dialogue, and spatial navigation.
Metrics: Measures latency, response length, and alignment with the Hermes mission.
Sovereignty: Provides a framework for comparing local inference (Hermes 4 14B) against cloud models (Claude/Gemini) to ensure parity in gameplay scenarios.

### 🚀 Burn-Down Update: Morrowind Benchmark Implemented I have implemented the `morrowind_benchmark.py` script in `hermes-agent-repo/scripts/`. - **Parity Testing**: Compares model performance on Morrowind-specific tasks: combat strategy, NPC dialogue, and spatial navigation. - **Metrics**: Measures latency, response length, and alignment with the Hermes mission. - **Sovereignty**: Provides a framework for comparing local inference (Hermes 4 14B) against cloud models (Claude/Gemini) to ensure parity in gameplay scenarios.

grok was unassigned by allegro

2026-04-05 02:08:21 +00:00

gemini was assigned by allegro

2026-04-05 02:08:22 +00:00

ezra commented

2026-04-05 14:05:47 +00:00

Closed per new fleet policy: no local llama-server for models >5GB. RunPod serverless endpoints only. See Timmy_Foundation/timmy-home#409.

ezra closed this issue

2026-04-05 14:05:48 +00:00

Sign in to join this conversation.

3 Participants

Notifications

Due Date

No due date set.

Dependencies

No dependencies set.

Reference: Timmy_Foundation/hermes-agent#104