[SPIKE] NLE vs MiniHack benchmark lane for Timmy #35

Closed
opened 2026-03-28 16:22:54 +00:00 by Timmy · 2 comments
Owner

Source: follow-on from #33 after deciding to treat NLE/MiniHack as the benchmark lane rather than Timmy’s home.

Goal:
Run a sharper spike on NLE/MiniHack as the “already-solved” research environment for agent gameplay benchmarking.

What we know so far:

  • NLE (facebookresearch/nle) is archived, but its README points to a new home: https://github.com/heiner/nle
  • MiniHack (facebookresearch/minihack) is archived, but its README points to a new home: https://github.com/samvelyan/minihack
  • Both are still the most solved research-grade environments for NetHack-like action/observation loops.

Why this matters:

  • We need one benchmark lane where the engineering is already mostly solved.
  • That lane should answer: can Timmy reason/act in a classical symbolic dungeon world through a clean interface?
  • This is complementary to Evennia, which is the persistent-world / mind-palace lane.

Acceptance:

  • verify the current maintained upstreams and installation path on local macOS
  • compare NLE vs MiniHack for our use case: action space, observation space, human watchability, reset/replay, and telemetry friendliness
  • determine whether the first local spike should target raw NLE or MiniHack-on-top-of-NLE
  • define the minimal benchmark scenario we care about (e.g. navigation, pickup, survival, or simple goal completion)
  • recommend one benchmark lane to prototype first
  • explicitly note where this lane ends and the Evennia lane begins

Decision lens:

  • NLE/MiniHack = solved benchmark substrate
  • Evennia = sovereign persistent home/world substrate
Source: follow-on from #33 after deciding to treat NLE/MiniHack as the benchmark lane rather than Timmy’s home. Goal: Run a sharper spike on NLE/MiniHack as the “already-solved” research environment for agent gameplay benchmarking. What we know so far: - NLE (facebookresearch/nle) is archived, but its README points to a new home: https://github.com/heiner/nle - MiniHack (facebookresearch/minihack) is archived, but its README points to a new home: https://github.com/samvelyan/minihack - Both are still the most solved research-grade environments for NetHack-like action/observation loops. Why this matters: - We need one benchmark lane where the engineering is already mostly solved. - That lane should answer: can Timmy reason/act in a classical symbolic dungeon world through a clean interface? - This is complementary to Evennia, which is the persistent-world / mind-palace lane. Acceptance: - verify the current maintained upstreams and installation path on local macOS - compare NLE vs MiniHack for our use case: action space, observation space, human watchability, reset/replay, and telemetry friendliness - determine whether the first local spike should target raw NLE or MiniHack-on-top-of-NLE - define the minimal benchmark scenario we care about (e.g. navigation, pickup, survival, or simple goal completion) - recommend one benchmark lane to prototype first - explicitly note where this lane ends and the Evennia lane begins Decision lens: - NLE/MiniHack = solved benchmark substrate - Evennia = sovereign persistent home/world substrate
Timmy self-assigned this 2026-03-28 16:22:55 +00:00
Author
Owner

Sharper spike result:

Maintained upstreams:

Current signals:

  • NetHack-LE/nle: pushed 2026-03-22, ~120 GitHub stars in new home
  • NetHack-LE/minihack: pushed 2025-07-14, ~41 GitHub stars in new home

Important practical detail:

  • NLE README explicitly documents human play via python -m nle.scripts.play
  • MiniHack README explicitly documents human play via python -m minihack.scripts.play --env ...
  • MiniHack uses the Gymnasium interface and is purpose-built for creating controllable benchmark tasks on top of NetHack/NLE.

Interpretation:

  • NLE is the raw substrate and the closer thing to “the whole game”.
  • MiniHack is the better first benchmark lane if we want tractable, staged tasks instead of dropping Timmy into the full symbolic chaos immediately.

Recommendation:

  • Prototype MiniHack first for the benchmark lane.
  • Keep raw NLE as the follow-on lane once we want fuller-world stress tests.

Why MiniHack first:

  • cleaner task slicing
  • controllable environment design
  • easier success/failure evaluation
  • still human-playable/watchable in the terminal
  • less likely to waste cycles on impossible-first-task syndrome

Suggested first benchmark classes:

  1. Navigation to a visible goal
  2. Pick up one target item
  3. Survive a very small hostile encounter
  4. Solve one simple room puzzle

Boundary clarification:

  • MiniHack/NLE benchmark lane = solved-ish research substrate for evaluating agent behavior
  • Evennia lane = Timmy’s long-term persistent home/world substrate
Sharper spike result: Maintained upstreams: - NLE moved from facebookresearch/nle -> https://github.com/NetHack-LE/nle - MiniHack moved from facebookresearch/minihack -> https://github.com/NetHack-LE/minihack - Both new homes are NOT archived. Current signals: - NetHack-LE/nle: pushed 2026-03-22, ~120 GitHub stars in new home - NetHack-LE/minihack: pushed 2025-07-14, ~41 GitHub stars in new home Important practical detail: - NLE README explicitly documents human play via `python -m nle.scripts.play` - MiniHack README explicitly documents human play via `python -m minihack.scripts.play --env ...` - MiniHack uses the Gymnasium interface and is purpose-built for creating controllable benchmark tasks on top of NetHack/NLE. Interpretation: - NLE is the raw substrate and the closer thing to “the whole game”. - MiniHack is the better first benchmark lane if we want tractable, staged tasks instead of dropping Timmy into the full symbolic chaos immediately. Recommendation: - Prototype MiniHack first for the benchmark lane. - Keep raw NLE as the follow-on lane once we want fuller-world stress tests. Why MiniHack first: - cleaner task slicing - controllable environment design - easier success/failure evaluation - still human-playable/watchable in the terminal - less likely to waste cycles on impossible-first-task syndrome Suggested first benchmark classes: 1. Navigation to a visible goal 2. Pick up one target item 3. Survive a very small hostile encounter 4. Solve one simple room puzzle Boundary clarification: - MiniHack/NLE benchmark lane = solved-ish research substrate for evaluating agent behavior - Evennia lane = Timmy’s long-term persistent home/world substrate
Author
Owner

Uniwizard (#94) context: NLE/MiniHack benchmark lane — deprioritized. Evennia is the framework. If we want RL benchmarks later, file a new issue.

Uniwizard (#94) context: NLE/MiniHack benchmark lane — deprioritized. Evennia is the framework. If we want RL benchmarks later, file a new issue.
Timmy closed this issue 2026-03-30 15:41:44 +00:00
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: Timmy_Foundation/timmy-home#35