[SPIKE] NLE vs MiniHack benchmark lane for Timmy #35

New Issue

Timmy · 2026-03-28T16:22:54Z

Timmy commented

2026-03-28 16:22:54 +00:00

Source: follow-on from #33 after deciding to treat NLE/MiniHack as the benchmark lane rather than Timmy’s home.

Goal:
Run a sharper spike on NLE/MiniHack as the “already-solved” research environment for agent gameplay benchmarking.

What we know so far:

NLE (facebookresearch/nle) is archived, but its README points to a new home: https://github.com/heiner/nle
MiniHack (facebookresearch/minihack) is archived, but its README points to a new home: https://github.com/samvelyan/minihack
Both are still the most solved research-grade environments for NetHack-like action/observation loops.

Why this matters:

We need one benchmark lane where the engineering is already mostly solved.
That lane should answer: can Timmy reason/act in a classical symbolic dungeon world through a clean interface?
This is complementary to Evennia, which is the persistent-world / mind-palace lane.

Acceptance:

verify the current maintained upstreams and installation path on local macOS
compare NLE vs MiniHack for our use case: action space, observation space, human watchability, reset/replay, and telemetry friendliness
determine whether the first local spike should target raw NLE or MiniHack-on-top-of-NLE
define the minimal benchmark scenario we care about (e.g. navigation, pickup, survival, or simple goal completion)
recommend one benchmark lane to prototype first
explicitly note where this lane ends and the Evennia lane begins

Decision lens:

NLE/MiniHack = solved benchmark substrate
Evennia = sovereign persistent home/world substrate

Source: follow-on from #33 after deciding to treat NLE/MiniHack as the benchmark lane rather than Timmy’s home. Goal: Run a sharper spike on NLE/MiniHack as the “already-solved” research environment for agent gameplay benchmarking. What we know so far: - NLE (facebookresearch/nle) is archived, but its README points to a new home: https://github.com/heiner/nle - MiniHack (facebookresearch/minihack) is archived, but its README points to a new home: https://github.com/samvelyan/minihack - Both are still the most solved research-grade environments for NetHack-like action/observation loops. Why this matters: - We need one benchmark lane where the engineering is already mostly solved. - That lane should answer: can Timmy reason/act in a classical symbolic dungeon world through a clean interface? - This is complementary to Evennia, which is the persistent-world / mind-palace lane. Acceptance: - verify the current maintained upstreams and installation path on local macOS - compare NLE vs MiniHack for our use case: action space, observation space, human watchability, reset/replay, and telemetry friendliness - determine whether the first local spike should target raw NLE or MiniHack-on-top-of-NLE - define the minimal benchmark scenario we care about (e.g. navigation, pickup, survival, or simple goal completion) - recommend one benchmark lane to prototype first - explicitly note where this lane ends and the Evennia lane begins Decision lens: - NLE/MiniHack = solved benchmark substrate - Evennia = sovereign persistent home/world substrate

Timmy self-assigned this 2026-03-28 16:22:55 +00:00

Timmy commented

2026-03-28 16:23:58 +00:00

Sharper spike result:

Maintained upstreams:

NLE moved from facebookresearch/nle -> https://github.com/NetHack-LE/nle
MiniHack moved from facebookresearch/minihack -> https://github.com/NetHack-LE/minihack
Both new homes are NOT archived.

Current signals:

NetHack-LE/nle: pushed 2026-03-22, ~120 GitHub stars in new home
NetHack-LE/minihack: pushed 2025-07-14, ~41 GitHub stars in new home

Important practical detail:

NLE README explicitly documents human play via python -m nle.scripts.play
MiniHack README explicitly documents human play via python -m minihack.scripts.play --env ...
MiniHack uses the Gymnasium interface and is purpose-built for creating controllable benchmark tasks on top of NetHack/NLE.

Interpretation:

NLE is the raw substrate and the closer thing to “the whole game”.
MiniHack is the better first benchmark lane if we want tractable, staged tasks instead of dropping Timmy into the full symbolic chaos immediately.

Recommendation:

Prototype MiniHack first for the benchmark lane.
Keep raw NLE as the follow-on lane once we want fuller-world stress tests.

Why MiniHack first:

cleaner task slicing
controllable environment design
easier success/failure evaluation
still human-playable/watchable in the terminal
less likely to waste cycles on impossible-first-task syndrome

Suggested first benchmark classes:

Navigation to a visible goal
Pick up one target item
Survive a very small hostile encounter
Solve one simple room puzzle

Boundary clarification:

MiniHack/NLE benchmark lane = solved-ish research substrate for evaluating agent behavior
Evennia lane = Timmy’s long-term persistent home/world substrate

Sharper spike result: Maintained upstreams: - NLE moved from facebookresearch/nle -> https://github.com/NetHack-LE/nle - MiniHack moved from facebookresearch/minihack -> https://github.com/NetHack-LE/minihack - Both new homes are NOT archived. Current signals: - NetHack-LE/nle: pushed 2026-03-22, ~120 GitHub stars in new home - NetHack-LE/minihack: pushed 2025-07-14, ~41 GitHub stars in new home Important practical detail: - NLE README explicitly documents human play via `python -m nle.scripts.play` - MiniHack README explicitly documents human play via `python -m minihack.scripts.play --env ...` - MiniHack uses the Gymnasium interface and is purpose-built for creating controllable benchmark tasks on top of NetHack/NLE. Interpretation: - NLE is the raw substrate and the closer thing to “the whole game”. - MiniHack is the better first benchmark lane if we want tractable, staged tasks instead of dropping Timmy into the full symbolic chaos immediately. Recommendation: - Prototype MiniHack first for the benchmark lane. - Keep raw NLE as the follow-on lane once we want fuller-world stress tests. Why MiniHack first: - cleaner task slicing - controllable environment design - easier success/failure evaluation - still human-playable/watchable in the terminal - less likely to waste cycles on impossible-first-task syndrome Suggested first benchmark classes: 1. Navigation to a visible goal 2. Pick up one target item 3. Survive a very small hostile encounter 4. Solve one simple room puzzle Boundary clarification: - MiniHack/NLE benchmark lane = solved-ish research substrate for evaluating agent behavior - Evennia lane = Timmy’s long-term persistent home/world substrate

allegro referenced this issue

2026-03-30 02:00:12 +00:00

[PROTOTYPE] Evennia sovereign Timmy world / mind palace #34

Timmy commented

2026-03-30 15:41:43 +00:00

Uniwizard (#94) context: NLE/MiniHack benchmark lane — deprioritized. Evennia is the framework. If we want RL benchmarks later, file a new issue.

Timmy closed this issue

2026-03-30 15:41:44 +00:00

Sign in to join this conversation.