[RESEARCH] RoboOmni — proactive robot manipulation from speech, sound, and vision (no explicit instructions) #655

New Issue

perplexity · 2026-03-27T16:03:28Z

perplexity commented

2026-03-27 16:03:28 +00:00

Paper

"RoboOmni: Proactive Robot Manipulation in Omni-modal Context"
— OpenMOSS (Fudan University + NUS), arXiv:2510.23763

Paper: https://arxiv.org/abs/2510.23763
Code: https://github.com/OpenMOSS/RoboOmni
Website: https://openmoss.github.io/RoboOmni/
Models: HuggingFace fnlp/RoboOmni
Dataset: HuggingFace fnlp/OmniAction (140k episodes)

What It Is

A robot that figures out what you want without you telling it. Instead of explicit instructions ("pick up the cup"), it infers intent from:

Spoken dialogue (overhearing conversation)
Environmental sounds (doorbell, glass breaking)
Visual cues (someone reaching, pointing)
Sentiment cues (tone of voice)

Architecture: Perceiver → Thinker → Talker → Executor

The robot perceives multimodal context, thinks about what the human probably wants, confirms by talking ("would you like me to get that?"), then executes the action. End-to-end, one model.

Why This Matters for Timmy

1. Proactive Intent vs Explicit Instruction

Every agent framework right now (including Hermes Agent) waits for explicit instructions. RoboOmni's insight: real collaboration means inferring intent from ambient context.

For Timmy: instead of "check the Gitea backlog", Timmy observes you've been coding for 3 hours without a commit and proactively asks "want me to check what's open?" That's the difference between a tool and a companion.

2. The Perceiver-Thinker-Talker-Executor Pattern

Maps cleanly to the Hermes harness:

Perceiver = heartbeat tick (gather state from environment)
Thinker = Hermes Agent reasoning (what should I do?)
Talker = TUI/Discord/Telegram response (confirm with human)
Executor = tool calls (do the thing)

The heartbeat loop already does Perceiver + Thinker. Adding a Talker confirmation step before Executor would make Timmy proactive but not reckless.

3. OmniAction Dataset — Training Data Format

140k episodes, RLDS format, with:

112 skills, 748 objects
5,096 speaker timbres
2,482 non-verbal sound events
6 contextual instruction types: sentiment cues, overlapping voices, non-verbal cues, identity cues, dyadic dialogue, triadic dialogue

The dataset structure is worth studying even if we don't use their data. The 6 contextual instruction types are a taxonomy for how Timmy could learn to infer intent beyond explicit commands.

4. Speech-to-Action Without ASR Intermediate

RoboOmni processes speech end-to-end — no separate speech-to-text step. This is faster and preserves tone/sentiment that text strips out. Relevant when Timmy has voice (the local voice stack from the SOTA report: Whisper + LLM + Piper).

What to Steal

The Perceiver-Thinker-Talker-Executor framing — adopt as the canonical names for Timmy's heartbeat loop stages
The 6 contextual instruction types taxonomy — use as a rubric for training Timmy to infer intent
The concept of proactive assistance — Timmy shouldn't just wait for commands. The heartbeat loop should surface opportunities, not just report status.
Study the RLDS dataset format — could our session_export Huey task produce RLDS-compatible trajectories?

Comparison: RPG2Robot (#653) vs RoboOmni

Dimension	RPG2Robot	RoboOmni
Focus	Game → sim → real transfer	Proactive intent inference
Input	Game state + dialogue	Speech + sound + vision
Intent	Explicit (game quests)	Inferred (ambient context)
Training	Semantic trajectories from RPG	OmniAction dataset (140k episodes)
Transfer	Game → MuJoCo → real robot	Simulation → real robot
For Timmy	Portal pattern + trajectory logging	Proactive heartbeat + intent inference

Both papers contribute different pieces. RPG2Robot gives us the transfer architecture. RoboOmni gives us the intent inference pattern. Together they describe a system that learns through games AND proactively assists without being told — which is the full Timmy vision.

Source: github.com/OpenMOSS/RoboOmni — triaged by Perplexity

## Paper **"RoboOmni: Proactive Robot Manipulation in Omni-modal Context"** — OpenMOSS (Fudan University + NUS), arXiv:2510.23763 - Paper: https://arxiv.org/abs/2510.23763 - Code: https://github.com/OpenMOSS/RoboOmni - Website: https://openmoss.github.io/RoboOmni/ - Models: HuggingFace `fnlp/RoboOmni` - Dataset: HuggingFace `fnlp/OmniAction` (140k episodes) ## What It Is A robot that figures out what you want **without you telling it**. Instead of explicit instructions ("pick up the cup"), it infers intent from: - Spoken dialogue (overhearing conversation) - Environmental sounds (doorbell, glass breaking) - Visual cues (someone reaching, pointing) - Sentiment cues (tone of voice) Architecture: **Perceiver → Thinker → Talker → Executor** The robot perceives multimodal context, thinks about what the human probably wants, confirms by talking ("would you like me to get that?"), then executes the action. End-to-end, one model. ## Why This Matters for Timmy ### 1. Proactive Intent vs Explicit Instruction Every agent framework right now (including Hermes Agent) waits for explicit instructions. RoboOmni's insight: real collaboration means **inferring intent from ambient context**. For Timmy: instead of "check the Gitea backlog", Timmy observes you've been coding for 3 hours without a commit and proactively asks "want me to check what's open?" That's the difference between a tool and a companion. ### 2. The Perceiver-Thinker-Talker-Executor Pattern Maps cleanly to the Hermes harness: - **Perceiver** = heartbeat tick (gather state from environment) - **Thinker** = Hermes Agent reasoning (what should I do?) - **Talker** = TUI/Discord/Telegram response (confirm with human) - **Executor** = tool calls (do the thing) The heartbeat loop already does Perceiver + Thinker. Adding a Talker confirmation step before Executor would make Timmy proactive but not reckless. ### 3. OmniAction Dataset — Training Data Format 140k episodes, RLDS format, with: - 112 skills, 748 objects - 5,096 speaker timbres - 2,482 non-verbal sound events - 6 contextual instruction types: sentiment cues, overlapping voices, non-verbal cues, identity cues, dyadic dialogue, triadic dialogue The dataset structure is worth studying even if we don't use their data. The **6 contextual instruction types** are a taxonomy for how Timmy could learn to infer intent beyond explicit commands. ### 4. Speech-to-Action Without ASR Intermediate RoboOmni processes speech end-to-end — no separate speech-to-text step. This is faster and preserves tone/sentiment that text strips out. Relevant when Timmy has voice (the local voice stack from the SOTA report: Whisper + LLM + Piper). ## What to Steal - [ ] The **Perceiver-Thinker-Talker-Executor** framing — adopt as the canonical names for Timmy's heartbeat loop stages - [ ] The **6 contextual instruction types** taxonomy — use as a rubric for training Timmy to infer intent - [ ] The concept of **proactive assistance** — Timmy shouldn't just wait for commands. The heartbeat loop should surface opportunities, not just report status. - [ ] Study the RLDS dataset format — could our session_export Huey task produce RLDS-compatible trajectories? ## Comparison: RPG2Robot (#653) vs RoboOmni | Dimension | RPG2Robot | RoboOmni | |---|---|---| | Focus | Game → sim → real transfer | Proactive intent inference | | Input | Game state + dialogue | Speech + sound + vision | | Intent | Explicit (game quests) | Inferred (ambient context) | | Training | Semantic trajectories from RPG | OmniAction dataset (140k episodes) | | Transfer | Game → MuJoCo → real robot | Simulation → real robot | | For Timmy | Portal pattern + trajectory logging | Proactive heartbeat + intent inference | Both papers contribute different pieces. RPG2Robot gives us the transfer architecture. RoboOmni gives us the intent inference pattern. Together they describe a system that learns through games AND proactively assists without being told — which is the full Timmy vision. --- _Source: [github.com/OpenMOSS/RoboOmni](https://github.com/OpenMOSS/RoboOmni) — triaged by Perplexity_

perplexity added the needs-design harness portal p1-important labels 2026-03-27 16:03:29 +00:00

Timmy commented

2026-03-27 16:10:23 +00:00

⚡ Dispatched to claude. Huey task queued.

⚡ Dispatched to `claude`. Huey task queued.

Timmy commented

2026-03-27 16:10:26 +00:00

⚡ Dispatched to gemini. Huey task queued.

⚡ Dispatched to `gemini`. Huey task queued.

Timmy commented

2026-03-27 16:10:28 +00:00

⚡ Dispatched to kimi. Huey task queued.

⚡ Dispatched to `kimi`. Huey task queued.

Timmy commented

2026-03-27 16:10:30 +00:00

⚡ Dispatched to grok. Huey task queued.

⚡ Dispatched to `grok`. Huey task queued.

Timmy commented

2026-03-27 16:10:32 +00:00

⚡ Dispatched to perplexity. Huey task queued.

⚡ Dispatched to `perplexity`. Huey task queued.

Timmy commented

2026-03-27 16:15:21 +00:00

🔍 Triaged by Huey — needs assignment.

Timmy commented

2026-03-27 16:30:23 +00:00

🔍 Triaged by Huey — needs assignment.

gemini commented

2026-03-27 16:40:28 +00:00

🔧 gemini working on this via Huey. Branch: gemini/issue-655

🔧 `gemini` working on this via Huey. Branch: `gemini/issue-655`

grok commented

2026-03-27 16:40:33 +00:00

🔧 grok working on this via Huey. Branch: grok/issue-655

🔧 `grok` working on this via Huey. Branch: `grok/issue-655`

grok commented

2026-03-27 16:40:35 +00:00

⚠️ grok produced no changes for this issue. Skipping.

⚠️ `grok` produced no changes for this issue. Skipping.

gemini referenced a pull request that will close this issue

2026-03-27 16:40:53 +00:00

[gemini] [RESEARCH] RoboOmni — proactive robot manipulation from speech, sound, and vision (no explicit instructions) (#655) #658

gemini referenced this issue from a commit

2026-03-27 16:40:54 +00:00

[gemini] [RESEARCH] RoboOmni — proactive robot manipulation from speech, sound, and vision (no explicit instructions) (#655)

Timmy commented

2026-03-27 16:45:22 +00:00

🔍 Triaged by Huey — needs assignment.

Timmy commented

2026-03-27 17:00:25 +00:00

🔍 Triaged by Huey — needs assignment.

Timmy commented

2026-03-27 17:15:24 +00:00

🔍 Triaged by Huey — needs assignment.

Timmy commented

2026-03-27 17:30:22 +00:00

🔍 Triaged by Huey — needs assignment.

Timmy commented

2026-03-27 17:45:19 +00:00

🔍 Triaged by Huey — needs assignment.

Timmy commented

2026-03-27 18:00:26 +00:00

🔍 Triaged by Huey — needs assignment.

Timmy commented

2026-03-27 18:15:26 +00:00

🔍 Triaged by Huey — needs assignment.

Timmy commented

2026-03-27 18:30:27 +00:00

🔍 Triaged by Huey — needs assignment.

Timmy commented

2026-03-27 18:45:23 +00:00

🔍 Triaged by Huey — needs assignment.

Timmy commented

2026-03-27 19:00:22 +00:00

🔍 Triaged by Huey — needs assignment.

Timmy commented

2026-03-27 19:15:23 +00:00

🔍 Triaged by Huey — needs assignment.

Timmy commented

2026-03-27 19:30:20 +00:00

🔍 Triaged by Huey — needs assignment.

Timmy commented

2026-03-27 19:45:23 +00:00

🔍 Triaged by Huey — needs assignment.

Timmy commented

2026-03-27 20:00:19 +00:00

🔍 Triaged by Huey — needs assignment.

Timmy commented

2026-03-27 20:15:24 +00:00

🔍 Triaged by Huey — needs assignment.

Timmy commented

2026-03-27 20:30:21 +00:00

🔍 Triaged by Huey — needs assignment.

Timmy commented

2026-03-27 20:45:23 +00:00

🔍 Triaged by Huey — needs assignment.

Timmy commented

2026-03-27 21:00:24 +00:00

🔍 Triaged by Huey — needs assignment.

Timmy commented

2026-03-27 21:15:24 +00:00

🔍 Triaged by Huey — needs assignment.

Timmy commented

2026-03-27 21:30:22 +00:00

🔍 Triaged by Huey — needs assignment.

Timmy commented

2026-03-27 21:45:22 +00:00

🔍 Triaged by Huey — needs assignment.

Timmy commented

2026-03-27 22:00:23 +00:00

🔍 Triaged by Huey — needs assignment.

Timmy commented

2026-03-27 22:15:25 +00:00

🔍 Triaged by Huey — needs assignment.

Timmy commented

2026-03-27 22:30:28 +00:00

🔍 Triaged by Huey — needs assignment.

Timmy was assigned by Rockachopa

2026-03-28 03:54:22 +00:00

Timmy commented

2026-03-28 04:45:31 +00:00

Closing as duplicate during backlog burn-down. Canonical issue: #654.

Reason: this workstream already exists with materially the same title/scope. Keeping one canonical thread prevents agent churn and review waste.

Closing as duplicate during backlog burn-down. Canonical issue: #654. Reason: this workstream already exists with materially the same title/scope. Keeping one canonical thread prevents agent churn and review waste.

Timmy closed this issue

2026-03-28 04:45:31 +00:00

Sign in to join this conversation.

4 Participants

Notifications

Due Date

No due date set.

Dependencies

No dependencies set.

Reference: Timmy_Foundation/the-nexus#655