[RESEARCH] RoboOmni — proactive robot manipulation from speech, sound, and vision (no explicit instructions) #655

Closed
opened 2026-03-27 16:03:28 +00:00 by perplexity · 35 comments
Member

Paper

"RoboOmni: Proactive Robot Manipulation in Omni-modal Context"
— OpenMOSS (Fudan University + NUS), arXiv:2510.23763

What It Is

A robot that figures out what you want without you telling it. Instead of explicit instructions ("pick up the cup"), it infers intent from:

  • Spoken dialogue (overhearing conversation)
  • Environmental sounds (doorbell, glass breaking)
  • Visual cues (someone reaching, pointing)
  • Sentiment cues (tone of voice)

Architecture: Perceiver → Thinker → Talker → Executor

The robot perceives multimodal context, thinks about what the human probably wants, confirms by talking ("would you like me to get that?"), then executes the action. End-to-end, one model.

Why This Matters for Timmy

1. Proactive Intent vs Explicit Instruction

Every agent framework right now (including Hermes Agent) waits for explicit instructions. RoboOmni's insight: real collaboration means inferring intent from ambient context.

For Timmy: instead of "check the Gitea backlog", Timmy observes you've been coding for 3 hours without a commit and proactively asks "want me to check what's open?" That's the difference between a tool and a companion.

2. The Perceiver-Thinker-Talker-Executor Pattern

Maps cleanly to the Hermes harness:

  • Perceiver = heartbeat tick (gather state from environment)
  • Thinker = Hermes Agent reasoning (what should I do?)
  • Talker = TUI/Discord/Telegram response (confirm with human)
  • Executor = tool calls (do the thing)

The heartbeat loop already does Perceiver + Thinker. Adding a Talker confirmation step before Executor would make Timmy proactive but not reckless.

3. OmniAction Dataset — Training Data Format

140k episodes, RLDS format, with:

  • 112 skills, 748 objects
  • 5,096 speaker timbres
  • 2,482 non-verbal sound events
  • 6 contextual instruction types: sentiment cues, overlapping voices, non-verbal cues, identity cues, dyadic dialogue, triadic dialogue

The dataset structure is worth studying even if we don't use their data. The 6 contextual instruction types are a taxonomy for how Timmy could learn to infer intent beyond explicit commands.

4. Speech-to-Action Without ASR Intermediate

RoboOmni processes speech end-to-end — no separate speech-to-text step. This is faster and preserves tone/sentiment that text strips out. Relevant when Timmy has voice (the local voice stack from the SOTA report: Whisper + LLM + Piper).

What to Steal

  • The Perceiver-Thinker-Talker-Executor framing — adopt as the canonical names for Timmy's heartbeat loop stages
  • The 6 contextual instruction types taxonomy — use as a rubric for training Timmy to infer intent
  • The concept of proactive assistance — Timmy shouldn't just wait for commands. The heartbeat loop should surface opportunities, not just report status.
  • Study the RLDS dataset format — could our session_export Huey task produce RLDS-compatible trajectories?

Comparison: RPG2Robot (#653) vs RoboOmni

Dimension RPG2Robot RoboOmni
Focus Game → sim → real transfer Proactive intent inference
Input Game state + dialogue Speech + sound + vision
Intent Explicit (game quests) Inferred (ambient context)
Training Semantic trajectories from RPG OmniAction dataset (140k episodes)
Transfer Game → MuJoCo → real robot Simulation → real robot
For Timmy Portal pattern + trajectory logging Proactive heartbeat + intent inference

Both papers contribute different pieces. RPG2Robot gives us the transfer architecture. RoboOmni gives us the intent inference pattern. Together they describe a system that learns through games AND proactively assists without being told — which is the full Timmy vision.


Source: github.com/OpenMOSS/RoboOmni — triaged by Perplexity

## Paper **"RoboOmni: Proactive Robot Manipulation in Omni-modal Context"** — OpenMOSS (Fudan University + NUS), arXiv:2510.23763 - Paper: https://arxiv.org/abs/2510.23763 - Code: https://github.com/OpenMOSS/RoboOmni - Website: https://openmoss.github.io/RoboOmni/ - Models: HuggingFace `fnlp/RoboOmni` - Dataset: HuggingFace `fnlp/OmniAction` (140k episodes) ## What It Is A robot that figures out what you want **without you telling it**. Instead of explicit instructions ("pick up the cup"), it infers intent from: - Spoken dialogue (overhearing conversation) - Environmental sounds (doorbell, glass breaking) - Visual cues (someone reaching, pointing) - Sentiment cues (tone of voice) Architecture: **Perceiver → Thinker → Talker → Executor** The robot perceives multimodal context, thinks about what the human probably wants, confirms by talking ("would you like me to get that?"), then executes the action. End-to-end, one model. ## Why This Matters for Timmy ### 1. Proactive Intent vs Explicit Instruction Every agent framework right now (including Hermes Agent) waits for explicit instructions. RoboOmni's insight: real collaboration means **inferring intent from ambient context**. For Timmy: instead of "check the Gitea backlog", Timmy observes you've been coding for 3 hours without a commit and proactively asks "want me to check what's open?" That's the difference between a tool and a companion. ### 2. The Perceiver-Thinker-Talker-Executor Pattern Maps cleanly to the Hermes harness: - **Perceiver** = heartbeat tick (gather state from environment) - **Thinker** = Hermes Agent reasoning (what should I do?) - **Talker** = TUI/Discord/Telegram response (confirm with human) - **Executor** = tool calls (do the thing) The heartbeat loop already does Perceiver + Thinker. Adding a Talker confirmation step before Executor would make Timmy proactive but not reckless. ### 3. OmniAction Dataset — Training Data Format 140k episodes, RLDS format, with: - 112 skills, 748 objects - 5,096 speaker timbres - 2,482 non-verbal sound events - 6 contextual instruction types: sentiment cues, overlapping voices, non-verbal cues, identity cues, dyadic dialogue, triadic dialogue The dataset structure is worth studying even if we don't use their data. The **6 contextual instruction types** are a taxonomy for how Timmy could learn to infer intent beyond explicit commands. ### 4. Speech-to-Action Without ASR Intermediate RoboOmni processes speech end-to-end — no separate speech-to-text step. This is faster and preserves tone/sentiment that text strips out. Relevant when Timmy has voice (the local voice stack from the SOTA report: Whisper + LLM + Piper). ## What to Steal - [ ] The **Perceiver-Thinker-Talker-Executor** framing — adopt as the canonical names for Timmy's heartbeat loop stages - [ ] The **6 contextual instruction types** taxonomy — use as a rubric for training Timmy to infer intent - [ ] The concept of **proactive assistance** — Timmy shouldn't just wait for commands. The heartbeat loop should surface opportunities, not just report status. - [ ] Study the RLDS dataset format — could our session_export Huey task produce RLDS-compatible trajectories? ## Comparison: RPG2Robot (#653) vs RoboOmni | Dimension | RPG2Robot | RoboOmni | |---|---|---| | Focus | Game → sim → real transfer | Proactive intent inference | | Input | Game state + dialogue | Speech + sound + vision | | Intent | Explicit (game quests) | Inferred (ambient context) | | Training | Semantic trajectories from RPG | OmniAction dataset (140k episodes) | | Transfer | Game → MuJoCo → real robot | Simulation → real robot | | For Timmy | Portal pattern + trajectory logging | Proactive heartbeat + intent inference | Both papers contribute different pieces. RPG2Robot gives us the transfer architecture. RoboOmni gives us the intent inference pattern. Together they describe a system that learns through games AND proactively assists without being told — which is the full Timmy vision. --- _Source: [github.com/OpenMOSS/RoboOmni](https://github.com/OpenMOSS/RoboOmni) — triaged by Perplexity_
perplexity added the needs-designharnessportalp1-important labels 2026-03-27 16:03:29 +00:00
Owner

Dispatched to claude. Huey task queued.

⚡ Dispatched to `claude`. Huey task queued.
Owner

Dispatched to gemini. Huey task queued.

⚡ Dispatched to `gemini`. Huey task queued.
Owner

Dispatched to kimi. Huey task queued.

⚡ Dispatched to `kimi`. Huey task queued.
Owner

Dispatched to grok. Huey task queued.

⚡ Dispatched to `grok`. Huey task queued.
Owner

Dispatched to perplexity. Huey task queued.

⚡ Dispatched to `perplexity`. Huey task queued.
Owner

🔍 Triaged by Huey — needs assignment.

🔍 Triaged by Huey — needs assignment.
Owner

🔍 Triaged by Huey — needs assignment.

🔍 Triaged by Huey — needs assignment.
Member

🔧 gemini working on this via Huey. Branch: gemini/issue-655

🔧 `gemini` working on this via Huey. Branch: `gemini/issue-655`
Member

🔧 grok working on this via Huey. Branch: grok/issue-655

🔧 `grok` working on this via Huey. Branch: `grok/issue-655`
Member

⚠️ grok produced no changes for this issue. Skipping.

⚠️ `grok` produced no changes for this issue. Skipping.
Owner

🔍 Triaged by Huey — needs assignment.

🔍 Triaged by Huey — needs assignment.
Owner

🔍 Triaged by Huey — needs assignment.

🔍 Triaged by Huey — needs assignment.
Owner

🔍 Triaged by Huey — needs assignment.

🔍 Triaged by Huey — needs assignment.
Owner

🔍 Triaged by Huey — needs assignment.

🔍 Triaged by Huey — needs assignment.
Owner

🔍 Triaged by Huey — needs assignment.

🔍 Triaged by Huey — needs assignment.
Owner

🔍 Triaged by Huey — needs assignment.

🔍 Triaged by Huey — needs assignment.
Owner

🔍 Triaged by Huey — needs assignment.

🔍 Triaged by Huey — needs assignment.
Owner

🔍 Triaged by Huey — needs assignment.

🔍 Triaged by Huey — needs assignment.
Owner

🔍 Triaged by Huey — needs assignment.

🔍 Triaged by Huey — needs assignment.
Owner

🔍 Triaged by Huey — needs assignment.

🔍 Triaged by Huey — needs assignment.
Owner

🔍 Triaged by Huey — needs assignment.

🔍 Triaged by Huey — needs assignment.
Owner

🔍 Triaged by Huey — needs assignment.

🔍 Triaged by Huey — needs assignment.
Owner

🔍 Triaged by Huey — needs assignment.

🔍 Triaged by Huey — needs assignment.
Owner

🔍 Triaged by Huey — needs assignment.

🔍 Triaged by Huey — needs assignment.
Owner

🔍 Triaged by Huey — needs assignment.

🔍 Triaged by Huey — needs assignment.
Owner

🔍 Triaged by Huey — needs assignment.

🔍 Triaged by Huey — needs assignment.
Owner

🔍 Triaged by Huey — needs assignment.

🔍 Triaged by Huey — needs assignment.
Owner

🔍 Triaged by Huey — needs assignment.

🔍 Triaged by Huey — needs assignment.
Owner

🔍 Triaged by Huey — needs assignment.

🔍 Triaged by Huey — needs assignment.
Owner

🔍 Triaged by Huey — needs assignment.

🔍 Triaged by Huey — needs assignment.
Owner

🔍 Triaged by Huey — needs assignment.

🔍 Triaged by Huey — needs assignment.
Owner

🔍 Triaged by Huey — needs assignment.

🔍 Triaged by Huey — needs assignment.
Owner

🔍 Triaged by Huey — needs assignment.

🔍 Triaged by Huey — needs assignment.
Owner

🔍 Triaged by Huey — needs assignment.

🔍 Triaged by Huey — needs assignment.
Timmy was assigned by Rockachopa 2026-03-28 03:54:22 +00:00
Owner

Closing as duplicate during backlog burn-down. Canonical issue: #654.

Reason: this workstream already exists with materially the same title/scope. Keeping one canonical thread prevents agent churn and review waste.

Closing as duplicate during backlog burn-down. Canonical issue: #654. Reason: this workstream already exists with materially the same title/scope. Keeping one canonical thread prevents agent churn and review waste.
Timmy closed this issue 2026-03-28 04:45:31 +00:00
Sign in to join this conversation.
4 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: Timmy_Foundation/the-nexus#655