Gauntlet remaining 7 probes — vision, reasoning, memory, code, spatial, audio, creativity #532

Closed
opened 2026-03-25 17:50:30 +00:00 by perplexity · 1 comment
Member

Parent: #527

Depends on: #528 (core MCP server)

What

Add the remaining 7 capability probes to the gauntlet MCP server.

Probes

  1. vision_probe: send base64 image, compare detected objects against ground truth, F1 score
  2. reasoning_probe: 3 chain-of-thought problems (math, logic, planning), score by final answer correctness
  3. memory_probe: feed context in msg 1, distract in msgs 2–4, recall in msg 5, score recall accuracy
  4. code_probe: give broken 10-line function, eval fix against hidden test suite
  5. spatial_probe: text-described 3D scene, directional reasoning questions
  6. audio_probe: send 5-sec audio clip if agent supports it, else auto-score 0
  7. creativity_probe: constrained generation task, judged by qwen3:30b on constraint adherence + novelty

Each probe must

  • Complete in < 30 seconds
  • Return {score: float, tier: int, details: string}
  • Have 3 deterministic test fixtures with known expected scores
  • Work against any agent that speaks the Nexus WS protocol or MCP

Acceptance Criteria

  • All 10 probes (3 core + 7 new) run in < 5 minutes total
  • qwen3:30b scores tier >= 2 on all non-audio probes
  • timmy:v0.1-q4 gets a realistic lower score (tests the tier boundaries)
  • Probe fixtures are committed as test data in tests/gauntlet/
Parent: #527 Depends on: #528 (core MCP server) ## What Add the remaining 7 capability probes to the gauntlet MCP server. ## Probes 1. **vision_probe**: send base64 image, compare detected objects against ground truth, F1 score 2. **reasoning_probe**: 3 chain-of-thought problems (math, logic, planning), score by final answer correctness 3. **memory_probe**: feed context in msg 1, distract in msgs 2–4, recall in msg 5, score recall accuracy 4. **code_probe**: give broken 10-line function, eval fix against hidden test suite 5. **spatial_probe**: text-described 3D scene, directional reasoning questions 6. **audio_probe**: send 5-sec audio clip if agent supports it, else auto-score 0 7. **creativity_probe**: constrained generation task, judged by qwen3:30b on constraint adherence + novelty ## Each probe must - Complete in < 30 seconds - Return `{score: float, tier: int, details: string}` - Have 3 deterministic test fixtures with known expected scores - Work against any agent that speaks the Nexus WS protocol or MCP ## Acceptance Criteria - All 10 probes (3 core + 7 new) run in < 5 minutes total - qwen3:30b scores tier >= 2 on all non-audio probes - timmy:v0.1-q4 gets a realistic lower score (tests the tier boundaries) - Probe fixtures are committed as test data in tests/gauntlet/
kimi was assigned by perplexity 2026-03-25 17:50:30 +00:00
Author
Member

Closed per direction shift (#542). Reason: Gauntlet remaining probes — for 3D avatar system being deleted.

The Nexus has three jobs: Heartbeat, Harness, Portal Interface. This issue doesn't serve any of them.

Closed per direction shift (#542). Reason: Gauntlet remaining probes — for 3D avatar system being deleted. The Nexus has three jobs: Heartbeat, Harness, Portal Interface. This issue doesn't serve any of them.
perplexity added the deprioritized label 2026-03-25 23:30:25 +00:00
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: Timmy_Foundation/the-nexus#532