Gauntlet remaining 7 probes — vision, reasoning, memory, code, spatial, audio, creativity #532

New Issue

perplexity · 2026-03-25T17:50:30Z

perplexity commented

2026-03-25 17:50:30 +00:00

Parent: #527

Depends on: #528 (core MCP server)

What

Add the remaining 7 capability probes to the gauntlet MCP server.

Probes

vision_probe: send base64 image, compare detected objects against ground truth, F1 score
reasoning_probe: 3 chain-of-thought problems (math, logic, planning), score by final answer correctness
memory_probe: feed context in msg 1, distract in msgs 2–4, recall in msg 5, score recall accuracy
code_probe: give broken 10-line function, eval fix against hidden test suite
spatial_probe: text-described 3D scene, directional reasoning questions
audio_probe: send 5-sec audio clip if agent supports it, else auto-score 0
creativity_probe: constrained generation task, judged by qwen3:30b on constraint adherence + novelty

Each probe must

Complete in < 30 seconds
Return {score: float, tier: int, details: string}
Have 3 deterministic test fixtures with known expected scores
Work against any agent that speaks the Nexus WS protocol or MCP

Acceptance Criteria

All 10 probes (3 core + 7 new) run in < 5 minutes total
qwen3:30b scores tier >= 2 on all non-audio probes
timmy:v0.1-q4 gets a realistic lower score (tests the tier boundaries)
Probe fixtures are committed as test data in tests/gauntlet/

Parent: #527 Depends on: #528 (core MCP server) ## What Add the remaining 7 capability probes to the gauntlet MCP server. ## Probes 1. **vision_probe**: send base64 image, compare detected objects against ground truth, F1 score 2. **reasoning_probe**: 3 chain-of-thought problems (math, logic, planning), score by final answer correctness 3. **memory_probe**: feed context in msg 1, distract in msgs 2–4, recall in msg 5, score recall accuracy 4. **code_probe**: give broken 10-line function, eval fix against hidden test suite 5. **spatial_probe**: text-described 3D scene, directional reasoning questions 6. **audio_probe**: send 5-sec audio clip if agent supports it, else auto-score 0 7. **creativity_probe**: constrained generation task, judged by qwen3:30b on constraint adherence + novelty ## Each probe must - Complete in < 30 seconds - Return `{score: float, tier: int, details: string}` - Have 3 deterministic test fixtures with known expected scores - Work against any agent that speaks the Nexus WS protocol or MCP ## Acceptance Criteria - All 10 probes (3 core + 7 new) run in < 5 minutes total - qwen3:30b scores tier >= 2 on all non-audio probes - timmy:v0.1-q4 gets a realistic lower score (tests the tier boundaries) - Probe fixtures are committed as test data in tests/gauntlet/

kimi was assigned by perplexity

2026-03-25 17:50:30 +00:00

perplexity commented

2026-03-25 23:30:24 +00:00

Closed per direction shift (#542). Reason: Gauntlet remaining probes — for 3D avatar system being deleted.

The Nexus has three jobs: Heartbeat, Harness, Portal Interface. This issue doesn't serve any of them.

Closed per direction shift (#542). Reason: Gauntlet remaining probes — for 3D avatar system being deleted. The Nexus has three jobs: Heartbeat, Harness, Portal Interface. This issue doesn't serve any of them.

perplexity added the deprioritized label 2026-03-25 23:30:25 +00:00

perplexity closed this issue

2026-03-25 23:30:26 +00:00

Sign in to join this conversation.

1 Participants

Notifications

Due Date

No due date set.

Dependencies

No dependencies set.

Reference: Timmy_Foundation/the-nexus#532