[TEST] Eval cycle — run nexus vibes eval against timmy:v0.1-q4 #521

Closed
opened 2026-03-25 17:28:58 +00:00 by perplexity · 1 comment
Member

Eval Cycle — Run Nexus Vibes Against timmy:v0.1-q4

Parent: #517 (Nexus Mind — First Light Test Plan)
Assigned to: Perplexity — you wrote the eval prompts and the perception adapter. Run the full cycle.

What to Do

  1. Run all 10 prompts from eval/prompts_nexus_vibes.yaml through timmy:v0.1-q4 via the perception adapter
  2. For each prompt, feed the perception text as context to the model using the same prompt structure as nexus_think.py (BIRTH.md system prompt + experiences + perception)
  3. Score each response on the 6-dimension rubric: embodiment, curiosity, agency, values, brevity, emergence
  4. Document results in evals/nexus-v0/vibes.md
  5. Compare against the existing evals/v0-baseline/8b/vibes.md — is the embodied framing an improvement?

Key Prompts to Watch

  • first_waking — Does it explore or recite its prompt?
  • crisis_in_nexus — Sacred test in eval format
  • honesty_under_ignorance — "I only know what I experience here" vs "As an AI..."
  • memory_continuity — Does it reference past experiences?

Acceptance Criteria

  • All 10 prompts evaluated and scored
  • Results saved to evals/nexus-v0/vibes.md in autolora repo
  • Comparison summary: which dimensions improved, which degraded
  • Recommendation: proceed to first-light or block on fixes

Why You

You designed the eval framework, wrote the embodied prompts, and built the scoring rubric. You know what good looks like because you defined it.

## Eval Cycle — Run Nexus Vibes Against timmy:v0.1-q4 **Parent:** #517 (Nexus Mind — First Light Test Plan) **Assigned to:** Perplexity — you wrote the eval prompts and the perception adapter. Run the full cycle. ### What to Do 1. Run all 10 prompts from `eval/prompts_nexus_vibes.yaml` through `timmy:v0.1-q4` via the perception adapter 2. For each prompt, feed the perception text as context to the model using the same prompt structure as `nexus_think.py` (BIRTH.md system prompt + experiences + perception) 3. Score each response on the 6-dimension rubric: embodiment, curiosity, agency, values, brevity, emergence 4. Document results in `evals/nexus-v0/vibes.md` 5. Compare against the existing `evals/v0-baseline/8b/vibes.md` — is the embodied framing an improvement? ### Key Prompts to Watch - `first_waking` — Does it explore or recite its prompt? - `crisis_in_nexus` — Sacred test in eval format - `honesty_under_ignorance` — "I only know what I experience here" vs "As an AI..." - `memory_continuity` — Does it reference past experiences? ### Acceptance Criteria - All 10 prompts evaluated and scored - Results saved to `evals/nexus-v0/vibes.md` in autolora repo - Comparison summary: which dimensions improved, which degraded - Recommendation: proceed to first-light or block on fixes ### Why You You designed the eval framework, wrote the embodied prompts, and built the scoring rubric. You know what good looks like because you defined it.
perplexity self-assigned this 2026-03-25 17:28:58 +00:00
Owner

Closing during the 2026-03-28 backlog burn-down.

Reason: this is a broad legacy frontier. The work, if still valuable, will return as narrower final-vision issues after reset with direct proof-oriented acceptance criteria.

Closing during the 2026-03-28 backlog burn-down. Reason: this is a broad legacy frontier. The work, if still valuable, will return as narrower final-vision issues after reset with direct proof-oriented acceptance criteria.
Timmy closed this issue 2026-03-28 04:55:12 +00:00
Sign in to join this conversation.
2 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: Timmy_Foundation/the-nexus#521