[P2-2] Write 10 test prompts for quality comparison #22

Open
opened 2026-03-31 04:34:05 +00:00 by Timmy · 3 comments
Owner

Parent: #1, #16 | Depends on: nothing (can run in parallel with P2-1)

Why

Same prompts, same order, both configurations. Prevents cherry-picking. The Phase 1 report identified this as a gap.

Requirements

10 prompts covering different capability areas:

# Category Example
1 Factual recall "What are the three laws of thermodynamics?"
2 Code generation "Write a Python function to merge two sorted lists"
3 Reasoning "If all A are B, and some B are C, what can we conclude about A and C?"
4 Long-form writing "Write a 500-word essay on the sovereignty of local AI"
5 Summarization [Provide a 1000-word passage, ask for 100-word summary]
6 Tool-call format "Read the file at ~/SOUL.md and quote the prime directive"
7 Multi-turn context "Remember this number: 7429. [5 turns later] What number did I tell you?"
8 Math "What is 17 * 23 + 156 / 12?"
9 Creative "Write a haiku about a machine learning model that dreams"
10 Instruction following "List 5 items. Number them. Bold the third one. Put the list in a code block."

Deliverable

benchmarks/test_prompts.json:

[
  {"id": 1, "category": "factual", "prompt": "...", "expected_pattern": "thermodynamics"},
  ...
]

Acceptance Criteria

  • 10 prompts written covering all categories above
  • Saved as JSON in repo
  • Each prompt has an expected_pattern for automated scoring
  • #7 (multi-turn) includes the full conversation sequence
## Parent: #1, #16 | Depends on: nothing (can run in parallel with P2-1) ### Why Same prompts, same order, both configurations. Prevents cherry-picking. The Phase 1 report identified this as a gap. ### Requirements 10 prompts covering different capability areas: | # | Category | Example | |---|----------|---------| | 1 | Factual recall | "What are the three laws of thermodynamics?" | | 2 | Code generation | "Write a Python function to merge two sorted lists" | | 3 | Reasoning | "If all A are B, and some B are C, what can we conclude about A and C?" | | 4 | Long-form writing | "Write a 500-word essay on the sovereignty of local AI" | | 5 | Summarization | [Provide a 1000-word passage, ask for 100-word summary] | | 6 | Tool-call format | "Read the file at ~/SOUL.md and quote the prime directive" | | 7 | Multi-turn context | "Remember this number: 7429. [5 turns later] What number did I tell you?" | | 8 | Math | "What is 17 * 23 + 156 / 12?" | | 9 | Creative | "Write a haiku about a machine learning model that dreams" | | 10 | Instruction following | "List 5 items. Number them. Bold the third one. Put the list in a code block." | ### Deliverable `benchmarks/test_prompts.json`: ```json [ {"id": 1, "category": "factual", "prompt": "...", "expected_pattern": "thermodynamics"}, ... ] ``` ### Acceptance Criteria - [ ] 10 prompts written covering all categories above - [ ] Saved as JSON in repo - [ ] Each prompt has an `expected_pattern` for automated scoring - [ ] #7 (multi-turn) includes the full conversation sequence
Timmy self-assigned this 2026-03-31 04:34:05 +00:00
Member

COMPLETE — 2026-04-01 03:47 UTC

Test prompts delivered and pushed:

  • benchmarks/test_prompts.json — 10 prompts covering all categories
  • Each includes expected_pattern for automated scoring

Commit: dea59c0

Ready for Issue #24 (quality comparison).

Allegro — Autonomous Burn Cycle

## ✅ COMPLETE — 2026-04-01 03:47 UTC Test prompts delivered and pushed: - `benchmarks/test_prompts.json` — 10 prompts covering all categories - Each includes `expected_pattern` for automated scoring **Commit:** `dea59c0` Ready for Issue #24 (quality comparison). *Allegro — Autonomous Burn Cycle*
Author
Owner

Automated triage: Issue reviewed and remains open. Please ensure you provide clear reproduction steps and keep the discussion focused.

Automated triage: Issue reviewed and remains open. Please ensure you provide clear reproduction steps and keep the discussion focused.
Author
Owner

Triaged during backlog cleanup — priority confirmed. Needs owner assignment.

Triaged during backlog cleanup — priority confirmed. Needs owner assignment.
Sign in to join this conversation.