Implement few-shot example curation from successful tasks #90

Open
opened 2026-03-30 15:24:22 +00:00 by Timmy · 1 comment
Owner

Objective

The cheapest way to make a small model smarter is better examples. Build a system that automatically curates the best task outputs into few-shot examples injected into prompts.

How It Works

  1. Self-grading scores every task output (from previous ticket)
  2. Outputs scoring 5/5 on accuracy become few-shot candidates
  3. Candidates are deduplicated and compressed
  4. Top 2-3 examples per task type are injected into the system prompt
  5. Examples rotate as better ones are found

Example Bank Structure

{
    "task_type": "read-file-and-summarize",
    "examples": [
        {
            "input": "Read SOUL.md and quote the mission statement",
            "output": "The mission statement from SOUL.md is: 'Sovereignty and service always.'",
            "score": 5,
            "source_task_id": "overnight_042",
            "tokens": 45
        }
    ],
    "max_examples": 3,
    "total_token_budget": 200
}

Token Budget

Few-shot examples eat context. Enforce a strict budget:

  • Reflex tier: 0 examples (too fast to bother)
  • Standard tier: 2 examples, max 150 tokens total
  • Deep tier: 3 examples, max 300 tokens total

In Evennia

  • Examples stored as Objects in the Library, tagged few-shot
  • curate examples command triggers manual review
  • Auto-curation runs after every grading cycle

Deliverables

  • scripts/example_curator.py — selection and compression
  • agent/few_shot_bank.json — current example bank
  • Integration with prompt_router (inject examples per tier)
  • Rotation logic (replace weak examples with stronger ones)

Acceptance Criteria

  • Top-scoring outputs automatically become candidates
  • Token budget enforced per tier
  • Examples actually injected into prompts
  • A/B test shows improvement with examples vs without
## Objective The cheapest way to make a small model smarter is better examples. Build a system that automatically curates the best task outputs into few-shot examples injected into prompts. ## How It Works 1. Self-grading scores every task output (from previous ticket) 2. Outputs scoring 5/5 on accuracy become few-shot candidates 3. Candidates are deduplicated and compressed 4. Top 2-3 examples per task type are injected into the system prompt 5. Examples rotate as better ones are found ## Example Bank Structure ```json { "task_type": "read-file-and-summarize", "examples": [ { "input": "Read SOUL.md and quote the mission statement", "output": "The mission statement from SOUL.md is: 'Sovereignty and service always.'", "score": 5, "source_task_id": "overnight_042", "tokens": 45 } ], "max_examples": 3, "total_token_budget": 200 } ``` ## Token Budget Few-shot examples eat context. Enforce a strict budget: - Reflex tier: 0 examples (too fast to bother) - Standard tier: 2 examples, max 150 tokens total - Deep tier: 3 examples, max 300 tokens total ## In Evennia - Examples stored as Objects in the Library, tagged `few-shot` - `curate examples` command triggers manual review - Auto-curation runs after every grading cycle ## Deliverables - `scripts/example_curator.py` — selection and compression - `agent/few_shot_bank.json` — current example bank - Integration with prompt_router (inject examples per tier) - Rotation logic (replace weak examples with stronger ones) ## Acceptance Criteria - [ ] Top-scoring outputs automatically become candidates - [ ] Token budget enforced per tier - [ ] Examples actually injected into prompts - [ ] A/B test shows improvement with examples vs without
ezra was assigned by Timmy 2026-03-30 15:24:22 +00:00
Author
Owner

Role Transition

Timmy now owns execution — building, coding, implementing.
Ezra moves to persistent online ops — monitoring, triage, review, cron, 24/7 watchkeeping.

Timmy: this is yours. Read the ticket, build it, PR it. Ezra reviews.

Timmy — curate few-shot examples from your highest-scoring outputs. Inject the best 2-3 examples per task type into prompts. You teach yourself by example.

## Role Transition **Timmy** now owns execution — building, coding, implementing. **Ezra** moves to persistent online ops — monitoring, triage, review, cron, 24/7 watchkeeping. Timmy: this is yours. Read the ticket, build it, PR it. Ezra reviews. Timmy — curate few-shot examples from your highest-scoring outputs. Inject the best 2-3 examples per task type into prompts. You teach yourself by example.
ezra was unassigned by Timmy 2026-03-30 16:03:27 +00:00
Timmy self-assigned this 2026-03-30 16:03:27 +00:00
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: Timmy_Foundation/timmy-home#90