Implement few-shot example curation from successful tasks #90

New Issue

Timmy · 2026-03-30T15:24:22Z

Timmy commented

2026-03-30 15:24:22 +00:00

Objective

The cheapest way to make a small model smarter is better examples. Build a system that automatically curates the best task outputs into few-shot examples injected into prompts.

How It Works

Self-grading scores every task output (from previous ticket)
Outputs scoring 5/5 on accuracy become few-shot candidates
Candidates are deduplicated and compressed
Top 2-3 examples per task type are injected into the system prompt
Examples rotate as better ones are found

Example Bank Structure

{
    "task_type": "read-file-and-summarize",
    "examples": [
        {
            "input": "Read SOUL.md and quote the mission statement",
            "output": "The mission statement from SOUL.md is: 'Sovereignty and service always.'",
            "score": 5,
            "source_task_id": "overnight_042",
            "tokens": 45
        }
    ],
    "max_examples": 3,
    "total_token_budget": 200
}

Token Budget

Few-shot examples eat context. Enforce a strict budget:

Reflex tier: 0 examples (too fast to bother)
Standard tier: 2 examples, max 150 tokens total
Deep tier: 3 examples, max 300 tokens total

In Evennia

Examples stored as Objects in the Library, tagged few-shot
curate examples command triggers manual review
Auto-curation runs after every grading cycle

Deliverables

scripts/example_curator.py — selection and compression
agent/few_shot_bank.json — current example bank
Integration with prompt_router (inject examples per tier)
Rotation logic (replace weak examples with stronger ones)

Acceptance Criteria

Top-scoring outputs automatically become candidates
Token budget enforced per tier
Examples actually injected into prompts
A/B test shows improvement with examples vs without

## Objective The cheapest way to make a small model smarter is better examples. Build a system that automatically curates the best task outputs into few-shot examples injected into prompts. ## How It Works 1. Self-grading scores every task output (from previous ticket) 2. Outputs scoring 5/5 on accuracy become few-shot candidates 3. Candidates are deduplicated and compressed 4. Top 2-3 examples per task type are injected into the system prompt 5. Examples rotate as better ones are found ## Example Bank Structure ```json { "task_type": "read-file-and-summarize", "examples": [ { "input": "Read SOUL.md and quote the mission statement", "output": "The mission statement from SOUL.md is: 'Sovereignty and service always.'", "score": 5, "source_task_id": "overnight_042", "tokens": 45 } ], "max_examples": 3, "total_token_budget": 200 } ``` ## Token Budget Few-shot examples eat context. Enforce a strict budget: - Reflex tier: 0 examples (too fast to bother) - Standard tier: 2 examples, max 150 tokens total - Deep tier: 3 examples, max 300 tokens total ## In Evennia - Examples stored as Objects in the Library, tagged `few-shot` - `curate examples` command triggers manual review - Auto-curation runs after every grading cycle ## Deliverables - `scripts/example_curator.py` — selection and compression - `agent/few_shot_bank.json` — current example bank - Integration with prompt_router (inject examples per tier) - Rotation logic (replace weak examples with stronger ones) ## Acceptance Criteria - [ ] Top-scoring outputs automatically become candidates - [ ] Token budget enforced per tier - [ ] Examples actually injected into prompts - [ ] A/B test shows improvement with examples vs without

ezra was assigned by Timmy

2026-03-30 15:24:22 +00:00

Timmy referenced this issue

2026-03-30 15:39:09 +00:00

[EPIC] Grand Timmy — The Uniwizard #94

Timmy referenced this issue

2026-03-30 15:41:41 +00:00

[INSPIRATION] Build reusable prompt packs and storyboard seeds from archive-derived style memory #45

Rockachopa referenced this issue

2026-03-30 15:48:45 +00:00

[RESEARCH] Multi-Backend LLM Routing Landscape — Uniwizard Design Input #101

Timmy referenced this issue

2026-03-30 15:58:50 +00:00

[EPIC] Grand Timmy — The Uniwizard #94

Timmy commented

2026-03-30 16:03:26 +00:00

Role Transition

Timmy now owns execution — building, coding, implementing.
Ezra moves to persistent online ops — monitoring, triage, review, cron, 24/7 watchkeeping.

Timmy: this is yours. Read the ticket, build it, PR it. Ezra reviews.

Timmy — curate few-shot examples from your highest-scoring outputs. Inject the best 2-3 examples per task type into prompts. You teach yourself by example.

## Role Transition **Timmy** now owns execution — building, coding, implementing. **Ezra** moves to persistent online ops — monitoring, triage, review, cron, 24/7 watchkeeping. Timmy: this is yours. Read the ticket, build it, PR it. Ezra reviews. Timmy — curate few-shot examples from your highest-scoring outputs. Inject the best 2-3 examples per task type into prompts. You teach yourself by example.

ezra was unassigned by Timmy

2026-03-30 16:03:27 +00:00

Timmy self-assigned this 2026-03-30 16:03:27 +00:00

ezra referenced this issue

2026-03-31 16:30:08 +00:00

[STUDY] Memory extraction — learn from Claude Code's auto-memory system #156

ezra referenced this issue

2026-03-31 17:03:24 +00:00

[EXTRACT P3-2] Write adaptation spec: Auto-memory extraction for Hermes #180

Sign in to join this conversation.