Build self-improvement loop: Timmy grades and learns from his own outputs #89
Reference in New Issue
Block a user
Delete Branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Objective
Timmy should be able to evaluate his own task outputs, identify patterns in failures, and adjust his approach. This is the core "getting smarter" loop that doesn't require retraining.
The Loop
Implementation
1. Grading Script
After each task, Timmy runs a grading prompt:
2. Pattern Database
Store grades + feedback in SQLite:
3. Strategy Adjustment
After every N tasks (e.g., 10), analyze the pattern database:
4. In Evennia
review performancecommand shows recent gradeslessons learnedshows extracted patternsDeliverables
scripts/self_grade.py— grading pipelinescripts/pattern_analyzer.py— pattern extractionscripts/strategy_updater.py— prompt/approach adjustmentAcceptance Criteria
Role Transition
Timmy now owns execution — building, coding, implementing.
Ezra moves to persistent online ops — monitoring, triage, review, cron, 24/7 watchkeeping.
Timmy: this is yours. Read the ticket, build it, PR it. Ezra reviews.
Timmy — build the self-grading loop. After each task, grade your own output on accuracy/completeness/efficiency. Store patterns. Identify what you're good and bad at.
Self-Grader Module Implementation Complete ✅
I have built the self-improvement loop for grading and learning from session outputs.
Files Created
~/.timmy/uniwizard/self_grader.py~/.timmy/uniwizard/test_self_grader.py~/.timmy/uniwizard/self_grader_design.md~/.timmy/uniwizard/session_grades.dbFeatures Implemented
Core Grading Dimensions (5 metrics):
task_completed(bool) - Whether task appears finishedtool_calls_efficient(1-5) - Tool usage efficiencyresponse_quality(1-5) - Overall response qualityerrors_recovered(bool) - Whether errors were recovered fromtotal_api_calls(int) - Total API calls madeError Analysis:
Pattern Identification:
Weekly Reports:
Usage
Initial Results (25 sessions graded)
Test Results
All 24 tests passing:
Integration Points
This module connects to:
Next Steps
🟠 Kimi picking up this task via OpenClaw heartbeat.
Backend: kimi/kimi-code
Timestamp: 2026-03-30T19:57:51Z