Co-authored-by: Kimi Claw <kimi@timmytime.ai> Co-committed-by: Kimi Claw <kimi@timmytime.ai>
14 KiB
Self-Grader Design Document
Issue: timmy-home #89 - "Build self-improvement loop: Timmy grades and learns from his own outputs"
Related Issues: #88 (Adaptive Routing), #98 (Quality Scoring)
1. Overview
The Self-Grader module enables Timmy to automatically evaluate his own task outputs, identify patterns in failures, and generate actionable improvement insights. This creates a closed feedback loop for continuous self-improvement.
Goals
- Automatically grade completed sessions on multiple quality dimensions
- Identify recurring error patterns and their root causes
- Track performance trends over time
- Generate actionable weekly improvement reports
- Feed insights into adaptive routing decisions
2. Architecture
┌─────────────────────────────────────────────────────────────┐
│ Self-Grader Module │
├─────────────────────────────────────────────────────────────┤
│ │
│ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ │
│ │ Parser │───▶│ Analyzer │───▶│ Grader │ │
│ │ │ │ │ │ │ │
│ │ Reads session│ │ Extracts │ │ Scores on 5 │ │
│ │ JSON files │ │ metrics │ │ dimensions │ │
│ └──────────────┘ └──────────────┘ └──────────────┘ │
│ │ │ │ │
│ ▼ ▼ ▼ │
│ ┌──────────────────────────────────────────────────────┐ │
│ │ SQLite Database Layer │ │
│ │ • session_grades table (individual scores) │ │
│ │ • weekly_reports table (aggregated insights) │ │
│ └──────────────────────────────────────────────────────┘ │
│ │ │
│ ▼ │
│ ┌──────────────────────────────────────────────────────┐ │
│ │ Pattern Identification │ │
│ │ • Task type performance analysis │ │
│ │ • Tool error frequency tracking │ │
│ │ • Error classification and clustering │ │
│ └──────────────────────────────────────────────────────┘ │
│ │ │
│ ▼ │
│ ┌──────────────────────────────────────────────────────┐ │
│ │ Report Generator │ │
│ │ • Weekly summary with trends │ │
│ │ • Improvement suggestions │ │
│ │ • Performance alerts │ │
│ └──────────────────────────────────────────────────────┘ │
│ │
└─────────────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────┐
│ Downstream Consumers │
│ • Adaptive Routing (#88) - route based on task type │
│ • Quality Scoring (#98) - external quality validation │
│ • Skill Recommendations - identify skill gaps │
│ • Alert System - notify on quality degradation │
└─────────────────────────────────────────────────────────────┘
3. Grading Dimensions
3.1 Core Metrics (1-5 scale where applicable)
| Metric | Type | Description |
|---|---|---|
task_completed |
boolean | Whether the task appears to have been completed successfully |
tool_calls_efficient |
int (1-5) | Efficiency of tool usage (error rate, call count) |
response_quality |
int (1-5) | Overall quality of final response |
errors_recovered |
boolean | Whether errors were successfully recovered from |
total_api_calls |
int | Total number of API/assistant calls made |
3.2 Derived Metrics
| Metric | Description |
|---|---|
total_errors |
Count of tool errors detected |
error_types |
Categorized error types (JSON list) |
tools_with_errors |
Tools that generated errors |
had_repeated_errors |
Flag for 3+ consecutive errors |
had_infinite_loop_risk |
Flag for 5+ consecutive errors or >50 calls |
had_user_clarification |
Whether clarification was requested |
4. Error Classification
The system classifies errors into categories for pattern analysis:
| Category | Pattern | Example |
|---|---|---|
file_not_found |
File/path errors | "No such file or directory" |
permission_denied |
Access errors | "Permission denied" |
timeout |
Time limit exceeded | "Request timed out" |
api_error |
External API failures | "Rate limit exceeded" |
syntax_error |
Code/parsing errors | "Invalid syntax" |
command_failed |
Command execution | "exit_code": 1 |
network_error |
Connectivity issues | "Connection refused" |
tool_not_found |
Tool resolution | "Unknown tool" |
unknown |
Unclassified | Any other error |
5. Task Type Inference
Sessions are categorized by task type for comparative analysis:
| Task Type | Pattern |
|---|---|
code_review |
"review", "code review", "PR" |
debugging |
"debug", "fix", "troubleshoot" |
feature_impl |
"implement", "add feature", "build" |
refactoring |
"refactor", "clean up", "reorganize" |
documentation |
"document", "readme", "docstring" |
testing |
"test", "pytest", "unit test" |
research |
"research", "investigate", "look up" |
deployment |
"deploy", "release", "publish" |
data_analysis |
"analyze data", "process file", "parse" |
infrastructure |
"server", "docker", "kubernetes" |
general |
Default catch-all |
6. Database Schema
6.1 session_grades Table
CREATE TABLE session_grades (
id INTEGER PRIMARY KEY AUTOINCREMENT,
session_id TEXT UNIQUE NOT NULL,
session_file TEXT NOT NULL,
graded_at TEXT NOT NULL,
-- Core metrics
task_completed INTEGER NOT NULL,
tool_calls_efficient INTEGER NOT NULL,
response_quality INTEGER NOT NULL,
errors_recovered INTEGER NOT NULL,
total_api_calls INTEGER NOT NULL,
-- Metadata
model TEXT,
platform TEXT,
session_start TEXT,
duration_seconds REAL,
task_summary TEXT,
-- Error analysis
total_errors INTEGER NOT NULL,
error_types TEXT, -- JSON array
tools_with_errors TEXT, -- JSON array
-- Pattern flags
had_repeated_errors INTEGER NOT NULL,
had_infinite_loop_risk INTEGER NOT NULL,
had_user_clarification INTEGER NOT NULL
);
6.2 weekly_reports Table
CREATE TABLE weekly_reports (
id INTEGER PRIMARY KEY AUTOINCREMENT,
week_start TEXT UNIQUE NOT NULL,
week_end TEXT NOT NULL,
generated_at TEXT NOT NULL,
report_json TEXT NOT NULL -- Serialized WeeklyReport
);
7. Scoring Algorithms
7.1 Task Completion Detection
Positive indicators:
- Final message contains completion phrases: "done", "completed", "success", "finished"
- References to created outputs: "saved to", "here is", "output:"
- Low error rate relative to total calls
Negative indicators:
- Explicit failure phrases: "failed", "unable to", "could not"
- Error rate > 30% of total calls
- Empty or very short final response
7.2 Tool Efficiency Scoring
error_rate = total_errors / total_api_calls
if error_rate == 0 and tool_calls <= 10:
score = 5 # Perfect
elif error_rate <= 0.1 and tool_calls <= 15:
score = 4 # Good
elif error_rate <= 0.25 and tool_calls <= 25:
score = 3 # Average
elif error_rate <= 0.4:
score = 2 # Poor
else:
score = 1 # Very poor
7.3 Response Quality Scoring
Base score: 3 (average)
Additions:
- Content length > 500 chars: +1
- Content length > 1000 chars: +1
- Contains code blocks: +1
- Contains links/references: +0.5
Penalties:
- Repeated errors: -1
- Total errors > 5: -1
- Infinite loop risk: -2
Range clamped to 1-5.
8. Pattern Identification
8.1 Per-Task-Type Analysis
Tracks average scores per task type to identify weak areas:
task_scores = {
'code_review': 4.2,
'debugging': 2.8, # <-- Needs attention
'feature_impl': 3.5,
}
8.2 Tool Error Frequency
Identifies which tools are most error-prone:
tool_errors = {
'browser_navigate': 15, # <-- High error rate
'terminal': 5,
'file_read': 2,
}
8.3 Error Pattern Clustering
Groups errors by type to identify systemic issues:
error_types = {
'file_not_found': 12, # <-- Need better path handling
'timeout': 8,
'api_error': 3,
}
9. Weekly Report Generation
9.1 Report Contents
-
Summary Statistics
- Total sessions graded
- Average tool efficiency
- Average response quality
- Task completion rate
- Error recovery rate
-
Problem Areas
- Lowest scoring task types
- Most error-prone tools
- Common error patterns
-
Improvement Suggestions
- Actionable recommendations based on patterns
- Skill gap identification
- Process improvement tips
9.2 Suggestion Generation Rules
| Condition | Suggestion |
|---|---|
| completion_rate < 70% | "Add pre-task planning steps" |
| avg_tool_efficiency < 3 | "Review error recovery patterns" |
| error_recovery_rate < 80% | "Implement better fallback strategies" |
| Specific task type low | "Consider skill enhancement for {task}" |
| Specific tool high errors | "Review usage patterns for {tool}" |
| Specific error common | "Add targeted error handling for {error}" |
10. Integration Points
10.1 With Adaptive Routing (#88)
The grader feeds task-type performance data to the router:
# Router uses grader insights
if task_type == 'debugging' and grader.get_task_score('debugging') < 3:
# Route to more capable model for debugging tasks
model = 'claude-opus-4'
10.2 With Quality Scoring (#98)
Grader scores feed into external quality validation:
# Quality scorer validates grader accuracy
external_score = quality_scorer.validate(session, grader_score)
discrepancy = abs(external_score - grader_score)
if discrepancy > threshold:
grader.calibrate() # Adjust scoring algorithms
10.3 With Skill System
Identifies skills that could improve low-scoring areas:
if grader.get_task_score('debugging') < 3:
recommend_skill('systematic-debugging')
11. Usage
11.1 Command Line
# Grade latest 10 ungraded sessions
python self_grader.py -g 10
# Show summary of last 7 days
python self_grader.py -s
# Show summary of last 30 days
python self_grader.py -s -d 30
# Generate weekly report
python self_grader.py -r
# Grade specific session file
python self_grader.py -f /path/to/session.json
11.2 Python API
from self_grader import SelfGrader
grader = SelfGrader()
# Grade latest sessions
grades = grader.grade_latest_sessions(n=10)
# Get pattern insights
patterns = grader.identify_patterns(days=7)
# Generate report
report = grader.generate_weekly_report()
# Get human-readable summary
print(grader.get_grades_summary(days=7))
12. Testing
Comprehensive test suite covers:
-
Unit Tests
- Error detection and classification
- Scoring algorithms
- Task type inference
-
Integration Tests
- Full session grading pipeline
- Database operations
- Report generation
-
Edge Cases
- Empty sessions
- Sessions with infinite loops
- Malformed session files
Run tests:
python -m pytest test_self_grader.py -v
13. Future Enhancements
-
Machine Learning Integration
- Train models to predict session success
- Learn optimal tool sequences
- Predict error likelihood
-
Human-in-the-Loop Validation
- Allow user override of grades
- Collect explicit feedback
- Calibrate scoring with human judgments
-
Real-time Monitoring
- Grade sessions as they complete
- Alert on quality degradation
- Live dashboard of metrics
-
Cross-Session Learning
- Identify recurring issues across similar tasks
- Suggest skill improvements
- Recommend tool alternatives
14. Files
| File | Description |
|---|---|
self_grader.py |
Main module with SelfGrader class |
test_self_grader.py |
Comprehensive test suite |
self_grader_design.md |
This design document |
~/.timmy/uniwizard/session_grades.db |
SQLite database (created at runtime) |
Document Version: 1.0 Created: 2026-03-30 Author: Timmy (UniWizard)