Co-authored-by: Kimi Claw <kimi@timmytime.ai> Co-committed-by: Kimi Claw <kimi@timmytime.ai>
454 lines
14 KiB
Markdown
454 lines
14 KiB
Markdown
# Self-Grader Design Document
|
|
|
|
**Issue:** timmy-home #89 - "Build self-improvement loop: Timmy grades and learns from his own outputs"
|
|
|
|
**Related Issues:** #88 (Adaptive Routing), #98 (Quality Scoring)
|
|
|
|
---
|
|
|
|
## 1. Overview
|
|
|
|
The Self-Grader module enables Timmy to automatically evaluate his own task outputs, identify patterns in failures, and generate actionable improvement insights. This creates a closed feedback loop for continuous self-improvement.
|
|
|
|
### Goals
|
|
- Automatically grade completed sessions on multiple quality dimensions
|
|
- Identify recurring error patterns and their root causes
|
|
- Track performance trends over time
|
|
- Generate actionable weekly improvement reports
|
|
- Feed insights into adaptive routing decisions
|
|
|
|
---
|
|
|
|
## 2. Architecture
|
|
|
|
```
|
|
┌─────────────────────────────────────────────────────────────┐
|
|
│ Self-Grader Module │
|
|
├─────────────────────────────────────────────────────────────┤
|
|
│ │
|
|
│ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ │
|
|
│ │ Parser │───▶│ Analyzer │───▶│ Grader │ │
|
|
│ │ │ │ │ │ │ │
|
|
│ │ Reads session│ │ Extracts │ │ Scores on 5 │ │
|
|
│ │ JSON files │ │ metrics │ │ dimensions │ │
|
|
│ └──────────────┘ └──────────────┘ └──────────────┘ │
|
|
│ │ │ │ │
|
|
│ ▼ ▼ ▼ │
|
|
│ ┌──────────────────────────────────────────────────────┐ │
|
|
│ │ SQLite Database Layer │ │
|
|
│ │ • session_grades table (individual scores) │ │
|
|
│ │ • weekly_reports table (aggregated insights) │ │
|
|
│ └──────────────────────────────────────────────────────┘ │
|
|
│ │ │
|
|
│ ▼ │
|
|
│ ┌──────────────────────────────────────────────────────┐ │
|
|
│ │ Pattern Identification │ │
|
|
│ │ • Task type performance analysis │ │
|
|
│ │ • Tool error frequency tracking │ │
|
|
│ │ • Error classification and clustering │ │
|
|
│ └──────────────────────────────────────────────────────┘ │
|
|
│ │ │
|
|
│ ▼ │
|
|
│ ┌──────────────────────────────────────────────────────┐ │
|
|
│ │ Report Generator │ │
|
|
│ │ • Weekly summary with trends │ │
|
|
│ │ • Improvement suggestions │ │
|
|
│ │ • Performance alerts │ │
|
|
│ └──────────────────────────────────────────────────────┘ │
|
|
│ │
|
|
└─────────────────────────────────────────────────────────────┘
|
|
│
|
|
▼
|
|
┌─────────────────────────────────────────────────────────────┐
|
|
│ Downstream Consumers │
|
|
│ • Adaptive Routing (#88) - route based on task type │
|
|
│ • Quality Scoring (#98) - external quality validation │
|
|
│ • Skill Recommendations - identify skill gaps │
|
|
│ • Alert System - notify on quality degradation │
|
|
└─────────────────────────────────────────────────────────────┘
|
|
```
|
|
|
|
---
|
|
|
|
## 3. Grading Dimensions
|
|
|
|
### 3.1 Core Metrics (1-5 scale where applicable)
|
|
|
|
| Metric | Type | Description |
|
|
|--------|------|-------------|
|
|
| `task_completed` | boolean | Whether the task appears to have been completed successfully |
|
|
| `tool_calls_efficient` | int (1-5) | Efficiency of tool usage (error rate, call count) |
|
|
| `response_quality` | int (1-5) | Overall quality of final response |
|
|
| `errors_recovered` | boolean | Whether errors were successfully recovered from |
|
|
| `total_api_calls` | int | Total number of API/assistant calls made |
|
|
|
|
### 3.2 Derived Metrics
|
|
|
|
| Metric | Description |
|
|
|--------|-------------|
|
|
| `total_errors` | Count of tool errors detected |
|
|
| `error_types` | Categorized error types (JSON list) |
|
|
| `tools_with_errors` | Tools that generated errors |
|
|
| `had_repeated_errors` | Flag for 3+ consecutive errors |
|
|
| `had_infinite_loop_risk` | Flag for 5+ consecutive errors or >50 calls |
|
|
| `had_user_clarification` | Whether clarification was requested |
|
|
|
|
---
|
|
|
|
## 4. Error Classification
|
|
|
|
The system classifies errors into categories for pattern analysis:
|
|
|
|
| Category | Pattern | Example |
|
|
|----------|---------|---------|
|
|
| `file_not_found` | File/path errors | "No such file or directory" |
|
|
| `permission_denied` | Access errors | "Permission denied" |
|
|
| `timeout` | Time limit exceeded | "Request timed out" |
|
|
| `api_error` | External API failures | "Rate limit exceeded" |
|
|
| `syntax_error` | Code/parsing errors | "Invalid syntax" |
|
|
| `command_failed` | Command execution | "exit_code": 1 |
|
|
| `network_error` | Connectivity issues | "Connection refused" |
|
|
| `tool_not_found` | Tool resolution | "Unknown tool" |
|
|
| `unknown` | Unclassified | Any other error |
|
|
|
|
---
|
|
|
|
## 5. Task Type Inference
|
|
|
|
Sessions are categorized by task type for comparative analysis:
|
|
|
|
| Task Type | Pattern |
|
|
|-----------|---------|
|
|
| `code_review` | "review", "code review", "PR" |
|
|
| `debugging` | "debug", "fix", "troubleshoot" |
|
|
| `feature_impl` | "implement", "add feature", "build" |
|
|
| `refactoring` | "refactor", "clean up", "reorganize" |
|
|
| `documentation` | "document", "readme", "docstring" |
|
|
| `testing` | "test", "pytest", "unit test" |
|
|
| `research` | "research", "investigate", "look up" |
|
|
| `deployment` | "deploy", "release", "publish" |
|
|
| `data_analysis` | "analyze data", "process file", "parse" |
|
|
| `infrastructure` | "server", "docker", "kubernetes" |
|
|
| `general` | Default catch-all |
|
|
|
|
---
|
|
|
|
## 6. Database Schema
|
|
|
|
### 6.1 session_grades Table
|
|
|
|
```sql
|
|
CREATE TABLE session_grades (
|
|
id INTEGER PRIMARY KEY AUTOINCREMENT,
|
|
session_id TEXT UNIQUE NOT NULL,
|
|
session_file TEXT NOT NULL,
|
|
graded_at TEXT NOT NULL,
|
|
|
|
-- Core metrics
|
|
task_completed INTEGER NOT NULL,
|
|
tool_calls_efficient INTEGER NOT NULL,
|
|
response_quality INTEGER NOT NULL,
|
|
errors_recovered INTEGER NOT NULL,
|
|
total_api_calls INTEGER NOT NULL,
|
|
|
|
-- Metadata
|
|
model TEXT,
|
|
platform TEXT,
|
|
session_start TEXT,
|
|
duration_seconds REAL,
|
|
task_summary TEXT,
|
|
|
|
-- Error analysis
|
|
total_errors INTEGER NOT NULL,
|
|
error_types TEXT, -- JSON array
|
|
tools_with_errors TEXT, -- JSON array
|
|
|
|
-- Pattern flags
|
|
had_repeated_errors INTEGER NOT NULL,
|
|
had_infinite_loop_risk INTEGER NOT NULL,
|
|
had_user_clarification INTEGER NOT NULL
|
|
);
|
|
```
|
|
|
|
### 6.2 weekly_reports Table
|
|
|
|
```sql
|
|
CREATE TABLE weekly_reports (
|
|
id INTEGER PRIMARY KEY AUTOINCREMENT,
|
|
week_start TEXT UNIQUE NOT NULL,
|
|
week_end TEXT NOT NULL,
|
|
generated_at TEXT NOT NULL,
|
|
report_json TEXT NOT NULL -- Serialized WeeklyReport
|
|
);
|
|
```
|
|
|
|
---
|
|
|
|
## 7. Scoring Algorithms
|
|
|
|
### 7.1 Task Completion Detection
|
|
|
|
Positive indicators:
|
|
- Final message contains completion phrases: "done", "completed", "success", "finished"
|
|
- References to created outputs: "saved to", "here is", "output:"
|
|
- Low error rate relative to total calls
|
|
|
|
Negative indicators:
|
|
- Explicit failure phrases: "failed", "unable to", "could not"
|
|
- Error rate > 30% of total calls
|
|
- Empty or very short final response
|
|
|
|
### 7.2 Tool Efficiency Scoring
|
|
|
|
```python
|
|
error_rate = total_errors / total_api_calls
|
|
|
|
if error_rate == 0 and tool_calls <= 10:
|
|
score = 5 # Perfect
|
|
elif error_rate <= 0.1 and tool_calls <= 15:
|
|
score = 4 # Good
|
|
elif error_rate <= 0.25 and tool_calls <= 25:
|
|
score = 3 # Average
|
|
elif error_rate <= 0.4:
|
|
score = 2 # Poor
|
|
else:
|
|
score = 1 # Very poor
|
|
```
|
|
|
|
### 7.3 Response Quality Scoring
|
|
|
|
Base score: 3 (average)
|
|
|
|
Additions:
|
|
- Content length > 500 chars: +1
|
|
- Content length > 1000 chars: +1
|
|
- Contains code blocks: +1
|
|
- Contains links/references: +0.5
|
|
|
|
Penalties:
|
|
- Repeated errors: -1
|
|
- Total errors > 5: -1
|
|
- Infinite loop risk: -2
|
|
|
|
Range clamped to 1-5.
|
|
|
|
---
|
|
|
|
## 8. Pattern Identification
|
|
|
|
### 8.1 Per-Task-Type Analysis
|
|
|
|
Tracks average scores per task type to identify weak areas:
|
|
|
|
```python
|
|
task_scores = {
|
|
'code_review': 4.2,
|
|
'debugging': 2.8, # <-- Needs attention
|
|
'feature_impl': 3.5,
|
|
}
|
|
```
|
|
|
|
### 8.2 Tool Error Frequency
|
|
|
|
Identifies which tools are most error-prone:
|
|
|
|
```python
|
|
tool_errors = {
|
|
'browser_navigate': 15, # <-- High error rate
|
|
'terminal': 5,
|
|
'file_read': 2,
|
|
}
|
|
```
|
|
|
|
### 8.3 Error Pattern Clustering
|
|
|
|
Groups errors by type to identify systemic issues:
|
|
|
|
```python
|
|
error_types = {
|
|
'file_not_found': 12, # <-- Need better path handling
|
|
'timeout': 8,
|
|
'api_error': 3,
|
|
}
|
|
```
|
|
|
|
---
|
|
|
|
## 9. Weekly Report Generation
|
|
|
|
### 9.1 Report Contents
|
|
|
|
1. **Summary Statistics**
|
|
- Total sessions graded
|
|
- Average tool efficiency
|
|
- Average response quality
|
|
- Task completion rate
|
|
- Error recovery rate
|
|
|
|
2. **Problem Areas**
|
|
- Lowest scoring task types
|
|
- Most error-prone tools
|
|
- Common error patterns
|
|
|
|
3. **Improvement Suggestions**
|
|
- Actionable recommendations based on patterns
|
|
- Skill gap identification
|
|
- Process improvement tips
|
|
|
|
### 9.2 Suggestion Generation Rules
|
|
|
|
| Condition | Suggestion |
|
|
|-----------|------------|
|
|
| completion_rate < 70% | "Add pre-task planning steps" |
|
|
| avg_tool_efficiency < 3 | "Review error recovery patterns" |
|
|
| error_recovery_rate < 80% | "Implement better fallback strategies" |
|
|
| Specific task type low | "Consider skill enhancement for {task}" |
|
|
| Specific tool high errors | "Review usage patterns for {tool}" |
|
|
| Specific error common | "Add targeted error handling for {error}" |
|
|
|
|
---
|
|
|
|
## 10. Integration Points
|
|
|
|
### 10.1 With Adaptive Routing (#88)
|
|
|
|
The grader feeds task-type performance data to the router:
|
|
|
|
```python
|
|
# Router uses grader insights
|
|
if task_type == 'debugging' and grader.get_task_score('debugging') < 3:
|
|
# Route to more capable model for debugging tasks
|
|
model = 'claude-opus-4'
|
|
```
|
|
|
|
### 10.2 With Quality Scoring (#98)
|
|
|
|
Grader scores feed into external quality validation:
|
|
|
|
```python
|
|
# Quality scorer validates grader accuracy
|
|
external_score = quality_scorer.validate(session, grader_score)
|
|
discrepancy = abs(external_score - grader_score)
|
|
if discrepancy > threshold:
|
|
grader.calibrate() # Adjust scoring algorithms
|
|
```
|
|
|
|
### 10.3 With Skill System
|
|
|
|
Identifies skills that could improve low-scoring areas:
|
|
|
|
```python
|
|
if grader.get_task_score('debugging') < 3:
|
|
recommend_skill('systematic-debugging')
|
|
```
|
|
|
|
---
|
|
|
|
## 11. Usage
|
|
|
|
### 11.1 Command Line
|
|
|
|
```bash
|
|
# Grade latest 10 ungraded sessions
|
|
python self_grader.py -g 10
|
|
|
|
# Show summary of last 7 days
|
|
python self_grader.py -s
|
|
|
|
# Show summary of last 30 days
|
|
python self_grader.py -s -d 30
|
|
|
|
# Generate weekly report
|
|
python self_grader.py -r
|
|
|
|
# Grade specific session file
|
|
python self_grader.py -f /path/to/session.json
|
|
```
|
|
|
|
### 11.2 Python API
|
|
|
|
```python
|
|
from self_grader import SelfGrader
|
|
|
|
grader = SelfGrader()
|
|
|
|
# Grade latest sessions
|
|
grades = grader.grade_latest_sessions(n=10)
|
|
|
|
# Get pattern insights
|
|
patterns = grader.identify_patterns(days=7)
|
|
|
|
# Generate report
|
|
report = grader.generate_weekly_report()
|
|
|
|
# Get human-readable summary
|
|
print(grader.get_grades_summary(days=7))
|
|
```
|
|
|
|
---
|
|
|
|
## 12. Testing
|
|
|
|
Comprehensive test suite covers:
|
|
|
|
1. **Unit Tests**
|
|
- Error detection and classification
|
|
- Scoring algorithms
|
|
- Task type inference
|
|
|
|
2. **Integration Tests**
|
|
- Full session grading pipeline
|
|
- Database operations
|
|
- Report generation
|
|
|
|
3. **Edge Cases**
|
|
- Empty sessions
|
|
- Sessions with infinite loops
|
|
- Malformed session files
|
|
|
|
Run tests:
|
|
```bash
|
|
python -m pytest test_self_grader.py -v
|
|
```
|
|
|
|
---
|
|
|
|
## 13. Future Enhancements
|
|
|
|
1. **Machine Learning Integration**
|
|
- Train models to predict session success
|
|
- Learn optimal tool sequences
|
|
- Predict error likelihood
|
|
|
|
2. **Human-in-the-Loop Validation**
|
|
- Allow user override of grades
|
|
- Collect explicit feedback
|
|
- Calibrate scoring with human judgments
|
|
|
|
3. **Real-time Monitoring**
|
|
- Grade sessions as they complete
|
|
- Alert on quality degradation
|
|
- Live dashboard of metrics
|
|
|
|
4. **Cross-Session Learning**
|
|
- Identify recurring issues across similar tasks
|
|
- Suggest skill improvements
|
|
- Recommend tool alternatives
|
|
|
|
---
|
|
|
|
## 14. Files
|
|
|
|
| File | Description |
|
|
|------|-------------|
|
|
| `self_grader.py` | Main module with SelfGrader class |
|
|
| `test_self_grader.py` | Comprehensive test suite |
|
|
| `self_grader_design.md` | This design document |
|
|
| `~/.timmy/uniwizard/session_grades.db` | SQLite database (created at runtime) |
|
|
|
|
---
|
|
|
|
*Document Version: 1.0*
|
|
*Created: 2026-03-30*
|
|
*Author: Timmy (UniWizard)*
|