Files
timmy-home/uniwizard/self_grader_design.md

454 lines
14 KiB
Markdown

# Self-Grader Design Document
**Issue:** timmy-home #89 - "Build self-improvement loop: Timmy grades and learns from his own outputs"
**Related Issues:** #88 (Adaptive Routing), #98 (Quality Scoring)
---
## 1. Overview
The Self-Grader module enables Timmy to automatically evaluate his own task outputs, identify patterns in failures, and generate actionable improvement insights. This creates a closed feedback loop for continuous self-improvement.
### Goals
- Automatically grade completed sessions on multiple quality dimensions
- Identify recurring error patterns and their root causes
- Track performance trends over time
- Generate actionable weekly improvement reports
- Feed insights into adaptive routing decisions
---
## 2. Architecture
```
┌─────────────────────────────────────────────────────────────┐
│ Self-Grader Module │
├─────────────────────────────────────────────────────────────┤
│ │
│ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ │
│ │ Parser │───▶│ Analyzer │───▶│ Grader │ │
│ │ │ │ │ │ │ │
│ │ Reads session│ │ Extracts │ │ Scores on 5 │ │
│ │ JSON files │ │ metrics │ │ dimensions │ │
│ └──────────────┘ └──────────────┘ └──────────────┘ │
│ │ │ │ │
│ ▼ ▼ ▼ │
│ ┌──────────────────────────────────────────────────────┐ │
│ │ SQLite Database Layer │ │
│ │ • session_grades table (individual scores) │ │
│ │ • weekly_reports table (aggregated insights) │ │
│ └──────────────────────────────────────────────────────┘ │
│ │ │
│ ▼ │
│ ┌──────────────────────────────────────────────────────┐ │
│ │ Pattern Identification │ │
│ │ • Task type performance analysis │ │
│ │ • Tool error frequency tracking │ │
│ │ • Error classification and clustering │ │
│ └──────────────────────────────────────────────────────┘ │
│ │ │
│ ▼ │
│ ┌──────────────────────────────────────────────────────┐ │
│ │ Report Generator │ │
│ │ • Weekly summary with trends │ │
│ │ • Improvement suggestions │ │
│ │ • Performance alerts │ │
│ └──────────────────────────────────────────────────────┘ │
│ │
└─────────────────────────────────────────────────────────────┘
┌─────────────────────────────────────────────────────────────┐
│ Downstream Consumers │
│ • Adaptive Routing (#88) - route based on task type │
│ • Quality Scoring (#98) - external quality validation │
│ • Skill Recommendations - identify skill gaps │
│ • Alert System - notify on quality degradation │
└─────────────────────────────────────────────────────────────┘
```
---
## 3. Grading Dimensions
### 3.1 Core Metrics (1-5 scale where applicable)
| Metric | Type | Description |
|--------|------|-------------|
| `task_completed` | boolean | Whether the task appears to have been completed successfully |
| `tool_calls_efficient` | int (1-5) | Efficiency of tool usage (error rate, call count) |
| `response_quality` | int (1-5) | Overall quality of final response |
| `errors_recovered` | boolean | Whether errors were successfully recovered from |
| `total_api_calls` | int | Total number of API/assistant calls made |
### 3.2 Derived Metrics
| Metric | Description |
|--------|-------------|
| `total_errors` | Count of tool errors detected |
| `error_types` | Categorized error types (JSON list) |
| `tools_with_errors` | Tools that generated errors |
| `had_repeated_errors` | Flag for 3+ consecutive errors |
| `had_infinite_loop_risk` | Flag for 5+ consecutive errors or >50 calls |
| `had_user_clarification` | Whether clarification was requested |
---
## 4. Error Classification
The system classifies errors into categories for pattern analysis:
| Category | Pattern | Example |
|----------|---------|---------|
| `file_not_found` | File/path errors | "No such file or directory" |
| `permission_denied` | Access errors | "Permission denied" |
| `timeout` | Time limit exceeded | "Request timed out" |
| `api_error` | External API failures | "Rate limit exceeded" |
| `syntax_error` | Code/parsing errors | "Invalid syntax" |
| `command_failed` | Command execution | "exit_code": 1 |
| `network_error` | Connectivity issues | "Connection refused" |
| `tool_not_found` | Tool resolution | "Unknown tool" |
| `unknown` | Unclassified | Any other error |
---
## 5. Task Type Inference
Sessions are categorized by task type for comparative analysis:
| Task Type | Pattern |
|-----------|---------|
| `code_review` | "review", "code review", "PR" |
| `debugging` | "debug", "fix", "troubleshoot" |
| `feature_impl` | "implement", "add feature", "build" |
| `refactoring` | "refactor", "clean up", "reorganize" |
| `documentation` | "document", "readme", "docstring" |
| `testing` | "test", "pytest", "unit test" |
| `research` | "research", "investigate", "look up" |
| `deployment` | "deploy", "release", "publish" |
| `data_analysis` | "analyze data", "process file", "parse" |
| `infrastructure` | "server", "docker", "kubernetes" |
| `general` | Default catch-all |
---
## 6. Database Schema
### 6.1 session_grades Table
```sql
CREATE TABLE session_grades (
id INTEGER PRIMARY KEY AUTOINCREMENT,
session_id TEXT UNIQUE NOT NULL,
session_file TEXT NOT NULL,
graded_at TEXT NOT NULL,
-- Core metrics
task_completed INTEGER NOT NULL,
tool_calls_efficient INTEGER NOT NULL,
response_quality INTEGER NOT NULL,
errors_recovered INTEGER NOT NULL,
total_api_calls INTEGER NOT NULL,
-- Metadata
model TEXT,
platform TEXT,
session_start TEXT,
duration_seconds REAL,
task_summary TEXT,
-- Error analysis
total_errors INTEGER NOT NULL,
error_types TEXT, -- JSON array
tools_with_errors TEXT, -- JSON array
-- Pattern flags
had_repeated_errors INTEGER NOT NULL,
had_infinite_loop_risk INTEGER NOT NULL,
had_user_clarification INTEGER NOT NULL
);
```
### 6.2 weekly_reports Table
```sql
CREATE TABLE weekly_reports (
id INTEGER PRIMARY KEY AUTOINCREMENT,
week_start TEXT UNIQUE NOT NULL,
week_end TEXT NOT NULL,
generated_at TEXT NOT NULL,
report_json TEXT NOT NULL -- Serialized WeeklyReport
);
```
---
## 7. Scoring Algorithms
### 7.1 Task Completion Detection
Positive indicators:
- Final message contains completion phrases: "done", "completed", "success", "finished"
- References to created outputs: "saved to", "here is", "output:"
- Low error rate relative to total calls
Negative indicators:
- Explicit failure phrases: "failed", "unable to", "could not"
- Error rate > 30% of total calls
- Empty or very short final response
### 7.2 Tool Efficiency Scoring
```python
error_rate = total_errors / total_api_calls
if error_rate == 0 and tool_calls <= 10:
score = 5 # Perfect
elif error_rate <= 0.1 and tool_calls <= 15:
score = 4 # Good
elif error_rate <= 0.25 and tool_calls <= 25:
score = 3 # Average
elif error_rate <= 0.4:
score = 2 # Poor
else:
score = 1 # Very poor
```
### 7.3 Response Quality Scoring
Base score: 3 (average)
Additions:
- Content length > 500 chars: +1
- Content length > 1000 chars: +1
- Contains code blocks: +1
- Contains links/references: +0.5
Penalties:
- Repeated errors: -1
- Total errors > 5: -1
- Infinite loop risk: -2
Range clamped to 1-5.
---
## 8. Pattern Identification
### 8.1 Per-Task-Type Analysis
Tracks average scores per task type to identify weak areas:
```python
task_scores = {
'code_review': 4.2,
'debugging': 2.8, # <-- Needs attention
'feature_impl': 3.5,
}
```
### 8.2 Tool Error Frequency
Identifies which tools are most error-prone:
```python
tool_errors = {
'browser_navigate': 15, # <-- High error rate
'terminal': 5,
'file_read': 2,
}
```
### 8.3 Error Pattern Clustering
Groups errors by type to identify systemic issues:
```python
error_types = {
'file_not_found': 12, # <-- Need better path handling
'timeout': 8,
'api_error': 3,
}
```
---
## 9. Weekly Report Generation
### 9.1 Report Contents
1. **Summary Statistics**
- Total sessions graded
- Average tool efficiency
- Average response quality
- Task completion rate
- Error recovery rate
2. **Problem Areas**
- Lowest scoring task types
- Most error-prone tools
- Common error patterns
3. **Improvement Suggestions**
- Actionable recommendations based on patterns
- Skill gap identification
- Process improvement tips
### 9.2 Suggestion Generation Rules
| Condition | Suggestion |
|-----------|------------|
| completion_rate < 70% | "Add pre-task planning steps" |
| avg_tool_efficiency < 3 | "Review error recovery patterns" |
| error_recovery_rate < 80% | "Implement better fallback strategies" |
| Specific task type low | "Consider skill enhancement for {task}" |
| Specific tool high errors | "Review usage patterns for {tool}" |
| Specific error common | "Add targeted error handling for {error}" |
---
## 10. Integration Points
### 10.1 With Adaptive Routing (#88)
The grader feeds task-type performance data to the router:
```python
# Router uses grader insights
if task_type == 'debugging' and grader.get_task_score('debugging') < 3:
# Route to more capable model for debugging tasks
model = 'claude-opus-4'
```
### 10.2 With Quality Scoring (#98)
Grader scores feed into external quality validation:
```python
# Quality scorer validates grader accuracy
external_score = quality_scorer.validate(session, grader_score)
discrepancy = abs(external_score - grader_score)
if discrepancy > threshold:
grader.calibrate() # Adjust scoring algorithms
```
### 10.3 With Skill System
Identifies skills that could improve low-scoring areas:
```python
if grader.get_task_score('debugging') < 3:
recommend_skill('systematic-debugging')
```
---
## 11. Usage
### 11.1 Command Line
```bash
# Grade latest 10 ungraded sessions
python self_grader.py -g 10
# Show summary of last 7 days
python self_grader.py -s
# Show summary of last 30 days
python self_grader.py -s -d 30
# Generate weekly report
python self_grader.py -r
# Grade specific session file
python self_grader.py -f /path/to/session.json
```
### 11.2 Python API
```python
from self_grader import SelfGrader
grader = SelfGrader()
# Grade latest sessions
grades = grader.grade_latest_sessions(n=10)
# Get pattern insights
patterns = grader.identify_patterns(days=7)
# Generate report
report = grader.generate_weekly_report()
# Get human-readable summary
print(grader.get_grades_summary(days=7))
```
---
## 12. Testing
Comprehensive test suite covers:
1. **Unit Tests**
- Error detection and classification
- Scoring algorithms
- Task type inference
2. **Integration Tests**
- Full session grading pipeline
- Database operations
- Report generation
3. **Edge Cases**
- Empty sessions
- Sessions with infinite loops
- Malformed session files
Run tests:
```bash
python -m pytest test_self_grader.py -v
```
---
## 13. Future Enhancements
1. **Machine Learning Integration**
- Train models to predict session success
- Learn optimal tool sequences
- Predict error likelihood
2. **Human-in-the-Loop Validation**
- Allow user override of grades
- Collect explicit feedback
- Calibrate scoring with human judgments
3. **Real-time Monitoring**
- Grade sessions as they complete
- Alert on quality degradation
- Live dashboard of metrics
4. **Cross-Session Learning**
- Identify recurring issues across similar tasks
- Suggest skill improvements
- Recommend tool alternatives
---
## 14. Files
| File | Description |
|------|-------------|
| `self_grader.py` | Main module with SelfGrader class |
| `test_self_grader.py` | Comprehensive test suite |
| `self_grader_design.md` | This design document |
| `~/.timmy/uniwizard/session_grades.db` | SQLite database (created at runtime) |
---
*Document Version: 1.0*
*Created: 2026-03-30*
*Author: Timmy (UniWizard)*