timmy-home/uniwizard/self_grader_design.md

# Self-Grader Design Document

**Issue:** timmy-home #89 - "Build self-improvement loop: Timmy grades and learns from his own outputs"

**Related Issues:** #88 (Adaptive Routing), #98 (Quality Scoring)

---

## 1. Overview

The Self-Grader module enables Timmy to automatically evaluate his own task outputs, identify patterns in failures, and generate actionable improvement insights. This creates a closed feedback loop for continuous self-improvement.

### Goals
- Automatically grade completed sessions on multiple quality dimensions
- Identify recurring error patterns and their root causes
- Track performance trends over time
- Generate actionable weekly improvement reports
- Feed insights into adaptive routing decisions

---

## 2. Architecture

```
┌─────────────────────────────────────────────────────────────┐
│                      Self-Grader Module                     │
├─────────────────────────────────────────────────────────────┤
│                                                             │
│  ┌──────────────┐    ┌──────────────┐    ┌──────────────┐  │
│  │   Parser     │───▶│   Analyzer   │───▶│   Grader     │  │
│  │              │    │              │    │              │  │
│  │ Reads session│    │ Extracts     │    │ Scores on 5  │  │
│  │ JSON files   │    │ metrics      │    │ dimensions   │  │
│  └──────────────┘    └──────────────┘    └──────────────┘  │
│         │                    │                   │          │
│         ▼                    ▼                   ▼          │
│  ┌──────────────────────────────────────────────────────┐  │
│  │              SQLite Database Layer                   │  │
│  │  • session_grades table (individual scores)          │  │
│  │  • weekly_reports table (aggregated insights)        │  │
│  └──────────────────────────────────────────────────────┘  │
│                            │                                │
│                            ▼                                │
│  ┌──────────────────────────────────────────────────────┐  │
│  │              Pattern Identification                  │  │
│  │  • Task type performance analysis                    │  │
│  │  • Tool error frequency tracking                     │  │
│  │  • Error classification and clustering               │  │
│  └──────────────────────────────────────────────────────┘  │
│                            │                                │
│                            ▼                                │
│  ┌──────────────────────────────────────────────────────┐  │
│  │              Report Generator                        │  │
│  │  • Weekly summary with trends                        │  │
│  │  • Improvement suggestions                           │  │
│  │  • Performance alerts                                │  │
│  └──────────────────────────────────────────────────────┘  │
│                                                             │
└─────────────────────────────────────────────────────────────┘
                              │
                              ▼
┌─────────────────────────────────────────────────────────────┐
│              Downstream Consumers                           │
│  • Adaptive Routing (#88) - route based on task type        │
│  • Quality Scoring (#98) - external quality validation      │
│  • Skill Recommendations - identify skill gaps              │
│  • Alert System - notify on quality degradation             │
└─────────────────────────────────────────────────────────────┘
```

---

## 3. Grading Dimensions

### 3.1 Core Metrics (1-5 scale where applicable)

| Metric | Type | Description |
|--------|------|-------------|
| `task_completed` | boolean | Whether the task appears to have been completed successfully |
| `tool_calls_efficient` | int (1-5) | Efficiency of tool usage (error rate, call count) |
| `response_quality` | int (1-5) | Overall quality of final response |
| `errors_recovered` | boolean | Whether errors were successfully recovered from |
| `total_api_calls` | int | Total number of API/assistant calls made |

### 3.2 Derived Metrics

| Metric | Description |
|--------|-------------|
| `total_errors` | Count of tool errors detected |
| `error_types` | Categorized error types (JSON list) |
| `tools_with_errors` | Tools that generated errors |
| `had_repeated_errors` | Flag for 3+ consecutive errors |
| `had_infinite_loop_risk` | Flag for 5+ consecutive errors or >50 calls |
| `had_user_clarification` | Whether clarification was requested |

---

## 4. Error Classification

The system classifies errors into categories for pattern analysis:

| Category | Pattern | Example |
|----------|---------|---------|
| `file_not_found` | File/path errors | "No such file or directory" |
| `permission_denied` | Access errors | "Permission denied" |
| `timeout` | Time limit exceeded | "Request timed out" |
| `api_error` | External API failures | "Rate limit exceeded" |
| `syntax_error` | Code/parsing errors | "Invalid syntax" |
| `command_failed` | Command execution | "exit_code": 1 |
| `network_error` | Connectivity issues | "Connection refused" |
| `tool_not_found` | Tool resolution | "Unknown tool" |
| `unknown` | Unclassified | Any other error |

---

## 5. Task Type Inference

Sessions are categorized by task type for comparative analysis:

| Task Type | Pattern |
|-----------|---------|
| `code_review` | "review", "code review", "PR" |
| `debugging` | "debug", "fix", "troubleshoot" |
| `feature_impl` | "implement", "add feature", "build" |
| `refactoring` | "refactor", "clean up", "reorganize" |
| `documentation` | "document", "readme", "docstring" |
| `testing` | "test", "pytest", "unit test" |
| `research` | "research", "investigate", "look up" |
| `deployment` | "deploy", "release", "publish" |
| `data_analysis` | "analyze data", "process file", "parse" |
| `infrastructure` | "server", "docker", "kubernetes" |
| `general` | Default catch-all |

---

## 6. Database Schema

### 6.1 session_grades Table

```sql
CREATE TABLE session_grades (
    id INTEGER PRIMARY KEY AUTOINCREMENT,
    session_id TEXT UNIQUE NOT NULL,
    session_file TEXT NOT NULL,
    graded_at TEXT NOT NULL,

    -- Core metrics
    task_completed INTEGER NOT NULL,
    tool_calls_efficient INTEGER NOT NULL,
    response_quality INTEGER NOT NULL,
    errors_recovered INTEGER NOT NULL,
    total_api_calls INTEGER NOT NULL,

    -- Metadata
    model TEXT,
    platform TEXT,
    session_start TEXT,
    duration_seconds REAL,
    task_summary TEXT,

    -- Error analysis
    total_errors INTEGER NOT NULL,
    error_types TEXT,  -- JSON array
    tools_with_errors TEXT,  -- JSON array

    -- Pattern flags
    had_repeated_errors INTEGER NOT NULL,
    had_infinite_loop_risk INTEGER NOT NULL,
    had_user_clarification INTEGER NOT NULL
);
```

### 6.2 weekly_reports Table

```sql
CREATE TABLE weekly_reports (
    id INTEGER PRIMARY KEY AUTOINCREMENT,
    week_start TEXT UNIQUE NOT NULL,
    week_end TEXT NOT NULL,
    generated_at TEXT NOT NULL,
    report_json TEXT NOT NULL  -- Serialized WeeklyReport
);
```

---

## 7. Scoring Algorithms

### 7.1 Task Completion Detection

Positive indicators:
- Final message contains completion phrases: "done", "completed", "success", "finished"
- References to created outputs: "saved to", "here is", "output:"
- Low error rate relative to total calls

Negative indicators:
- Explicit failure phrases: "failed", "unable to", "could not"
- Error rate > 30% of total calls
- Empty or very short final response

### 7.2 Tool Efficiency Scoring

```python
error_rate = total_errors / total_api_calls

if error_rate == 0 and tool_calls <= 10:
    score = 5  # Perfect
elif error_rate <= 0.1 and tool_calls <= 15:
    score = 4  # Good
elif error_rate <= 0.25 and tool_calls <= 25:
    score = 3  # Average
elif error_rate <= 0.4:
    score = 2  # Poor
else:
    score = 1  # Very poor
```

### 7.3 Response Quality Scoring

Base score: 3 (average)

Additions:
- Content length > 500 chars: +1
- Content length > 1000 chars: +1
- Contains code blocks: +1
- Contains links/references: +0.5

Penalties:
- Repeated errors: -1
- Total errors > 5: -1
- Infinite loop risk: -2

Range clamped to 1-5.

---

## 8. Pattern Identification

### 8.1 Per-Task-Type Analysis

Tracks average scores per task type to identify weak areas:

```python
task_scores = {
    'code_review': 4.2,
    'debugging': 2.8,  # <-- Needs attention
    'feature_impl': 3.5,
}
```

### 8.2 Tool Error Frequency

Identifies which tools are most error-prone:

```python
tool_errors = {
    'browser_navigate': 15,  # <-- High error rate
    'terminal': 5,
    'file_read': 2,
}
```

### 8.3 Error Pattern Clustering

Groups errors by type to identify systemic issues:

```python
error_types = {
    'file_not_found': 12,  # <-- Need better path handling
    'timeout': 8,
    'api_error': 3,
}
```

---

## 9. Weekly Report Generation

### 9.1 Report Contents

1. **Summary Statistics**
   - Total sessions graded
   - Average tool efficiency
   - Average response quality
   - Task completion rate
   - Error recovery rate

2. **Problem Areas**
   - Lowest scoring task types
   - Most error-prone tools
   - Common error patterns

3. **Improvement Suggestions**
   - Actionable recommendations based on patterns
   - Skill gap identification
   - Process improvement tips

### 9.2 Suggestion Generation Rules

| Condition | Suggestion |
|-----------|------------|
| completion_rate < 70% | "Add pre-task planning steps" |
| avg_tool_efficiency < 3 | "Review error recovery patterns" |
| error_recovery_rate < 80% | "Implement better fallback strategies" |
| Specific task type low | "Consider skill enhancement for {task}" |
| Specific tool high errors | "Review usage patterns for {tool}" |
| Specific error common | "Add targeted error handling for {error}" |

---

## 10. Integration Points

### 10.1 With Adaptive Routing (#88)

The grader feeds task-type performance data to the router:

```python
# Router uses grader insights
if task_type == 'debugging' and grader.get_task_score('debugging') < 3:
    # Route to more capable model for debugging tasks
    model = 'claude-opus-4'
```

### 10.2 With Quality Scoring (#98)

Grader scores feed into external quality validation:

```python
# Quality scorer validates grader accuracy
external_score = quality_scorer.validate(session, grader_score)
discrepancy = abs(external_score - grader_score)
if discrepancy > threshold:
    grader.calibrate()  # Adjust scoring algorithms
```

### 10.3 With Skill System

Identifies skills that could improve low-scoring areas:

```python
if grader.get_task_score('debugging') < 3:
    recommend_skill('systematic-debugging')
```

---

## 11. Usage

### 11.1 Command Line

```bash
# Grade latest 10 ungraded sessions
python self_grader.py -g 10

# Show summary of last 7 days
python self_grader.py -s

# Show summary of last 30 days
python self_grader.py -s -d 30

# Generate weekly report
python self_grader.py -r

# Grade specific session file
python self_grader.py -f /path/to/session.json
```

### 11.2 Python API

```python
from self_grader import SelfGrader

grader = SelfGrader()

# Grade latest sessions
grades = grader.grade_latest_sessions(n=10)

# Get pattern insights
patterns = grader.identify_patterns(days=7)

# Generate report
report = grader.generate_weekly_report()

# Get human-readable summary
print(grader.get_grades_summary(days=7))
```

---

## 12. Testing

Comprehensive test suite covers:

1. **Unit Tests**
   - Error detection and classification
   - Scoring algorithms
   - Task type inference

2. **Integration Tests**
   - Full session grading pipeline
   - Database operations
   - Report generation

3. **Edge Cases**
   - Empty sessions
   - Sessions with infinite loops
   - Malformed session files

Run tests:
```bash
python -m pytest test_self_grader.py -v
```

---

## 13. Future Enhancements

1. **Machine Learning Integration**
   - Train models to predict session success
   - Learn optimal tool sequences
   - Predict error likelihood

2. **Human-in-the-Loop Validation**
   - Allow user override of grades
   - Collect explicit feedback
   - Calibrate scoring with human judgments

3. **Real-time Monitoring**
   - Grade sessions as they complete
   - Alert on quality degradation
   - Live dashboard of metrics

4. **Cross-Session Learning**
   - Identify recurring issues across similar tasks
   - Suggest skill improvements
   - Recommend tool alternatives

---

## 14. Files

| File | Description |
|------|-------------|
| `self_grader.py` | Main module with SelfGrader class |
| `test_self_grader.py` | Comprehensive test suite |
| `self_grader_design.md` | This design document |
| `~/.timmy/uniwizard/session_grades.db` | SQLite database (created at runtime) |

---

*Document Version: 1.0*
*Created: 2026-03-30*
*Author: Timmy (UniWizard)*