Files
timmy-home/uniwizard/self_grader_design.md

14 KiB

Self-Grader Design Document

Issue: timmy-home #89 - "Build self-improvement loop: Timmy grades and learns from his own outputs"

Related Issues: #88 (Adaptive Routing), #98 (Quality Scoring)


1. Overview

The Self-Grader module enables Timmy to automatically evaluate his own task outputs, identify patterns in failures, and generate actionable improvement insights. This creates a closed feedback loop for continuous self-improvement.

Goals

  • Automatically grade completed sessions on multiple quality dimensions
  • Identify recurring error patterns and their root causes
  • Track performance trends over time
  • Generate actionable weekly improvement reports
  • Feed insights into adaptive routing decisions

2. Architecture

┌─────────────────────────────────────────────────────────────┐
│                      Self-Grader Module                     │
├─────────────────────────────────────────────────────────────┤
│                                                             │
│  ┌──────────────┐    ┌──────────────┐    ┌──────────────┐  │
│  │   Parser     │───▶│   Analyzer   │───▶│   Grader     │  │
│  │              │    │              │    │              │  │
│  │ Reads session│    │ Extracts     │    │ Scores on 5  │  │
│  │ JSON files   │    │ metrics      │    │ dimensions   │  │
│  └──────────────┘    └──────────────┘    └──────────────┘  │
│         │                    │                   │          │
│         ▼                    ▼                   ▼          │
│  ┌──────────────────────────────────────────────────────┐  │
│  │              SQLite Database Layer                   │  │
│  │  • session_grades table (individual scores)          │  │
│  │  • weekly_reports table (aggregated insights)        │  │
│  └──────────────────────────────────────────────────────┘  │
│                            │                                │
│                            ▼                                │
│  ┌──────────────────────────────────────────────────────┐  │
│  │              Pattern Identification                  │  │
│  │  • Task type performance analysis                    │  │
│  │  • Tool error frequency tracking                     │  │
│  │  • Error classification and clustering               │  │
│  └──────────────────────────────────────────────────────┘  │
│                            │                                │
│                            ▼                                │
│  ┌──────────────────────────────────────────────────────┐  │
│  │              Report Generator                        │  │
│  │  • Weekly summary with trends                        │  │
│  │  • Improvement suggestions                           │  │
│  │  • Performance alerts                                │  │
│  └──────────────────────────────────────────────────────┘  │
│                                                             │
└─────────────────────────────────────────────────────────────┘
                              │
                              ▼
┌─────────────────────────────────────────────────────────────┐
│              Downstream Consumers                           │
│  • Adaptive Routing (#88) - route based on task type        │
│  • Quality Scoring (#98) - external quality validation      │
│  • Skill Recommendations - identify skill gaps              │
│  • Alert System - notify on quality degradation             │
└─────────────────────────────────────────────────────────────┘

3. Grading Dimensions

3.1 Core Metrics (1-5 scale where applicable)

Metric Type Description
task_completed boolean Whether the task appears to have been completed successfully
tool_calls_efficient int (1-5) Efficiency of tool usage (error rate, call count)
response_quality int (1-5) Overall quality of final response
errors_recovered boolean Whether errors were successfully recovered from
total_api_calls int Total number of API/assistant calls made

3.2 Derived Metrics

Metric Description
total_errors Count of tool errors detected
error_types Categorized error types (JSON list)
tools_with_errors Tools that generated errors
had_repeated_errors Flag for 3+ consecutive errors
had_infinite_loop_risk Flag for 5+ consecutive errors or >50 calls
had_user_clarification Whether clarification was requested

4. Error Classification

The system classifies errors into categories for pattern analysis:

Category Pattern Example
file_not_found File/path errors "No such file or directory"
permission_denied Access errors "Permission denied"
timeout Time limit exceeded "Request timed out"
api_error External API failures "Rate limit exceeded"
syntax_error Code/parsing errors "Invalid syntax"
command_failed Command execution "exit_code": 1
network_error Connectivity issues "Connection refused"
tool_not_found Tool resolution "Unknown tool"
unknown Unclassified Any other error

5. Task Type Inference

Sessions are categorized by task type for comparative analysis:

Task Type Pattern
code_review "review", "code review", "PR"
debugging "debug", "fix", "troubleshoot"
feature_impl "implement", "add feature", "build"
refactoring "refactor", "clean up", "reorganize"
documentation "document", "readme", "docstring"
testing "test", "pytest", "unit test"
research "research", "investigate", "look up"
deployment "deploy", "release", "publish"
data_analysis "analyze data", "process file", "parse"
infrastructure "server", "docker", "kubernetes"
general Default catch-all

6. Database Schema

6.1 session_grades Table

CREATE TABLE session_grades (
    id INTEGER PRIMARY KEY AUTOINCREMENT,
    session_id TEXT UNIQUE NOT NULL,
    session_file TEXT NOT NULL,
    graded_at TEXT NOT NULL,
    
    -- Core metrics
    task_completed INTEGER NOT NULL,
    tool_calls_efficient INTEGER NOT NULL,
    response_quality INTEGER NOT NULL,
    errors_recovered INTEGER NOT NULL,
    total_api_calls INTEGER NOT NULL,
    
    -- Metadata
    model TEXT,
    platform TEXT,
    session_start TEXT,
    duration_seconds REAL,
    task_summary TEXT,
    
    -- Error analysis
    total_errors INTEGER NOT NULL,
    error_types TEXT,  -- JSON array
    tools_with_errors TEXT,  -- JSON array
    
    -- Pattern flags
    had_repeated_errors INTEGER NOT NULL,
    had_infinite_loop_risk INTEGER NOT NULL,
    had_user_clarification INTEGER NOT NULL
);

6.2 weekly_reports Table

CREATE TABLE weekly_reports (
    id INTEGER PRIMARY KEY AUTOINCREMENT,
    week_start TEXT UNIQUE NOT NULL,
    week_end TEXT NOT NULL,
    generated_at TEXT NOT NULL,
    report_json TEXT NOT NULL  -- Serialized WeeklyReport
);

7. Scoring Algorithms

7.1 Task Completion Detection

Positive indicators:

  • Final message contains completion phrases: "done", "completed", "success", "finished"
  • References to created outputs: "saved to", "here is", "output:"
  • Low error rate relative to total calls

Negative indicators:

  • Explicit failure phrases: "failed", "unable to", "could not"
  • Error rate > 30% of total calls
  • Empty or very short final response

7.2 Tool Efficiency Scoring

error_rate = total_errors / total_api_calls

if error_rate == 0 and tool_calls <= 10:
    score = 5  # Perfect
elif error_rate <= 0.1 and tool_calls <= 15:
    score = 4  # Good
elif error_rate <= 0.25 and tool_calls <= 25:
    score = 3  # Average
elif error_rate <= 0.4:
    score = 2  # Poor
else:
    score = 1  # Very poor

7.3 Response Quality Scoring

Base score: 3 (average)

Additions:

  • Content length > 500 chars: +1
  • Content length > 1000 chars: +1
  • Contains code blocks: +1
  • Contains links/references: +0.5

Penalties:

  • Repeated errors: -1
  • Total errors > 5: -1
  • Infinite loop risk: -2

Range clamped to 1-5.


8. Pattern Identification

8.1 Per-Task-Type Analysis

Tracks average scores per task type to identify weak areas:

task_scores = {
    'code_review': 4.2,
    'debugging': 2.8,  # <-- Needs attention
    'feature_impl': 3.5,
}

8.2 Tool Error Frequency

Identifies which tools are most error-prone:

tool_errors = {
    'browser_navigate': 15,  # <-- High error rate
    'terminal': 5,
    'file_read': 2,
}

8.3 Error Pattern Clustering

Groups errors by type to identify systemic issues:

error_types = {
    'file_not_found': 12,  # <-- Need better path handling
    'timeout': 8,
    'api_error': 3,
}

9. Weekly Report Generation

9.1 Report Contents

  1. Summary Statistics

    • Total sessions graded
    • Average tool efficiency
    • Average response quality
    • Task completion rate
    • Error recovery rate
  2. Problem Areas

    • Lowest scoring task types
    • Most error-prone tools
    • Common error patterns
  3. Improvement Suggestions

    • Actionable recommendations based on patterns
    • Skill gap identification
    • Process improvement tips

9.2 Suggestion Generation Rules

Condition Suggestion
completion_rate < 70% "Add pre-task planning steps"
avg_tool_efficiency < 3 "Review error recovery patterns"
error_recovery_rate < 80% "Implement better fallback strategies"
Specific task type low "Consider skill enhancement for {task}"
Specific tool high errors "Review usage patterns for {tool}"
Specific error common "Add targeted error handling for {error}"

10. Integration Points

10.1 With Adaptive Routing (#88)

The grader feeds task-type performance data to the router:

# Router uses grader insights
if task_type == 'debugging' and grader.get_task_score('debugging') < 3:
    # Route to more capable model for debugging tasks
    model = 'claude-opus-4'

10.2 With Quality Scoring (#98)

Grader scores feed into external quality validation:

# Quality scorer validates grader accuracy
external_score = quality_scorer.validate(session, grader_score)
discrepancy = abs(external_score - grader_score)
if discrepancy > threshold:
    grader.calibrate()  # Adjust scoring algorithms

10.3 With Skill System

Identifies skills that could improve low-scoring areas:

if grader.get_task_score('debugging') < 3:
    recommend_skill('systematic-debugging')

11. Usage

11.1 Command Line

# Grade latest 10 ungraded sessions
python self_grader.py -g 10

# Show summary of last 7 days
python self_grader.py -s

# Show summary of last 30 days
python self_grader.py -s -d 30

# Generate weekly report
python self_grader.py -r

# Grade specific session file
python self_grader.py -f /path/to/session.json

11.2 Python API

from self_grader import SelfGrader

grader = SelfGrader()

# Grade latest sessions
grades = grader.grade_latest_sessions(n=10)

# Get pattern insights
patterns = grader.identify_patterns(days=7)

# Generate report
report = grader.generate_weekly_report()

# Get human-readable summary
print(grader.get_grades_summary(days=7))

12. Testing

Comprehensive test suite covers:

  1. Unit Tests

    • Error detection and classification
    • Scoring algorithms
    • Task type inference
  2. Integration Tests

    • Full session grading pipeline
    • Database operations
    • Report generation
  3. Edge Cases

    • Empty sessions
    • Sessions with infinite loops
    • Malformed session files

Run tests:

python -m pytest test_self_grader.py -v

13. Future Enhancements

  1. Machine Learning Integration

    • Train models to predict session success
    • Learn optimal tool sequences
    • Predict error likelihood
  2. Human-in-the-Loop Validation

    • Allow user override of grades
    • Collect explicit feedback
    • Calibrate scoring with human judgments
  3. Real-time Monitoring

    • Grade sessions as they complete
    • Alert on quality degradation
    • Live dashboard of metrics
  4. Cross-Session Learning

    • Identify recurring issues across similar tasks
    • Suggest skill improvements
    • Recommend tool alternatives

14. Files

File Description
self_grader.py Main module with SelfGrader class
test_self_grader.py Comprehensive test suite
self_grader_design.md This design document
~/.timmy/uniwizard/session_grades.db SQLite database (created at runtime)

Document Version: 1.0 Created: 2026-03-30 Author: Timmy (UniWizard)