# Self-Grader Design Document **Issue:** timmy-home #89 - "Build self-improvement loop: Timmy grades and learns from his own outputs" **Related Issues:** #88 (Adaptive Routing), #98 (Quality Scoring) --- ## 1. Overview The Self-Grader module enables Timmy to automatically evaluate his own task outputs, identify patterns in failures, and generate actionable improvement insights. This creates a closed feedback loop for continuous self-improvement. ### Goals - Automatically grade completed sessions on multiple quality dimensions - Identify recurring error patterns and their root causes - Track performance trends over time - Generate actionable weekly improvement reports - Feed insights into adaptive routing decisions --- ## 2. Architecture ``` ┌─────────────────────────────────────────────────────────────┐ │ Self-Grader Module │ ├─────────────────────────────────────────────────────────────┤ │ │ │ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ │ │ │ Parser │───▶│ Analyzer │───▶│ Grader │ │ │ │ │ │ │ │ │ │ │ │ Reads session│ │ Extracts │ │ Scores on 5 │ │ │ │ JSON files │ │ metrics │ │ dimensions │ │ │ └──────────────┘ └──────────────┘ └──────────────┘ │ │ │ │ │ │ │ ▼ ▼ ▼ │ │ ┌──────────────────────────────────────────────────────┐ │ │ │ SQLite Database Layer │ │ │ │ • session_grades table (individual scores) │ │ │ │ • weekly_reports table (aggregated insights) │ │ │ └──────────────────────────────────────────────────────┘ │ │ │ │ │ ▼ │ │ ┌──────────────────────────────────────────────────────┐ │ │ │ Pattern Identification │ │ │ │ • Task type performance analysis │ │ │ │ • Tool error frequency tracking │ │ │ │ • Error classification and clustering │ │ │ └──────────────────────────────────────────────────────┘ │ │ │ │ │ ▼ │ │ ┌──────────────────────────────────────────────────────┐ │ │ │ Report Generator │ │ │ │ • Weekly summary with trends │ │ │ │ • Improvement suggestions │ │ │ │ • Performance alerts │ │ │ └──────────────────────────────────────────────────────┘ │ │ │ └─────────────────────────────────────────────────────────────┘ │ ▼ ┌─────────────────────────────────────────────────────────────┐ │ Downstream Consumers │ │ • Adaptive Routing (#88) - route based on task type │ │ • Quality Scoring (#98) - external quality validation │ │ • Skill Recommendations - identify skill gaps │ │ • Alert System - notify on quality degradation │ └─────────────────────────────────────────────────────────────┘ ``` --- ## 3. Grading Dimensions ### 3.1 Core Metrics (1-5 scale where applicable) | Metric | Type | Description | |--------|------|-------------| | `task_completed` | boolean | Whether the task appears to have been completed successfully | | `tool_calls_efficient` | int (1-5) | Efficiency of tool usage (error rate, call count) | | `response_quality` | int (1-5) | Overall quality of final response | | `errors_recovered` | boolean | Whether errors were successfully recovered from | | `total_api_calls` | int | Total number of API/assistant calls made | ### 3.2 Derived Metrics | Metric | Description | |--------|-------------| | `total_errors` | Count of tool errors detected | | `error_types` | Categorized error types (JSON list) | | `tools_with_errors` | Tools that generated errors | | `had_repeated_errors` | Flag for 3+ consecutive errors | | `had_infinite_loop_risk` | Flag for 5+ consecutive errors or >50 calls | | `had_user_clarification` | Whether clarification was requested | --- ## 4. Error Classification The system classifies errors into categories for pattern analysis: | Category | Pattern | Example | |----------|---------|---------| | `file_not_found` | File/path errors | "No such file or directory" | | `permission_denied` | Access errors | "Permission denied" | | `timeout` | Time limit exceeded | "Request timed out" | | `api_error` | External API failures | "Rate limit exceeded" | | `syntax_error` | Code/parsing errors | "Invalid syntax" | | `command_failed` | Command execution | "exit_code": 1 | | `network_error` | Connectivity issues | "Connection refused" | | `tool_not_found` | Tool resolution | "Unknown tool" | | `unknown` | Unclassified | Any other error | --- ## 5. Task Type Inference Sessions are categorized by task type for comparative analysis: | Task Type | Pattern | |-----------|---------| | `code_review` | "review", "code review", "PR" | | `debugging` | "debug", "fix", "troubleshoot" | | `feature_impl` | "implement", "add feature", "build" | | `refactoring` | "refactor", "clean up", "reorganize" | | `documentation` | "document", "readme", "docstring" | | `testing` | "test", "pytest", "unit test" | | `research` | "research", "investigate", "look up" | | `deployment` | "deploy", "release", "publish" | | `data_analysis` | "analyze data", "process file", "parse" | | `infrastructure` | "server", "docker", "kubernetes" | | `general` | Default catch-all | --- ## 6. Database Schema ### 6.1 session_grades Table ```sql CREATE TABLE session_grades ( id INTEGER PRIMARY KEY AUTOINCREMENT, session_id TEXT UNIQUE NOT NULL, session_file TEXT NOT NULL, graded_at TEXT NOT NULL, -- Core metrics task_completed INTEGER NOT NULL, tool_calls_efficient INTEGER NOT NULL, response_quality INTEGER NOT NULL, errors_recovered INTEGER NOT NULL, total_api_calls INTEGER NOT NULL, -- Metadata model TEXT, platform TEXT, session_start TEXT, duration_seconds REAL, task_summary TEXT, -- Error analysis total_errors INTEGER NOT NULL, error_types TEXT, -- JSON array tools_with_errors TEXT, -- JSON array -- Pattern flags had_repeated_errors INTEGER NOT NULL, had_infinite_loop_risk INTEGER NOT NULL, had_user_clarification INTEGER NOT NULL ); ``` ### 6.2 weekly_reports Table ```sql CREATE TABLE weekly_reports ( id INTEGER PRIMARY KEY AUTOINCREMENT, week_start TEXT UNIQUE NOT NULL, week_end TEXT NOT NULL, generated_at TEXT NOT NULL, report_json TEXT NOT NULL -- Serialized WeeklyReport ); ``` --- ## 7. Scoring Algorithms ### 7.1 Task Completion Detection Positive indicators: - Final message contains completion phrases: "done", "completed", "success", "finished" - References to created outputs: "saved to", "here is", "output:" - Low error rate relative to total calls Negative indicators: - Explicit failure phrases: "failed", "unable to", "could not" - Error rate > 30% of total calls - Empty or very short final response ### 7.2 Tool Efficiency Scoring ```python error_rate = total_errors / total_api_calls if error_rate == 0 and tool_calls <= 10: score = 5 # Perfect elif error_rate <= 0.1 and tool_calls <= 15: score = 4 # Good elif error_rate <= 0.25 and tool_calls <= 25: score = 3 # Average elif error_rate <= 0.4: score = 2 # Poor else: score = 1 # Very poor ``` ### 7.3 Response Quality Scoring Base score: 3 (average) Additions: - Content length > 500 chars: +1 - Content length > 1000 chars: +1 - Contains code blocks: +1 - Contains links/references: +0.5 Penalties: - Repeated errors: -1 - Total errors > 5: -1 - Infinite loop risk: -2 Range clamped to 1-5. --- ## 8. Pattern Identification ### 8.1 Per-Task-Type Analysis Tracks average scores per task type to identify weak areas: ```python task_scores = { 'code_review': 4.2, 'debugging': 2.8, # <-- Needs attention 'feature_impl': 3.5, } ``` ### 8.2 Tool Error Frequency Identifies which tools are most error-prone: ```python tool_errors = { 'browser_navigate': 15, # <-- High error rate 'terminal': 5, 'file_read': 2, } ``` ### 8.3 Error Pattern Clustering Groups errors by type to identify systemic issues: ```python error_types = { 'file_not_found': 12, # <-- Need better path handling 'timeout': 8, 'api_error': 3, } ``` --- ## 9. Weekly Report Generation ### 9.1 Report Contents 1. **Summary Statistics** - Total sessions graded - Average tool efficiency - Average response quality - Task completion rate - Error recovery rate 2. **Problem Areas** - Lowest scoring task types - Most error-prone tools - Common error patterns 3. **Improvement Suggestions** - Actionable recommendations based on patterns - Skill gap identification - Process improvement tips ### 9.2 Suggestion Generation Rules | Condition | Suggestion | |-----------|------------| | completion_rate < 70% | "Add pre-task planning steps" | | avg_tool_efficiency < 3 | "Review error recovery patterns" | | error_recovery_rate < 80% | "Implement better fallback strategies" | | Specific task type low | "Consider skill enhancement for {task}" | | Specific tool high errors | "Review usage patterns for {tool}" | | Specific error common | "Add targeted error handling for {error}" | --- ## 10. Integration Points ### 10.1 With Adaptive Routing (#88) The grader feeds task-type performance data to the router: ```python # Router uses grader insights if task_type == 'debugging' and grader.get_task_score('debugging') < 3: # Route to more capable model for debugging tasks model = 'claude-opus-4' ``` ### 10.2 With Quality Scoring (#98) Grader scores feed into external quality validation: ```python # Quality scorer validates grader accuracy external_score = quality_scorer.validate(session, grader_score) discrepancy = abs(external_score - grader_score) if discrepancy > threshold: grader.calibrate() # Adjust scoring algorithms ``` ### 10.3 With Skill System Identifies skills that could improve low-scoring areas: ```python if grader.get_task_score('debugging') < 3: recommend_skill('systematic-debugging') ``` --- ## 11. Usage ### 11.1 Command Line ```bash # Grade latest 10 ungraded sessions python self_grader.py -g 10 # Show summary of last 7 days python self_grader.py -s # Show summary of last 30 days python self_grader.py -s -d 30 # Generate weekly report python self_grader.py -r # Grade specific session file python self_grader.py -f /path/to/session.json ``` ### 11.2 Python API ```python from self_grader import SelfGrader grader = SelfGrader() # Grade latest sessions grades = grader.grade_latest_sessions(n=10) # Get pattern insights patterns = grader.identify_patterns(days=7) # Generate report report = grader.generate_weekly_report() # Get human-readable summary print(grader.get_grades_summary(days=7)) ``` --- ## 12. Testing Comprehensive test suite covers: 1. **Unit Tests** - Error detection and classification - Scoring algorithms - Task type inference 2. **Integration Tests** - Full session grading pipeline - Database operations - Report generation 3. **Edge Cases** - Empty sessions - Sessions with infinite loops - Malformed session files Run tests: ```bash python -m pytest test_self_grader.py -v ``` --- ## 13. Future Enhancements 1. **Machine Learning Integration** - Train models to predict session success - Learn optimal tool sequences - Predict error likelihood 2. **Human-in-the-Loop Validation** - Allow user override of grades - Collect explicit feedback - Calibrate scoring with human judgments 3. **Real-time Monitoring** - Grade sessions as they complete - Alert on quality degradation - Live dashboard of metrics 4. **Cross-Session Learning** - Identify recurring issues across similar tasks - Suggest skill improvements - Recommend tool alternatives --- ## 14. Files | File | Description | |------|-------------| | `self_grader.py` | Main module with SelfGrader class | | `test_self_grader.py` | Comprehensive test suite | | `self_grader_design.md` | This design document | | `~/.timmy/uniwizard/session_grades.db` | SQLite database (created at runtime) | --- *Document Version: 1.0* *Created: 2026-03-30* *Author: Timmy (UniWizard)*