Files

Kimi Claw 83e400d4aa [KimiClaw] Uniwizard routing modules — quality scorer, task classifier, self-grader (#107 )

Co-authored-by: Kimi Claw <kimi@timmytime.ai>
Co-committed-by: Kimi Claw <kimi@timmytime.ai>

2026-03-30 20:15:36 +00:00

14 KiB

Raw Blame History

Self-Grader Design Document

Issue: timmy-home #89 - "Build self-improvement loop: Timmy grades and learns from his own outputs"

Related Issues: #88 (Adaptive Routing), #98 (Quality Scoring)

1. Overview

The Self-Grader module enables Timmy to automatically evaluate his own task outputs, identify patterns in failures, and generate actionable improvement insights. This creates a closed feedback loop for continuous self-improvement.

Goals

Automatically grade completed sessions on multiple quality dimensions
Identify recurring error patterns and their root causes
Track performance trends over time
Generate actionable weekly improvement reports
Feed insights into adaptive routing decisions

2. Architecture

┌─────────────────────────────────────────────────────────────┐
│                      Self-Grader Module                     │
├─────────────────────────────────────────────────────────────┤
│                                                             │
│  ┌──────────────┐    ┌──────────────┐    ┌──────────────┐  │
│  │   Parser     │───▶│   Analyzer   │───▶│   Grader     │  │
│  │              │    │              │    │              │  │
│  │ Reads session│    │ Extracts     │    │ Scores on 5  │  │
│  │ JSON files   │    │ metrics      │    │ dimensions   │  │
│  └──────────────┘    └──────────────┘    └──────────────┘  │
│         │                    │                   │          │
│         ▼                    ▼                   ▼          │
│  ┌──────────────────────────────────────────────────────┐  │
│  │              SQLite Database Layer                   │  │
│  │  • session_grades table (individual scores)          │  │
│  │  • weekly_reports table (aggregated insights)        │  │
│  └──────────────────────────────────────────────────────┘  │
│                            │                                │
│                            ▼                                │
│  ┌──────────────────────────────────────────────────────┐  │
│  │              Pattern Identification                  │  │
│  │  • Task type performance analysis                    │  │
│  │  • Tool error frequency tracking                     │  │
│  │  • Error classification and clustering               │  │
│  └──────────────────────────────────────────────────────┘  │
│                            │                                │
│                            ▼                                │
│  ┌──────────────────────────────────────────────────────┐  │
│  │              Report Generator                        │  │
│  │  • Weekly summary with trends                        │  │
│  │  • Improvement suggestions                           │  │
│  │  • Performance alerts                                │  │
│  └──────────────────────────────────────────────────────┘  │
│                                                             │
└─────────────────────────────────────────────────────────────┘
                              │
                              ▼
┌─────────────────────────────────────────────────────────────┐
│              Downstream Consumers                           │
│  • Adaptive Routing (#88) - route based on task type        │
│  • Quality Scoring (#98) - external quality validation      │
│  • Skill Recommendations - identify skill gaps              │
│  • Alert System - notify on quality degradation             │
└─────────────────────────────────────────────────────────────┘

3. Grading Dimensions

3.1 Core Metrics (1-5 scale where applicable)

Metric	Type	Description
`task_completed`	boolean	Whether the task appears to have been completed successfully
`tool_calls_efficient`	int (1-5)	Efficiency of tool usage (error rate, call count)
`response_quality`	int (1-5)	Overall quality of final response
`errors_recovered`	boolean	Whether errors were successfully recovered from
`total_api_calls`	int	Total number of API/assistant calls made

3.2 Derived Metrics

Metric	Description
`total_errors`	Count of tool errors detected
`error_types`	Categorized error types (JSON list)
`tools_with_errors`	Tools that generated errors
`had_repeated_errors`	Flag for 3+ consecutive errors
`had_infinite_loop_risk`	Flag for 5+ consecutive errors or >50 calls
`had_user_clarification`	Whether clarification was requested

4. Error Classification

The system classifies errors into categories for pattern analysis:

Category	Pattern	Example
`file_not_found`	File/path errors	"No such file or directory"
`permission_denied`	Access errors	"Permission denied"
`timeout`	Time limit exceeded	"Request timed out"
`api_error`	External API failures	"Rate limit exceeded"
`syntax_error`	Code/parsing errors	"Invalid syntax"
`command_failed`	Command execution	"exit_code": 1
`network_error`	Connectivity issues	"Connection refused"
`tool_not_found`	Tool resolution	"Unknown tool"
`unknown`	Unclassified	Any other error

5. Task Type Inference

Sessions are categorized by task type for comparative analysis:

Task Type	Pattern
`code_review`	"review", "code review", "PR"
`debugging`	"debug", "fix", "troubleshoot"
`feature_impl`	"implement", "add feature", "build"
`refactoring`	"refactor", "clean up", "reorganize"
`documentation`	"document", "readme", "docstring"
`testing`	"test", "pytest", "unit test"
`research`	"research", "investigate", "look up"
`deployment`	"deploy", "release", "publish"
`data_analysis`	"analyze data", "process file", "parse"
`infrastructure`	"server", "docker", "kubernetes"
`general`	Default catch-all

6. Database Schema

6.1 session_grades Table

CREATE TABLE session_grades (
    id INTEGER PRIMARY KEY AUTOINCREMENT,
    session_id TEXT UNIQUE NOT NULL,
    session_file TEXT NOT NULL,
    graded_at TEXT NOT NULL,
    
    -- Core metrics
    task_completed INTEGER NOT NULL,
    tool_calls_efficient INTEGER NOT NULL,
    response_quality INTEGER NOT NULL,
    errors_recovered INTEGER NOT NULL,
    total_api_calls INTEGER NOT NULL,
    
    -- Metadata
    model TEXT,
    platform TEXT,
    session_start TEXT,
    duration_seconds REAL,
    task_summary TEXT,
    
    -- Error analysis
    total_errors INTEGER NOT NULL,
    error_types TEXT,  -- JSON array
    tools_with_errors TEXT,  -- JSON array
    
    -- Pattern flags
    had_repeated_errors INTEGER NOT NULL,
    had_infinite_loop_risk INTEGER NOT NULL,
    had_user_clarification INTEGER NOT NULL
);

6.2 weekly_reports Table

CREATE TABLE weekly_reports (
    id INTEGER PRIMARY KEY AUTOINCREMENT,
    week_start TEXT UNIQUE NOT NULL,
    week_end TEXT NOT NULL,
    generated_at TEXT NOT NULL,
    report_json TEXT NOT NULL  -- Serialized WeeklyReport
);

7. Scoring Algorithms

7.1 Task Completion Detection

Positive indicators:

Final message contains completion phrases: "done", "completed", "success", "finished"
References to created outputs: "saved to", "here is", "output:"
Low error rate relative to total calls

Negative indicators:

Explicit failure phrases: "failed", "unable to", "could not"
Error rate > 30% of total calls
Empty or very short final response

7.2 Tool Efficiency Scoring

error_rate = total_errors / total_api_calls

if error_rate == 0 and tool_calls <= 10:
    score = 5  # Perfect
elif error_rate <= 0.1 and tool_calls <= 15:
    score = 4  # Good
elif error_rate <= 0.25 and tool_calls <= 25:
    score = 3  # Average
elif error_rate <= 0.4:
    score = 2  # Poor
else:
    score = 1  # Very poor

7.3 Response Quality Scoring

Base score: 3 (average)

Additions:

Content length > 500 chars: +1
Content length > 1000 chars: +1
Contains code blocks: +1
Contains links/references: +0.5

Penalties:

Repeated errors: -1
Total errors > 5: -1
Infinite loop risk: -2

Range clamped to 1-5.

8. Pattern Identification

8.1 Per-Task-Type Analysis

Tracks average scores per task type to identify weak areas:

task_scores = {
    'code_review': 4.2,
    'debugging': 2.8,  # <-- Needs attention
    'feature_impl': 3.5,
}

8.2 Tool Error Frequency

Identifies which tools are most error-prone:

tool_errors = {
    'browser_navigate': 15,  # <-- High error rate
    'terminal': 5,
    'file_read': 2,
}

8.3 Error Pattern Clustering

Groups errors by type to identify systemic issues:

error_types = {
    'file_not_found': 12,  # <-- Need better path handling
    'timeout': 8,
    'api_error': 3,
}

9. Weekly Report Generation

9.1 Report Contents

Summary Statistics
- Total sessions graded
- Average tool efficiency
- Average response quality
- Task completion rate
- Error recovery rate
Problem Areas
- Lowest scoring task types
- Most error-prone tools
- Common error patterns
Improvement Suggestions
- Actionable recommendations based on patterns
- Skill gap identification
- Process improvement tips

9.2 Suggestion Generation Rules

Condition	Suggestion
completion_rate < 70%	"Add pre-task planning steps"
avg_tool_efficiency < 3	"Review error recovery patterns"
error_recovery_rate < 80%	"Implement better fallback strategies"
Specific task type low	"Consider skill enhancement for {task}"
Specific tool high errors	"Review usage patterns for {tool}"
Specific error common	"Add targeted error handling for {error}"

10. Integration Points

10.1 With Adaptive Routing (#88)

The grader feeds task-type performance data to the router:

# Router uses grader insights
if task_type == 'debugging' and grader.get_task_score('debugging') < 3:
    # Route to more capable model for debugging tasks
    model = 'claude-opus-4'

10.2 With Quality Scoring (#98)

Grader scores feed into external quality validation:

# Quality scorer validates grader accuracy
external_score = quality_scorer.validate(session, grader_score)
discrepancy = abs(external_score - grader_score)
if discrepancy > threshold:
    grader.calibrate()  # Adjust scoring algorithms

10.3 With Skill System

Identifies skills that could improve low-scoring areas:

if grader.get_task_score('debugging') < 3:
    recommend_skill('systematic-debugging')

11. Usage

11.1 Command Line

# Grade latest 10 ungraded sessions
python self_grader.py -g 10

# Show summary of last 7 days
python self_grader.py -s

# Show summary of last 30 days
python self_grader.py -s -d 30

# Generate weekly report
python self_grader.py -r

# Grade specific session file
python self_grader.py -f /path/to/session.json

11.2 Python API

from self_grader import SelfGrader

grader = SelfGrader()

# Grade latest sessions
grades = grader.grade_latest_sessions(n=10)

# Get pattern insights
patterns = grader.identify_patterns(days=7)

# Generate report
report = grader.generate_weekly_report()

# Get human-readable summary
print(grader.get_grades_summary(days=7))

12. Testing

Comprehensive test suite covers:

Unit Tests
- Error detection and classification
- Scoring algorithms
- Task type inference
Integration Tests
- Full session grading pipeline
- Database operations
- Report generation
Edge Cases
- Empty sessions
- Sessions with infinite loops
- Malformed session files

Run tests:

python -m pytest test_self_grader.py -v

13. Future Enhancements

Machine Learning Integration
- Train models to predict session success
- Learn optimal tool sequences
- Predict error likelihood
Human-in-the-Loop Validation
- Allow user override of grades
- Collect explicit feedback
- Calibrate scoring with human judgments
Real-time Monitoring
- Grade sessions as they complete
- Alert on quality degradation
- Live dashboard of metrics
Cross-Session Learning
- Identify recurring issues across similar tasks
- Suggest skill improvements
- Recommend tool alternatives

14. Files

File	Description
`self_grader.py`	Main module with SelfGrader class
`test_self_grader.py`	Comprehensive test suite
`self_grader_design.md`	This design document
`~/.timmy/uniwizard/session_grades.db`	SQLite database (created at runtime)

Document Version: 1.0 Created: 2026-03-30 Author: Timmy (UniWizard)

14 KiB Raw Blame History