fix: block false kimi completion without pr proof

fix: reclaim stale kimi heartbeat tasks
feat: add planning/decomposition phase to KimiClaw heartbeat
2026-04-05 14:16:13 -04:00 · 2026-04-04 15:50:38 -04:00 · 2026-03-30 18:28:38 -04:00 · 2026-03-30 17:59:43 -04:00 · 2026-03-30 20:15:36 +00:00 · 2026-03-30 17:18:33 +00:00
32 changed files with 9305 additions and 7 deletions
--- a/configs/llama-server.service
+++ b/configs/llama-server.service
@@ -0,0 +1,22 @@
 [Unit]
 Description=llama.cpp inference server for Timmy
 After=network.target
 [Service]
 Type=simple
 User=root
 WorkingDirectory=/root/timmy
 ExecStart=/root/timmy/llama-server \
    -m /root/timmy/models/hermes-3-8b.Q4_K_M.gguf \
    --host 127.0.0.1 \
    --port 8081 \
    -c 8192 \
    -np 1 \
    --jinja \
    -ngl 0
 Restart=always
 RestartSec=10
 Environment="HOME=/root"
 [Install]
 WantedBy=multi-user.target
--- a/configs/timmy-agent.service
+++ b/configs/timmy-agent.service
@@ -0,0 +1,17 @@
 [Unit]
 Description=Timmy Agent Harness
 After=llama-server.service
 Requires=llama-server.service
 [Service]
 Type=simple
 User=root
 WorkingDirectory=/root/timmy
 ExecStart=/root/timmy/venv/bin/python /root/timmy/timmy-home/agent/agent_daemon.py
 Restart=always
 RestartSec=30
 Environment="HOME=/root"
 Environment="TIMMY_MODEL_URL=http://127.0.0.1:8081"
 [Install]
 WantedBy=multi-user.target
--- a/configs/timmy-health.service
+++ b/configs/timmy-health.service
@@ -0,0 +1,16 @@
 [Unit]
 Description=Timmy Health Check Daemon
 After=network.target
 [Service]
 Type=simple
 User=root
 WorkingDirectory=/root/timmy
 ExecStart=/root/timmy/venv/bin/python /root/timmy/uni-wizard/daemons/health_daemon.py
 Restart=always
 RestartSec=10
 Environment="HOME=/root"
 Environment="PYTHONPATH=/root/timmy/uni-wizard"
 [Install]
 WantedBy=multi-user.target
--- a/configs/timmy-task-router.service
+++ b/configs/timmy-task-router.service
@@ -0,0 +1,16 @@
 [Unit]
 Description=Timmy Task Router Daemon
 After=network.target
 [Service]
 Type=simple
 User=root
 WorkingDirectory=/root/timmy
 ExecStart=/root/timmy/venv/bin/python /root/timmy/uni-wizard/daemons/task_router.py
 Restart=always
 RestartSec=10
 Environment="HOME=/root"
 Environment="PYTHONPATH=/root/timmy/uni-wizard"
 [Install]
 WantedBy=multi-user.target
--- a/docs/SCORECARD.md
+++ b/docs/SCORECARD.md
@@ -0,0 +1,125 @@
 # Scorecard Generator Documentation
 ## Overview
 The Scorecard Generator analyzes overnight loop JSONL data and produces comprehensive reports with statistics, trends, and recommendations.
 ## Usage
 ### Basic Usage
 ```bash
 # Generate scorecard from default input directory
 python uni-wizard/scripts/generate_scorecard.py
 # Specify custom input/output directories
 python uni-wizard/scripts/generate_scorecard.py \
    --input ~/shared/overnight-loop \
    --output ~/timmy/reports
 ```
 ### Cron Setup
 ```bash
 # Generate scorecard every morning at 6 AM
 0 6 * * * /root/timmy/venv/bin/python /root/timmy/uni-wizard/scripts/generate_scorecard.py
 ```
 ## Input Format
 JSONL files in `~/shared/overnight-loop/*.jsonl`:
 ```json
 {"task": "read-soul", "status": "pass", "duration_s": 19.7, "timestamp": "2026-03-29T21:54:12Z"}
 {"task": "check-health", "status": "fail", "duration_s": 5.2, "error": "timeout", "timestamp": "2026-03-29T22:15:33Z"}
 ```
 Fields:
 - `task`: Task identifier
 - `status`: "pass" or "fail"
 - `duration_s`: Execution time in seconds
 - `timestamp`: ISO 8601 timestamp
 - `error`: Error message (for failed tasks)
 ## Output
 ### JSON Report
 `~/timmy/reports/scorecard_YYYYMMDD.json`:
 ```json
 {
  "generated_at": "2026-03-30T06:00:00Z",
  "summary": {
    "total_tasks": 100,
    "passed": 95,
    "failed": 5,
    "pass_rate": 95.0,
    "duration_stats": {
      "avg": 12.5,
      "median": 10.2,
      "p95": 45.0,
      "min": 1.2,
      "max": 120.5
    }
  },
  "by_task": {...},
  "by_hour": {...},
  "errors": {...},
  "recommendations": [...]
 }
 ```
 ### Markdown Report
 `~/timmy/reports/scorecard_YYYYMMDD.md`:
 - Executive summary with pass/fail counts
 - Duration statistics (avg, median, p95)
 - Per-task breakdown with pass rates
 - Hourly timeline showing performance trends
 - Error analysis with frequency counts
 - Actionable recommendations
 ## Report Interpretation
 ### Pass Rate Thresholds
 | Pass Rate | Status | Action |
 |-----------|--------|--------|
 | 95%+ | ✅ Excellent | Continue current operations |
 | 85-94% | ⚠️ Good | Monitor for degradation |
 | 70-84% | ⚠️ Fair | Review failing tasks |
 | <70% | ❌ Poor | Immediate investigation required |
 ### Duration Guidelines
 | Duration | Assessment |
 |----------|------------|
 | <5s | Fast |
 | 5-15s | Normal |
 | 15-30s | Slow |
 | >30s | Very slow - consider optimization |
 ## Troubleshooting
 ### No JSONL files found
 ```bash
 # Check input directory
 ls -la ~/shared/overnight-loop/
 # Ensure Syncthing is syncing
 systemctl status syncthing@root
 ```
 ### Malformed lines
 The generator skips malformed lines with a warning. Check the JSONL files for syntax errors.
 ### Empty reports
 If no data exists, verify:
 1. Overnight loop is running and writing JSONL
 2. File permissions allow reading
 3. Input path is correct
--- a/heartbeat/last_tick.json
+++ b/heartbeat/last_tick.json
@@ -1,15 +1,24 @@
 {
-  "tick_id": "20260328_015026",
+  "tick_id": "20260330_212052",
-  "timestamp": "2026-03-28T01:50:26.595915+00:00",
+  "timestamp": "2026-03-30T21:20:52.930215+00:00",
  "perception": {
    "gitea_alive": true,
    "model_health": {
-      "ollama_running": true,
+      "provider": "local-llama.cpp",
-      "models_loaded": [],
+      "provider_base_url": "http://localhost:8081/v1",
      "provider_model": "hermes4:14b",
      "local_inference_running": true,
      "models_loaded": [
        "NousResearch_Hermes-4-14B-Q4_K_M.gguf"
      ],
      "api_responding": true,
      "inference_ok": false,
-      "inference_error": "HTTP Error 404: Not Found",
+      "inference_error": "HTTP Error 500: Internal Server Error",
-      "timestamp": "2026-03-28T01:50:26.594893+00:00"
+      "latest_session": "session_d8c25163-9934-4ab2-9158-ff18a31e30f5.json",
      "latest_export": "session_d8c25163-9934-4ab2-9158-ff18a31e30f5.json",
      "export_lag_minutes": 0,
      "export_fresh": true,
      "timestamp": "2026-03-30T21:20:52.929294+00:00"
    },
    "Timmy_Foundation/the-nexus": {
      "open_issues": 1,
@@ -21,7 +30,7 @@
    },
    "huey_alive": true
  },
-  "previous_tick": "20260328_014026",
+  "previous_tick": "20260328_015026",
  "decision": {
    "actions": [],
    "severity": "fallback",
--- a/scripts/provision-timmy-vps.sh
+++ b/scripts/provision-timmy-vps.sh
@@ -0,0 +1,260 @@
 #!/bin/bash
 # Timmy VPS Provisioning Script
 # Transforms fresh Ubuntu 22.04+ VPS into sovereign local-first wizard
 set -e
 TIMMY_USER="${TIMMY_USER:-root}"
 TIMMY_HOME="${TIMMY_HOME:-/root}"
 TIMMY_DIR="$TIMMY_HOME/timmy"
 REPO_URL="${REPO_URL:-http://143.198.27.163:3000/Timmy_Foundation/timmy-home.git}"
 MODEL_URL="${MODEL_URL:-https://huggingface.co/TheBloke/Hermes-3-Llama-3.1-8B-GGUF/resolve/main/hermes-3-llama-3.1-8b.Q4_K_M.gguf}"
 MODEL_NAME="${MODEL_NAME:-hermes-3-8b.Q4_K_M.gguf}"
 echo "========================================"
 echo "  Timmy VPS Provisioning"
 echo "========================================"
 echo ""
 # Colors
 RED='\033[0;31m'
 GREEN='\033[0;32m'
 YELLOW='\033[1;33m'
 NC='\033[0m' # No Color
 log() {
    echo -e "${GREEN}[TIMMY]${NC} $1"
 }
 warn() {
    echo -e "${YELLOW}[WARN]${NC} $1"
 }
 error() {
    echo -e "${RED}[ERROR]${NC} $1"
 }
 # Check if running as root
 if [ "$EUID" -ne 0 ]; then
    error "Please run as root"
    exit 1
 fi
 # Check Ubuntu version
 if ! grep -q "Ubuntu 22.04\|Ubuntu 24.04" /etc/os-release; then
    warn "Not Ubuntu 22.04/24.04 - may not work correctly"
 fi
 log "Step 1/8: Installing system dependencies..."
 export DEBIAN_FRONTEND=noninteractive
 apt-get update -qq
 apt-get install -y -qq \
    build-essential \
    cmake \
    git \
    curl \
    wget \
    python3 \
    python3-pip \
    python3-venv \
    libopenblas-dev \
    pkg-config \
    ufw \
    jq \
    sqlite3 \
    libsqlite3-dev \
    2>&1 | tail -5
 log "Step 2/8: Setting up directory structure..."
 mkdir -p "$TIMMY_DIR"/{soul,scripts,logs,shared,models,configs}
 mkdir -p "$TIMMY_HOME/.config/systemd/user"
 log "Step 3/8: Building llama.cpp from source..."
 if [ ! -f "$TIMMY_DIR/llama-server" ]; then
    cd /tmp
    git clone --depth 1 https://github.com/ggerganov/llama.cpp.git 2>/dev/null || true
    cd llama.cpp
    # Build with OpenBLAS for CPU optimization
    cmake -B build \
        -DGGML_BLAS=ON \
        -DGGML_BLAS_VENDOR=OpenBLAS \
        -DLLAMA_BUILD_TESTS=OFF \
        -DLLAMA_BUILD_EXAMPLES=OFF \
        -DCMAKE_BUILD_TYPE=Release
    cmake --build build --config Release -j$(nproc)
    # Copy binaries
    cp build/bin/llama-server "$TIMMY_DIR/"
    cp build/bin/llama-cli "$TIMMY_DIR/"
    log "llama.cpp built successfully"
 else
    log "llama.cpp already exists, skipping build"
 fi
 log "Step 4/8: Downloading model weights..."
 if [ ! -f "$TIMMY_DIR/models/$MODEL_NAME" ]; then
    cd "$TIMMY_DIR/models"
    wget -q --show-progress "$MODEL_URL" -O "$MODEL_NAME" || {
        error "Failed to download model. Continuing anyway..."
    }
    log "Model downloaded"
 else
    log "Model already exists, skipping download"
 fi
 log "Step 5/8: Setting up llama-server systemd service..."
 cat > /etc/systemd/system/llama-server.service << EOF
 [Unit]
 Description=llama.cpp inference server for Timmy
 After=network.target
 [Service]
 Type=simple
 User=$TIMMY_USER
 WorkingDirectory=$TIMMY_DIR
 ExecStart=$TIMMY_DIR/llama-server \\
    -m $TIMMY_DIR/models/$MODEL_NAME \\
    --host 127.0.0.1 \\
    --port 8081 \\
    -c 8192 \\
    -np 1 \\
    --jinja \\
    -ngl 0
 Restart=always
 RestartSec=10
 Environment="HOME=$TIMMY_HOME"
 [Install]
 WantedBy=multi-user.target
 EOF
 systemctl daemon-reload
 systemctl enable llama-server.service
 log "Step 6/8: Cloning timmy-home repo and setting up agent..."
 if [ ! -d "$TIMMY_DIR/timmy-home" ]; then
    cd "$TIMMY_DIR"
    git clone "$REPO_URL" timmy-home 2>/dev/null || warn "Could not clone repo"
 fi
 # Create minimal Python environment for agent
 if [ ! -d "$TIMMY_DIR/venv" ]; then
    python3 -m venv "$TIMMY_DIR/venv"
    "$TIMMY_DIR/venv/bin/pip" install -q requests pyyaml 2>&1 | tail -3
 fi
 log "Step 7/8: Setting up Timmy agent systemd service..."
 cat > /etc/systemd/system/timmy-agent.service << EOF
 [Unit]
 Description=Timmy Agent Harness
 After=llama-server.service
 Requires=llama-server.service
 [Service]
 Type=simple
 User=$TIMMY_USER
 WorkingDirectory=$TIMMY_DIR
 ExecStart=$TIMMY_DIR/venv/bin/python $TIMMY_DIR/timmy-home/agent/agent_daemon.py
 Restart=always
 RestartSec=30
 Environment="HOME=$TIMMY_HOME"
 Environment="TIMMY_MODEL_URL=http://127.0.0.1:8081"
 [Install]
 WantedBy=multi-user.target
 EOF
 systemctl daemon-reload
 systemctl enable timmy-agent.service
 log "Step 8/8: Configuring firewall..."
 # Reset UFW
 ufw --force reset 2>/dev/null || true
 ufw default deny incoming
 ufw default allow outgoing
 # Allow SSH
 ufw allow 22/tcp
 # Allow Syncthing (sync protocol)
 ufw allow 22000/tcp
 ufw allow 22000/udp
 # Allow Syncthing (discovery)
 ufw allow 21027/udp
 # Note: llama-server on 8081 is NOT exposed (localhost only)
 ufw --force enable
 log "Starting services..."
 systemctl start llama-server.service || warn "llama-server failed to start (may need model)"
 # Wait for llama-server to be ready
 log "Waiting for llama-server to be ready..."
 for i in {1..30}; do
    if curl -s http://127.0.0.1:8081/health >/dev/null 2>&1; then
        log "llama-server is healthy!"
        break
    fi
    sleep 2
 done
 # Create status script
 cat > "$TIMMY_DIR/scripts/status.sh" << 'EOF'
 #!/bin/bash
 echo "=== Timmy VPS Status ==="
 echo ""
 echo "Services:"
 systemctl is-active llama-server.service && echo "  llama-server: RUNNING" || echo "  llama-server: STOPPED"
 systemctl is-active timmy-agent.service && echo "  timmy-agent: RUNNING" || echo "  timmy-agent: STOPPED"
 echo ""
 echo "Inference Health:"
 curl -s http://127.0.0.1:8081/health | jq . 2>/dev/null || echo "  Not responding"
 echo ""
 echo "Disk Usage:"
 df -h $HOME | tail -1
 echo ""
 echo "Memory:"
 free -h | grep Mem
 EOF
 chmod +x "$TIMMY_DIR/scripts/status.sh"
 # Create README
 cat > "$TIMMY_DIR/README.txt" << EOF
 Timmy Sovereign Wizard VPS
 ==========================
 Quick Commands:
  $TIMMY_DIR/scripts/status.sh    - Check system status
  systemctl status llama-server   - Check inference service
  systemctl status timmy-agent    - Check agent service
 Directories:
  $TIMMY_DIR/models/     - AI model weights
  $TIMMY_DIR/soul/       - SOUL.md and conscience files
  $TIMMY_DIR/logs/       - Agent logs
  $TIMMY_DIR/shared/     - Syncthing shared folder
 Inference Endpoint:
  http://127.0.0.1:8081 (localhost only)
 Provisioning complete!
 EOF
 echo ""
 echo "========================================"
 log "Provisioning Complete!"
 echo "========================================"
 echo ""
 echo "Status:"
 "$TIMMY_DIR/scripts/status.sh"
 echo ""
 echo "Next steps:"
 echo "  1. Run syncthing setup: curl -sL $REPO_URL/raw/branch/main/scripts/setup-syncthing.sh | bash"
 echo "  2. Check inference: curl http://127.0.0.1:8081/health"
 echo "  3. Review logs: journalctl -u llama-server -f"
 echo ""
--- a/uni-wizard/README.md
+++ b/uni-wizard/README.md
@@ -0,0 +1,127 @@
 # Uni-Wizard Architecture
 ## Vision
 A single wizard harness that elegantly routes all API interactions through one unified interface. No more fragmented wizards - one consciousness, infinite capabilities.
 ## Architecture
 ```
 ┌─────────────────────────────────────────────────────────────┐
 │                     UNI-WIZARD HARNESS                       │
 │                                                              │
 │  ┌─────────────┐    ┌─────────────┐    ┌─────────────┐     │
 │  │   System    │    │    Git      │    │  Network    │     │
 │  │   Tools     │◄──►│   Tools     │◄──►│   Tools     │     │
 │  └──────┬──────┘    └──────┬──────┘    └──────┬──────┘     │
 │         │                  │                  │             │
 │         └──────────────────┼──────────────────┘             │
 │                            ▼                                │
 │                    ┌───────────────┐                        │
 │                    │  Tool Router  │                        │
 │                    │  (Registry)   │                        │
 │                    └───────┬───────┘                        │
 │                            │                                │
 │         ┌──────────────────┼──────────────────┐             │
 │         ▼                  ▼                  ▼             │
 │  ┌─────────────┐    ┌─────────────┐    ┌─────────────┐     │
 │  │   Local     │    │   Gitea     │    │   Relay     │     │
 │  │   llama.cpp │    │    API      │    │   Nostr     │     │
 │  └─────────────┘    └─────────────┘    └─────────────┘     │
 │                                                              │
 └─────────────────────────────────────────────────────────────┘
                            │
                            ▼
                    ┌───────────────┐
                    │  LLM (local)  │
                    │  Hermes-3 8B  │
                    └───────────────┘
 ```
 ## Design Principles
 1. **Single Entry Point**: One harness, all capabilities
 2. **Unified Registry**: All tools registered centrally
 3. **Elegant Routing**: Tools discover and route automatically
 4. **Local-First**: No cloud dependencies
 5. **Self-Healing**: Tools can restart, reconnect, recover
 ## Tool Categories
 ### System Layer
 - `system_info` — OS, CPU, RAM, disk, uptime
 - `process_manager` — list, start, stop processes
 - `service_controller` — systemd service management
 - `health_monitor` — system health checks
 ### Git Layer  
 - `git_operations` — status, log, commit, push, pull
 - `repo_manager` — clone, branch, merge
 - `pr_handler` — create, review, merge PRs
 ### Network Layer
 - `http_client` — GET, POST, PUT, DELETE
 - `gitea_client` — full Gitea API wrapper
 - `nostr_client` — relay communication
 - `api_router` — generic API endpoint handler
 ### File Layer
 - `file_operations` — read, write, append, search
 - `directory_manager` — tree, list, navigate
 - `archive_handler` — zip, tar, compress
 ## Registry System
 ```python
 # tools/registry.py
 class ToolRegistry:
    def __init__(self):
        self.tools = {}
    def register(self, name, handler, schema):
        self.tools[name] = {
            'handler': handler,
            'schema': schema,
            'description': handler.__doc__
        }
    def execute(self, name, params):
        tool = self.tools.get(name)
        if not tool:
            return f"Error: Tool '{name}' not found"
        try:
            return tool['handler'](**params)
        except Exception as e:
            return f"Error executing {name}: {str(e)}"
 ```
 ## API Flow
 1. **User Request** → Natural language task
 2. **LLM Planning** → Breaks into tool calls
 3. **Registry Lookup** → Finds appropriate tools
 4. **Execution** → Tools run in sequence/parallel
 5. **Response** → Results synthesized and returned
 ## Example Usage
 ```python
 # Single harness, multiple capabilities
 result = harness.execute("""
 Check system health, pull latest git changes, 
 and create a Gitea issue if tests fail
 """)
 ```
 This becomes:
 1. `system_info` → check health
 2. `git_pull` → update repo
 3. `run_tests` → execute tests
 4. `gitea_create_issue` → report failures
 ## Benefits
 - **Simplicity**: One harness to maintain
 - **Power**: All capabilities unified
 - **Elegance**: Clean routing, no fragmentation
 - **Resilience**: Self-contained, local-first
--- a/uni-wizard/daemons/init.py
+++ b/uni-wizard/daemons/init.py
@@ -0,0 +1,9 @@
 """
 Uni-Wizard Daemons Package
 Background services for the uni-wizard architecture
 """
 from .health_daemon import HealthDaemon
 from .task_router import TaskRouter
 __all__ = ['HealthDaemon', 'TaskRouter']
--- a/uni-wizard/daemons/health_daemon.py
+++ b/uni-wizard/daemons/health_daemon.py
@@ -0,0 +1,180 @@
 """
 Health Check Daemon for Uni-Wizard
 Monitors VPS status and exposes health endpoint
 """
 import json
 import time
 import threading
 from http.server import HTTPServer, BaseHTTPRequestHandler
 from datetime import datetime
 from pathlib import Path
 import sys
 # Add parent to path
 sys.path.insert(0, str(Path(__file__).parent.parent))
 from harness import get_harness
 class HealthCheckHandler(BaseHTTPRequestHandler):
    """HTTP handler for health endpoint"""
    def log_message(self, format, *args):
        # Suppress default logging
        pass
    def do_GET(self):
        """Handle GET requests"""
        if self.path == '/health':
            self.send_health_response()
        elif self.path == '/status':
            self.send_full_status()
        else:
            self.send_error(404)
    def send_health_response(self):
        """Send simple health check"""
        harness = get_harness()
        result = harness.execute("health_check")
        try:
            health_data = json.loads(result)
            status_code = 200 if health_data.get("overall") == "healthy" else 503
        except:
            status_code = 503
            health_data = {"error": "Health check failed"}
        self.send_response(status_code)
        self.send_header('Content-Type', 'application/json')
        self.end_headers()
        self.wfile.write(json.dumps(health_data).encode())
    def send_full_status(self):
        """Send full system status"""
        harness = get_harness()
        status = {
            "timestamp": datetime.now().isoformat(),
            "harness": json.loads(harness.get_status()),
            "system": json.loads(harness.execute("system_info")),
            "health": json.loads(harness.execute("health_check"))
        }
        self.send_response(200)
        self.send_header('Content-Type', 'application/json')
        self.end_headers()
        self.wfile.write(json.dumps(status, indent=2).encode())
 class HealthDaemon:
    """
    Health monitoring daemon.
    Runs continuously, monitoring:
    - System resources
    - Service status
    - Inference endpoint
    Exposes:
    - HTTP endpoint on port 8082
    - JSON status file at ~/timmy/logs/health.json
    """
    def __init__(self, port: int = 8082, check_interval: int = 60):
        self.port = port
        self.check_interval = check_interval
        self.running = False
        self.server = None
        self.monitor_thread = None
        self.last_health = None
        # Ensure log directory exists
        self.log_path = Path.home() / "timmy" / "logs"
        self.log_path.mkdir(parents=True, exist_ok=True)
        self.health_file = self.log_path / "health.json"
    def start(self):
        """Start the health daemon"""
        self.running = True
        # Start HTTP server
        self.server = HTTPServer(('127.0.0.1', self.port), HealthCheckHandler)
        server_thread = threading.Thread(target=self.server.serve_forever)
        server_thread.daemon = True
        server_thread.start()
        # Start monitoring loop
        self.monitor_thread = threading.Thread(target=self._monitor_loop)
        self.monitor_thread.daemon = True
        self.monitor_thread.start()
        print(f"Health daemon started on http://127.0.0.1:{self.port}")
        print(f"  - /health  - Quick health check")
        print(f"  - /status  - Full system status")
        print(f"Health file: {self.health_file}")
    def stop(self):
        """Stop the health daemon"""
        self.running = False
        if self.server:
            self.server.shutdown()
        print("Health daemon stopped")
    def _monitor_loop(self):
        """Background monitoring loop"""
        while self.running:
            try:
                self._update_health_file()
                time.sleep(self.check_interval)
            except Exception as e:
                print(f"Monitor error: {e}")
                time.sleep(5)
    def _update_health_file(self):
        """Update the health status file"""
        harness = get_harness()
        try:
            health_result = harness.execute("health_check")
            system_result = harness.execute("system_info")
            status = {
                "timestamp": datetime.now().isoformat(),
                "health": json.loads(health_result),
                "system": json.loads(system_result)
            }
            self.health_file.write_text(json.dumps(status, indent=2))
            self.last_health = status
        except Exception as e:
            print(f"Failed to update health file: {e}")
 def main():
    """Run the health daemon"""
    import signal
    daemon = HealthDaemon()
    def signal_handler(sig, frame):
        print("\nShutting down...")
        daemon.stop()
        sys.exit(0)
    signal.signal(signal.SIGINT, signal_handler)
    signal.signal(signal.SIGTERM, signal_handler)
    daemon.start()
    # Keep main thread alive
    try:
        while True:
            time.sleep(1)
    except KeyboardInterrupt:
        daemon.stop()
 if __name__ == "__main__":
    main()
--- a/uni-wizard/daemons/task_router.py
+++ b/uni-wizard/daemons/task_router.py
@@ -0,0 +1,222 @@
 """
 Task Router for Uni-Wizard
 Polls Gitea for assigned issues and executes them
 """
 import json
 import time
 import sys
 from pathlib import Path
 from datetime import datetime
 # Add parent to path
 sys.path.insert(0, str(Path(__file__).parent.parent))
 from harness import get_harness
 class TaskRouter:
    """
    Gitea Task Router.
    Polls Gitea for issues assigned to Timmy and routes them
    to appropriate tools for execution.
    Flow:
    1. Poll Gitea API for open issues assigned to Timmy
    2. Parse issue body for commands/tasks
    3. Route to appropriate tool via harness
    4. Post results back as comments
    5. Close issue if task complete
    """
    def __init__(
        self,
        gitea_url: str = "http://143.198.27.163:3000",
        repo: str = "Timmy_Foundation/timmy-home",
        assignee: str = "timmy",
        poll_interval: int = 60
    ):
        self.gitea_url = gitea_url
        self.repo = repo
        self.assignee = assignee
        self.poll_interval = poll_interval
        self.running = False
        self.harness = get_harness()
        self.processed_issues = set()
        # Log file
        self.log_path = Path.home() / "timmy" / "logs"
        self.log_path.mkdir(parents=True, exist_ok=True)
        self.router_log = self.log_path / "task_router.jsonl"
    def start(self):
        """Start the task router"""
        self.running = True
        print(f"Task router started")
        print(f"  Polling: {self.gitea_url}")
        print(f"  Assignee: {self.assignee}")
        print(f"  Interval: {self.poll_interval}s")
        while self.running:
            try:
                self._poll_and_route()
                time.sleep(self.poll_interval)
            except Exception as e:
                self._log_event("error", {"message": str(e)})
                time.sleep(5)
    def stop(self):
        """Stop the task router"""
        self.running = False
        print("Task router stopped")
    def _poll_and_route(self):
        """Poll for issues and route tasks"""
        # Get assigned issues
        result = self.harness.execute(
            "gitea_list_issues",
            repo=self.repo,
            state="open",
            assignee=self.assignee
        )
        try:
            issues = json.loads(result)
        except:
            return
        for issue in issues.get("issues", []):
            issue_num = issue["number"]
            # Skip already processed
            if issue_num in self.processed_issues:
                continue
            # Process the issue
            self._process_issue(issue)
            self.processed_issues.add(issue_num)
    def _process_issue(self, issue: dict):
        """Process a single issue"""
        issue_num = issue["number"]
        title = issue["title"]
        self._log_event("issue_received", {
            "number": issue_num,
            "title": title
        })
        # Parse title for command hints
        # Format: "[ACTION] Description" or just "Description"
        action = self._parse_action(title)
        # Route to appropriate handler
        if action == "system_check":
            result = self._handle_system_check(issue_num)
        elif action == "git_operation":
            result = self._handle_git_operation(issue_num, issue)
        elif action == "health_report":
            result = self._handle_health_report(issue_num)
        else:
            result = self._handle_generic(issue_num, issue)
        # Post result as comment
        self._post_comment(issue_num, result)
        self._log_event("issue_processed", {
            "number": issue_num,
            "action": action,
            "result": "success" if result else "failed"
        })
    def _parse_action(self, title: str) -> str:
        """Parse action from issue title"""
        title_lower = title.lower()
        if any(kw in title_lower for kw in ["health", "status", "check"]):
            return "health_report"
        elif any(kw in title_lower for kw in ["system", "resource", "disk", "memory"]):
            return "system_check"
        elif any(kw in title_lower for kw in ["git", "commit", "push", "pull", "branch"]):
            return "git_operation"
        return "generic"
    def _handle_system_check(self, issue_num: int) -> str:
        """Handle system check task"""
        result = self.harness.execute("system_info")
        return f"## System Check Results\n\n```json\n{result}\n```"
    def _handle_health_report(self, issue_num: int) -> str:
        """Handle health report task"""
        result = self.harness.execute("health_check")
        return f"## Health Report\n\n```json\n{result}\n```"
    def _handle_git_operation(self, issue_num: int, issue: dict) -> str:
        """Handle git operation task"""
        body = issue.get("body", "")
        # Parse body for git commands
        results = []
        # Check for status request
        if "status" in body.lower():
            result = self.harness.execute("git_status", repo_path="/root/timmy/timmy-home")
            results.append(f"**Git Status:**\n```json\n{result}\n```")
        # Check for pull request
        if "pull" in body.lower():
            result = self.harness.execute("git_pull", repo_path="/root/timmy/timmy-home")
            results.append(f"**Git Pull:**\n{result}")
        if not results:
            results.append("No specific git operation detected in issue body.")
        return "\n\n".join(results)
    def _handle_generic(self, issue_num: int, issue: dict) -> str:
        """Handle generic task"""
        return f"Received issue #{issue_num}: {issue['title']}\n\nI'll process this and update shortly."
    def _post_comment(self, issue_num: int, body: str):
        """Post a comment on the issue"""
        result = self.harness.execute(
            "gitea_comment",
            repo=self.repo,
            issue_number=issue_num,
            body=body
        )
        return result
    def _log_event(self, event_type: str, data: dict):
        """Log an event to the JSONL file"""
        log_entry = {
            "timestamp": datetime.now().isoformat(),
            "event": event_type,
            **data
        }
        with open(self.router_log, "a") as f:
            f.write(json.dumps(log_entry) + "\n")
 def main():
    """Run the task router"""
    import signal
    router = TaskRouter()
    def signal_handler(sig, frame):
        print("\nShutting down...")
        router.stop()
        sys.exit(0)
    signal.signal(signal.SIGINT, signal_handler)
    signal.signal(signal.SIGTERM, signal_handler)
    router.start()
 if __name__ == "__main__":
    main()
--- a/uni-wizard/harness.py
+++ b/uni-wizard/harness.py
@@ -0,0 +1,174 @@
 """
 Uni-Wizard Harness
 Single entry point for all capabilities
 """
 import json
 import sys
 from typing import Dict, Any, Optional
 from pathlib import Path
 # Add tools to path
 sys.path.insert(0, str(Path(__file__).parent))
 from tools import registry, call_tool
 class UniWizardHarness:
    """
    The Uni-Wizard Harness - one consciousness, infinite capabilities.
    All API flows route through this single harness:
    - System monitoring and control
    - Git operations
    - Network requests
    - Gitea API
    - Local inference
    Usage:
        harness = UniWizardHarness()
        result = harness.execute("system_info")
        result = harness.execute("git_status", repo_path="/path/to/repo")
    """
    def __init__(self):
        self.registry = registry
        self.history = []
    def list_capabilities(self) -> str:
        """List all available tools/capabilities"""
        tools = []
        for category in self.registry.get_categories():
            cat_tools = self.registry.get_tools_by_category(category)
            tools.append(f"\n{category.upper()}:")
            for tool in cat_tools:
                tools.append(f"  - {tool['name']}: {tool['description']}")
        return "\n".join(tools)
    def execute(self, tool_name: str, **params) -> str:
        """
        Execute a tool by name.
        Args:
            tool_name: Name of the tool to execute
            **params: Parameters for the tool
        Returns:
            String result from the tool
        """
        # Log execution
        self.history.append({
            "tool": tool_name,
            "params": params
        })
        # Execute via registry
        result = call_tool(tool_name, **params)
        return result
    def execute_plan(self, plan: list) -> Dict[str, str]:
        """
        Execute a sequence of tool calls.
        Args:
            plan: List of dicts with 'tool' and 'params'
                  e.g., [{"tool": "system_info", "params": {}}]
        Returns:
            Dict mapping tool names to results
        """
        results = {}
        for step in plan:
            tool_name = step.get("tool")
            params = step.get("params", {})
            result = self.execute(tool_name, **params)
            results[tool_name] = result
        return results
    def get_tool_definitions(self) -> str:
        """Get tool definitions formatted for LLM system prompt"""
        return self.registry.get_tool_definitions()
    def get_status(self) -> str:
        """Get harness status"""
        return json.dumps({
            "total_tools": len(self.registry.list_tools()),
            "categories": self.registry.get_categories(),
            "tools_by_category": {
                cat: self.registry.list_tools(cat) 
                for cat in self.registry.get_categories()
            },
            "execution_history_count": len(self.history)
        }, indent=2)
 # Singleton instance
 _harness = None
 def get_harness() -> UniWizardHarness:
    """Get the singleton harness instance"""
    global _harness
    if _harness is None:
        _harness = UniWizardHarness()
    return _harness
 def main():
    """CLI interface for the harness"""
    harness = get_harness()
    if len(sys.argv) < 2:
        print("Uni-Wizard Harness")
        print("==================")
        print("\nUsage: python harness.py <command> [args]")
        print("\nCommands:")
        print("  list          - List all capabilities")
        print("  status        - Show harness status")
        print("  tools         - Show tool definitions (for LLM)")
        print("  exec <tool>   - Execute a tool")
        print("\nExamples:")
        print('  python harness.py exec system_info')
        print('  python harness.py exec git_status repo_path=/tmp/timmy-home')
        return
    command = sys.argv[1]
    if command == "list":
        print(harness.list_capabilities())
    elif command == "status":
        print(harness.get_status())
    elif command == "tools":
        print(harness.get_tool_definitions())
    elif command == "exec" and len(sys.argv) >= 3:
        tool_name = sys.argv[2]
        # Parse params from args (key=value format)
        params = {}
        for arg in sys.argv[3:]:
            if '=' in arg:
                key, value = arg.split('=', 1)
                # Try to parse as int/bool
                if value.isdigit():
                    value = int(value)
                elif value.lower() == 'true':
                    value = True
                elif value.lower() == 'false':
                    value = False
                params[key] = value
        result = harness.execute(tool_name, **params)
        print(result)
    else:
        print(f"Unknown command: {command}")
        print("Run without arguments for help")
 if __name__ == "__main__":
    main()
--- a/uni-wizard/scripts/generate_scorecard.py
+++ b/uni-wizard/scripts/generate_scorecard.py
@@ -0,0 +1,388 @@
 #!/usr/bin/env python3
 """
 JSONL Scorecard Generator for Uni-Wizard
 Analyzes overnight loop results and produces comprehensive reports
 """
 import json
 import sys
 from pathlib import Path
 from datetime import datetime
 from collections import defaultdict
 from typing import Dict, List, Any
 import statistics
 class ScorecardGenerator:
    """
    Generates scorecards from overnight loop JSONL data.
    Analyzes:
    - Pass/fail rates
    - Response times (avg, median, p95)
    - Per-task breakdowns
    - Error patterns
    - Timeline trends
    """
    def __init__(self, input_dir: str = "~/shared/overnight-loop"):
        self.input_dir = Path(input_dir).expanduser()
        self.tasks = []
        self.stats = {
            "total": 0,
            "passed": 0,
            "failed": 0,
            "pass_rate": 0.0,
            "durations": [],
            "by_task": defaultdict(lambda: {"total": 0, "passed": 0, "failed": 0, "durations": []}),
            "by_hour": defaultdict(lambda: {"total": 0, "passed": 0, "durations": []}),
            "errors": defaultdict(int)
        }
    def load_jsonl(self, filepath: Path) -> List[Dict]:
        """Load and parse a JSONL file, handling errors gracefully"""
        tasks = []
        with open(filepath, 'r') as f:
            for line_num, line in enumerate(f, 1):
                line = line.strip()
                if not line:
                    continue
                try:
                    task = json.loads(line)
                    tasks.append(task)
                except json.JSONDecodeError:
                    print(f"Warning: Skipping malformed line {line_num} in {filepath}")
                    continue
        return tasks
    def load_all(self):
        """Load all JSONL files from input directory"""
        if not self.input_dir.exists():
            print(f"Input directory not found: {self.input_dir}")
            return
        jsonl_files = list(self.input_dir.glob("*.jsonl"))
        if not jsonl_files:
            print(f"No .jsonl files found in {self.input_dir}")
            return
        for filepath in sorted(jsonl_files):
            print(f"Loading: {filepath.name}")
            tasks = self.load_jsonl(filepath)
            self.tasks.extend(tasks)
        print(f"Loaded {len(self.tasks)} tasks from {len(jsonl_files)} files")
    def analyze(self):
        """Analyze all loaded tasks"""
        if not self.tasks:
            print("No tasks to analyze")
            return
        for task in self.tasks:
            self._process_task(task)
        # Calculate overall pass rate
        if self.stats["total"] > 0:
            self.stats["pass_rate"] = (self.stats["passed"] / self.stats["total"]) * 100
        print(f"Analysis complete: {self.stats['passed']}/{self.stats['total']} passed ({self.stats['pass_rate']:.1f}%)")
    def _process_task(self, task: Dict):
        """Process a single task record"""
        # Basic stats
        self.stats["total"] += 1
        status = task.get("status", "unknown")
        duration = task.get("duration_s", 0)
        task_type = task.get("task", "unknown")
        timestamp = task.get("timestamp", "")
        # Pass/fail
        if status == "pass":
            self.stats["passed"] += 1
            self.stats["by_task"][task_type]["passed"] += 1
        else:
            self.stats["failed"] += 1
            self.stats["by_task"][task_type]["failed"] += 1
            # Track error patterns
            error = task.get("error", "unknown_error")
            self.stats["errors"][error] += 1
        # Durations
        self.stats["durations"].append(duration)
        self.stats["by_task"][task_type]["durations"].append(duration)
        self.stats["by_task"][task_type]["total"] += 1
        # Hourly breakdown
        if timestamp:
            try:
                hour = timestamp[:13]  # YYYY-MM-DDTHH
                self.stats["by_hour"][hour]["total"] += 1
                if status == "pass":
                    self.stats["by_hour"][hour]["passed"] += 1
                self.stats["by_hour"][hour]["durations"].append(duration)
            except:
                pass
    def calculate_duration_stats(self, durations: List[float]) -> Dict[str, float]:
        """Calculate duration statistics"""
        if not durations:
            return {"avg": 0, "median": 0, "p95": 0, "min": 0, "max": 0}
        sorted_durations = sorted(durations)
        n = len(sorted_durations)
        return {
            "avg": round(statistics.mean(durations), 2),
            "median": round(statistics.median(durations), 2),
            "p95": round(sorted_durations[int(n * 0.95)] if n > 1 else sorted_durations[0], 2),
            "min": round(min(durations), 2),
            "max": round(max(durations), 2)
        }
    def generate_json(self) -> Dict:
        """Generate structured JSON report"""
        duration_stats = self.calculate_duration_stats(self.stats["durations"])
        report = {
            "generated_at": datetime.now().isoformat(),
            "summary": {
                "total_tasks": self.stats["total"],
                "passed": self.stats["passed"],
                "failed": self.stats["failed"],
                "pass_rate": round(self.stats["pass_rate"], 2),
                "duration_stats": duration_stats
            },
            "by_task": {},
            "by_hour": {},
            "errors": dict(self.stats["errors"]),
            "recommendations": self._generate_recommendations()
        }
        # Per-task breakdown
        for task_type, data in self.stats["by_task"].items():
            if data["total"] > 0:
                pass_rate = (data["passed"] / data["total"]) * 100
                report["by_task"][task_type] = {
                    "total": data["total"],
                    "passed": data["passed"],
                    "failed": data["failed"],
                    "pass_rate": round(pass_rate, 2),
                    "duration_stats": self.calculate_duration_stats(data["durations"])
                }
        # Hourly breakdown
        for hour, data in sorted(self.stats["by_hour"].items()):
            if data["total"] > 0:
                pass_rate = (data["passed"] / data["total"]) * 100
                report["by_hour"][hour] = {
                    "total": data["total"],
                    "passed": data["passed"],
                    "pass_rate": round(pass_rate, 2),
                    "avg_duration": round(statistics.mean(data["durations"]), 2) if data["durations"] else 0
                }
        return report
    def generate_markdown(self) -> str:
        """Generate markdown report"""
        json_report = self.generate_json()
        md = f"""# Overnight Loop Scorecard
 **Generated:** {json_report['generated_at']}
 ---
 ## Summary
 | Metric | Value |
 |--------|-------|
 | Total Tasks | {json_report['summary']['total_tasks']} |
 | Passed | {json_report['summary']['passed']} ✅ |
 | Failed | {json_report['summary']['failed']} ❌ |
 | **Pass Rate** | **{json_report['summary']['pass_rate']:.1f}%** |
 ### Duration Statistics
 | Metric | Value (seconds) |
 |--------|-----------------|
 | Average | {json_report['summary']['duration_stats']['avg']} |
 | Median | {json_report['summary']['duration_stats']['median']} |
 | P95 | {json_report['summary']['duration_stats']['p95']} |
 | Min | {json_report['summary']['duration_stats']['min']} |
 | Max | {json_report['summary']['duration_stats']['max']} |
 ---
 ## Per-Task Breakdown
 | Task | Total | Passed | Failed | Pass Rate | Avg Duration |
 |------|-------|--------|--------|-----------|--------------|
 """
        # Sort by pass rate (ascending - worst first)
        sorted_tasks = sorted(
            json_report['by_task'].items(),
            key=lambda x: x[1]['pass_rate']
        )
        for task_type, data in sorted_tasks:
            status = "✅" if data['pass_rate'] >= 90 else "⚠️" if data['pass_rate'] >= 70 else "❌"
            md += f"| {task_type} | {data['total']} | {data['passed']} | {data['failed']} | {status} {data['pass_rate']:.1f}% | {data['duration_stats']['avg']}s |\n"
        md += """
 ---
 ## Timeline (Hourly)
 | Hour | Tasks | Passed | Pass Rate | Avg Duration |
 |------|-------|--------|-----------|--------------|
 """
        for hour, data in sorted(json_report['by_hour'].items()):
            trend = "📈" if data['pass_rate'] >= 90 else "📊" if data['pass_rate'] >= 70 else "📉"
            md += f"| {hour} | {data['total']} | {data['passed']} | {trend} {data['pass_rate']:.1f}% | {data['avg_duration']}s |\n"
        md += """
 ---
 ## Error Analysis
 | Error Pattern | Count |
 |---------------|-------|
 """
        for error, count in sorted(json_report['errors'].items(), key=lambda x: x[1], reverse=True):
            md += f"| {error} | {count} |\n"
        md += """
 ---
 ## Recommendations
 """
        for rec in json_report['recommendations']:
            md += f"- {rec}\n"
        md += """
 ---
 *Generated by Uni-Wizard Scorecard Generator*
 """
        return md
    def _generate_recommendations(self) -> List[str]:
        """Generate recommendations based on analysis"""
        recommendations = []
        # Check overall pass rate
        if self.stats["pass_rate"] < 70:
            recommendations.append(f"⚠️ Overall pass rate ({self.stats['pass_rate']:.1f}%) is concerning. Review infrastructure health.")
        elif self.stats["pass_rate"] >= 95:
            recommendations.append(f"✅ Excellent pass rate ({self.stats['pass_rate']:.1f}%). System is performing well.")
        # Check for failing tasks
        failing_tasks = []
        for task_type, data in self.stats["by_task"].items():
            if data["total"] > 0:
                pass_rate = (data["passed"] / data["total"]) * 100
                if pass_rate < 50:
                    failing_tasks.append(task_type)
        if failing_tasks:
            recommendations.append(f"❌ Tasks with <50% pass rate: {', '.join(failing_tasks)}. Consider debugging or removing.")
        # Check for slow tasks
        slow_tasks = []
        for task_type, data in self.stats["by_task"].items():
            if data["durations"]:
                avg = statistics.mean(data["durations"])
                if avg > 30:  # Tasks taking >30s on average
                    slow_tasks.append(f"{task_type} ({avg:.1f}s)")
        if slow_tasks:
            recommendations.append(f"⏱️ Slow tasks detected: {', '.join(slow_tasks)}. Consider optimization.")
        # Check error patterns
        if self.stats["errors"]:
            top_error = max(self.stats["errors"].items(), key=lambda x: x[1])
            recommendations.append(f"🔍 Most common error: '{top_error[0]}' ({top_error[1]} occurrences). Investigate root cause.")
        # Timeline trend
        if len(self.stats["by_hour"]) >= 2:
            hours = sorted(self.stats["by_hour"].keys())
            first_hour = hours[0]
            last_hour = hours[-1]
            first_rate = (self.stats["by_hour"][first_hour]["passed"] / self.stats["by_hour"][first_hour]["total"]) * 100
            last_rate = (self.stats["by_hour"][last_hour]["passed"] / self.stats["by_hour"][last_hour]["total"]) * 100
            if last_rate > first_rate + 10:
                recommendations.append(f"📈 Performance improving over time (+{last_rate - first_rate:.1f}% pass rate).")
            elif last_rate < first_rate - 10:
                recommendations.append(f"📉 Performance degrading over time (-{first_rate - last_rate:.1f}% pass rate). Check for resource exhaustion.")
        return recommendations
    def save_reports(self, output_dir: str = "~/timmy/reports"):
        """Save JSON and markdown reports"""
        output_path = Path(output_dir).expanduser()
        output_path.mkdir(parents=True, exist_ok=True)
        date_str = datetime.now().strftime("%Y%m%d")
        # Save JSON
        json_file = output_path / f"scorecard_{date_str}.json"
        json_report = self.generate_json()
        with open(json_file, 'w') as f:
            json.dump(json_report, f, indent=2)
        print(f"JSON report saved: {json_file}")
        # Save Markdown
        md_file = output_path / f"scorecard_{date_str}.md"
        md_report = self.generate_markdown()
        with open(md_file, 'w') as f:
            f.write(md_report)
        print(f"Markdown report saved: {md_file}")
        return json_file, md_file
 def main():
    """CLI entry point"""
    import argparse
    parser = argparse.ArgumentParser(description="Generate scorecard from overnight loop JSONL")
    parser.add_argument("--input", "-i", default="~/shared/overnight-loop", help="Input directory with JSONL files")
    parser.add_argument("--output", "-o", default="~/timmy/reports", help="Output directory for reports")
    args = parser.parse_args()
    print("="*60)
    print("UNI-WIZARD SCORECARD GENERATOR")
    print("="*60)
    print()
    generator = ScorecardGenerator(input_dir=args.input)
    generator.load_all()
    generator.analyze()
    if generator.stats["total"] > 0:
        json_file, md_file = generator.save_reports(output_dir=args.output)
        print()
        print("="*60)
        print("REPORTS GENERATED")
        print("="*60)
        print(f"JSON: {json_file}")
        print(f"Markdown: {md_file}")
    else:
        print("No data to report")
 if __name__ == "__main__":
    main()
--- a/uni-wizard/test_harness.py
+++ b/uni-wizard/test_harness.py
@@ -0,0 +1,114 @@
 #!/usr/bin/env python3
 """
 Test script for Uni-Wizard Harness
 Exercises all tool categories
 """
 import sys
 from pathlib import Path
 sys.path.insert(0, str(Path(__file__).parent))
 from harness import get_harness
 def test_system_tools():
    """Test system monitoring tools"""
    print("\n" + "="*60)
    print("TESTING SYSTEM TOOLS")
    print("="*60)
    harness = get_harness()
    tests = [
        ("system_info", {}),
        ("health_check", {}),
        ("process_list", {"filter_name": "python"}),
        ("disk_usage", {}),
    ]
    for tool_name, params in tests:
        print(f"\n>>> {tool_name}()")
        result = harness.execute(tool_name, **params)
        print(result[:500] + "..." if len(result) > 500 else result)
 def test_git_tools():
    """Test git operations"""
    print("\n" + "="*60)
    print("TESTING GIT TOOLS")
    print("="*60)
    harness = get_harness()
    # Test with timmy-home repo if it exists
    repo_path = "/tmp/timmy-home"
    tests = [
        ("git_status", {"repo_path": repo_path}),
        ("git_log", {"repo_path": repo_path, "count": 5}),
        ("git_branch_list", {"repo_path": repo_path}),
    ]
    for tool_name, params in tests:
        print(f"\n>>> {tool_name}()")
        result = harness.execute(tool_name, **params)
        print(result[:500] + "..." if len(result) > 500 else result)
 def test_network_tools():
    """Test network operations"""
    print("\n" + "="*60)
    print("TESTING NETWORK TOOLS")
    print("="*60)
    harness = get_harness()
    tests = [
        ("http_get", {"url": "http://143.198.27.163:3000/api/v1/repos/Timmy_Foundation/timmy-home"}),
        ("gitea_list_issues", {"state": "open"}),
    ]
    for tool_name, params in tests:
        print(f"\n>>> {tool_name}()")
        result = harness.execute(tool_name, **params)
        print(result[:500] + "..." if len(result) > 500 else result)
 def test_harness_features():
    """Test harness management features"""
    print("\n" + "="*60)
    print("TESTING HARNESS FEATURES")
    print("="*60)
    harness = get_harness()
    print("\n>>> list_capabilities()")
    print(harness.list_capabilities())
    print("\n>>> get_status()")
    print(harness.get_status())
 def run_all_tests():
    """Run complete test suite"""
    print("UNI-WIZARD HARNESS TEST SUITE")
    print("=============================")
    try:
        test_system_tools()
        test_git_tools()
        test_network_tools()
        test_harness_features()
        print("\n" + "="*60)
        print("✓ ALL TESTS COMPLETED")
        print("="*60)
    except Exception as e:
        print(f"\n✗ TEST FAILED: {e}")
        import traceback
        traceback.print_exc()
 if __name__ == "__main__":
    run_all_tests()
--- a/uni-wizard/tools/init.py
+++ b/uni-wizard/tools/init.py
@@ -0,0 +1,24 @@
 """
 Uni-Wizard Tools Package
 All tools for self-sufficient operation
 """
 from .registry import registry, ToolRegistry, ToolResult, tool, call_tool
 # Import all tool modules to register them
 from . import system_tools
 from . import git_tools
 from . import network_tools
 __all__ = [
    'registry',
    'ToolRegistry',
    'ToolResult', 
    'tool',
    'call_tool'
 ]
 # Ensure all tools are registered
 system_tools.register_all()
 git_tools.register_all()
 network_tools.register_all()
--- a/uni-wizard/tools/git_tools.py
+++ b/uni-wizard/tools/git_tools.py
@@ -0,0 +1,448 @@
 """
 Git Tools for Uni-Wizard
 Repository operations and version control
 """
 import os
 import json
 import subprocess
 from typing import Dict, List, Optional
 from pathlib import Path
 from .registry import registry
 def run_git_command(args: List[str], cwd: str = None) -> tuple:
    """Execute a git command and return (stdout, stderr, returncode)"""
    try:
        result = subprocess.run(
            ['git'] + args,
            capture_output=True,
            text=True,
            cwd=cwd
        )
        return result.stdout, result.stderr, result.returncode
    except Exception as e:
        return "", str(e), 1
 def git_status(repo_path: str = ".") -> str:
    """
    Get git repository status.
    Args:
        repo_path: Path to git repository (default: current directory)
    Returns:
        Status info including branch, changed files, last commit
    """
    try:
        status = {"repo_path": os.path.abspath(repo_path)}
        # Current branch
        stdout, _, rc = run_git_command(['branch', '--show-current'], cwd=repo_path)
        if rc == 0:
            status["branch"] = stdout.strip()
        else:
            return f"Error: Not a git repository at {repo_path}"
        # Last commit
        stdout, _, rc = run_git_command(['log', '-1', '--format=%H|%s|%an|%ad', '--date=short'], cwd=repo_path)
        if rc == 0:
            parts = stdout.strip().split('|')
            if len(parts) >= 4:
                status["last_commit"] = {
                    "hash": parts[0][:8],
                    "message": parts[1],
                    "author": parts[2],
                    "date": parts[3]
                }
        # Changed files
        stdout, _, rc = run_git_command(['status', '--porcelain'], cwd=repo_path)
        if rc == 0:
            changes = []
            for line in stdout.strip().split('\n'):
                if line:
                    status_code = line[:2]
                    file_path = line[3:]
                    changes.append({
                        "file": file_path,
                        "status": status_code.strip()
                    })
            status["changes"] = changes
            status["has_changes"] = len(changes) > 0
        # Remote info
        stdout, _, rc = run_git_command(['remote', '-v'], cwd=repo_path)
        if rc == 0:
            remotes = []
            for line in stdout.strip().split('\n'):
                if line:
                    parts = line.split()
                    if len(parts) >= 2:
                        remotes.append({"name": parts[0], "url": parts[1]})
            status["remotes"] = remotes
        return json.dumps(status, indent=2)
    except Exception as e:
        return f"Error getting git status: {str(e)}"
 def git_log(repo_path: str = ".", count: int = 10) -> str:
    """
    Get recent commit history.
    Args:
        repo_path: Path to git repository
        count: Number of commits to show (default: 10)
    Returns:
        List of recent commits
    """
    try:
        stdout, stderr, rc = run_git_command(
            ['log', f'-{count}', '--format=%H|%s|%an|%ad', '--date=short'],
            cwd=repo_path
        )
        if rc != 0:
            return f"Error: {stderr}"
        commits = []
        for line in stdout.strip().split('\n'):
            if line:
                parts = line.split('|')
                if len(parts) >= 4:
                    commits.append({
                        "hash": parts[0][:8],
                        "message": parts[1],
                        "author": parts[2],
                        "date": parts[3]
                    })
        return json.dumps({"count": len(commits), "commits": commits}, indent=2)
    except Exception as e:
        return f"Error getting git log: {str(e)}"
 def git_pull(repo_path: str = ".") -> str:
    """
    Pull latest changes from remote.
    Args:
        repo_path: Path to git repository
    Returns:
        Pull result
    """
    try:
        stdout, stderr, rc = run_git_command(['pull'], cwd=repo_path)
        if rc == 0:
            if 'Already up to date' in stdout:
                return "✓ Already up to date"
            return f"✓ Pull successful:\n{stdout}"
        else:
            return f"✗ Pull failed:\n{stderr}"
    except Exception as e:
        return f"Error pulling: {str(e)}"
 def git_commit(repo_path: str = ".", message: str = None, files: List[str] = None) -> str:
    """
    Stage and commit changes.
    Args:
        repo_path: Path to git repository
        message: Commit message (required)
        files: Specific files to commit (default: all changes)
    Returns:
        Commit result
    """
    if not message:
        return "Error: commit message is required"
    try:
        # Stage files
        if files:
            for f in files:
                _, stderr, rc = run_git_command(['add', f], cwd=repo_path)
                if rc != 0:
                    return f"✗ Failed to stage {f}: {stderr}"
        else:
            _, stderr, rc = run_git_command(['add', '.'], cwd=repo_path)
            if rc != 0:
                return f"✗ Failed to stage changes: {stderr}"
        # Commit
        stdout, stderr, rc = run_git_command(['commit', '-m', message], cwd=repo_path)
        if rc == 0:
            return f"✓ Commit successful:\n{stdout}"
        else:
            if 'nothing to commit' in stderr.lower():
                return "✓ Nothing to commit (working tree clean)"
            return f"✗ Commit failed:\n{stderr}"
    except Exception as e:
        return f"Error committing: {str(e)}"
 def git_push(repo_path: str = ".", remote: str = "origin", branch: str = None) -> str:
    """
    Push to remote repository.
    Args:
        repo_path: Path to git repository
        remote: Remote name (default: origin)
        branch: Branch to push (default: current branch)
    Returns:
        Push result
    """
    try:
        if not branch:
            # Get current branch
            stdout, _, rc = run_git_command(['branch', '--show-current'], cwd=repo_path)
            if rc == 0:
                branch = stdout.strip()
            else:
                return "Error: Could not determine current branch"
        stdout, stderr, rc = run_git_command(['push', remote, branch], cwd=repo_path)
        if rc == 0:
            return f"✓ Push successful to {remote}/{branch}"
        else:
            return f"✗ Push failed:\n{stderr}"
    except Exception as e:
        return f"Error pushing: {str(e)}"
 def git_checkout(repo_path: str = ".", branch: str = None, create: bool = False) -> str:
    """
    Checkout a branch.
    Args:
        repo_path: Path to git repository
        branch: Branch name to checkout
        create: Create the branch if it doesn't exist
    Returns:
        Checkout result
    """
    if not branch:
        return "Error: branch name is required"
    try:
        if create:
            stdout, stderr, rc = run_git_command(['checkout', '-b', branch], cwd=repo_path)
        else:
            stdout, stderr, rc = run_git_command(['checkout', branch], cwd=repo_path)
        if rc == 0:
            return f"✓ Checked out branch: {branch}"
        else:
            return f"✗ Checkout failed:\n{stderr}"
    except Exception as e:
        return f"Error checking out: {str(e)}"
 def git_branch_list(repo_path: str = ".") -> str:
    """
    List all branches.
    Args:
        repo_path: Path to git repository
    Returns:
        List of branches with current marked
    """
    try:
        stdout, stderr, rc = run_git_command(['branch', '-a'], cwd=repo_path)
        if rc != 0:
            return f"Error: {stderr}"
        branches = []
        for line in stdout.strip().split('\n'):
            if line:
                branch = line.strip()
                is_current = branch.startswith('*')
                if is_current:
                    branch = branch[1:].strip()
                branches.append({
                    "name": branch,
                    "current": is_current
                })
        return json.dumps({"branches": branches}, indent=2)
    except Exception as e:
        return f"Error listing branches: {str(e)}"
 # Register all git tools
 def register_all():
    registry.register(
        name="git_status",
        handler=git_status,
        description="Get git repository status (branch, changes, last commit)",
        parameters={
            "type": "object",
            "properties": {
                "repo_path": {
                    "type": "string",
                    "description": "Path to git repository",
                    "default": "."
                }
            }
        },
        category="git"
    )
    registry.register(
        name="git_log",
        handler=git_log,
        description="Get recent commit history",
        parameters={
            "type": "object",
            "properties": {
                "repo_path": {
                    "type": "string",
                    "description": "Path to git repository",
                    "default": "."
                },
                "count": {
                    "type": "integer",
                    "description": "Number of commits to show",
                    "default": 10
                }
            }
        },
        category="git"
    )
    registry.register(
        name="git_pull",
        handler=git_pull,
        description="Pull latest changes from remote",
        parameters={
            "type": "object",
            "properties": {
                "repo_path": {
                    "type": "string",
                    "description": "Path to git repository",
                    "default": "."
                }
            }
        },
        category="git"
    )
    registry.register(
        name="git_commit",
        handler=git_commit,
        description="Stage and commit changes",
        parameters={
            "type": "object",
            "properties": {
                "repo_path": {
                    "type": "string",
                    "description": "Path to git repository",
                    "default": "."
                },
                "message": {
                    "type": "string",
                    "description": "Commit message (required)"
                },
                "files": {
                    "type": "array",
                    "description": "Specific files to commit (default: all changes)",
                    "items": {"type": "string"}
                }
            },
            "required": ["message"]
        },
        category="git"
    )
    registry.register(
        name="git_push",
        handler=git_push,
        description="Push to remote repository",
        parameters={
            "type": "object",
            "properties": {
                "repo_path": {
                    "type": "string",
                    "description": "Path to git repository",
                    "default": "."
                },
                "remote": {
                    "type": "string",
                    "description": "Remote name",
                    "default": "origin"
                },
                "branch": {
                    "type": "string",
                    "description": "Branch to push (default: current)"
                }
            }
        },
        category="git"
    )
    registry.register(
        name="git_checkout",
        handler=git_checkout,
        description="Checkout a branch",
        parameters={
            "type": "object",
            "properties": {
                "repo_path": {
                    "type": "string",
                    "description": "Path to git repository",
                    "default": "."
                },
                "branch": {
                    "type": "string",
                    "description": "Branch name to checkout"
                },
                "create": {
                    "type": "boolean",
                    "description": "Create branch if it doesn't exist",
                    "default": False
                }
            },
            "required": ["branch"]
        },
        category="git"
    )
    registry.register(
        name="git_branch_list",
        handler=git_branch_list,
        description="List all branches",
        parameters={
            "type": "object",
            "properties": {
                "repo_path": {
                    "type": "string",
                    "description": "Path to git repository",
                    "default": "."
                }
            }
        },
        category="git"
    )
 register_all()
--- a/uni-wizard/tools/network_tools.py
+++ b/uni-wizard/tools/network_tools.py
@@ -0,0 +1,459 @@
 """
 Network Tools for Uni-Wizard
 HTTP client and Gitea API integration
 """
 import json
 import urllib.request
 import urllib.error
 from typing import Dict, Optional, Any
 from base64 import b64encode
 from .registry import registry
 class HTTPClient:
    """Simple HTTP client for API calls"""
    def __init__(self, base_url: str = None, auth: tuple = None):
        self.base_url = base_url
        self.auth = auth
    def _make_request(
        self,
        method: str,
        url: str,
        data: Dict = None,
        headers: Dict = None
    ) -> tuple:
        """Make HTTP request and return (body, status_code, error)"""
        try:
            # Build full URL
            full_url = url
            if self.base_url and not url.startswith('http'):
                full_url = f"{self.base_url.rstrip('/')}/{url.lstrip('/')}"
            # Prepare data
            body = None
            if data:
                body = json.dumps(data).encode('utf-8')
            # Build request
            req = urllib.request.Request(
                full_url,
                data=body,
                method=method
            )
            # Add headers
            req.add_header('Content-Type', 'application/json')
            if headers:
                for key, value in headers.items():
                    req.add_header(key, value)
            # Add auth
            if self.auth:
                username, password = self.auth
                credentials = b64encode(f"{username}:{password}".encode()).decode()
                req.add_header('Authorization', f'Basic {credentials}')
            # Make request
            with urllib.request.urlopen(req, timeout=30) as response:
                return response.read().decode('utf-8'), response.status, None
        except urllib.error.HTTPError as e:
            return e.read().decode('utf-8'), e.code, str(e)
        except Exception as e:
            return None, 0, str(e)
    def get(self, url: str) -> tuple:
        return self._make_request('GET', url)
    def post(self, url: str, data: Dict) -> tuple:
        return self._make_request('POST', url, data)
    def put(self, url: str, data: Dict) -> tuple:
        return self._make_request('PUT', url, data)
    def delete(self, url: str) -> tuple:
        return self._make_request('DELETE', url)
 def http_get(url: str) -> str:
    """
    Perform HTTP GET request.
    Args:
        url: URL to fetch
    Returns:
        Response body or error message
    """
    client = HTTPClient()
    body, status, error = client.get(url)
    if error:
        return f"Error (HTTP {status}): {error}"
    return body
 def http_post(url: str, body: Dict) -> str:
    """
    Perform HTTP POST request with JSON body.
    Args:
        url: URL to post to
        body: JSON body as dictionary
    Returns:
        Response body or error message
    """
    client = HTTPClient()
    response_body, status, error = client.post(url, body)
    if error:
        return f"Error (HTTP {status}): {error}"
    return response_body
 # Gitea API Tools
 GITEA_URL = "http://143.198.27.163:3000"
 GITEA_USER = "timmy"
 GITEA_PASS = ""  # Should be configured
 def gitea_create_issue(
    repo: str = "Timmy_Foundation/timmy-home",
    title: str = None,
    body: str = None,
    labels: list = None
 ) -> str:
    """
    Create a Gitea issue.
    Args:
        repo: Repository path (owner/repo)
        title: Issue title (required)
        body: Issue body
        labels: List of label names
    Returns:
        Created issue URL or error
    """
    if not title:
        return "Error: title is required"
    try:
        client = HTTPClient(
            base_url=GITEA_URL,
            auth=(GITEA_USER, GITEA_PASS) if GITEA_PASS else None
        )
        data = {
            "title": title,
            "body": body or ""
        }
        if labels:
            data["labels"] = labels
        response, status, error = client.post(
            f"/api/v1/repos/{repo}/issues",
            data
        )
        if error:
            return f"Error creating issue: {error}"
        result = json.loads(response)
        return f"✓ Issue created: #{result['number']} - {result['html_url']}"
    except Exception as e:
        return f"Error: {str(e)}"
 def gitea_comment(
    repo: str = "Timmy_Foundation/timmy-home",
    issue_number: int = None,
    body: str = None
 ) -> str:
    """
    Comment on a Gitea issue.
    Args:
        repo: Repository path
        issue_number: Issue number (required)
        body: Comment body (required)
    Returns:
        Comment result
    """
    if not issue_number or not body:
        return "Error: issue_number and body are required"
    try:
        client = HTTPClient(
            base_url=GITEA_URL,
            auth=(GITEA_USER, GITEA_PASS) if GITEA_PASS else None
        )
        response, status, error = client.post(
            f"/api/v1/repos/{repo}/issues/{issue_number}/comments",
            {"body": body}
        )
        if error:
            return f"Error posting comment: {error}"
        result = json.loads(response)
        return f"✓ Comment posted: {result['html_url']}"
    except Exception as e:
        return f"Error: {str(e)}"
 def gitea_list_issues(
    repo: str = "Timmy_Foundation/timmy-home",
    state: str = "open",
    assignee: str = None
 ) -> str:
    """
    List Gitea issues.
    Args:
        repo: Repository path
        state: open, closed, or all
        assignee: Filter by assignee username
    Returns:
        JSON list of issues
    """
    try:
        client = HTTPClient(
            base_url=GITEA_URL,
            auth=(GITEA_USER, GITEA_PASS) if GITEA_PASS else None
        )
        url = f"/api/v1/repos/{repo}/issues?state={state}"
        if assignee:
            url += f"&assignee={assignee}"
        response, status, error = client.get(url)
        if error:
            return f"Error fetching issues: {error}"
        issues = json.loads(response)
        # Simplify output
        simplified = []
        for issue in issues:
            simplified.append({
                "number": issue["number"],
                "title": issue["title"],
                "state": issue["state"],
                "assignee": issue.get("assignee", {}).get("login") if issue.get("assignee") else None,
                "url": issue["html_url"]
            })
        return json.dumps({
            "count": len(simplified),
            "issues": simplified
        }, indent=2)
    except Exception as e:
        return f"Error: {str(e)}"
 def gitea_get_issue(repo: str = "Timmy_Foundation/timmy-home", issue_number: int = None) -> str:
    """
    Get details of a specific Gitea issue.
    Args:
        repo: Repository path
        issue_number: Issue number (required)
    Returns:
        Issue details
    """
    if not issue_number:
        return "Error: issue_number is required"
    try:
        client = HTTPClient(
            base_url=GITEA_URL,
            auth=(GITEA_USER, GITEA_PASS) if GITEA_PASS else None
        )
        response, status, error = client.get(
            f"/api/v1/repos/{repo}/issues/{issue_number}"
        )
        if error:
            return f"Error fetching issue: {error}"
        issue = json.loads(response)
        return json.dumps({
            "number": issue["number"],
            "title": issue["title"],
            "body": issue["body"][:500] + "..." if len(issue["body"]) > 500 else issue["body"],
            "state": issue["state"],
            "assignee": issue.get("assignee", {}).get("login") if issue.get("assignee") else None,
            "created_at": issue["created_at"],
            "url": issue["html_url"]
        }, indent=2)
    except Exception as e:
        return f"Error: {str(e)}"
 # Register all network tools
 def register_all():
    registry.register(
        name="http_get",
        handler=http_get,
        description="Perform HTTP GET request",
        parameters={
            "type": "object",
            "properties": {
                "url": {
                    "type": "string",
                    "description": "URL to fetch"
                }
            },
            "required": ["url"]
        },
        category="network"
    )
    registry.register(
        name="http_post",
        handler=http_post,
        description="Perform HTTP POST request with JSON body",
        parameters={
            "type": "object",
            "properties": {
                "url": {
                    "type": "string",
                    "description": "URL to post to"
                },
                "body": {
                    "type": "object",
                    "description": "JSON body as dictionary"
                }
            },
            "required": ["url", "body"]
        },
        category="network"
    )
    registry.register(
        name="gitea_create_issue",
        handler=gitea_create_issue,
        description="Create a Gitea issue",
        parameters={
            "type": "object",
            "properties": {
                "repo": {
                    "type": "string",
                    "description": "Repository path (owner/repo)",
                    "default": "Timmy_Foundation/timmy-home"
                },
                "title": {
                    "type": "string",
                    "description": "Issue title"
                },
                "body": {
                    "type": "string",
                    "description": "Issue body"
                },
                "labels": {
                    "type": "array",
                    "description": "List of label names",
                    "items": {"type": "string"}
                }
            },
            "required": ["title"]
        },
        category="network"
    )
    registry.register(
        name="gitea_comment",
        handler=gitea_comment,
        description="Comment on a Gitea issue",
        parameters={
            "type": "object",
            "properties": {
                "repo": {
                    "type": "string",
                    "description": "Repository path",
                    "default": "Timmy_Foundation/timmy-home"
                },
                "issue_number": {
                    "type": "integer",
                    "description": "Issue number"
                },
                "body": {
                    "type": "string",
                    "description": "Comment body"
                }
            },
            "required": ["issue_number", "body"]
        },
        category="network"
    )
    registry.register(
        name="gitea_list_issues",
        handler=gitea_list_issues,
        description="List Gitea issues",
        parameters={
            "type": "object",
            "properties": {
                "repo": {
                    "type": "string",
                    "description": "Repository path",
                    "default": "Timmy_Foundation/timmy-home"
                },
                "state": {
                    "type": "string",
                    "enum": ["open", "closed", "all"],
                    "description": "Issue state",
                    "default": "open"
                },
                "assignee": {
                    "type": "string",
                    "description": "Filter by assignee username"
                }
            }
        },
        category="network"
    )
    registry.register(
        name="gitea_get_issue",
        handler=gitea_get_issue,
        description="Get details of a specific Gitea issue",
        parameters={
            "type": "object",
            "properties": {
                "repo": {
                    "type": "string",
                    "description": "Repository path",
                    "default": "Timmy_Foundation/timmy-home"
                },
                "issue_number": {
                    "type": "integer",
                    "description": "Issue number"
                }
            },
            "required": ["issue_number"]
        },
        category="network"
    )
 register_all()
--- a/uni-wizard/tools/registry.py
+++ b/uni-wizard/tools/registry.py
@@ -0,0 +1,265 @@
 """
 Uni-Wizard Tool Registry
 Central registry for all tool capabilities
 """
 import json
 import inspect
 from typing import Dict, Callable, Any, Optional
 from dataclasses import dataclass, asdict
 from functools import wraps
@dataclass
 class ToolSchema:
    """Schema definition for a tool"""
    name: str
    description: str
    parameters: Dict[str, Any]
    returns: str
    examples: list = None
    def to_dict(self):
        return asdict(self)
@dataclass
 class ToolResult:
    """Standardized tool execution result"""
    success: bool
    data: Any
    error: Optional[str] = None
    execution_time_ms: Optional[float] = None
    def to_json(self) -> str:
        return json.dumps({
            'success': self.success,
            'data': self.data,
            'error': self.error,
            'execution_time_ms': self.execution_time_ms
        }, indent=2)
    def __str__(self) -> str:
        if self.success:
            return str(self.data)
        return f"Error: {self.error}"
 class ToolRegistry:
    """
    Central registry for all uni-wizard tools.
    All tools register here with their schemas.
    The LLM queries available tools via get_tool_definitions().
    """
    def __init__(self):
        self._tools: Dict[str, Dict] = {}
        self._categories: Dict[str, list] = {}
    def register(
        self,
        name: str,
        handler: Callable,
        description: str = None,
        parameters: Dict = None,
        category: str = "general",
        examples: list = None
    ):
        """
        Register a tool in the registry.
        Args:
            name: Tool name (used in tool calls)
            handler: Function to execute
            description: What the tool does
            parameters: JSON Schema for parameters
            category: Tool category (system, git, network, file)
            examples: Example usages
        """
        # Auto-extract description from docstring if not provided
        if description is None and handler.__doc__:
            description = handler.__doc__.strip().split('\n')[0]
        # Auto-extract parameters from function signature
        if parameters is None:
            parameters = self._extract_params(handler)
        self._tools[name] = {
            'name': name,
            'handler': handler,
            'description': description or f"Execute {name}",
            'parameters': parameters,
            'category': category,
            'examples': examples or []
        }
        # Add to category
        if category not in self._categories:
            self._categories[category] = []
        self._categories[category].append(name)
        return self  # For chaining
    def _extract_params(self, handler: Callable) -> Dict:
        """Extract parameter schema from function signature"""
        sig = inspect.signature(handler)
        params = {
            "type": "object",
            "properties": {},
            "required": []
        }
        for name, param in sig.parameters.items():
            # Skip 'self', 'cls', and params with defaults
            if name in ('self', 'cls'):
                continue
            param_info = {"type": "string"}  # Default
            # Try to infer type from annotation
            if param.annotation != inspect.Parameter.empty:
                if param.annotation == int:
                    param_info["type"] = "integer"
                elif param.annotation == float:
                    param_info["type"] = "number"
                elif param.annotation == bool:
                    param_info["type"] = "boolean"
                elif param.annotation == list:
                    param_info["type"] = "array"
                elif param.annotation == dict:
                    param_info["type"] = "object"
            # Add description if in docstring
            if handler.__doc__:
                # Simple param extraction from docstring
                for line in handler.__doc__.split('\n'):
                    if f'{name}:' in line or f'{name} (' in line:
                        desc = line.split(':', 1)[-1].strip()
                        param_info["description"] = desc
                        break
            params["properties"][name] = param_info
            # Required if no default
            if param.default == inspect.Parameter.empty:
                params["required"].append(name)
        return params
    def execute(self, name: str, **params) -> ToolResult:
        """
        Execute a tool by name with parameters.
        Args:
            name: Tool name
            **params: Tool parameters
        Returns:
            ToolResult with success/failure and data
        """
        import time
        start = time.time()
        tool = self._tools.get(name)
        if not tool:
            return ToolResult(
                success=False,
                data=None,
                error=f"Tool '{name}' not found in registry",
                execution_time_ms=(time.time() - start) * 1000
            )
        try:
            handler = tool['handler']
            result = handler(**params)
            return ToolResult(
                success=True,
                data=result,
                execution_time_ms=(time.time() - start) * 1000
            )
        except Exception as e:
            return ToolResult(
                success=False,
                data=None,
                error=f"{type(e).__name__}: {str(e)}",
                execution_time_ms=(time.time() - start) * 1000
            )
    def get_tool(self, name: str) -> Optional[Dict]:
        """Get tool definition by name"""
        tool = self._tools.get(name)
        if tool:
            # Return without handler (not serializable)
            return {
                'name': tool['name'],
                'description': tool['description'],
                'parameters': tool['parameters'],
                'category': tool['category'],
                'examples': tool['examples']
            }
        return None
    def get_tools_by_category(self, category: str) -> list:
        """Get all tools in a category"""
        tool_names = self._categories.get(category, [])
        return [self.get_tool(name) for name in tool_names if self.get_tool(name)]
    def list_tools(self, category: str = None) -> list:
        """List all tool names, optionally filtered by category"""
        if category:
            return self._categories.get(category, [])
        return list(self._tools.keys())
    def get_tool_definitions(self) -> str:
        """
        Get all tool definitions formatted for LLM system prompt.
        Returns JSON string of all tools with schemas.
        """
        tools = []
        for name, tool in self._tools.items():
            tools.append({
                "name": name,
                "description": tool['description'],
                "parameters": tool['parameters']
            })
        return json.dumps(tools, indent=2)
    def get_categories(self) -> list:
        """Get all tool categories"""
        return list(self._categories.keys())
 # Global registry instance
 registry = ToolRegistry()
 def tool(name: str = None, category: str = "general", examples: list = None):
    """
    Decorator to register a function as a tool.
    Usage:
        @tool(category="system")
        def system_info():
            return {...}
    """
    def decorator(func: Callable):
        tool_name = name or func.__name__
        registry.register(
            name=tool_name,
            handler=func,
            category=category,
            examples=examples
        )
        return func
    return decorator
 # Convenience function for quick tool execution
 def call_tool(name: str, **params) -> str:
    """Execute a tool and return string result"""
    result = registry.execute(name, **params)
    return str(result)
--- a/uni-wizard/tools/system_tools.py
+++ b/uni-wizard/tools/system_tools.py
@@ -0,0 +1,377 @@
 """
 System Tools for Uni-Wizard
 Monitor and control the VPS environment
 """
 import os
 import json
 import subprocess
 import platform
 import psutil
 from datetime import datetime, timedelta
 from typing import Dict, List, Optional
 from .registry import tool, registry
@tool(category="system")
 def system_info() -> str:
    """
    Get comprehensive system information.
    Returns:
        JSON string with OS, CPU, memory, disk, and uptime info
    """
    try:
        # CPU info
        cpu_count = psutil.cpu_count()
        cpu_percent = psutil.cpu_percent(interval=1)
        cpu_freq = psutil.cpu_freq()
        # Memory info
        memory = psutil.virtual_memory()
        # Disk info
        disk = psutil.disk_usage('/')
        # Uptime
        boot_time = datetime.fromtimestamp(psutil.boot_time())
        uptime = datetime.now() - boot_time
        # Load average (Linux only)
        load_avg = os.getloadavg() if hasattr(os, 'getloadavg') else [0, 0, 0]
        info = {
            "hostname": platform.node(),
            "os": {
                "system": platform.system(),
                "release": platform.release(),
                "version": platform.version(),
                "machine": platform.machine()
            },
            "cpu": {
                "count": cpu_count,
                "percent": cpu_percent,
                "frequency_mhz": cpu_freq.current if cpu_freq else None
            },
            "memory": {
                "total_gb": round(memory.total / (1024**3), 2),
                "available_gb": round(memory.available / (1024**3), 2),
                "percent_used": memory.percent
            },
            "disk": {
                "total_gb": round(disk.total / (1024**3), 2),
                "free_gb": round(disk.free / (1024**3), 2),
                "percent_used": round((disk.used / disk.total) * 100, 1)
            },
            "uptime": {
                "boot_time": boot_time.isoformat(),
                "uptime_seconds": int(uptime.total_seconds()),
                "uptime_human": str(timedelta(seconds=int(uptime.total_seconds())))
            },
            "load_average": {
                "1min": round(load_avg[0], 2),
                "5min": round(load_avg[1], 2),
                "15min": round(load_avg[2], 2)
            }
        }
        return json.dumps(info, indent=2)
    except Exception as e:
        return f"Error getting system info: {str(e)}"
@tool(category="system")
 def process_list(filter_name: str = None) -> str:
    """
    List running processes with optional name filter.
    Args:
        filter_name: Optional process name to filter by
    Returns:
        JSON list of processes with PID, name, CPU%, memory
    """
    try:
        processes = []
        for proc in psutil.process_iter(['pid', 'name', 'cpu_percent', 'memory_percent', 'status']):
            try:
                info = proc.info
                if filter_name and filter_name.lower() not in info['name'].lower():
                    continue
                processes.append({
                    "pid": info['pid'],
                    "name": info['name'],
                    "cpu_percent": info['cpu_percent'],
                    "memory_percent": round(info['memory_percent'], 2) if info['memory_percent'] else 0,
                    "status": info['status']
                })
            except (psutil.NoSuchProcess, psutil.AccessDenied):
                continue
        # Sort by CPU usage
        processes.sort(key=lambda x: x['cpu_percent'], reverse=True)
        return json.dumps({
            "count": len(processes),
            "filter": filter_name,
            "processes": processes[:50]  # Limit to top 50
        }, indent=2)
    except Exception as e:
        return f"Error listing processes: {str(e)}"
@tool(category="system")
 def service_status(service_name: str) -> str:
    """
    Check systemd service status.
    Args:
        service_name: Name of the service (e.g., 'llama-server', 'syncthing@root')
    Returns:
        Service status information
    """
    try:
        result = subprocess.run(
            ['systemctl', 'status', service_name, '--no-pager'],
            capture_output=True,
            text=True
        )
        # Parse output
        lines = result.stdout.split('\n')
        status_info = {"service": service_name}
        for line in lines:
            if 'Active:' in line:
                status_info['active'] = line.split(':', 1)[1].strip()
            elif 'Loaded:' in line:
                status_info['loaded'] = line.split(':', 1)[1].strip()
            elif 'Main PID:' in line:
                status_info['pid'] = line.split(':', 1)[1].strip()
            elif 'Memory:' in line:
                status_info['memory'] = line.split(':', 1)[1].strip()
            elif 'CPU:' in line:
                status_info['cpu'] = line.split(':', 1)[1].strip()
        status_info['exit_code'] = result.returncode
        return json.dumps(status_info, indent=2)
    except Exception as e:
        return f"Error checking service status: {str(e)}"
@tool(category="system")
 def service_control(service_name: str, action: str) -> str:
    """
    Control a systemd service (start, stop, restart, enable, disable).
    Args:
        service_name: Name of the service
        action: start, stop, restart, enable, disable, status
    Returns:
        Result of the action
    """
    valid_actions = ['start', 'stop', 'restart', 'enable', 'disable', 'status']
    if action not in valid_actions:
        return f"Invalid action. Use: {', '.join(valid_actions)}"
    try:
        result = subprocess.run(
            ['systemctl', action, service_name],
            capture_output=True,
            text=True
        )
        if result.returncode == 0:
            return f"✓ Service '{service_name}' {action} successful"
        else:
            return f"✗ Service '{service_name}' {action} failed: {result.stderr}"
    except Exception as e:
        return f"Error controlling service: {str(e)}"
@tool(category="system")
 def health_check() -> str:
    """
    Comprehensive health check of the VPS.
    Checks:
    - System resources (CPU, memory, disk)
    - Critical services (llama-server, syncthing, timmy-agent)
    - Network connectivity
    - Inference endpoint
    Returns:
        Health report with status and recommendations
    """
    try:
        health = {
            "timestamp": datetime.now().isoformat(),
            "overall": "healthy",
            "checks": {}
        }
        # System resources
        memory = psutil.virtual_memory()
        disk = psutil.disk_usage('/')
        health["checks"]["memory"] = {
            "status": "healthy" if memory.percent < 90 else "warning",
            "percent_used": memory.percent,
            "available_gb": round(memory.available / (1024**3), 2)
        }
        health["checks"]["disk"] = {
            "status": "healthy" if disk.percent < 90 else "warning",
            "percent_used": disk.percent,
            "free_gb": round(disk.free / (1024**3), 2)
        }
        # Check inference endpoint
        try:
            import urllib.request
            req = urllib.request.urlopen('http://127.0.0.1:8081/health', timeout=5)
            health["checks"]["inference"] = {"status": "healthy", "port": 8081}
        except:
            health["checks"]["inference"] = {"status": "down", "port": 8081}
            health["overall"] = "degraded"
        # Check services
        services = ['llama-server', 'syncthing@root']
        for svc in services:
            result = subprocess.run(['systemctl', 'is-active', svc], capture_output=True, text=True)
            health["checks"][svc] = {
                "status": "healthy" if result.returncode == 0 else "down"
            }
            if result.returncode != 0:
                health["overall"] = "degraded"
        return json.dumps(health, indent=2)
    except Exception as e:
        return f"Error running health check: {str(e)}"
@tool(category="system")
 def disk_usage(path: str = "/") -> str:
    """
    Get disk usage for a path.
    Args:
        path: Path to check (default: /)
    Returns:
        Disk usage statistics
    """
    try:
        usage = psutil.disk_usage(path)
        return json.dumps({
            "path": path,
            "total_gb": round(usage.total / (1024**3), 2),
            "used_gb": round(usage.used / (1024**3), 2),
            "free_gb": round(usage.free / (1024**3), 2),
            "percent_used": round((usage.used / usage.total) * 100, 1)
        }, indent=2)
    except Exception as e:
        return f"Error checking disk usage: {str(e)}"
 # Auto-register all tools in this module
 def register_all():
    """Register all system tools"""
    registry.register(
        name="system_info",
        handler=system_info,
        description="Get comprehensive system information (OS, CPU, memory, disk, uptime)",
        category="system"
    )
    registry.register(
        name="process_list",
        handler=process_list,
        description="List running processes with optional name filter",
        parameters={
            "type": "object",
            "properties": {
                "filter_name": {
                    "type": "string",
                    "description": "Optional process name to filter by"
                }
            }
        },
        category="system"
    )
    registry.register(
        name="service_status",
        handler=service_status,
        description="Check systemd service status",
        parameters={
            "type": "object",
            "properties": {
                "service_name": {
                    "type": "string",
                    "description": "Name of the systemd service"
                }
            },
            "required": ["service_name"]
        },
        category="system"
    )
    registry.register(
        name="service_control",
        handler=service_control,
        description="Control a systemd service (start, stop, restart, enable, disable)",
        parameters={
            "type": "object",
            "properties": {
                "service_name": {
                    "type": "string",
                    "description": "Name of the service"
                },
                "action": {
                    "type": "string",
                    "enum": ["start", "stop", "restart", "enable", "disable", "status"],
                    "description": "Action to perform"
                }
            },
            "required": ["service_name", "action"]
        },
        category="system"
    )
    registry.register(
        name="health_check",
        handler=health_check,
        description="Comprehensive health check of VPS (resources, services, inference)",
        category="system"
    )
    registry.register(
        name="disk_usage",
        handler=disk_usage,
        description="Get disk usage for a path",
        parameters={
            "type": "object",
            "properties": {
                "path": {
                    "type": "string",
                    "description": "Path to check",
                    "default": "/"
                }
            }
        },
        category="system"
    )
 register_all()
--- a/uniwizard/context_compression_review.md
+++ b/uniwizard/context_compression_review.md
@@ -0,0 +1,401 @@
 # Context Compression Review
 ## Gitea Issue: timmy-home #92
 **Date:** 2026-03-30  
 **Reviewer:** Timmy (Agent)  
 **Scope:** `~/.hermes/hermes-agent/agent/context_compressor.py`
 ---
 ## Executive Summary
 The Hermes context compressor is a **mature, well-architected implementation** with sophisticated handling of tool call pairs, iterative summary updates, and token-aware tail protection. However, there are several **high-impact gaps** related to fallback chain awareness, early warning systems, and checkpoint integration that should be addressed for production reliability.
 **Overall Grade:** B+ (Solid foundation, needs edge-case hardening)
 ---
 ## What the Current Implementation Does Well
 ### 1. Structured Summary Template (Lines 276-303)
 The compressor uses a Pi-mono/OpenCode-inspired structured format:
 - **Goal**: What the user is trying to accomplish
 - **Constraints & Preferences**: User preferences, coding style
 - **Progress**: Done / In Progress / Blocked sections
 - **Key Decisions**: Important technical decisions with rationale
 - **Relevant Files**: Files read/modified/created with notes
 - **Next Steps**: What needs to happen next
 - **Critical Context**: Values, error messages, config details that would be lost
 This is **best-in-class** compared to most context compression implementations.
 ### 2. Iterative Summary Updates (Lines 264-304)
 The `_previous_summary` mechanism preserves information across multiple compactions:
 - On first compaction: Summarizes from scratch
 - On subsequent compactions: Updates previous summary with new progress
 - Moves items from "In Progress" to "Done" when completed
 - Accumulates constraints and file references across compactions
 ### 3. Token-Budget Tail Protection (Lines 490-539)
 Instead of fixed message counts, protects the most recent N tokens:
 ```python
 tail_token_budget = threshold_tokens * summary_target_ratio
 # Default: 50% of 128K context = 64K threshold → ~13K token tail
 ```
 This scales automatically with model context window.
 ### 4. Tool Call/Result Pair Integrity (Lines 392-450)
 Sophisticated handling of orphaned tool pairs:
 - `_sanitize_tool_pairs()`: Removes orphaned results, adds stubs for missing results
 - `_align_boundary_forward/backward()`: Prevents splitting tool groups
 - Protects the integrity of the message sequence for API compliance
 ### 5. Tool Output Pruning Pre-Pass (Lines 152-182)
 Cheap first pass that replaces old tool results with placeholders:
 ```python
 _PRUNED_TOOL_PLACEHOLDER = "[Old tool output cleared to save context space]"
 ```
 Only prunes content >200 chars, preserving smaller results.
 ### 6. Rich Serialization for Summary Input (Lines 199-248)
 Includes tool call arguments and truncates intelligently:
 - Tool results: Up to 3000 chars (with smart truncation keeping head/tail)
 - Tool calls: Function name AND arguments (truncated to 400 chars if needed)
 - All roles: 3000 char limit with ellipses
 ### 7. Proper Integration with Agent Loop
 - Initialized in `AIAgent.__init__()` (lines 1191-1203)
 - Triggered in `_compress_context()` (line 5259)
 - Resets state in `reset_session_state()` (lines 1263-1271)
 - Updates token counts via `update_from_response()` (lines 122-126)
 ---
 ## What's Missing or Broken
 ### 🔴 CRITICAL: No Fallback Chain Context Window Awareness
 **Issue:** When the agent falls back to a model with a smaller context window (e.g., primary Claude 1M tokens → fallback GPT-4 128K tokens), the compressor's threshold is based on the **original model**, not the fallback model.
 **Location:** `run_agent.py` compression initialization (lines 1191-1203)
 **Impact:** 
 - Fallback model may hit context limits before compression triggers
 - Or compression may trigger too aggressively for smaller models
 **Evidence:**
 ```python
 # In AIAgent.__init__():
 self.context_compressor = ContextCompressor(
    model=self.model,  # Original model only
    # ... no fallback context lengths passed
 )
 ```
 **Fix Needed:** Pass fallback chain context lengths and use minimum:
 ```python
 # Suggested approach:
 context_lengths = [get_model_context_length(m) for m in [primary] + fallbacks]
 effective_context = min(context_lengths)  # Conservative
 ```
 ---
 ### 🔴 HIGH: No Pre-Compression Checkpoint
 **Issue:** When compression occurs, the pre-compression state is lost. Users cannot "rewind" to before compression if the summary loses critical information.
 **Location:** `run_agent.py` `_compress_context()` (line 5259)
 **Impact:**
 - Information loss is irreversible
 - If summary misses critical context, conversation is corrupted
 - No audit trail of what was removed
 **Fix Needed:** Create checkpoint before compression:
 ```python
 def _compress_context(self, messages, system_message, ...):
    # Create checkpoint BEFORE compression
    if self._checkpoint_mgr:
        self._checkpoint_mgr.create_checkpoint(
            name=f"pre-compression-{self.context_compressor.compression_count}",
            messages=messages,  # Full pre-compression state
        )
    compressed = self.context_compressor.compress(messages, ...)
 ```
 ---
 ### 🟡 MEDIUM: No Progressive Context Pressure Warnings
 **Issue:** Only one warning at 85% (line 7871), then sudden compression at 50-100% threshold. No graduated alert system.
 **Location:** `run_agent.py` context pressure check (lines 7865-7872)
 **Current:**
 ```python
 if _compaction_progress >= 0.85 and not self._context_pressure_warned:
    self._context_pressure_warned = True
 ```
 **Better:**
 ```python
 # Progressive warnings at 60%, 75%, 85%, 95%
 warning_levels = [(0.60, "info"), (0.75, "notice"), 
                  (0.85, "warning"), (0.95, "critical")]
 ```
 ---
 ### 🟡 MEDIUM: Summary Validation Missing
 **Issue:** No verification that the generated summary actually contains the critical information from the compressed turns.
 **Location:** `context_compressor.py` `_generate_summary()` (lines 250-369)
 **Risk:** If the summarization model fails or produces low-quality output, critical information is silently lost.
 **Fix Needed:** Add summary quality checks:
 ```python
 def _validate_summary(self, summary: str, turns: list) -> bool:
    """Verify summary captures critical information."""
    # Check for key file paths mentioned in turns
    # Check for error messages that were present
    # Check for specific values/IDs
    # Return False if validation fails, trigger fallback
 ```
 ---
 ### 🟡 MEDIUM: No Semantic Deduplication
 **Issue:** Same information may be repeated across the original turns and the previous summary, leading to bloated input to the summarizer.
 **Location:** `_generate_summary()` iterative update path (lines 264-304)
 **Example:** If the previous summary already mentions "file X was modified", and new turns also mention it, the information appears twice in the summarizer input.
 ---
 ### 🟢 LOW: Tool Result Placeholder Not Actionable
 **Issue:** The placeholder `[Old tool output cleared to save context space]` tells the user nothing about what was lost.
 **Location:** Line 45
 **Better:**
 ```python
 # Include tool name and truncated preview
 _PRUNED_TOOL_PLACEHOLDER_TEMPLATE = (
    "[Tool output for {tool_name} cleared. "
    "Preview: {preview}... ({original_chars} chars removed)]"
 )
 ```
 ---
 ### 🟢 LOW: Compression Metrics Not Tracked
 **Issue:** No tracking of compression ratio, frequency, or information density over time.
 **Useful metrics to track:**
 - Tokens saved per compression
 - Compression ratio (input tokens / output tokens)
 - Frequency of compression (compressions per 100 turns)
 - Average summary length
 ---
 ## Specific Code Improvements
 ### 1. Add Fallback Context Length Detection
 **File:** `run_agent.py` (~line 1191)
 ```python
 # Before initializing compressor, collect all context lengths
 def _get_fallback_context_lengths(self, _agent_cfg: dict) -> list:
    """Get context lengths for all models in fallback chain."""
    lengths = []
    # Primary model
    lengths.append(get_model_context_length(
        self.model, base_url=self.base_url, 
        api_key=self.api_key, provider=self.provider
    ))
    # Fallback models from config
    fallback_providers = _agent_cfg.get("fallback_providers", [])
    for fb in fallback_providers:
        if isinstance(fb, dict):
            fb_model = fb.get("model", "")
            fb_base = fb.get("base_url", "")
            fb_provider = fb.get("provider", "")
            fb_key_env = fb.get("api_key_env", "")
            fb_key = os.getenv(fb_key_env, "")
            if fb_model:
                lengths.append(get_model_context_length(
                    fb_model, base_url=fb_base,
                    api_key=fb_key, provider=fb_provider
                ))
    return [l for l in lengths if l and l > 0]
 # Use minimum context length for conservative compression
 _fallback_contexts = self._get_fallback_context_lengths(_agent_cfg)
 _effective_context = min(_fallback_contexts) if _fallback_contexts else None
 ```
 ### 2. Add Pre-Compression Checkpoint
 **File:** `run_agent.py` `_compress_context()` method
 See patch file for implementation.
 ### 3. Add Summary Validation
 **File:** `context_compressor.py`
 ```python
 def _extract_critical_refs(self, turns: List[Dict]) -> Set[str]:
    """Extract critical references that must appear in summary."""
    critical = set()
    for msg in turns:
        content = msg.get("content", "") or ""
        # File paths
        for match in re.finditer(r'[\w\-./]+\.(py|js|ts|json|yaml|md)\b', content):
            critical.add(match.group(0))
        # Error messages
        if "error" in content.lower() or "exception" in content.lower():
            lines = content.split('\n')
            for line in lines:
                if any(k in line.lower() for k in ["error", "exception", "traceback"]):
                    critical.add(line[:100])  # First 100 chars of error line
    return critical
 def _validate_summary(self, summary: str, turns: List[Dict]) -> Tuple[bool, List[str]]:
    """Validate that summary captures critical information.
    Returns (is_valid, missing_items).
    """
    if not summary or len(summary) < 100:
        return False, ["summary too short"]
    critical = self._extract_critical_refs(turns)
    missing = [ref for ref in critical if ref not in summary]
    # Allow some loss but not too much
    if len(missing) > len(critical) * 0.5:
        return False, missing[:5]  # Return first 5 missing
    return True, []
 ```
 ### 4. Progressive Context Pressure Warnings
 **File:** `run_agent.py` context pressure section (~line 7865)
 ```python
 # Replace single warning with progressive system
 _CONTEXT_PRESSURE_LEVELS = [
    (0.60, "ℹ️ Context usage at 60% — monitoring"),
    (0.75, "📊 Context usage at 75% — consider wrapping up soon"),
    (0.85, "⚠️ Context usage at 85% — compression imminent"),
    (0.95, "🔴 Context usage at 95% — compression will trigger soon"),
 ]
 # Track which levels have been reported
 if not hasattr(self, '_context_pressure_reported'):
    self._context_pressure_reported = set()
 for threshold, message in _CONTEXT_PRESSURE_LEVELS:
    if _compaction_progress >= threshold and threshold not in self._context_pressure_reported:
        self._context_pressure_reported.add(threshold)
        if self.status_callback:
            self.status_callback("warning", message)
        if not self.quiet_mode:
            print(f"\n{message}\n")
 ```
 ---
 ## Interaction with Fallback Chain
 ### Current Behavior
 The compressor is initialized once at agent startup with the primary model's context length:
 ```python
 self.context_compressor = ContextCompressor(
    model=self.model,  # Primary model only
    threshold_percent=compression_threshold,  # Default 50%
    # ...
 )
 ```
 ### Problems
 1. **No dynamic adjustment:** If fallback occurs to a smaller model, compression threshold is wrong
 2. **No re-initialization on model switch:** `/model` command doesn't update compressor
 3. **Context probe affects wrong model:** If primary probe fails, fallback models may have already been used
 ### Recommended Architecture
 ```python
 class AIAgent:
    def _update_compressor_for_model(self, model: str, base_url: str, provider: str):
        """Reconfigure compressor when model changes (fallback or /model command)."""
        new_context = get_model_context_length(model, base_url=base_url, provider=provider)
        if new_context != self.context_compressor.context_length:
            self.context_compressor.context_length = new_context
            self.context_compressor.threshold_tokens = int(
                new_context * self.context_compressor.threshold_percent
            )
            logger.info(f"Compressor adjusted for {model}: {new_context:,} tokens")
    def _handle_fallback(self, fallback_model: str, ...):
        """Update compressor when falling back to different model."""
        self._update_compressor_for_model(fallback_model, ...)
 ```
 ---
 ## Testing Gaps
 1. **No fallback chain test:** Tests don't verify behavior when context limits differ
 2. **No checkpoint integration test:** Pre-compression checkpoint not tested
 3. **No summary validation test:** No test for detecting poor-quality summaries
 4. **No progressive warning test:** Only tests the 85% threshold
 5. **No tool result deduplication test:** Tests verify pairs are preserved but not deduplicated
 ---
 ## Recommendations Priority
 | Priority | Item | Effort | Impact |
 |----------|------|--------|--------|
 | P0 | Pre-compression checkpoint | Medium | Critical |
 | P0 | Fallback context awareness | Medium | High |
 | P1 | Progressive warnings | Low | Medium |
 | P1 | Summary validation | Medium | High |
 | P2 | Semantic deduplication | High | Medium |
 | P2 | Better pruning placeholders | Low | Low |
 | P3 | Compression metrics | Low | Low |
 ---
 ## Conclusion
 The context compressor is a **solid, production-ready implementation** with sophisticated handling of the core compression problem. The structured summary format and iterative update mechanism are particularly well-designed.
 The main gaps are in **edge-case hardening**:
 1. Fallback chain awareness needs to be addressed for multi-model reliability
 2. Pre-compression checkpoint is essential for information recovery
 3. Summary validation would prevent silent information loss
 These are incremental improvements to an already strong foundation.
 ---
 *Review conducted by Timmy Agent*  
 *For Gitea issue timmy-home #92*
--- a/uniwizard/context_compressor.patch
+++ b/uniwizard/context_compressor.patch
@@ -0,0 +1,332 @@
 From: Timmy Agent <timmy@uniwizard.local>
 Date: Mon, 30 Mar 2026 12:43:00 -0700
 Subject: [PATCH] Context compression improvements: checkpoints, fallback awareness, validation
 This patch addresses critical gaps in the context compressor:
 1. Pre-compression checkpoints for recovery
 2. Progressive context pressure warnings  
 3. Summary validation to detect information loss
 4. Better tool pruning placeholders
 ---
 agent/context_compressor.py | 102 +++++++++++++++++++++++++++++++++++-
 run_agent.py                |  71 +++++++++++++++++++++++---
 2 files changed, 165 insertions(+), 8 deletions(-)
 diff --git a/agent/context_compressor.py b/agent/context_compressor.py
 index abc123..def456 100644
 --- a/agent/context_compressor.py
 +++ b/agent/context_compressor.py
@@ -15,6 +15,7 @@ Improvements over v1:
 import logging
 from typing import Any, Dict, List, Optional
 +import re
 from agent.auxiliary_client import call_llm
 from agent.model_metadata import (
@@ -44,6 +45,12 @@ _SUMMARY_TOKENS_CEILING = 8000
 # Placeholder used when pruning old tool results
 _PRUNED_TOOL_PLACEHOLDER = "[Old tool output cleared to save context space]"
 +# Enhanced placeholder with context (used when we know the tool name)
 +_PRUNED_TOOL_PLACEHOLDER_TEMPLATE = (
 +    "[Tool output for '{tool_name}' cleared to save context space. "
 +    "Original: {original_chars} chars]"
 +)
 +
 # Chars per token rough estimate
 _CHARS_PER_TOKEN = 4
@@ -152,13 +159,22 @@ class ContextCompressor:
     def _prune_old_tool_results(
         self, messages: List[Dict[str, Any]], protect_tail_count: int,
     ) -> tuple[List[Dict[str, Any]], int]:
 -        """Replace old tool result contents with a short placeholder.
 +        """Replace old tool result contents with an informative placeholder.
         Walks backward from the end, protecting the most recent
 -        ``protect_tail_count`` messages. Older tool results get their
 -        content replaced with a placeholder string.
 +        ``protect_tail_count`` messages. Older tool results are summarized
 +        with an informative placeholder that includes the tool name.
         Returns (pruned_messages, pruned_count).
 +        
 +        Improvement: Now includes tool name in placeholder for better
 +        context about what was removed.
         """
         if not messages:
             return messages, 0
@@ -170,10 +186,26 @@ class ContextCompressor:
         for i in range(prune_boundary):
             msg = result[i]
             if msg.get("role") != "tool":
                 continue
             content = msg.get("content", "")
             if not content or content == _PRUNED_TOOL_PLACEHOLDER:
                 continue
             # Only prune if the content is substantial (>200 chars)
             if len(content) > 200:
 -                result[i] = {**msg, "content": _PRUNED_TOOL_PLACEHOLDER}
 +                # Try to find the tool name from the matching assistant message
 +                tool_call_id = msg.get("tool_call_id", "")
 +                tool_name = "unknown"
 +                for m in messages:
 +                    if m.get("role") == "assistant" and m.get("tool_calls"):
 +                        for tc in m.get("tool_calls", []):
 +                            tc_id = tc.get("id", "") if isinstance(tc, dict) else getattr(tc, "id", "")
 +                            if tc_id == tool_call_id:
 +                                fn = tc.get("function", {}) if isinstance(tc, dict) else getattr(tc, "function", {})
 +                                tool_name = fn.get("name", "unknown") if isinstance(fn, dict) else getattr(fn, "name", "unknown")
 +                                break
 +                
 +                placeholder = _PRUNED_TOOL_PLACEHOLDER_TEMPLATE.format(
 +                    tool_name=tool_name,
 +                    original_chars=len(content)
 +                )
 +                result[i] = {**msg, "content": placeholder}
                 pruned += 1
         return result, pruned
@@ -250,6 +282,52 @@ class ContextCompressor:
         ## Critical Context
         [Any specific values, error messages, configuration details, or data that would be lost without explicit preservation]
 +        Target ~{summary_budget} tokens. Be specific — include file paths, command outputs, error messages, and concrete values rather than vague descriptions.
 +
 +        Write only the summary body. Do not include any preamble or prefix."""
 +
 +    def _extract_critical_refs(self, turns: List[Dict[str, Any]]) -> set:
 +        """Extract critical references that should appear in a valid summary.
 +        
 +        Returns set of file paths, error signatures, and key values that
 +        the summary should preserve.
 +        """
 +        critical = set()
 +        for msg in turns:
 +            content = msg.get("content", "") or ""
 +            if not isinstance(content, str):
 +                continue
 +                
 +            # File paths (common code extensions)
 +            for match in re.finditer(r'[\w\-./]+\.(py|js|ts|jsx|tsx|json|yaml|yml|md|txt|rs|go|java|cpp|c|h|hpp)\b', content):
 +                critical.add(match.group(0))
 +            
 +            # Error patterns
 +            lines = content.split('\n')
 +            for line in lines:
 +                line_lower = line.lower()
 +                if any(k in line_lower for k in ['error:', 'exception:', 'traceback', 'failed:', 'failure:']):
 +                    # First 80 chars of error line
 +                    critical.add(line[:80].strip())
 +            
 +            # URLs
 +            for match in re.finditer(r'https?://[^\s<>"\']+', content):
 +                critical.add(match.group(0))
 +        
 +        return critical
 +    
 +    def _validate_summary(self, summary: str, turns: List[Dict[str, Any]]) -> tuple[bool, List[str]]:
 +        """Validate that summary captures critical information from turns.
 +        
 +        Returns (is_valid, missing_critical_items).
 +        """
 +        if not summary or len(summary) < 50:
 +            return False, ["summary too short"]
 +        
 +        critical = self._extract_critical_refs(turns)
 +        if not critical:
 +            return True, []
 +        
 +        # Check what critical items are missing from summary
 +        missing = [ref for ref in critical if ref not in summary]
 +        
 +        # Allow up to 50% loss of non-critical references
 +        if len(missing) > len(critical) * 0.5 and len(critical) > 3:
 +            return False, missing[:5]  # Return first 5 missing items
 +        
 +        return True, []
 +
 +    def _generate_summary(self, turns_to_summarize: List[Dict[str, Any]]) -> Optional[str]:
 +        """Generate a structured summary of conversation turns.
 +        
 +        NEW: Added validation step to detect low-quality summaries.
 +        Falls back to extended summarization if validation fails.
 +        """
 +        summary_budget = self._compute_summary_budget(turns_to_summarize)
 +        content_to_summarize = self._serialize_for_summary(turns_to_summarize)
 +
         if self._previous_summary:
             # Iterative update: preserve existing info, add new progress
             prompt = f"""You are updating a context compaction summary...
@@ -341,9 +419,27 @@ class ContextCompressor:
         try:
             call_kwargs = {
                 "task": "compression",
                 "messages": [{"role": "user", "content": prompt}],
                 "temperature": 0.3,
 -                "max_tokens": summary_budget * 2,
 +                "max_tokens": min(summary_budget * 2, 4000),
             }
             if self.summary_model:
                 call_kwargs["model"] = self.summary_model
             response = call_llm(**call_kwargs)
             content = response.choices[0].message.content
             # Handle cases where content is not a string (e.g., dict from llama.cpp)
             if not isinstance(content, str):
                 content = str(content) if content else ""
             summary = content.strip()
 +            
 +            # NEW: Validate the generated summary
 +            is_valid, missing = self._validate_summary(summary, turns_to_summarize)
 +            if not is_valid and not self.quiet_mode:
 +                logger.warning(
 +                    "Summary validation detected potential information loss. "
 +                    "Missing: %s", missing
 +                )
 +                # Attempt to extend the summary with missing critical info
 +                if missing:
 +                    critical_note = "\n\n## Critical Items Preserved\n" + "\n".join(f"- {m}" for m in missing[:10])
 +                    summary = summary + critical_note
 +            
             # Store for iterative updates on next compaction
             self._previous_summary = summary
             return self._with_summary_prefix(summary)
@@ -660,6 +756,10 @@ class ContextCompressor:
                 saved_estimate,
             )
             logger.info("Compression #%d complete", self.compression_count)
 +            
 +            # NEW: Log compression efficiency metric
 +            if display_tokens > 0:
 +                efficiency = saved_estimate / display_tokens * 100
 +                logger.info("Compression efficiency: %.1f%% tokens saved", efficiency)
         return compressed
 diff --git a/run_agent.py b/run_agent.py
 index abc123..def456 100644
 --- a/run_agent.py
 +++ b/run_agent.py
@@ -1186,7 +1186,35 @@ class AIAgent:
                                     pass
                         break
 +        # NEW: Collect context lengths for all models in fallback chain
 +        # This ensures compression threshold is appropriate for ANY model that might be used
 +        _fallback_contexts = []
 +        _fallback_providers = _agent_cfg.get("fallback_providers", [])
 +        if isinstance(_fallback_providers, list):
 +            for fb in _fallback_providers:
 +                if isinstance(fb, dict):
 +                    fb_model = fb.get("model", "")
 +                    fb_base = fb.get("base_url", "")
 +                    fb_provider = fb.get("provider", "")
 +                    fb_key_env = fb.get("api_key_env", "")
 +                    fb_key = os.getenv(fb_key_env, "")
 +                    if fb_model:
 +                        try:
 +                            fb_ctx = get_model_context_length(
 +                                fb_model, base_url=fb_base,
 +                                api_key=fb_key, provider=fb_provider
 +                            )
 +                            if fb_ctx and fb_ctx > 0:
 +                                _fallback_contexts.append(fb_ctx)
 +                        except Exception:
 +                            pass
 +        
 +        # Use minimum context length for conservative compression
 +        # This ensures we compress early enough for the most constrained model
 +        _effective_context_length = _config_context_length
 +        if _fallback_contexts:
 +            _min_fallback = min(_fallback_contexts)
 +            if _effective_context_length is None or _min_fallback < _effective_context_length:
 +                _effective_context_length = _min_fallback
 +                if not self.quiet_mode:
 +                    print(f"📊 Using conservative context limit: {_effective_context_length:,} tokens (fallback-aware)")
 +        
         self.context_compressor = ContextCompressor(
             model=self.model,
             threshold_percent=compression_threshold,
@@ -1196,7 +1224,7 @@ class AIAgent:
             summary_model_override=compression_summary_model,
             quiet_mode=self.quiet_mode,
             base_url=self.base_url,
             api_key=self.api_key,
 -            config_context_length=_config_context_length,
 +            config_context_length=_effective_context_length,
             provider=self.provider,
         )
         self.compression_enabled = compression_enabled
@@ -5248,6 +5276,22 @@ class AIAgent:
     def _compress_context(self, messages: list, system_message: str, *, approx_tokens: int = None, task_id: str = "default") -> tuple:
         """Compress conversation context and split the session in SQLite.
 +        
 +        NEW: Creates a checkpoint before compression for recovery.
 +        This allows rewinding if the summary loses critical information.
 +        
 +        Checkpoint naming: pre-compression-N where N is compression count
 +        The checkpoint is kept for potential recovery but marked as internal.
         Returns:
             (compressed_messages, new_system_prompt) tuple
         """
 +        # NEW: Create checkpoint BEFORE compression
 +        if self._checkpoint_mgr and hasattr(self._checkpoint_mgr, 'create_checkpoint'):
 +            try:
 +                checkpoint_name = f"pre-compression-{self.context_compressor.compression_count}"
 +                self._checkpoint_mgr.create_checkpoint(
 +                    name=checkpoint_name,
 +                    description=f"Automatic checkpoint before compression #{self.context_compressor.compression_count}"
 +                )
 +                if not self.quiet_mode:
 +                    logger.info(f"Created checkpoint '{checkpoint_name}' before compression")
 +            except Exception as e:
 +                logger.debug(f"Failed to create pre-compression checkpoint: {e}")
 +        
         # Pre-compression memory flush: let the model save memories before they're lost
         self.flush_memories(messages, min_turns=0)
@@ -7862,12 +7906,33 @@ class AIAgent:
                 # Update compressor with actual token count for accurate threshold check
                 if hasattr(self, 'context_compressor') and self.context_compressor:
                     self.context_compressor.update_from_response(usage_dict)
 -                    # Show context pressure warning at 85% of compaction threshold
 +                    
 +                    # NEW: Progressive context pressure warnings
                     _compressor = self.context_compressor
                     if _compressor.threshold_tokens > 0:
                         _compaction_progress = _real_tokens / _compressor.threshold_tokens
 -                        if _compaction_progress >= 0.85 and not self._context_pressure_warned:
 -                            self._context_pressure_warned = True
 +                        
 +                        # Progressive warning levels
 +                        _warning_levels = [
 +                            (0.60, "info", "ℹ️ Context usage at 60%"),
 +                            (0.75, "notice", "📊 Context usage at 75% — consider wrapping up"),
 +                            (0.85, "warning", "⚠️ Context usage at 85% — compression imminent"),
 +                            (0.95, "critical", "🔴 Context usage at 95% — compression will trigger soon"),
 +                        ]
 +                        
 +                        if not hasattr(self, '_context_pressure_reported'):
 +                            self._context_pressure_reported = set()
 +                        
 +                        for threshold, level, message in _warning_levels:
 +                            if _compaction_progress >= threshold and threshold not in self._context_pressure_reported:
 +                                self._context_pressure_reported.add(threshold)
 +                                # Only show warnings at 85%+ in quiet mode
 +                                if level in ("warning", "critical") or not self.quiet_mode:
 +                                    if self.status_callback:
 +                                        self.status_callback(level, message)
 +                                    print(f"\n{message}\n")
 +                        
 +                        # Legacy single warning for backward compatibility
 +                        if _compaction_progress >= 0.85 and not getattr(self, '_context_pressure_warned', False):
 +                            self._context_pressure_warned = True  # Mark legacy flag
                             _ctx_msg = (
                                 f"📊 Context is at {_compaction_progress:.0%} of compression threshold "
                                 f"({_real_tokens:,} / {_compressor.threshold_tokens:,} tokens). "
 -- 
 2.40.0
--- a/uniwizard/job_profiles.yaml
+++ b/uniwizard/job_profiles.yaml
@@ -0,0 +1,174 @@
 # Job-Specific Toolset Profiles for Local Timmy Automation
 # Location: ~/.timmy/uniwizard/job_profiles.yaml
 #
 # Purpose: Narrow the tool surface per job type to prevent context thrashing
 # and reduce token usage in local Hermes sessions.
 #
 # Usage in cron jobs:
 #   agent = AIAgent(
 #       enabled_toolsets=JOB_PROFILES["code-work"]["toolsets"],
 #       disabled_toolsets=JOB_PROFILES["code-work"].get("disabled_toolsets", []),
 #       ...
 #   )
 #
 # Token savings are calculated against full toolset (~9,261 tokens, 40 tools)
 profiles:
  # ============================================================================
  # CODE-WORK: Software development tasks
  # ============================================================================
  code-work:
    description: "Terminal-based coding with file operations and git"
    use_case: "Code reviews, refactoring, debugging, git operations, builds"
    toolsets:
      - terminal   # shell commands, git, builds
      - file       # read, write, patch, search
    tools_enabled: 6
    token_estimate: "~2,194 tokens"
    token_savings: "~76% reduction vs full toolset"
    notes: |
      Git operations run via terminal. Includes patch for targeted edits.
      No web access - assumes code and docs are local or git-managed.
      Process management included for background builds/tasks.
  # ============================================================================
  # RESEARCH: Information gathering and analysis
  # ============================================================================
  research:
    description: "Web research with browser automation and file persistence"
    use_case: "Documentation lookup, API research, competitive analysis, fact-checking"
    toolsets:
      - web        # web_search, web_extract
      - browser    # full browser automation
      - file       # save findings, read local files
    tools_enabled: 15
    token_estimate: "~2,518 tokens"
    token_savings: "~73% reduction vs full toolset"
    notes: |
      Browser + web search combination allows deep research workflows.
      File tools for saving research artifacts and reading local sources.
      No terminal to prevent accidental local changes during research.
  # ============================================================================
  # TRIAGE: Read-only issue and status checking
  # ============================================================================
  triage:
    description: "Read-only operations for status checks and issue triage"
    use_case: "Gitea issue monitoring, CI status checks, log analysis, health checks"
    toolsets:
      - terminal   # curl for API calls, status checks
      - file       # read local files, logs
    disabled_toolsets:
      # Note: file toolset includes write/patch - triage jobs should
      # be instructed via prompt to only use read_file and search_files
      # For truly read-only, use disabled_tools=['write_file', 'patch']
      # (requires AIAgent support or custom toolset)
    tools_enabled: 6
    token_estimate: "~2,194 tokens"
    token_savings: "~76% reduction vs full toolset"
    read_only_hint: |
      IMPORTANT: Triage jobs should only READ. Do not modify files.
      Use read_file and search_files only. Do NOT use write_file or patch.
    notes: |
      Gitea API accessed via curl in terminal (token in env).
      File reading for log analysis, config inspection.
      Prompt must include read_only_hint to prevent modifications.
  # ============================================================================
  # CREATIVE: Content creation and editing
  # ============================================================================
  creative:
    description: "Content creation with web lookup for reference"
    use_case: "Writing, editing, content generation, documentation"
    toolsets:
      - file       # read/write content
      - web        # research, fact-checking, references
    tools_enabled: 4
    token_estimate: "~1,185 tokens"
    token_savings: "~87% reduction vs full toolset"
    notes: |
      Minimal toolset for focused writing tasks.
      Web for looking up references, quotes, facts.
      No terminal to prevent accidental system changes.
      No browser to keep context minimal.
  # ============================================================================
  # OPS: System operations and maintenance
  # ============================================================================
  ops:
    description: "System operations, process management, deployment"
    use_case: "Server maintenance, process monitoring, log rotation, backups"
    toolsets:
      - terminal   # shell commands, ssh, docker
      - process    # background process management
      - file       # config files, log files
    tools_enabled: 6
    token_estimate: "~2,194 tokens"
    token_savings: "~76% reduction vs full toolset"
    notes: |
      Process management explicitly included for service control.
      Terminal for docker, systemctl, deployment commands.
      File tools for config editing and log inspection.
  # ============================================================================
  # MINIMAL: Absolute minimum for simple tasks
  # ============================================================================
  minimal:
    description: "Absolute minimum toolset for single-purpose tasks"
    use_case: "Simple file reading, status reports with no external access"
    toolsets:
      - file
    tools_enabled: 4
    token_estimate: "~800 tokens"
    token_savings: "~91% reduction vs full toolset"
    notes: |
      For tasks that only need to read/write local files.
      No network access, no terminal, no browser.
 # ============================================================================
 # WIRING INSTRUCTIONS FOR CRON DISPATCH
 # ============================================================================
 #
 # In your cron job runner or dispatcher, load the profile and pass to AIAgent:
 #
 #   import yaml
 #   
 #   def load_job_profile(profile_name: str) -> dict:
 #       with open("~/.timmy/uniwizard/job_profiles.yaml") as f:
 #           profiles = yaml.safe_load(f)["profiles"]
 #       return profiles.get(profile_name, profiles["minimal"])
 #   
 #   # In job execution:
 #   profile = load_job_profile(job.get("tool_profile", "minimal"))
 #   
 #   agent = AIAgent(
 #       model=model,
 #       enabled_toolsets=profile["toolsets"],
 #       disabled_toolsets=profile.get("disabled_toolsets", []),
 #       quiet_mode=True,
 #       platform="cron",
 #       ...
 #   )
 #
 # Add to job definition in ~/.hermes/cron/jobs.yaml:
 #
 #   - id: daily-backup-check
 #     name: "Check backup status"
 #     schedule: "0 9 * * *"
 #     tool_profile: ops        # <-- NEW FIELD
 #     prompt: "Check backup logs and report status..."
 #
 # ============================================================================
 # ============================================================================
 # COMPARISON: Toolset Size Reference
 # ============================================================================
 #
 # Full toolset (all):           40 tools  ~9,261 tokens
 # code-work, ops, triage:        6 tools  ~2,194 tokens  (-76%)
 # research:                     15 tools  ~2,518 tokens  (-73%)
 # creative:                      4 tools  ~1,185 tokens  (-87%)
 # minimal:                       4 tools    ~800 tokens  (-91%)
 #
 # Token estimates based on JSON schema serialization (chars/4 approximation).
 # Actual token counts vary by model tokenizer.
--- a/uniwizard/job_profiles_design.md
+++ b/uniwizard/job_profiles_design.md
@@ -0,0 +1,363 @@
 # Job Profiles Design Document
 ## [ROUTING] Streamline local Timmy automation context per job
 **Issue:** timmy-config #90  
 **Author:** Timmy (AI Agent)  
 **Date:** 2026-03-30  
 **Status:** Design Complete - Ready for Implementation
 ---
 ## Executive Summary
 Local Hermes sessions experience context thrashing when all 40 tools (~9,261 tokens of schema) are loaded for every job. This design introduces **job-specific toolset profiles** that narrow the tool surface based on task type, achieving **73-91% token reduction** and preventing the "loop or thrash" behavior observed in long-running automation.
 ---
 ## Problem Statement
 When `toolsets: [all]` is enabled (current default in `~/.hermes/config.yaml`), every AIAgent instantiation loads:
 - **40 tools** across 12+ toolsets
 - **~9,261 tokens** of JSON schema
 - Full browser automation (12 tools)
 - Vision, image generation, TTS, MoA reasoning
 - All MCP servers (Morrowind, etc.)
 For a simple cron job checking Gitea issues, this is massive overkill. The LLM:
 1. Sees too many options
 2. Hallucinates tool calls that aren't needed
 3. Gets confused about which tool to use
 4. Loops trying different approaches
 ---
 ## Solution Overview
 Leverage the existing `enabled_toolsets` parameter in `AIAgent.__init__()` to create **job profiles**—pre-defined toolset combinations optimized for specific automation types.
 ### Key Design Decisions
 | Decision | Rationale |
 |----------|-----------|
 | Use YAML profiles, not code | Easy to extend without deployment |
 | Map to existing toolsets | No changes needed to Hermes core |
 | 5 base profiles | Covers 95% of automation needs |
 | Token estimates in comments | Helps users understand trade-offs |
 ---
 ## Profile Specifications
 ### 1. CODE-WORK Profile
 **Purpose:** Software development, git operations, code review
 ```yaml
 toolsets: [terminal, file]
 tools_enabled: 6
 token_estimate: "~2,194 tokens"
 token_savings: "~76%"
 ```
 **Included Tools:**
 - `terminal`, `process` - git, builds, shell commands
 - `read_file`, `search_files`, `write_file`, `patch`
 **Use Cases:**
 - Automated code review
 - Refactoring tasks
 - Build and test automation
 - Git branch management
 **Not Included:**
 - Web search (assumes local docs/code)
 - Browser automation
 - Vision/image generation
 ---
 ### 2. RESEARCH Profile
 **Purpose:** Information gathering, documentation lookup, analysis
 ```yaml
 toolsets: [web, browser, file]
 tools_enabled: 15
 token_estimate: "~2,518 tokens"
 token_savings: "~73%"
 ```
 **Included Tools:**
 - `web_search`, `web_extract` - quick lookups
 - Full browser suite (12 tools) - deep research
 - File tools - save findings, read local docs
 **Use Cases:**
 - API documentation research
 - Competitive analysis
 - Fact-checking reports
 - Technical due diligence
 **Not Included:**
 - Terminal (prevents accidental local changes)
 - Vision/image generation
 ---
 ### 3. TRIAGE Profile
 **Purpose:** Read-only status checking, issue monitoring, health checks
 ```yaml
 toolsets: [terminal, file]
 tools_enabled: 6
 token_estimate: "~2,194 tokens"
 token_savings: "~76%"
 read_only: true  # enforced via prompt
 ```
 **Included Tools:**
 - `terminal` - curl for Gitea API, status commands
 - `read_file`, `search_files` - log analysis, config inspection
 **Critical Note on Write Safety:**
 The `file` toolset includes `write_file` and `patch`. For truly read-only triage, the job prompt **MUST** include:
 ```
 [SYSTEM: This is a READ-ONLY triage job. Only use read_file and search_files.
 Do NOT use write_file, patch, or terminal commands that modify state.]
 ```
 **Future Enhancement:**
 Consider adding a `disabled_tools` parameter to AIAgent for granular control without creating new toolsets.
 **Use Cases:**
 - Gitea issue triage
 - CI/CD status monitoring
 - Log file analysis
 - System health checks
 ---
 ### 4. CREATIVE Profile
 **Purpose:** Content creation, writing, editing
 ```yaml
 toolsets: [file, web]
 tools_enabled: 4
 token_estimate: "~1,185 tokens"
 token_savings: "~87%"
 ```
 **Included Tools:**
 - `read_file`, `search_files`, `write_file`, `patch`
 - `web_search`, `web_extract` - references, fact-checking
 **Use Cases:**
 - Documentation writing
 - Content generation
 - Editing and proofreading
 - Newsletter/article composition
 **Not Included:**
 - Terminal (no system access needed)
 - Browser (web_extract sufficient for text)
 - Vision/image generation
 ---
 ### 5. OPS Profile
 **Purpose:** System operations, maintenance, deployment
 ```yaml
 toolsets: [terminal, process, file]
 tools_enabled: 6
 token_estimate: "~2,194 tokens"
 token_savings: "~76%"
 ```
 **Included Tools:**
 - `terminal`, `process` - service management, background tasks
 - File tools - config editing, log inspection
 **Use Cases:**
 - Server maintenance
 - Log rotation
 - Service restart
 - Deployment automation
 - Docker container management
 ---
 ## How Toolset Filtering Works
 The Hermes harness already supports this via `AIAgent.__init__`:
 ```python
 def __init__(
    self,
    ...
    enabled_toolsets: List[str] = None,   # Only these toolsets
    disabled_toolsets: List[str] = None,  # Exclude these toolsets
    ...
 ):
 ```
 The filtering happens in `model_tools.get_tool_definitions()`:
 ```python
 def get_tool_definitions(
    enabled_toolsets: List[str] = None,
    disabled_toolsets: List[str] = None,
    ...
 ) -> List[Dict[str, Any]]:
    # 1. Resolve toolsets to tool names via toolsets.resolve_toolset()
    # 2. Filter by availability (check_fn for each tool)
    # 3. Return OpenAI-format tool definitions
 ```
 ### Current Cron Usage (Line 423-443 in `cron/scheduler.py`):
 ```python
 agent = AIAgent(
    model=turn_route["model"],
    ...
    disabled_toolsets=["cronjob", "messaging", "clarify"],  # Hardcoded
    quiet_mode=True,
    platform="cron",
    ...
 )
 ```
 ---
 ## Wiring into Cron Dispatch
 ### Step 1: Load Profile
 ```python
 import yaml
 from pathlib import Path
 def load_job_profile(profile_name: str) -> dict:
    """Load a job profile from ~/.timmy/uniwizard/job_profiles.yaml"""
    profile_path = Path.home() / ".timmy/uniwizard/job_profiles.yaml"
    with open(profile_path) as f:
        config = yaml.safe_load(f)
    profiles = config.get("profiles", {})
    return profiles.get(profile_name, profiles.get("minimal", {"toolsets": ["file"]}))
 ```
 ### Step 2: Modify `run_job()` in `cron/scheduler.py`
 ```python
 def run_job(job: dict) -> tuple[bool, str, str, Optional[str]]:
    ...
    # Load job profile (default to minimal if not specified)
    profile_name = job.get("tool_profile", "minimal")
    profile = load_job_profile(profile_name)
    # Build toolset filter
    enabled_toolsets = profile.get("toolsets", ["file"])
    disabled_toolsets = profile.get("disabled_toolsets", ["cronjob", "messaging", "clarify"])
    agent = AIAgent(
        model=turn_route["model"],
        ...
        enabled_toolsets=enabled_toolsets,      # NEW
        disabled_toolsets=disabled_toolsets,    # MODIFIED
        quiet_mode=True,
        platform="cron",
        ...
    )
 ```
 ### Step 3: Update Job Definition Format
 Add `tool_profile` field to `~/.hermes/cron/jobs.yaml`:
 ```yaml
 jobs:
  - id: daily-issue-triage
    name: "Triage Gitea Issues"
    schedule: "0 9 * * *"
    tool_profile: triage    # <-- NEW
    prompt: "Check timmy-config repo for new issues..."
    deliver: telegram
  - id: weekly-docs-review
    name: "Review Documentation"
    schedule: "0 10 * * 1"
    tool_profile: creative  # <-- NEW
    prompt: "Review and improve README files..."
 ```
 ---
 ## Token Savings Summary
 | Profile | Tools | Tokens | Savings |
 |---------|-------|--------|---------|
 | Full (`all`) | 40 | ~9,261 | 0% |
 | code-work | 6 | ~2,194 | -76% |
 | research | 15 | ~2,518 | -73% |
 | triage | 6 | ~2,194 | -76% |
 | creative | 4 | ~1,185 | -87% |
 | ops | 6 | ~2,194 | -76% |
 | minimal | 4 | ~800 | -91% |
 **Benefits:**
 1. Faster prompt processing (less context to scan)
 2. Reduced API costs (fewer input tokens)
 3. More focused tool selection (less confusion)
 4. Faster tool calls (smaller schema to parse)
 ---
 ## Migration Path
 ### Phase 1: Deploy Profiles (This PR)
 - [x] Create `~/.timmy/uniwizard/job_profiles.yaml`
 - [x] Create design document
 - [ ] Post Gitea issue comment
 ### Phase 2: Cron Integration (Next PR)
 - [ ] Modify `cron/scheduler.py` to load profiles
 - [ ] Add `tool_profile` field to job schema
 - [ ] Update existing jobs to use appropriate profiles
 ### Phase 3: CLI Integration (Future)
 - [ ] Add `/profile` slash command to switch profiles
 - [ ] Show active profile in CLI banner
 - [ ] Profile-specific skills loading
 ---
 ## Files Changed
 | File | Purpose |
 |------|---------|
 | `~/.timmy/uniwizard/job_profiles.yaml` | Profile definitions |
 | `~/.timmy/uniwizard/job_profiles_design.md` | This design document |
 ---
 ## Open Questions
 1. **Should we add `disabled_tools` parameter to AIAgent?**
   - Would enable true read-only triage without prompt hacks
   - Requires changes to `model_tools.py` and `run_agent.py`
 2. **Should profiles include model recommendations?**
   - e.g., `recommended_model: claude-opus-4` for code-work
   - Could help route simple jobs to cheaper models
 3. **Should we support profile composition?**
   - e.g., `profiles: [ops, web]` for ops jobs that need web lookup
 ---
 ## References
 - Hermes toolset system: `~/.hermes/hermes-agent/toolsets.py`
 - Tool filtering logic: `~/.hermes/hermes-agent/model_tools.py:get_tool_definitions()`
 - Cron scheduler: `~/.hermes/hermes-agent/cron/scheduler.py:run_job()`
 - AIAgent initialization: `~/.hermes/hermes-agent/run_agent.py:AIAgent.__init__()`
--- a/uniwizard/kimi-heartbeat.sh
+++ b/uniwizard/kimi-heartbeat.sh
@@ -0,0 +1,345 @@
 #!/bin/bash
 # kimi-heartbeat.sh — Polls Gitea for assigned-kimi issues, dispatches to KimiClaw via OpenClaw
 # Zero LLM cost for polling — only calls kimi/kimi-code for actual work.
 #
 # Run manually:  bash ~/.timmy/uniwizard/kimi-heartbeat.sh
 # Runs via launchd every 2 minutes: ai.timmy.kimi-heartbeat.plist
 #
 # Workflow for humans:
 #   1. Create or open a Gitea issue in any tracked repo
 #   2. Add the "assigned-kimi" label
 #   3. This script picks it up, dispatches to KimiClaw, posts results back
 #   4. Label transitions: assigned-kimi → kimi-in-progress → kimi-done
 #
 # PLANNING: If the issue body is >500 chars or contains "##" headers,
 # KimiClaw first runs a 2-minute planning pass to decompose the task.
 # If it needs subtasks, it creates child issues and labels them assigned-kimi
 # for the next heartbeat cycle. This prevents 10-minute timeouts on complex work.
 set -euo pipefail
 # --- Config ---
 TOKEN=$(cat "$HOME/.timmy/kimi_gitea_token" | tr -d '[:space:]')
 TIMMY_TOKEN=$(cat "$HOME/.config/gitea/timmy-token" | tr -d '[:space:]')
 # Prefer Tailscale (private network) over public IP
 if curl -sf --connect-timeout 2 "http://100.126.61.75:3000/api/v1/version" > /dev/null 2>&1; then
  BASE="http://100.126.61.75:3000/api/v1"
 else
  BASE="http://143.198.27.163:3000/api/v1"
 fi
 LOG="/tmp/kimi-heartbeat.log"
 LOCKFILE="/tmp/kimi-heartbeat.lock"
 MAX_DISPATCH=10  # Increased max dispatch to 10
 PLAN_TIMEOUT=120   # 2 minutes for planning pass
 EXEC_TIMEOUT=480   # 8 minutes for execution pass
 BODY_COMPLEXITY_THRESHOLD=500  # chars — above this triggers planning
 STALE_PROGRESS_SECONDS=3600    # reclaim kimi-in-progress after 1 hour of silence
 REPOS=(
  "Timmy_Foundation/timmy-home"
  "Timmy_Foundation/timmy-config"
  "Timmy_Foundation/the-nexus"
  "Timmy_Foundation/hermes-agent"
 )
 # --- Helpers ---
 log() { echo "[$(date '+%Y-%m-%d %H:%M:%S')] $*" | tee -a "$LOG"; }
 needs_pr_proof() {
  local haystack="${1,,}"
  [[ "$haystack" =~ implement|fix|refactor|feature|perf|performance|rebase|deploy|integration|module|script|pipeline|benchmark|cache|test|bug|build|port ]]
 }
 has_pr_proof() {
  local haystack="${1,,}"
  [[ "$haystack" == *"proof:"* || "$haystack" == *"pr:"* || "$haystack" == *"/pulls/"* || "$haystack" == *"commit:"* ]]
 }
 post_issue_comment_json() {
  local repo="$1"
  local issue_num="$2"
  local token="$3"
  local body="$4"
  local payload
  payload=$(python3 - "$body" <<'PY'
 import json, sys
 print(json.dumps({"body": sys.argv[1]}))
 PY
 )
  curl -sf -X POST -H "Authorization: token $token" -H "Content-Type: application/json" \
    -d "$payload" "$BASE/repos/$repo/issues/$issue_num/comments" > /dev/null 2>&1 || true
 }
 # Prevent overlapping runs
 if [ -f "$LOCKFILE" ]; then
  lock_age=$(( $(date +%s) - $(stat -f %m "$LOCKFILE" 2>/dev/null || echo 0) ))
  if [ "$lock_age" -lt 600 ]; then
    log "SKIP: previous run still active (lock age: ${lock_age}s)"
    exit 0
  else
    log "WARN: stale lock (${lock_age}s), removing"
    rm -f "$LOCKFILE"
  fi
 fi
 trap 'rm -f "$LOCKFILE"' EXIT
 touch "$LOCKFILE"
 dispatched=0
 for repo in "${REPOS[@]}"; do
  # Fetch open issues with assigned-kimi label
  response=$(curl -sf -H "Authorization: token $TIMMY_TOKEN" \
    "$BASE/repos/$repo/issues?state=open&labels=assigned-kimi&limit=20" 2>/dev/null || echo "[]")
  # Filter: skip done tasks, but reclaim stale kimi-in-progress work automatically
  issues=$(echo "$response" | python3 -c "
 import json, sys, datetime
 STALE = int(${STALE_PROGRESS_SECONDS})
 def parse_ts(value):
    if not value:
        return None
    try:
        return datetime.datetime.fromisoformat(value.replace('Z', '+00:00'))
    except Exception:
        return None
 try:
    data = json.loads(sys.stdin.buffer.read())
 except:
    sys.exit(0)
 now = datetime.datetime.now(datetime.timezone.utc)
 for i in data:
    labels = [l['name'] for l in i.get('labels', [])]
    if 'kimi-done' in labels:
        continue
    reclaim = False
    updated_at = i.get('updated_at', '') or ''
    if 'kimi-in-progress' in labels:
        ts = parse_ts(updated_at)
        age = (now - ts).total_seconds() if ts else (STALE + 1)
        if age < STALE:
            continue
        reclaim = True
    body = (i.get('body', '') or '')
    body_len = len(body)
    body_clean = body[:1500].replace('\n', ' ').replace('|', ' ')
    title = i['title'].replace('|', ' ')
    updated_clean = updated_at.replace('|', ' ')
    reclaim_flag = 'reclaim' if reclaim else 'fresh'
    print(f\"{i['number']}|{title}|{body_len}|{reclaim_flag}|{updated_clean}|{body_clean}\")
 " 2>/dev/null)
  [ -z "$issues" ] && continue
  while IFS='|' read -r issue_num title body_len reclaim_flag updated_at body; do
    [ -z "$issue_num" ] && continue
    log "FOUND: $repo #$issue_num — $title (body: ${body_len} chars, mode: ${reclaim_flag}, updated: ${updated_at})"
    # --- Get label IDs for this repo ---
    label_json=$(curl -sf -H "Authorization: token $TIMMY_TOKEN" \
      "$BASE/repos/$repo/labels" 2>/dev/null || echo "[]")
    progress_id=$(echo "$label_json" | python3 -c "import json,sys; [print(l['id']) for l in json.load(sys.stdin) if l['name']=='kimi-in-progress']" 2>/dev/null)
    done_id=$(echo "$label_json" | python3 -c "import json,sys; [print(l['id']) for l in json.load(sys.stdin) if l['name']=='kimi-done']" 2>/dev/null)
    kimi_id=$(echo "$label_json" | python3 -c "import json,sys; [print(l['id']) for l in json.load(sys.stdin) if l['name']=='assigned-kimi']" 2>/dev/null)
    if [ "$reclaim_flag" = "reclaim" ]; then
      log "RECLAIM: $repo #$issue_num — stale kimi-in-progress since $updated_at"
      [ -n "$progress_id" ] && curl -sf -X DELETE -H "Authorization: token $TIMMY_TOKEN" \
        "$BASE/repos/$repo/issues/$issue_num/labels/$progress_id" > /dev/null 2>&1 || true
      curl -sf -X POST -H "Authorization: token $TOKEN" -H "Content-Type: application/json" \
        -d "{\"body\":\"🟡 **KimiClaw reclaiming stale task.**\\nPrevious kimi-in-progress state exceeded ${STALE_PROGRESS_SECONDS}s without resolution.\\nLast update: $updated_at\\nTimestamp: $(date -u '+%Y-%m-%dT%H:%M:%SZ')\"}" \
        "$BASE/repos/$repo/issues/$issue_num/comments" > /dev/null 2>&1 || true
    fi
    # --- Add kimi-in-progress label ---
    if [ -n "$progress_id" ]; then
      curl -sf -X POST -H "Authorization: token $TIMMY_TOKEN" -H "Content-Type: application/json" \
        -d "{\"labels\":[$progress_id]}" \
        "$BASE/repos/$repo/issues/$issue_num/labels" > /dev/null 2>&1 || true
    fi
    # --- Decide: plan first or execute directly ---
    needs_planning=false
    if [ "$body_len" -gt "$BODY_COMPLEXITY_THRESHOLD" ]; then
      needs_planning=true
    fi
    if [ "$needs_planning" = true ]; then
      # =============================================
      # PHASE 1: PLANNING PASS (2 min timeout)
      # =============================================
      log "PLAN: $repo #$issue_num — complex task, running planning pass"
      curl -sf -X POST -H "Authorization: token $TOKEN" -H "Content-Type: application/json" \
        -d "{\"body\":\"🟠 **KimiClaw picking up this task** via heartbeat.\\nBackend: kimi/kimi-code (Moonshot AI)\\nMode: **Planning first** (task is complex)\\nTimestamp: $(date -u '+%Y-%m-%dT%H:%M:%SZ')\"}" \
        "$BASE/repos/$repo/issues/$issue_num/comments" > /dev/null 2>&1 || true
      plan_prompt="You are KimiClaw, a planning agent. You have 2 MINUTES.\n\nTASK: Analyze this Gitea issue and decide if you can complete it in under 8 minutes, or if it needs to be broken into subtasks.\n\nISSUE #$issue_num in $repo: $title\n\nBODY:\n$body\n\nRULES:\n- If you CAN complete this in one pass (research, write analysis, answer a question): respond with EXECUTE followed by a one-line plan.\n- If the task is TOO BIG (needs git operations, multiple repos, >2000 words of output, or multi-step implementation): respond with DECOMPOSE followed by a numbered list of 2-5 smaller subtasks. Each subtask must be completable in under 8 minutes by itself.\n- Each subtask line format: SUBTASK: <title> | <one-line description>\n- Be realistic about what fits in 8 minutes with no terminal access.\n- You CANNOT clone repos, run git, or execute code. You CAN research, analyze, write specs, review code via API, and produce documents.\n\nRespond with ONLY your decision. No preamble."
      plan_result=$(openclaw agent --agent main --message "$plan_prompt" --timeout $PLAN_TIMEOUT --json 2>/dev/null || echo '{\"status\":\"error\"}')
      plan_status=$(echo "$plan_result" | python3 -c "import json,sys; print(json.load(sys.stdin).get('status','error'))" 2>/dev/null || echo "error")
      plan_text=$(echo "$plan_result" | python3 -c "\nimport json,sys\nd = json.load(sys.stdin)\npayloads = d.get('result',{}).get('payloads',[])\nprint(payloads[0]['text'] if payloads else '')\n" 2>/dev/null || echo "")
      if echo "$plan_text" | grep -qi "^DECOMPOSE"; then
        # --- Create subtask issues ---
        log "DECOMPOSE: $repo #$issue_num — creating subtasks"
        # Post the plan as a comment
        escaped_plan=$(echo "$plan_text" | python3 -c "import sys,json; print(json.dumps(sys.stdin.read()))" 2>/dev/null)
        curl -sf -X POST -H "Authorization: token $TOKEN" -H "Content-Type: application/json" \
          -d "{\"body\":\"📝 **Planning complete — decomposing into subtasks:**\\n\\n$plan_text\"}" \
          "$BASE/repos/$repo/issues/$issue_num/comments" > /dev/null 2>&1 || true
        # Extract SUBTASK lines and create child issues
        echo "$plan_text" | grep -i "^SUBTASK:" | head -5 | while IFS='|' read -r sub_title sub_desc; do
          sub_title=$(echo "$sub_title" | sed 's/^SUBTASK: *//')
          sub_desc=$(echo "${sub_desc:-$sub_title}" | sed 's/^ *//')
          if [ -n "$sub_title" ]; then
            sub_body="## Parent Issue\\nChild of #$issue_num: $title\\n\\n## Task\\n$sub_desc\\n\\n## Constraints\\n- Must complete in under 8 minutes\\n- No git/terminal operations\\n- Post results as analysis/documentation\\n\\n## Assignee\\n@KimiClaw"
            curl -sf -X POST -H "Authorization: token $TIMMY_TOKEN" -H "Content-Type: application/json" \
              -d "{\"title\":\"[SUB] $sub_title\",\"body\":\"$sub_body\"}" \
              "$BASE/repos/$repo/issues" > /dev/null 2>&1
            # Get the issue number of what we just created and label it
            new_num=$(curl -sf -H "Authorization: token $TIMMY_TOKEN" \
              "$BASE/repos/$repo/issues?state=open&limit=1&type=issues" | \
              python3 -c "import json,sys; d=json.load(sys.stdin); print(d[0]['number'] if d else '')" 2>/dev/null)
            if [ -n "$new_num" ] && [ -n "$kimi_id" ]; then
              curl -sf -X POST -H "Authorization: token $TIMMY_TOKEN" -H "Content-Type: application/json" \
                -d "{\"labels\":[$kimi_id]}" \
                "$BASE/repos/$repo/issues/$new_num/labels" > /dev/null 2>&1 || true
              log "SUBTASK: $repo #$new_num — $sub_title"
            fi
          fi
        done
        # Mark parent as kimi-done (subtasks will be picked up next cycle)
        [ -n "$progress_id" ] && curl -sf -X DELETE -H "Authorization: token $TIMMY_TOKEN" \
          "$BASE/repos/$repo/issues/$issue_num/labels/$progress_id" > /dev/null 2>&1 || true
        [ -n "$done_id" ] && curl -sf -X POST -H "Authorization: token $TIMMY_TOKEN" -H "Content-Type: application/json" \
          -d "{\"labels\":[$done_id]}" \
          "$BASE/repos/$repo/issues/$issue_num/labels" > /dev/null 2>&1 || true
        dispatched=$((dispatched + 1))
        log "PLANNED: $repo #$issue_num — subtasks created, parent marked done"
      else
        # --- Plan says EXECUTE — proceed to execution ---
        log "EXECUTE: $repo #$issue_num — planning pass says single-pass OK"
        # Fall through to execution below
        needs_planning=false
      fi
    fi
    if [ "$needs_planning" = false ]; then
      # =============================================
      # PHASE 2: EXECUTION PASS (8 min timeout)
      # =============================================
      # Post pickup comment if we didn't already (simple tasks skip planning)
      if [ "$body_len" -le "$BODY_COMPLEXITY_THRESHOLD" ]; then
        curl -sf -X POST -H "Authorization: token $TOKEN" -H "Content-Type: application/json" \
          -d "{\"body\":\"🟠 **KimiClaw picking up this task** via heartbeat.\\nBackend: kimi/kimi-code (Moonshot AI)\\nMode: **Direct execution** (task fits in one pass)\\nTimestamp: $(date -u '+%Y-%m-%dT%H:%M:%SZ')\"}" \
          "$BASE/repos/$repo/issues/$issue_num/comments" > /dev/null 2>&1 || true
      fi
      log "DISPATCH: $repo #$issue_num to openclaw (timeout: ${EXEC_TIMEOUT}s)"
      exec_prompt="You are KimiClaw, an AI agent powered by Kimi K2.5 (Moonshot AI).
 You are working on Gitea issue #$issue_num in repo $repo.
 You have 8 MINUTES maximum. Be concise and focused.
 ISSUE TITLE: $title
 ISSUE BODY:
 $body
 YOUR TASK:
 1. Read the issue carefully and do the work described
 2. Stay focused — deliver the core ask, skip nice-to-haves
 3. Provide your COMPLETE results as your response (use markdown)
 4. If you realize mid-task this will take longer than 8 minutes, STOP and summarize what you've done so far plus what remains"
      # --- Dispatch to OpenClaw (background) ---
      (
        result=$(openclaw agent --agent main --message "$exec_prompt" --timeout $EXEC_TIMEOUT --json 2>/dev/null || echo '{"status":"error"}')
        status=$(echo "$result" | python3 -c "import json,sys; print(json.load(sys.stdin).get('status','error'))" 2>/dev/null || echo "error")
        # Extract response text
        response_text=$(echo "$result" | python3 -c "
 import json,sys
 d = json.load(sys.stdin)
 payloads = d.get('result',{}).get('payloads',[])
 print(payloads[0]['text'][:3000] if payloads else 'No response')
 " 2>/dev/null || echo "No response")
        if [ "$status" = "ok" ] && [ "$response_text" != "No response" ]; then
          escaped=$(echo "$response_text" | python3 -c "import sys,json; print(json.dumps(sys.stdin.read())[1:-1])" 2>/dev/null)
          if needs_pr_proof "$title $body" && ! has_pr_proof "$response_text"; then
            log "BLOCKED: $repo #$issue_num — response lacked PR/proof for code task"
            post_issue_comment_json "$repo" "$issue_num" "$TOKEN" "🟡 **KimiClaw produced analysis only — no PR/proof detected.**
 This issue looks like implementation work, so it is NOT being marked kimi-done.
 Kimi response excerpt:
 $escaped
 Action: removing Kimi queue labels so a code-capable agent can pick it up."
            [ -n "$progress_id" ] && curl -sf -X DELETE -H "Authorization: token $TIMMY_TOKEN" \
              "$BASE/repos/$repo/issues/$issue_num/labels/$progress_id" > /dev/null 2>&1 || true
            [ -n "$kimi_id" ] && curl -sf -X DELETE -H "Authorization: token $TIMMY_TOKEN" \
              "$BASE/repos/$repo/issues/$issue_num/labels/$kimi_id" > /dev/null 2>&1 || true
          else
            log "COMPLETED: $repo #$issue_num"
            post_issue_comment_json "$repo" "$issue_num" "$TOKEN" "🟢 **KimiClaw result:**
 $escaped"
            [ -n "$progress_id" ] && curl -sf -X DELETE -H "Authorization: token $TIMMY_TOKEN" \
              "$BASE/repos/$repo/issues/$issue_num/labels/$progress_id" > /dev/null 2>&1 || true
            [ -n "$done_id" ] && curl -sf -X POST -H "Authorization: token $TIMMY_TOKEN" -H "Content-Type: application/json" \
              -d "{\"labels\":[$done_id]}" \
              "$BASE/repos/$repo/issues/$issue_num/labels" > /dev/null 2>&1 || true
          fi
        else
          log "FAILED: $repo #$issue_num — status=$status"
          curl -sf -X POST -H "Authorization: token $TOKEN" -H "Content-Type: application/json" \
            -d "{\"body\":\"\ud83d\udd34 **KimiClaw failed/timed out.**\\nStatus: $status\\nTimestamp: $(date -u '+%Y-%m-%dT%H:%M:%SZ')\\n\\nTask may be too complex for single-pass execution. Consider breaking into smaller subtasks.\"}" \
            "$BASE/repos/$repo/issues/$issue_num/comments" > /dev/null 2>&1 || true
          # Remove kimi-in-progress on failure
          [ -n "$progress_id" ] && curl -sf -X DELETE -H "Authorization: token $TIMMY_TOKEN" \
            "$BASE/repos/$repo/issues/$issue_num/labels/$progress_id" > /dev/null 2>&1 || true
        fi
      ) &
      dispatched=$((dispatched + 1))
      log "DISPATCHED: $repo #$issue_num (background PID $!)"
    fi
    # Enforce dispatch cap
    if [ "$dispatched" -ge "$MAX_DISPATCH" ]; then
      log "CAPPED: reached $MAX_DISPATCH dispatches, remaining issues deferred to next heartbeat"
      break 2  # Break out of both loops
    fi
    # Stagger dispatches to avoid overwhelming kimi
    sleep 3
  done <<< "$issues"
 done
 if [ "$dispatched" -eq 0 ]; then
  log "Heartbeat: no pending tasks"
 else
  log "Heartbeat: dispatched $dispatched task(s)"
 fi
--- a/uniwizard/quality_scorer.py
+++ b/uniwizard/quality_scorer.py
@@ -0,0 +1,642 @@
 """
 Uniwizard Backend Quality Scorer
 Tracks per-backend performance metrics and provides intelligent routing recommendations.
 Uses a rolling window of last 100 responses per backend across 5 task types.
 """
 import sqlite3
 import json
 import time
 from dataclasses import dataclass, asdict
 from datetime import datetime
 from enum import Enum
 from pathlib import Path
 from typing import Optional, List, Dict, Tuple
 from contextlib import contextmanager
 class TaskType(Enum):
    """Task types for backend specialization tracking."""
    CODE = "code"
    REASONING = "reasoning"
    RESEARCH = "research"
    CREATIVE = "creative"
    FAST_OPS = "fast_ops"
 class ResponseStatus(Enum):
    """Status of a backend response."""
    SUCCESS = "success"
    ERROR = "error"
    REFUSAL = "refusal"
    TIMEOUT = "timeout"
 # The 7 Uniwizard backends
 BACKENDS = [
    "anthropic",
    "openai-codex",
    "gemini",
    "groq",
    "grok",
    "kimi-coding",
    "openrouter",
 ]
 # Default DB path
 DEFAULT_DB_PATH = Path.home() / ".timmy" / "uniwizard" / "quality_scores.db"
@dataclass
 class BackendScore:
    """Aggregated score card for a backend on a specific task type."""
    backend: str
    task_type: str
    total_requests: int
    success_count: int
    error_count: int
    refusal_count: int
    timeout_count: int
    avg_latency_ms: float
    avg_ttft_ms: float
    p95_latency_ms: float
    score: float  # Composite quality score (0-100)
@dataclass
 class ResponseRecord:
    """Single response record for storage."""
    id: Optional[int]
    backend: str
    task_type: str
    status: str
    latency_ms: float
    ttft_ms: float  # Time to first token
    timestamp: float
    metadata: Optional[str]  # JSON string for extensibility
 class QualityScorer:
    """
    Tracks backend quality metrics with rolling windows.
    Stores per-response data in SQLite, computes aggregated scores
    on-demand for routing decisions.
    """
    ROLLING_WINDOW_SIZE = 100
    # Score weights for composite calculation
    WEIGHTS = {
        "success_rate": 0.35,
        "low_error_rate": 0.20,
        "low_refusal_rate": 0.15,
        "low_timeout_rate": 0.10,
        "low_latency": 0.20,
    }
    def __init__(self, db_path: Optional[Path] = None):
        self.db_path = Path(db_path) if db_path else DEFAULT_DB_PATH
        self._init_db()
    @contextmanager
    def _get_conn(self):
        """Get a database connection with proper cleanup."""
        conn = sqlite3.connect(str(self.db_path))
        conn.row_factory = sqlite3.Row
        try:
            yield conn
            conn.commit()
        finally:
            conn.close()
    def _init_db(self):
        """Initialize the SQLite database schema."""
        self.db_path.parent.mkdir(parents=True, exist_ok=True)
        with self._get_conn() as conn:
            conn.execute("""
                CREATE TABLE IF NOT EXISTS responses (
                    id INTEGER PRIMARY KEY AUTOINCREMENT,
                    backend TEXT NOT NULL,
                    task_type TEXT NOT NULL,
                    status TEXT NOT NULL,
                    latency_ms REAL NOT NULL,
                    ttft_ms REAL NOT NULL,
                    timestamp REAL NOT NULL,
                    metadata TEXT
                )
            """)
            # Index for fast rolling window queries
            conn.execute("""
                CREATE INDEX IF NOT EXISTS idx_backend_task_time 
                ON responses(backend, task_type, timestamp DESC)
            """)
            conn.execute("""
                CREATE INDEX IF NOT EXISTS idx_backend_time 
                ON responses(backend, timestamp DESC)
            """)
    def record_response(
        self,
        backend: str,
        task_type: str,
        status: ResponseStatus,
        latency_ms: float,
        ttft_ms: float,
        metadata: Optional[Dict] = None
    ) -> None:
        """
        Record a response from a backend.
        Args:
            backend: Backend name (must be in BACKENDS)
            task_type: Task type string or TaskType enum
            status: ResponseStatus (success/error/refusal/timeout)
            latency_ms: Total response latency in milliseconds
            ttft_ms: Time to first token in milliseconds
            metadata: Optional dict with additional context
        """
        if backend not in BACKENDS:
            raise ValueError(f"Unknown backend: {backend}. Must be one of: {BACKENDS}")
        task_str = task_type.value if isinstance(task_type, TaskType) else task_type
        with self._get_conn() as conn:
            conn.execute("""
                INSERT INTO responses (backend, task_type, status, latency_ms, ttft_ms, timestamp, metadata)
                VALUES (?, ?, ?, ?, ?, ?, ?)
            """, (
                backend,
                task_str,
                status.value,
                latency_ms,
                ttft_ms,
                time.time(),
                json.dumps(metadata) if metadata else None
            ))
            # Prune old records to maintain rolling window
            self._prune_rolling_window(conn, backend, task_str)
    def _prune_rolling_window(self, conn: sqlite3.Connection, backend: str, task_type: str) -> None:
        """Remove records beyond the rolling window size for this backend/task combo."""
        # Get IDs to keep (most recent ROLLING_WINDOW_SIZE)
        cursor = conn.execute("""
            SELECT id FROM responses 
            WHERE backend = ? AND task_type = ?
            ORDER BY timestamp DESC
            LIMIT ? OFFSET ?
        """, (backend, task_type, self.ROLLING_WINDOW_SIZE, self.ROLLING_WINDOW_SIZE))
        ids_to_delete = [row[0] for row in cursor.fetchall()]
        if ids_to_delete:
            placeholders = ','.join('?' * len(ids_to_delete))
            conn.execute(f"""
                DELETE FROM responses 
                WHERE id IN ({placeholders})
            """, ids_to_delete)
    def get_backend_score(
        self,
        backend: str,
        task_type: Optional[str] = None
    ) -> BackendScore:
        """
        Get aggregated score for a backend, optionally filtered by task type.
        Args:
            backend: Backend name
            task_type: Optional task type filter
        Returns:
            BackendScore with aggregated metrics
        """
        if backend not in BACKENDS:
            raise ValueError(f"Unknown backend: {backend}")
        with self._get_conn() as conn:
            if task_type:
                row = conn.execute("""
                    SELECT 
                        COUNT(*) as total,
                        SUM(CASE WHEN status = 'success' THEN 1 ELSE 0 END) as successes,
                        SUM(CASE WHEN status = 'error' THEN 1 ELSE 0 END) as errors,
                        SUM(CASE WHEN status = 'refusal' THEN 1 ELSE 0 END) as refusals,
                        SUM(CASE WHEN status = 'timeout' THEN 1 ELSE 0 END) as timeouts,
                        AVG(latency_ms) as avg_latency,
                        AVG(ttft_ms) as avg_ttft,
                        MAX(latency_ms) as max_latency
                    FROM (
                        SELECT * FROM responses 
                        WHERE backend = ? AND task_type = ?
                        ORDER BY timestamp DESC
                        LIMIT ?
                    )
                """, (backend, task_type, self.ROLLING_WINDOW_SIZE)).fetchone()
            else:
                row = conn.execute("""
                    SELECT 
                        COUNT(*) as total,
                        SUM(CASE WHEN status = 'success' THEN 1 ELSE 0 END) as successes,
                        SUM(CASE WHEN status = 'error' THEN 1 ELSE 0 END) as errors,
                        SUM(CASE WHEN status = 'refusal' THEN 1 ELSE 0 END) as refusals,
                        SUM(CASE WHEN status = 'timeout' THEN 1 ELSE 0 END) as timeouts,
                        AVG(latency_ms) as avg_latency,
                        AVG(ttft_ms) as avg_ttft,
                        MAX(latency_ms) as max_latency
                    FROM (
                        SELECT * FROM responses 
                        WHERE backend = ?
                        ORDER BY timestamp DESC
                        LIMIT ?
                    )
                """, (backend, self.ROLLING_WINDOW_SIZE)).fetchone()
            total = row[0] or 0
            if total == 0:
                return BackendScore(
                    backend=backend,
                    task_type=task_type or "all",
                    total_requests=0,
                    success_count=0,
                    error_count=0,
                    refusal_count=0,
                    timeout_count=0,
                    avg_latency_ms=0.0,
                    avg_ttft_ms=0.0,
                    p95_latency_ms=0.0,
                    score=0.0
                )
            successes = row[1] or 0
            errors = row[2] or 0
            refusals = row[3] or 0
            timeouts = row[4] or 0
            avg_latency = row[5] or 0.0
            avg_ttft = row[6] or 0.0
            # Calculate P95 latency
            p95 = self._get_p95_latency(conn, backend, task_type)
            # Calculate composite score
            score = self._calculate_score(
                total, successes, errors, refusals, timeouts, avg_latency
            )
            return BackendScore(
                backend=backend,
                task_type=task_type or "all",
                total_requests=total,
                success_count=successes,
                error_count=errors,
                refusal_count=refusals,
                timeout_count=timeouts,
                avg_latency_ms=round(avg_latency, 2),
                avg_ttft_ms=round(avg_ttft, 2),
                p95_latency_ms=round(p95, 2),
                score=round(score, 2)
            )
    def _get_p95_latency(
        self,
        conn: sqlite3.Connection,
        backend: str,
        task_type: Optional[str]
    ) -> float:
        """Calculate P95 latency from rolling window."""
        if task_type:
            row = conn.execute("""
                SELECT latency_ms FROM responses 
                WHERE backend = ? AND task_type = ?
                ORDER BY timestamp DESC
                LIMIT ?
            """, (backend, task_type, self.ROLLING_WINDOW_SIZE)).fetchall()
        else:
            row = conn.execute("""
                SELECT latency_ms FROM responses 
                WHERE backend = ?
                ORDER BY timestamp DESC
                LIMIT ?
            """, (backend, self.ROLLING_WINDOW_SIZE)).fetchall()
        if not row:
            return 0.0
        latencies = sorted([r[0] for r in row])
        idx = int(len(latencies) * 0.95)
        return latencies[min(idx, len(latencies) - 1)]
    def _calculate_score(
        self,
        total: int,
        successes: int,
        errors: int,
        refusals: int,
        timeouts: int,
        avg_latency: float
    ) -> float:
        """
        Calculate composite quality score (0-100).
        Higher is better. Considers success rate, error/refusal/timeout rates,
        and normalized latency.
        """
        if total == 0:
            return 0.0
        success_rate = successes / total
        error_rate = errors / total
        refusal_rate = refusals / total
        timeout_rate = timeouts / total
        # Normalize latency: assume 5000ms is "bad" (score 0), 100ms is "good" (score 1)
        # Using exponential decay for latency scoring
        latency_score = max(0, min(1, 1 - (avg_latency / 10000)))
        score = (
            self.WEIGHTS["success_rate"] * success_rate * 100 +
            self.WEIGHTS["low_error_rate"] * (1 - error_rate) * 100 +
            self.WEIGHTS["low_refusal_rate"] * (1 - refusal_rate) * 100 +
            self.WEIGHTS["low_timeout_rate"] * (1 - timeout_rate) * 100 +
            self.WEIGHTS["low_latency"] * latency_score * 100
        )
        return max(0, min(100, score))
    def recommend_backend(
        self,
        task_type: Optional[str] = None,
        min_samples: int = 5
    ) -> List[Tuple[str, float]]:
        """
        Get ranked list of backends for a task type.
        Args:
            task_type: Optional task type to specialize for
            min_samples: Minimum samples before considering a backend
        Returns:
            List of (backend_name, score) tuples, sorted by score descending
        """
        scores = []
        for backend in BACKENDS:
            score_card = self.get_backend_score(backend, task_type)
            # Require minimum samples for confident recommendations
            if score_card.total_requests < min_samples:
                # Penalize low-sample backends but still include them
                adjusted_score = score_card.score * (score_card.total_requests / min_samples)
            else:
                adjusted_score = score_card.score
            scores.append((backend, round(adjusted_score, 2)))
        # Sort by score descending
        scores.sort(key=lambda x: x[1], reverse=True)
        return scores
    def get_all_scores(
        self,
        task_type: Optional[str] = None
    ) -> Dict[str, BackendScore]:
        """Get score cards for all backends."""
        return {
            backend: self.get_backend_score(backend, task_type)
            for backend in BACKENDS
        }
    def get_task_breakdown(self, backend: str) -> Dict[str, BackendScore]:
        """Get per-task-type scores for a single backend."""
        if backend not in BACKENDS:
            raise ValueError(f"Unknown backend: {backend}")
        return {
            task.value: self.get_backend_score(backend, task.value)
            for task in TaskType
        }
    def get_stats(self) -> Dict:
        """Get overall database statistics."""
        with self._get_conn() as conn:
            total = conn.execute("SELECT COUNT(*) FROM responses").fetchone()[0]
            by_backend = {}
            for backend in BACKENDS:
                count = conn.execute(
                    "SELECT COUNT(*) FROM responses WHERE backend = ?",
                    (backend,)
                ).fetchone()[0]
                by_backend[backend] = count
            by_task = {}
            for task in TaskType:
                count = conn.execute(
                    "SELECT COUNT(*) FROM responses WHERE task_type = ?",
                    (task.value,)
                ).fetchone()[0]
                by_task[task.value] = count
            oldest = conn.execute(
                "SELECT MIN(timestamp) FROM responses"
            ).fetchone()[0]
            newest = conn.execute(
                "SELECT MAX(timestamp) FROM responses"
            ).fetchone()[0]
            return {
                "total_records": total,
                "by_backend": by_backend,
                "by_task_type": by_task,
                "oldest_record": datetime.fromtimestamp(oldest).isoformat() if oldest else None,
                "newest_record": datetime.fromtimestamp(newest).isoformat() if newest else None,
            }
    def clear_data(self) -> None:
        """Clear all recorded data (useful for testing)."""
        with self._get_conn() as conn:
            conn.execute("DELETE FROM responses")
 def print_score_report(scorer: QualityScorer, task_type: Optional[str] = None) -> None:
    """
    Print a formatted score report to console.
    Args:
        scorer: QualityScorer instance
        task_type: Optional task type filter
    """
    print("\n" + "=" * 80)
    print("  UNIWIZARD BACKEND QUALITY SCORES")
    print("=" * 80)
    if task_type:
        print(f"\n  Task Type: {task_type.upper()}")
    else:
        print("\n  Overall Performance (all task types)")
    print("-" * 80)
    scores = scorer.recommend_backend(task_type)
    all_scores = scorer.get_all_scores(task_type)
    # Header
    print(f"\n  {'Rank':<6} {'Backend':<16} {'Score':<8} {'Success':<10} {'Latency':<12} {'Samples':<8}")
    print("  " + "-" * 72)
    # Rankings
    for rank, (backend, score) in enumerate(scores, 1):
        card = all_scores[backend]
        success_pct = (card.success_count / card.total_requests * 100) if card.total_requests > 0 else 0
        bar_len = int(score / 5)  # 20 chars = 100
        bar = "█" * bar_len + "░" * (20 - bar_len)
        print(f"  {rank:<6} {backend:<16} {score:>6.1f}  {success_pct:>6.1f}%   {card.avg_latency_ms:>7.1f}ms   {card.total_requests:>6}")
        print(f"         [{bar}]")
    # Per-backend breakdown
    print("\n" + "-" * 80)
    print("  DETAILED BREAKDOWN")
    print("-" * 80)
    for backend in BACKENDS:
        card = all_scores[backend]
        if card.total_requests == 0:
            print(f"\n  {backend}: No data yet")
            continue
        print(f"\n  {backend.upper()}:")
        print(f"    Requests: {card.total_requests} | "
              f"Success: {card.success_count} | "
              f"Errors: {card.error_count} | "
              f"Refusals: {card.refusal_count} | "
              f"Timeouts: {card.timeout_count}")
        print(f"    Avg Latency: {card.avg_latency_ms}ms | "
              f"TTFT: {card.avg_ttft_ms}ms | "
              f"P95: {card.p95_latency_ms}ms")
        print(f"    Quality Score: {card.score}/100")
    # Recommendations
    print("\n" + "=" * 80)
    print("  RECOMMENDATIONS")
    print("=" * 80)
    recommendations = scorer.recommend_backend(task_type)
    top_3 = [b for b, s in recommendations[:3] if s > 0]
    if top_3:
        print(f"\n  Best backends{f' for {task_type}' if task_type else ''}:")
        for i, backend in enumerate(top_3, 1):
            score = next(s for b, s in recommendations if b == backend)
            print(f"    {i}. {backend} (score: {score})")
    else:
        print("\n  Not enough data for recommendations yet.")
    print("\n" + "=" * 80)
 def print_full_report(scorer: QualityScorer) -> None:
    """Print a complete report including per-task-type breakdowns."""
    print("\n" + "=" * 80)
    print("  UNIWIZARD BACKEND QUALITY SCORECARD")
    print("=" * 80)
    stats = scorer.get_stats()
    print(f"\n  Database: {scorer.db_path}")
    print(f"  Total Records: {stats['total_records']}")
    print(f"  Date Range: {stats['oldest_record'] or 'N/A'} to {stats['newest_record'] or 'N/A'}")
    # Overall scores
    print_score_report(scorer)
    # Per-task breakdown
    print("\n" + "=" * 80)
    print("  PER-TASK SPECIALIZATION")
    print("=" * 80)
    for task in TaskType:
        print(f"\n{'─' * 80}")
        scores = scorer.recommend_backend(task.value)
        print(f"\n  {task.value.upper()}:")
        for rank, (backend, score) in enumerate(scores[:3], 1):
            if score > 0:
                print(f"    {rank}. {backend}: {score}")
    print("\n" + "=" * 80)
 # Convenience functions for CLI usage
 def get_scorer(db_path: Optional[Path] = None) -> QualityScorer:
    """Get or create a QualityScorer instance."""
    return QualityScorer(db_path)
 def record(
    backend: str,
    task_type: str,
    status: str,
    latency_ms: float,
    ttft_ms: float = 0.0,
    metadata: Optional[Dict] = None
 ) -> None:
    """Convenience function to record a response."""
    scorer = get_scorer()
    scorer.record_response(
        backend=backend,
        task_type=task_type,
        status=ResponseStatus(status),
        latency_ms=latency_ms,
        ttft_ms=ttft_ms,
        metadata=metadata
    )
 def recommend(task_type: Optional[str] = None) -> List[Tuple[str, float]]:
    """Convenience function to get recommendations."""
    scorer = get_scorer()
    return scorer.recommend_backend(task_type)
 def report(task_type: Optional[str] = None) -> None:
    """Convenience function to print report."""
    scorer = get_scorer()
    print_score_report(scorer, task_type)
 def full_report() -> None:
    """Convenience function to print full report."""
    scorer = get_scorer()
    print_full_report(scorer)
 if __name__ == "__main__":
    # Demo mode - show empty report structure
    scorer = QualityScorer()
    # Add some demo data if empty
    stats = scorer.get_stats()
    if stats["total_records"] == 0:
        print("Generating demo data...")
        import random
        for _ in range(50):
            scorer.record_response(
                backend=random.choice(BACKENDS),
                task_type=random.choice([t.value for t in TaskType]),
                status=random.choices(
                    [ResponseStatus.SUCCESS, ResponseStatus.ERROR, ResponseStatus.REFUSAL, ResponseStatus.TIMEOUT],
                    weights=[0.85, 0.08, 0.05, 0.02]
                )[0],
                latency_ms=random.gauss(1500, 500),
                ttft_ms=random.gauss(200, 100)
            )
    full_report()
--- a/uniwizard/self_grader.py
+++ b/uniwizard/self_grader.py
@@ -0,0 +1,769 @@
 #!/usr/bin/env python3
 """
 Self-Grader Module for Timmy/UniWizard
 Grades Hermes session logs to identify patterns in failures and track improvement.
 Connects to quality scoring (#98) and adaptive routing (#88).
 Author: Timmy (UniWizard)
 """
 import json
 import sqlite3
 import re
 from pathlib import Path
 from dataclasses import dataclass, asdict
 from datetime import datetime, timedelta
 from typing import List, Dict, Optional, Any, Tuple
 from collections import defaultdict
 import statistics
@dataclass
 class SessionGrade:
    """Grade for a single session."""
    session_id: str
    session_file: str
    graded_at: str
    # Core metrics
    task_completed: bool
    tool_calls_efficient: int  # 1-5 scale
    response_quality: int  # 1-5 scale
    errors_recovered: bool
    total_api_calls: int
    # Additional metadata
    model: str
    platform: Optional[str]
    session_start: str
    duration_seconds: Optional[float]
    task_summary: str
    # Error analysis
    total_errors: int
    error_types: str  # JSON list of error categories
    tools_with_errors: str  # JSON list of tool names
    # Pattern flags
    had_repeated_errors: bool
    had_infinite_loop_risk: bool
    had_user_clarification: bool
@dataclass
 class WeeklyReport:
    """Weekly improvement report."""
    week_start: str
    week_end: str
    total_sessions: int
    avg_tool_efficiency: float
    avg_response_quality: float
    completion_rate: float
    error_recovery_rate: float
    # Patterns
    worst_task_types: List[Tuple[str, float]]
    most_error_prone_tools: List[Tuple[str, int]]
    common_error_patterns: List[Tuple[str, int]]
    # Trends
    improvement_suggestions: List[str]
 class SelfGrader:
    """Grades Hermes sessions and tracks improvement patterns."""
    # Error pattern regexes
    ERROR_PATTERNS = {
        'file_not_found': re.compile(r'file.*not found|no such file|does not exist', re.I),
        'permission_denied': re.compile(r'permission denied|access denied|unauthorized', re.I),
        'timeout': re.compile(r'time(d)?\s*out|deadline exceeded', re.I),
        'api_error': re.compile(r'api.*error|rate limit|too many requests', re.I),
        'syntax_error': re.compile(r'syntax error|invalid syntax|parse error', re.I),
        'command_failed': re.compile(r'exit_code.*[1-9]|command.*failed|failed to', re.I),
        'network_error': re.compile(r'network|connection|unreachable|refused', re.I),
        'tool_not_found': re.compile(r'tool.*not found|unknown tool|no tool named', re.I),
    }
    # Task type patterns
    TASK_PATTERNS = {
        'code_review': re.compile(r'code review|review.*code|review.*pr|pull request', re.I),
        'debugging': re.compile(r'debug|fix.*bug|troubleshoot|error.*fix', re.I),
        'feature_impl': re.compile(r'implement|add.*feature|build.*function', re.I),
        'refactoring': re.compile(r'refactor|clean.*up|reorganize|restructure', re.I),
        'documentation': re.compile(r'document|readme|docstring|comment', re.I),
        'testing': re.compile(r'test|pytest|unit test|integration test', re.I),
        'research': re.compile(r'research|investigate|look up|find.*about', re.I),
        'deployment': re.compile(r'deploy|release|publish|push.*prod', re.I),
        'data_analysis': re.compile(r'analyze.*data|process.*file|parse.*json|csv', re.I),
        'infrastructure': re.compile(r'server|docker|kubernetes|terraform|ansible', re.I),
    }
    def __init__(self, grades_db_path: Optional[Path] = None, 
                 sessions_dir: Optional[Path] = None):
        """Initialize the grader with database and sessions directory."""
        self.grades_db_path = Path(grades_db_path) if grades_db_path else Path.home() / ".timmy" / "uniwizard" / "session_grades.db"
        self.sessions_dir = Path(sessions_dir) if sessions_dir else Path.home() / ".hermes" / "sessions"
        self._init_database()
    def _init_database(self):
        """Initialize the SQLite database with schema."""
        self.grades_db_path.parent.mkdir(parents=True, exist_ok=True)
        with sqlite3.connect(self.grades_db_path) as conn:
            conn.execute("""
                CREATE TABLE IF NOT EXISTS session_grades (
                    id INTEGER PRIMARY KEY AUTOINCREMENT,
                    session_id TEXT UNIQUE NOT NULL,
                    session_file TEXT NOT NULL,
                    graded_at TEXT NOT NULL,
                    -- Core metrics
                    task_completed INTEGER NOT NULL,
                    tool_calls_efficient INTEGER NOT NULL,
                    response_quality INTEGER NOT NULL,
                    errors_recovered INTEGER NOT NULL,
                    total_api_calls INTEGER NOT NULL,
                    -- Metadata
                    model TEXT,
                    platform TEXT,
                    session_start TEXT,
                    duration_seconds REAL,
                    task_summary TEXT,
                    -- Error analysis
                    total_errors INTEGER NOT NULL,
                    error_types TEXT,
                    tools_with_errors TEXT,
                    -- Pattern flags
                    had_repeated_errors INTEGER NOT NULL,
                    had_infinite_loop_risk INTEGER NOT NULL,
                    had_user_clarification INTEGER NOT NULL
                )
            """)
            # Index for efficient queries
            conn.execute("CREATE INDEX IF NOT EXISTS idx_graded_at ON session_grades(graded_at)")
            conn.execute("CREATE INDEX IF NOT EXISTS idx_session_start ON session_grades(session_start)")
            conn.execute("CREATE INDEX IF NOT EXISTS idx_model ON session_grades(model)")
            # Weekly reports table
            conn.execute("""
                CREATE TABLE IF NOT EXISTS weekly_reports (
                    id INTEGER PRIMARY KEY AUTOINCREMENT,
                    week_start TEXT UNIQUE NOT NULL,
                    week_end TEXT NOT NULL,
                    generated_at TEXT NOT NULL,
                    report_json TEXT NOT NULL
                )
            """)
            conn.commit()
    def grade_session_file(self, session_path: Path) -> Optional[SessionGrade]:
        """Grade a single session file."""
        try:
            with open(session_path) as f:
                data = json.load(f)
        except (json.JSONDecodeError, IOError) as e:
            print(f"Error reading {session_path}: {e}")
            return None
        session_id = data.get('session_id', '')
        messages = data.get('messages', [])
        if not messages:
            return None
        # Analyze message flow
        analysis = self._analyze_messages(messages)
        # Calculate grades
        task_completed = self._assess_task_completion(messages, analysis)
        tool_efficiency = self._assess_tool_efficiency(analysis)
        response_quality = self._assess_response_quality(messages, analysis)
        errors_recovered = self._assess_error_recovery(messages, analysis)
        # Extract task summary from first user message
        task_summary = ""
        for msg in messages:
            if msg.get('role') == 'user':
                task_summary = msg.get('content', '')[:200]
                break
        # Calculate duration if timestamps available
        duration = None
        if messages and 'timestamp' in messages[0] and 'timestamp' in messages[-1]:
            try:
                start = datetime.fromisoformat(messages[0]['timestamp'].replace('Z', '+00:00'))
                end = datetime.fromisoformat(messages[-1]['timestamp'].replace('Z', '+00:00'))
                duration = (end - start).total_seconds()
            except (ValueError, KeyError):
                pass
        return SessionGrade(
            session_id=session_id,
            session_file=str(session_path.name),
            graded_at=datetime.now().isoformat(),
            task_completed=task_completed,
            tool_calls_efficient=tool_efficiency,
            response_quality=response_quality,
            errors_recovered=errors_recovered,
            total_api_calls=analysis['total_api_calls'],
            model=data.get('model', 'unknown'),
            platform=data.get('platform'),
            session_start=data.get('session_start', ''),
            duration_seconds=duration,
            task_summary=task_summary,
            total_errors=analysis['total_errors'],
            error_types=json.dumps(list(analysis['error_types'])),
            tools_with_errors=json.dumps(list(analysis['tools_with_errors'])),
            had_repeated_errors=analysis['had_repeated_errors'],
            had_infinite_loop_risk=analysis['had_infinite_loop_risk'],
            had_user_clarification=analysis['had_user_clarification']
        )
    def _analyze_messages(self, messages: List[Dict]) -> Dict[str, Any]:
        """Analyze message flow to extract metrics."""
        analysis = {
            'total_api_calls': 0,
            'total_errors': 0,
            'error_types': set(),
            'tools_with_errors': set(),
            'tool_call_counts': defaultdict(int),
            'error_sequences': [],
            'had_repeated_errors': False,
            'had_infinite_loop_risk': False,
            'had_user_clarification': False,
            'final_assistant_msg': None,
            'consecutive_errors': 0,
            'max_consecutive_errors': 0,
        }
        last_tool_was_error = False
        for i, msg in enumerate(messages):
            role = msg.get('role')
            if role == 'assistant':
                analysis['total_api_calls'] += 1
                # Check for clarification requests
                content = msg.get('content', '')
                tool_calls = msg.get('tool_calls', [])
                if tool_calls and tool_calls[0].get('function', {}).get('name') == 'clarify':
                    analysis['had_user_clarification'] = True
                if 'clarify' in content.lower() and 'need clarification' in content.lower():
                    analysis['had_user_clarification'] = True
                # Track tool calls
                for tc in tool_calls:
                    tool_name = tc.get('function', {}).get('name', 'unknown')
                    analysis['tool_call_counts'][tool_name] += 1
                # Track final assistant message
                analysis['final_assistant_msg'] = msg
                # Don't reset consecutive errors here - they continue until a tool succeeds
            elif role == 'tool':
                content = msg.get('content', '')
                tool_name = msg.get('name', 'unknown')
                # Check for errors
                is_error = self._detect_error(content)
                if is_error:
                    analysis['total_errors'] += 1
                    analysis['tools_with_errors'].add(tool_name)
                    # Classify error
                    error_type = self._classify_error(content)
                    analysis['error_types'].add(error_type)
                    # Track consecutive errors (consecutive tool messages with errors)
                    analysis['consecutive_errors'] += 1
                    analysis['max_consecutive_errors'] = max(
                        analysis['max_consecutive_errors'],
                        analysis['consecutive_errors']
                    )
                    last_tool_was_error = True
                else:
                    # Reset consecutive errors on success
                    analysis['consecutive_errors'] = 0
                    last_tool_was_error = False
        # Detect patterns
        analysis['had_repeated_errors'] = analysis['max_consecutive_errors'] >= 3
        analysis['had_infinite_loop_risk'] = (
            analysis['max_consecutive_errors'] >= 5 or 
            analysis['total_api_calls'] > 50
        )
        return analysis
    def _detect_error(self, content: str) -> bool:
        """Detect if tool result contains an error."""
        if not content:
            return False
        content_lower = content.lower()
        # Check for explicit error indicators
        error_indicators = [
            '"error":', '"error" :', 'error:', 'exception:',
            '"exit_code": 1', '"exit_code": 2', '"exit_code": -1',
            'traceback', 'failed', 'failure',
        ]
        for indicator in error_indicators:
            if indicator in content_lower:
                return True
        return False
    def _classify_error(self, content: str) -> str:
        """Classify the type of error."""
        content_lower = content.lower()
        for error_type, pattern in self.ERROR_PATTERNS.items():
            if pattern.search(content_lower):
                return error_type
        return 'unknown'
    def _assess_task_completion(self, messages: List[Dict], analysis: Dict) -> bool:
        """Assess whether the task was likely completed."""
        if not messages:
            return False
        # Check final assistant message
        final_msg = analysis.get('final_assistant_msg')
        if not final_msg:
            return False
        content = final_msg.get('content', '')
        # Positive completion indicators
        completion_phrases = [
            'done', 'completed', 'success', 'finished', 'created',
            'implemented', 'fixed', 'resolved', 'saved to', 'here is',
            'here are', 'the result', 'output:', 'file:', 'pr:', 'pull request'
        ]
        for phrase in completion_phrases:
            if phrase in content.lower():
                return True
        # Check if there were many errors
        if analysis['total_errors'] > analysis['total_api_calls'] * 0.3:
            return False
        # Check for explicit failure
        failure_phrases = ['failed', 'unable to', 'could not', 'error:', 'sorry, i cannot']
        for phrase in failure_phrases:
            if phrase in content.lower()[:200]:
                return False
        return True
    def _assess_tool_efficiency(self, analysis: Dict) -> int:
        """Rate tool call efficiency on 1-5 scale."""
        tool_calls = analysis['total_api_calls']
        errors = analysis['total_errors']
        if tool_calls == 0:
            return 3  # Neutral if no tool calls
        error_rate = errors / tool_calls
        # Score based on error rate and total calls
        if error_rate == 0 and tool_calls <= 10:
            return 5  # Perfect efficiency
        elif error_rate <= 0.1 and tool_calls <= 15:
            return 4  # Good efficiency
        elif error_rate <= 0.25 and tool_calls <= 25:
            return 3  # Average
        elif error_rate <= 0.4:
            return 2  # Poor
        else:
            return 1  # Very poor
    def _assess_response_quality(self, messages: List[Dict], analysis: Dict) -> int:
        """Rate response quality on 1-5 scale."""
        final_msg = analysis.get('final_assistant_msg')
        if not final_msg:
            return 1
        content = final_msg.get('content', '')
        content_len = len(content)
        # Quality indicators
        score = 3  # Start at average
        # Length heuristics
        if content_len > 500:
            score += 1
        if content_len > 1000:
            score += 1
        # Code blocks indicate substantive response
        if '```' in content:
            score += 1
        # Links/references indicate thoroughness
        if 'http' in content or 'see ' in content.lower():
            score += 0.5
        # Error penalties
        if analysis['had_repeated_errors']:
            score -= 1
        if analysis['total_errors'] > 5:
            score -= 1
        # Loop risk is severe
        if analysis['had_infinite_loop_risk']:
            score -= 2
        return max(1, min(5, int(score)))
    def _assess_error_recovery(self, messages: List[Dict], analysis: Dict) -> bool:
        """Assess whether errors were successfully recovered from."""
        if analysis['total_errors'] == 0:
            return True  # No errors to recover from
        # If task completed despite errors, recovered
        if self._assess_task_completion(messages, analysis):
            return True
        # If no repeated errors, likely recovered
        if not analysis['had_repeated_errors']:
            return True
        return False
    def save_grade(self, grade: SessionGrade) -> bool:
        """Save a grade to the database."""
        try:
            with sqlite3.connect(self.grades_db_path) as conn:
                conn.execute("""
                    INSERT OR REPLACE INTO session_grades (
                        session_id, session_file, graded_at,
                        task_completed, tool_calls_efficient, response_quality,
                        errors_recovered, total_api_calls, model, platform,
                        session_start, duration_seconds, task_summary,
                        total_errors, error_types, tools_with_errors,
                        had_repeated_errors, had_infinite_loop_risk, had_user_clarification
                    ) VALUES (?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?)
                """, (
                    grade.session_id, grade.session_file, grade.graded_at,
                    int(grade.task_completed), grade.tool_calls_efficient,
                    grade.response_quality, int(grade.errors_recovered),
                    grade.total_api_calls, grade.model, grade.platform,
                    grade.session_start, grade.duration_seconds, grade.task_summary,
                    grade.total_errors, grade.error_types, grade.tools_with_errors,
                    int(grade.had_repeated_errors), int(grade.had_infinite_loop_risk),
                    int(grade.had_user_clarification)
                ))
                conn.commit()
            return True
        except sqlite3.Error as e:
            print(f"Database error saving grade: {e}")
            return False
    def grade_latest_sessions(self, n: int = 10) -> List[SessionGrade]:
        """Grade the last N ungraded sessions."""
        # Get recent session files
        session_files = sorted(
            [f for f in self.sessions_dir.glob("session_*.json") 
             if not f.name.endswith("sessions.json")],
            key=lambda x: x.stat().st_mtime,
            reverse=True
        )
        # Get already graded sessions
        graded_ids = set()
        try:
            with sqlite3.connect(self.grades_db_path) as conn:
                cursor = conn.execute("SELECT session_id FROM session_grades")
                graded_ids = {row[0] for row in cursor.fetchall()}
        except sqlite3.Error:
            pass
        # Grade ungraded sessions
        grades = []
        for sf in session_files[:n]:
            # Extract session ID from filename
            session_id = sf.stem.replace('session_', '')
            if session_id in graded_ids:
                continue
            grade = self.grade_session_file(sf)
            if grade:
                if self.save_grade(grade):
                    grades.append(grade)
        return grades
    def identify_patterns(self, days: int = 7) -> Dict[str, Any]:
        """Identify patterns in recent graded sessions."""
        since = (datetime.now() - timedelta(days=days)).isoformat()
        with sqlite3.connect(self.grades_db_path) as conn:
            # Overall stats
            cursor = conn.execute("""
                SELECT 
                    COUNT(*),
                    AVG(tool_calls_efficient),
                    AVG(response_quality),
                    AVG(CASE WHEN task_completed THEN 1.0 ELSE 0.0 END),
                    AVG(CASE WHEN errors_recovered THEN 1.0 ELSE 0.0 END)
                FROM session_grades
                WHERE graded_at > ?
            """, (since,))
            row = cursor.fetchone()
            stats = {
                'total_sessions': row[0] or 0,
                'avg_tool_efficiency': round(row[1] or 0, 2),
                'avg_response_quality': round(row[2] or 0, 2),
                'completion_rate': round((row[3] or 0) * 100, 1),
                'error_recovery_rate': round((row[4] or 0) * 100, 1),
            }
            # Tool error analysis
            cursor = conn.execute("""
                SELECT tools_with_errors, COUNT(*)
                FROM session_grades
                WHERE graded_at > ? AND tools_with_errors != '[]'
                GROUP BY tools_with_errors
            """, (since,))
            tool_errors = defaultdict(int)
            for row in cursor.fetchall():
                tools = json.loads(row[0])
                for tool in tools:
                    tool_errors[tool] += row[1]
            # Error type analysis
            cursor = conn.execute("""
                SELECT error_types, COUNT(*)
                FROM session_grades
                WHERE graded_at > ? AND error_types != '[]'
                GROUP BY error_types
            """, (since,))
            error_types = defaultdict(int)
            for row in cursor.fetchall():
                types = json.loads(row[0])
                for et in types:
                    error_types[et] += row[1]
            # Task type performance (infer from task_summary)
            cursor = conn.execute("""
                SELECT task_summary, response_quality
                FROM session_grades
                WHERE graded_at > ?
            """, (since,))
            task_scores = defaultdict(list)
            for row in cursor.fetchall():
                summary = row[0] or ''
                score = row[1]
                task_type = self._infer_task_type(summary)
                task_scores[task_type].append(score)
            avg_task_scores = {
                tt: round(sum(scores) / len(scores), 2)
                for tt, scores in task_scores.items()
            }
            return {
                **stats,
                'tool_error_counts': dict(tool_errors),
                'error_type_counts': dict(error_types),
                'task_type_scores': avg_task_scores,
            }
    def _infer_task_type(self, summary: str) -> str:
        """Infer task type from summary text."""
        for task_type, pattern in self.TASK_PATTERNS.items():
            if pattern.search(summary):
                return task_type
        return 'general'
    def generate_weekly_report(self) -> WeeklyReport:
        """Generate a weekly improvement report."""
        # Calculate week boundaries (Monday to Sunday)
        today = datetime.now()
        monday = today - timedelta(days=today.weekday())
        sunday = monday + timedelta(days=6)
        patterns = self.identify_patterns(days=7)
        # Find worst task types
        task_scores = patterns.get('task_type_scores', {})
        worst_tasks = sorted(task_scores.items(), key=lambda x: x[1])[:3]
        # Find most error-prone tools
        tool_errors = patterns.get('tool_error_counts', {})
        worst_tools = sorted(tool_errors.items(), key=lambda x: x[1], reverse=True)[:3]
        # Find common error patterns
        error_types = patterns.get('error_type_counts', {})
        common_errors = sorted(error_types.items(), key=lambda x: x[1], reverse=True)[:3]
        # Generate suggestions
        suggestions = self._generate_suggestions(patterns, worst_tasks, worst_tools, common_errors)
        report = WeeklyReport(
            week_start=monday.strftime('%Y-%m-%d'),
            week_end=sunday.strftime('%Y-%m-%d'),
            total_sessions=patterns['total_sessions'],
            avg_tool_efficiency=patterns['avg_tool_efficiency'],
            avg_response_quality=patterns['avg_response_quality'],
            completion_rate=patterns['completion_rate'],
            error_recovery_rate=patterns['error_recovery_rate'],
            worst_task_types=worst_tasks,
            most_error_prone_tools=worst_tools,
            common_error_patterns=common_errors,
            improvement_suggestions=suggestions
        )
        # Save report
        with sqlite3.connect(self.grades_db_path) as conn:
            conn.execute("""
                INSERT OR REPLACE INTO weekly_reports
                (week_start, week_end, generated_at, report_json)
                VALUES (?, ?, ?, ?)
            """, (
                report.week_start,
                report.week_end,
                datetime.now().isoformat(),
                json.dumps(asdict(report))
            ))
            conn.commit()
        return report
    def _generate_suggestions(self, patterns: Dict, worst_tasks: List, 
                              worst_tools: List, common_errors: List) -> List[str]:
        """Generate improvement suggestions based on patterns."""
        suggestions = []
        if patterns['completion_rate'] < 70:
            suggestions.append("Task completion rate is below 70%. Consider adding pre-task planning steps.")
        if patterns['avg_tool_efficiency'] < 3:
            suggestions.append("Tool efficiency is low. Review error recovery patterns and add retry logic.")
        if worst_tasks:
            task_names = ', '.join([t[0] for t in worst_tasks])
            suggestions.append(f"Lowest scoring task types: {task_names}. Consider skill enhancement.")
        if worst_tools:
            tool_names = ', '.join([t[0] for t in worst_tools])
            suggestions.append(f"Most error-prone tools: {tool_names}. Review usage patterns.")
        if common_errors:
            error_names = ', '.join([e[0] for e in common_errors])
            suggestions.append(f"Common error types: {error_names}. Add targeted error handling.")
        if patterns['error_recovery_rate'] < 80:
            suggestions.append("Error recovery rate needs improvement. Implement better fallback strategies.")
        if not suggestions:
            suggestions.append("Performance is stable. Focus on expanding task coverage.")
        return suggestions
    def get_grades_summary(self, days: int = 30) -> str:
        """Get a human-readable summary of recent grades."""
        patterns = self.identify_patterns(days=days)
        lines = [
            f"=== Session Grades Summary (Last {days} days) ===",
            "",
            f"Total Sessions Graded: {patterns['total_sessions']}",
            f"Average Tool Efficiency: {patterns['avg_tool_efficiency']}/5",
            f"Average Response Quality: {patterns['avg_response_quality']}/5",
            f"Task Completion Rate: {patterns['completion_rate']}%",
            f"Error Recovery Rate: {patterns['error_recovery_rate']}%",
            "",
        ]
        if patterns.get('task_type_scores'):
            lines.append("Task Type Performance:")
            for task, score in sorted(patterns['task_type_scores'].items(), key=lambda x: -x[1]):
                lines.append(f"  - {task}: {score}/5")
            lines.append("")
        if patterns.get('tool_error_counts'):
            lines.append("Tool Error Counts:")
            for tool, count in sorted(patterns['tool_error_counts'].items(), key=lambda x: -x[1]):
                lines.append(f"  - {tool}: {count}")
            lines.append("")
        return '\n'.join(lines)
 def main():
    """CLI entry point for self-grading."""
    import argparse
    parser = argparse.ArgumentParser(description='Grade Hermes sessions')
    parser.add_argument('--grade-latest', '-g', type=int, metavar='N',
                        help='Grade the last N ungraded sessions')
    parser.add_argument('--summary', '-s', action='store_true',
                        help='Show summary of recent grades')
    parser.add_argument('--days', '-d', type=int, default=7,
                        help='Number of days for summary (default: 7)')
    parser.add_argument('--report', '-r', action='store_true',
                        help='Generate weekly report')
    parser.add_argument('--file', '-f', type=Path,
                        help='Grade a specific session file')
    args = parser.parse_args()
    grader = SelfGrader()
    if args.file:
        grade = grader.grade_session_file(args.file)
        if grade:
            grader.save_grade(grade)
            print(f"Graded session: {grade.session_id}")
            print(f"  Task completed: {grade.task_completed}")
            print(f"  Tool efficiency: {grade.tool_calls_efficient}/5")
            print(f"  Response quality: {grade.response_quality}/5")
            print(f"  Errors recovered: {grade.errors_recovered}")
        else:
            print("Failed to grade session")
    elif args.grade_latest:
        grades = grader.grade_latest_sessions(args.grade_latest)
        print(f"Graded {len(grades)} sessions")
        for g in grades:
            print(f"  - {g.session_id}: quality={g.response_quality}/5, "
                  f"completed={g.task_completed}")
    elif args.report:
        report = grader.generate_weekly_report()
        print(f"\n=== Weekly Report ({report.week_start} to {report.week_end}) ===")
        print(f"Total Sessions: {report.total_sessions}")
        print(f"Avg Tool Efficiency: {report.avg_tool_efficiency}/5")
        print(f"Avg Response Quality: {report.avg_response_quality}/5")
        print(f"Completion Rate: {report.completion_rate}%")
        print(f"Error Recovery Rate: {report.error_recovery_rate}%")
        print("\nSuggestions:")
        for s in report.improvement_suggestions:
            print(f"  - {s}")
    else:
        print(grader.get_grades_summary(days=args.days))
 if __name__ == '__main__':
    main()
--- a/uniwizard/self_grader_design.md
+++ b/uniwizard/self_grader_design.md
@@ -0,0 +1,453 @@
 # Self-Grader Design Document
 **Issue:** timmy-home #89 - "Build self-improvement loop: Timmy grades and learns from his own outputs"
 **Related Issues:** #88 (Adaptive Routing), #98 (Quality Scoring)
 ---
 ## 1. Overview
 The Self-Grader module enables Timmy to automatically evaluate his own task outputs, identify patterns in failures, and generate actionable improvement insights. This creates a closed feedback loop for continuous self-improvement.
 ### Goals
 - Automatically grade completed sessions on multiple quality dimensions
 - Identify recurring error patterns and their root causes
 - Track performance trends over time
 - Generate actionable weekly improvement reports
 - Feed insights into adaptive routing decisions
 ---
 ## 2. Architecture
 ```
 ┌─────────────────────────────────────────────────────────────┐
 │                      Self-Grader Module                     │
 ├─────────────────────────────────────────────────────────────┤
 │                                                             │
 │  ┌──────────────┐    ┌──────────────┐    ┌──────────────┐  │
 │  │   Parser     │───▶│   Analyzer   │───▶│   Grader     │  │
 │  │              │    │              │    │              │  │
 │  │ Reads session│    │ Extracts     │    │ Scores on 5  │  │
 │  │ JSON files   │    │ metrics      │    │ dimensions   │  │
 │  └──────────────┘    └──────────────┘    └──────────────┘  │
 │         │                    │                   │          │
 │         ▼                    ▼                   ▼          │
 │  ┌──────────────────────────────────────────────────────┐  │
 │  │              SQLite Database Layer                   │  │
 │  │  • session_grades table (individual scores)          │  │
 │  │  • weekly_reports table (aggregated insights)        │  │
 │  └──────────────────────────────────────────────────────┘  │
 │                            │                                │
 │                            ▼                                │
 │  ┌──────────────────────────────────────────────────────┐  │
 │  │              Pattern Identification                  │  │
 │  │  • Task type performance analysis                    │  │
 │  │  • Tool error frequency tracking                     │  │
 │  │  • Error classification and clustering               │  │
 │  └──────────────────────────────────────────────────────┘  │
 │                            │                                │
 │                            ▼                                │
 │  ┌──────────────────────────────────────────────────────┐  │
 │  │              Report Generator                        │  │
 │  │  • Weekly summary with trends                        │  │
 │  │  • Improvement suggestions                           │  │
 │  │  • Performance alerts                                │  │
 │  └──────────────────────────────────────────────────────┘  │
 │                                                             │
 └─────────────────────────────────────────────────────────────┘
                              │
                              ▼
 ┌─────────────────────────────────────────────────────────────┐
 │              Downstream Consumers                           │
 │  • Adaptive Routing (#88) - route based on task type        │
 │  • Quality Scoring (#98) - external quality validation      │
 │  • Skill Recommendations - identify skill gaps              │
 │  • Alert System - notify on quality degradation             │
 └─────────────────────────────────────────────────────────────┘
 ```
 ---
 ## 3. Grading Dimensions
 ### 3.1 Core Metrics (1-5 scale where applicable)
 | Metric | Type | Description |
 |--------|------|-------------|
 | `task_completed` | boolean | Whether the task appears to have been completed successfully |
 | `tool_calls_efficient` | int (1-5) | Efficiency of tool usage (error rate, call count) |
 | `response_quality` | int (1-5) | Overall quality of final response |
 | `errors_recovered` | boolean | Whether errors were successfully recovered from |
 | `total_api_calls` | int | Total number of API/assistant calls made |
 ### 3.2 Derived Metrics
 | Metric | Description |
 |--------|-------------|
 | `total_errors` | Count of tool errors detected |
 | `error_types` | Categorized error types (JSON list) |
 | `tools_with_errors` | Tools that generated errors |
 | `had_repeated_errors` | Flag for 3+ consecutive errors |
 | `had_infinite_loop_risk` | Flag for 5+ consecutive errors or >50 calls |
 | `had_user_clarification` | Whether clarification was requested |
 ---
 ## 4. Error Classification
 The system classifies errors into categories for pattern analysis:
 | Category | Pattern | Example |
 |----------|---------|---------|
 | `file_not_found` | File/path errors | "No such file or directory" |
 | `permission_denied` | Access errors | "Permission denied" |
 | `timeout` | Time limit exceeded | "Request timed out" |
 | `api_error` | External API failures | "Rate limit exceeded" |
 | `syntax_error` | Code/parsing errors | "Invalid syntax" |
 | `command_failed` | Command execution | "exit_code": 1 |
 | `network_error` | Connectivity issues | "Connection refused" |
 | `tool_not_found` | Tool resolution | "Unknown tool" |
 | `unknown` | Unclassified | Any other error |
 ---
 ## 5. Task Type Inference
 Sessions are categorized by task type for comparative analysis:
 | Task Type | Pattern |
 |-----------|---------|
 | `code_review` | "review", "code review", "PR" |
 | `debugging` | "debug", "fix", "troubleshoot" |
 | `feature_impl` | "implement", "add feature", "build" |
 | `refactoring` | "refactor", "clean up", "reorganize" |
 | `documentation` | "document", "readme", "docstring" |
 | `testing` | "test", "pytest", "unit test" |
 | `research` | "research", "investigate", "look up" |
 | `deployment` | "deploy", "release", "publish" |
 | `data_analysis` | "analyze data", "process file", "parse" |
 | `infrastructure` | "server", "docker", "kubernetes" |
 | `general` | Default catch-all |
 ---
 ## 6. Database Schema
 ### 6.1 session_grades Table
 ```sql
 CREATE TABLE session_grades (
    id INTEGER PRIMARY KEY AUTOINCREMENT,
    session_id TEXT UNIQUE NOT NULL,
    session_file TEXT NOT NULL,
    graded_at TEXT NOT NULL,
    -- Core metrics
    task_completed INTEGER NOT NULL,
    tool_calls_efficient INTEGER NOT NULL,
    response_quality INTEGER NOT NULL,
    errors_recovered INTEGER NOT NULL,
    total_api_calls INTEGER NOT NULL,
    -- Metadata
    model TEXT,
    platform TEXT,
    session_start TEXT,
    duration_seconds REAL,
    task_summary TEXT,
    -- Error analysis
    total_errors INTEGER NOT NULL,
    error_types TEXT,  -- JSON array
    tools_with_errors TEXT,  -- JSON array
    -- Pattern flags
    had_repeated_errors INTEGER NOT NULL,
    had_infinite_loop_risk INTEGER NOT NULL,
    had_user_clarification INTEGER NOT NULL
 );
 ```
 ### 6.2 weekly_reports Table
 ```sql
 CREATE TABLE weekly_reports (
    id INTEGER PRIMARY KEY AUTOINCREMENT,
    week_start TEXT UNIQUE NOT NULL,
    week_end TEXT NOT NULL,
    generated_at TEXT NOT NULL,
    report_json TEXT NOT NULL  -- Serialized WeeklyReport
 );
 ```
 ---
 ## 7. Scoring Algorithms
 ### 7.1 Task Completion Detection
 Positive indicators:
 - Final message contains completion phrases: "done", "completed", "success", "finished"
 - References to created outputs: "saved to", "here is", "output:"
 - Low error rate relative to total calls
 Negative indicators:
 - Explicit failure phrases: "failed", "unable to", "could not"
 - Error rate > 30% of total calls
 - Empty or very short final response
 ### 7.2 Tool Efficiency Scoring
 ```python
 error_rate = total_errors / total_api_calls
 if error_rate == 0 and tool_calls <= 10:
    score = 5  # Perfect
 elif error_rate <= 0.1 and tool_calls <= 15:
    score = 4  # Good
 elif error_rate <= 0.25 and tool_calls <= 25:
    score = 3  # Average
 elif error_rate <= 0.4:
    score = 2  # Poor
 else:
    score = 1  # Very poor
 ```
 ### 7.3 Response Quality Scoring
 Base score: 3 (average)
 Additions:
 - Content length > 500 chars: +1
 - Content length > 1000 chars: +1
 - Contains code blocks: +1
 - Contains links/references: +0.5
 Penalties:
 - Repeated errors: -1
 - Total errors > 5: -1
 - Infinite loop risk: -2
 Range clamped to 1-5.
 ---
 ## 8. Pattern Identification
 ### 8.1 Per-Task-Type Analysis
 Tracks average scores per task type to identify weak areas:
 ```python
 task_scores = {
    'code_review': 4.2,
    'debugging': 2.8,  # <-- Needs attention
    'feature_impl': 3.5,
 }
 ```
 ### 8.2 Tool Error Frequency
 Identifies which tools are most error-prone:
 ```python
 tool_errors = {
    'browser_navigate': 15,  # <-- High error rate
    'terminal': 5,
    'file_read': 2,
 }
 ```
 ### 8.3 Error Pattern Clustering
 Groups errors by type to identify systemic issues:
 ```python
 error_types = {
    'file_not_found': 12,  # <-- Need better path handling
    'timeout': 8,
    'api_error': 3,
 }
 ```
 ---
 ## 9. Weekly Report Generation
 ### 9.1 Report Contents
 1. **Summary Statistics**
   - Total sessions graded
   - Average tool efficiency
   - Average response quality
   - Task completion rate
   - Error recovery rate
 2. **Problem Areas**
   - Lowest scoring task types
   - Most error-prone tools
   - Common error patterns
 3. **Improvement Suggestions**
   - Actionable recommendations based on patterns
   - Skill gap identification
   - Process improvement tips
 ### 9.2 Suggestion Generation Rules
 | Condition | Suggestion |
 |-----------|------------|
 | completion_rate < 70% | "Add pre-task planning steps" |
 | avg_tool_efficiency < 3 | "Review error recovery patterns" |
 | error_recovery_rate < 80% | "Implement better fallback strategies" |
 | Specific task type low | "Consider skill enhancement for {task}" |
 | Specific tool high errors | "Review usage patterns for {tool}" |
 | Specific error common | "Add targeted error handling for {error}" |
 ---
 ## 10. Integration Points
 ### 10.1 With Adaptive Routing (#88)
 The grader feeds task-type performance data to the router:
 ```python
 # Router uses grader insights
 if task_type == 'debugging' and grader.get_task_score('debugging') < 3:
    # Route to more capable model for debugging tasks
    model = 'claude-opus-4'
 ```
 ### 10.2 With Quality Scoring (#98)
 Grader scores feed into external quality validation:
 ```python
 # Quality scorer validates grader accuracy
 external_score = quality_scorer.validate(session, grader_score)
 discrepancy = abs(external_score - grader_score)
 if discrepancy > threshold:
    grader.calibrate()  # Adjust scoring algorithms
 ```
 ### 10.3 With Skill System
 Identifies skills that could improve low-scoring areas:
 ```python
 if grader.get_task_score('debugging') < 3:
    recommend_skill('systematic-debugging')
 ```
 ---
 ## 11. Usage
 ### 11.1 Command Line
 ```bash
 # Grade latest 10 ungraded sessions
 python self_grader.py -g 10
 # Show summary of last 7 days
 python self_grader.py -s
 # Show summary of last 30 days
 python self_grader.py -s -d 30
 # Generate weekly report
 python self_grader.py -r
 # Grade specific session file
 python self_grader.py -f /path/to/session.json
 ```
 ### 11.2 Python API
 ```python
 from self_grader import SelfGrader
 grader = SelfGrader()
 # Grade latest sessions
 grades = grader.grade_latest_sessions(n=10)
 # Get pattern insights
 patterns = grader.identify_patterns(days=7)
 # Generate report
 report = grader.generate_weekly_report()
 # Get human-readable summary
 print(grader.get_grades_summary(days=7))
 ```
 ---
 ## 12. Testing
 Comprehensive test suite covers:
 1. **Unit Tests**
   - Error detection and classification
   - Scoring algorithms
   - Task type inference
 2. **Integration Tests**
   - Full session grading pipeline
   - Database operations
   - Report generation
 3. **Edge Cases**
   - Empty sessions
   - Sessions with infinite loops
   - Malformed session files
 Run tests:
 ```bash
 python -m pytest test_self_grader.py -v
 ```
 ---
 ## 13. Future Enhancements
 1. **Machine Learning Integration**
   - Train models to predict session success
   - Learn optimal tool sequences
   - Predict error likelihood
 2. **Human-in-the-Loop Validation**
   - Allow user override of grades
   - Collect explicit feedback
   - Calibrate scoring with human judgments
 3. **Real-time Monitoring**
   - Grade sessions as they complete
   - Alert on quality degradation
   - Live dashboard of metrics
 4. **Cross-Session Learning**
   - Identify recurring issues across similar tasks
   - Suggest skill improvements
   - Recommend tool alternatives
 ---
 ## 14. Files
 | File | Description |
 |------|-------------|
 | `self_grader.py` | Main module with SelfGrader class |
 | `test_self_grader.py` | Comprehensive test suite |
 | `self_grader_design.md` | This design document |
 | `~/.timmy/uniwizard/session_grades.db` | SQLite database (created at runtime) |
 ---
 *Document Version: 1.0*
 *Created: 2026-03-30*
 *Author: Timmy (UniWizard)*
--- a/uniwizard/task_classifier.py
+++ b/uniwizard/task_classifier.py
@@ -0,0 +1,655 @@
 """
 Enhanced Task Classifier for Uniwizard
 Classifies incoming prompts into task types and maps them to ranked backend preferences.
 Integrates with the 7-backend fallback chain defined in config.yaml.
 """
 from __future__ import annotations
 import re
 from dataclasses import dataclass
 from enum import Enum
 from typing import Any, Dict, List, Optional, Set, Tuple
 class TaskType(Enum):
    """Classification categories for incoming prompts."""
    CODE = "code"
    REASONING = "reasoning"
    RESEARCH = "research"
    CREATIVE = "creative"
    FAST_OPS = "fast_ops"
    TOOL_USE = "tool_use"
    UNKNOWN = "unknown"
 class ComplexityLevel(Enum):
    """Complexity tiers for prompt analysis."""
    LOW = "low"
    MEDIUM = "medium"
    HIGH = "high"
 # Backend identifiers (match fallback_providers chain order)
 BACKEND_ANTHROPIC = "anthropic"
 BACKEND_OPENAI_CODEX = "openai-codex"
 BACKEND_GEMINI = "gemini"
 BACKEND_GROQ = "groq"
 BACKEND_GROK = "grok"
 BACKEND_KIMI = "kimi-coding"
 BACKEND_OPENROUTER = "openrouter"
 ALL_BACKENDS = [
    BACKEND_ANTHROPIC,
    BACKEND_OPENAI_CODEX,
    BACKEND_GEMINI,
    BACKEND_GROQ,
    BACKEND_GROK,
    BACKEND_KIMI,
    BACKEND_OPENROUTER,
 ]
 # Task-specific keyword mappings
 CODE_KEYWORDS: Set[str] = {
    "code", "coding", "program", "programming", "function", "class",
    "implement", "implementation", "refactor", "debug", "debugging",
    "error", "exception", "traceback", "stacktrace", "test", "tests",
    "pytest", "unittest", "import", "module", "package", "library",
    "api", "endpoint", "route", "middleware", "database", "query",
    "sql", "orm", "migration", "deploy", "docker", "kubernetes",
    "k8s", "ci/cd", "pipeline", "build", "compile", "syntax",
    "lint", "format", "black", "flake8", "mypy", "type", "typing",
    "async", "await", "callback", "promise", "thread", "process",
    "concurrency", "parallel", "optimization", "optimize", "performance",
    "memory", "leak", "bug", "fix", "patch", "commit", "git",
    "repository", "repo", "clone", "fork", "merge", "conflict",
    "branch", "pull request", "pr", "review", "crud", "rest",
    "graphql", "json", "xml", "yaml", "toml", "csv", "parse",
    "regex", "regular expression", "string", "bytes", "encoding",
    "decoding", "serialize", "deserialize", "marshal", "unmarshal",
    "encrypt", "decrypt", "hash", "checksum", "signature", "jwt",
    "oauth", "authentication", "authorization", "auth", "login",
    "logout", "session", "cookie", "token", "permission", "role",
    "rbac", "acl", "security", "vulnerability", "cve", "exploit",
    "sandbox", "isolate", "container", "vm", "virtual machine",
 }
 REASONING_KEYWORDS: Set[str] = {
    "analyze", "analysis", "investigate", "investigation",
    "compare", "comparison", "contrast", "evaluate", "evaluation",
    "assess", "assessment", "reason", "reasoning", "logic",
    "logical", "deduce", "deduction", "infer", "inference",
    "synthesize", "synthesis", "critique", "criticism", "review",
    "argument", "premise", "conclusion", "evidence", "proof",
    "theorem", "axiom", "corollary", "lemma", "proposition",
    "hypothesis", "theory", "model", "framework", "paradigm",
    "philosophy", "ethical", "ethics", "moral", "morality",
    "implication", "consequence", "trade-off", "tradeoff",
    "pros and cons", "advantage", "disadvantage", "benefit",
    "drawback", "risk", "mitigation", "strategy", "strategic",
    "plan", "planning", "design", "architecture", "system",
    "complex", "complicated", "nuanced", "subtle", "sophisticated",
    "rigorous", "thorough", "comprehensive", "exhaustive",
    "step by step", "chain of thought", "think through",
    "work through", "figure out", "understand", "comprehend",
 }
 RESEARCH_KEYWORDS: Set[str] = {
    "research", "find", "search", "look up", "lookup",
    "investigate", "study", "explore", "discover",
    "paper", "publication", "journal", "article", "study",
    "arxiv", "scholar", "academic", "scientific", "literature",
    "review", "survey", "meta-analysis", "bibliography",
    "citation", "reference", "source", "primary source",
    "secondary source", "peer review", "empirical", "experiment",
    "experimental", "observational", "longitudinal", "cross-sectional",
    "qualitative", "quantitative", "mixed methods", "case study",
    "dataset", "data", "statistics", "statistical", "correlation",
    "causation", "regression", "machine learning", "ml", "ai",
    "neural network", "deep learning", "transformer", "llm",
    "benchmark", "evaluation", "metric", "sota", "state of the art",
    "survey", "poll", "interview", "focus group", "ethnography",
    "field work", "archive", "archival", "repository", "collection",
    "index", "catalog", "database", "librar", "museum", "histor",
    "genealogy", "ancestry", "patent", "trademark", "copyright",
    "legislation", "regulation", "policy", "compliance",
 }
 CREATIVE_KEYWORDS: Set[str] = {
    "create", "creative", "creativity", "design", "designer",
    "art", "artistic", "artist", "paint", "painting", "draw",
    "drawing", "sketch", "illustration", "illustrator", "graphic",
    "visual", "image", "photo", "photography", "photographer",
    "video", "film", "movie", "animation", "animate", "motion",
    "music", "musical", "song", "lyric", "compose", "composition",
    "melody", "harmony", "rhythm", "beat", "sound", "audio",
    "write", "writing", "writer", "author", "story", "storytelling",
    "narrative", "plot", "character", "dialogue", "scene",
    "novel", "fiction", "short story", "poem", "poetry", "poet",
    "verse", "prose", "essay", "blog", "article", "content",
    "copy", "copywriting", "marketing", "brand", "branding",
    "slogan", "tagline", "headline", "title", "name", "naming",
    "brainstorm", "ideate", "concept", "conceptualize", "imagine",
    "imagination", "inspire", "inspiration", "muse", "vision",
    "aesthetic", "style", "theme", "mood", "tone", "voice",
    "unique", "original", "fresh", "novel", "innovative",
    "unconventional", "experimental", "avant-garde", "edgy",
    "humor", "funny", "comedy", "satire", "parody", "wit",
    "romance", "romantic", "drama", "dramatic", "thriller",
    "mystery", "horror", "sci-fi", "science fiction", "fantasy",
    "adventure", "action", "documentary", "biopic", "memoir",
 }
 FAST_OPS_KEYWORDS: Set[str] = {
    "quick", "fast", "brief", "short", "simple", "easy",
    "status", "check", "list", "ls", "show", "display",
    "get", "fetch", "retrieve", "read", "cat", "view",
    "summary", "summarize", "tl;dr", "tldr", "overview",
    "count", "number", "how many", "total", "sum", "average",
    "min", "max", "sort", "filter", "grep", "search",
    "find", "locate", "which", "where", "what is", "what's",
    "who", "when", "yes/no", "confirm", "verify", "validate",
    "ping", "health", "alive", "up", "running", "online",
    "date", "time", "timezone", "clock", "timer", "alarm",
    "remind", "reminder", "note", "jot", "save", "store",
    "delete", "remove", "rm", "clean", "clear", "purge",
    "start", "stop", "restart", "enable", "disable", "toggle",
    "on", "off", "open", "close", "switch", "change", "set",
    "update", "upgrade", "install", "uninstall", "download",
    "upload", "sync", "backup", "restore", "export", "import",
    "convert", "transform", "format", "parse", "extract",
    "compress", "decompress", "zip", "unzip", "tar", "archive",
    "copy", "cp", "move", "mv", "rename", "link", "symlink",
    "permission", "chmod", "chown", "access", "ownership",
    "hello", "hi", "hey", "greeting", "thanks", "thank you",
    "bye", "goodbye", "help", "?", "how to", "how do i",
 }
 TOOL_USE_KEYWORDS: Set[str] = {
    "tool", "tools", "use tool", "call tool", "invoke",
    "run command", "execute", "terminal", "shell", "bash",
    "zsh", "powershell", "cmd", "command line", "cli",
    "file", "files", "directory", "folder", "path", "fs",
    "read file", "write file", "edit file", "patch file",
    "search files", "find files", "grep", "rg", "ack",
    "browser", "web", "navigate", "click", "scroll",
    "screenshot", "vision", "image", "analyze image",
    "delegate", "subagent", "agent", "spawn", "task",
    "mcp", "server", "mcporter", "protocol",
    "process", "background", "kill", "signal", "pid",
    "git", "commit", "push", "pull", "clone", "branch",
    "docker", "container", "compose", "dockerfile",
    "kubernetes", "kubectl", "k8s", "pod", "deployment",
    "aws", "gcp", "azure", "cloud", "s3", "bucket",
    "database", "db", "sql", "query", "migrate", "seed",
    "api", "endpoint", "request", "response", "curl",
    "http", "https", "rest", "graphql", "websocket",
    "json", "xml", "yaml", "csv", "parse", "serialize",
    "scrap", "crawl", "extract", "parse html", "xpath",
    "schedule", "cron", "job", "task queue", "worker",
    "notification", "alert", "webhook", "event", "trigger",
 }
 # URL pattern for detecting web/research tasks
 _URL_PATTERN = re.compile(
    r'https?://(?:[-\w.])+(?:[:\d]+)?(?:/(?:[\w/_.])*(?:\?(?:[\w&=%.])*)?(?:#(?:[\w.])*)?)?',
    re.IGNORECASE
 )
 # Code block detection (count ``` blocks, not individual lines)
 _CODE_BLOCK_PATTERN = re.compile(r'```[\w]*\n', re.MULTILINE)
 def _count_code_blocks(text: str) -> int:
    """Count complete code blocks (opening ``` to closing ```)."""
    # Count pairs of ``` - each pair is one code block
    fence_count = text.count('```')
    return fence_count // 2
 _INLINE_CODE_PATTERN = re.compile(r'`[^`]+`')
 # Complexity thresholds
 COMPLEXITY_THRESHOLDS = {
    "chars": {"low": 200, "medium": 800},
    "words": {"low": 35, "medium": 150},
    "lines": {"low": 3, "medium": 15},
    "urls": {"low": 0, "medium": 2},
    "code_blocks": {"low": 0, "medium": 1},
 }
@dataclass
 class ClassificationResult:
    """Result of task classification."""
    task_type: TaskType
    preferred_backends: List[str]
    complexity: ComplexityLevel
    reason: str
    confidence: float
    features: Dict[str, Any]
 class TaskClassifier:
    """
    Enhanced task classifier for routing prompts to appropriate backends.
    Maps task types to ranked backend preferences based on:
    - Backend strengths (coding, reasoning, speed, context length, etc.)
    - Message complexity (length, structure, keywords)
    - Detected features (URLs, code blocks, specific terminology)
    """
    # Backend preference rankings by task type
    # Order matters: first is most preferred
    TASK_BACKEND_MAP: Dict[TaskType, List[str]] = {
        TaskType.CODE: [
            BACKEND_OPENAI_CODEX,   # Best for code generation
            BACKEND_ANTHROPIC,      # Excellent for code review, complex analysis
            BACKEND_KIMI,           # Long context for large codebases
            BACKEND_GEMINI,         # Good multimodal code understanding
            BACKEND_GROQ,           # Fast for simple code tasks
            BACKEND_OPENROUTER,     # Overflow option
            BACKEND_GROK,           # General knowledge backup
        ],
        TaskType.REASONING: [
            BACKEND_ANTHROPIC,      # Deep reasoning champion
            BACKEND_GEMINI,         # Strong analytical capabilities
            BACKEND_KIMI,           # Long context for complex reasoning chains
            BACKEND_GROK,           # Broad knowledge for reasoning
            BACKEND_OPENAI_CODEX,   # Structured reasoning
            BACKEND_OPENROUTER,     # Overflow
            BACKEND_GROQ,           # Fast fallback
        ],
        TaskType.RESEARCH: [
            BACKEND_GEMINI,         # Research and multimodal leader
            BACKEND_KIMI,           # 262K context for long documents
            BACKEND_ANTHROPIC,      # Deep analysis
            BACKEND_GROK,           # Broad knowledge
            BACKEND_OPENROUTER,     # Broadest model access
            BACKEND_OPENAI_CODEX,   # Structured research
            BACKEND_GROQ,           # Fast triage
        ],
        TaskType.CREATIVE: [
            BACKEND_GROK,           # Creative writing and drafting
            BACKEND_ANTHROPIC,      # Nuanced creative work
            BACKEND_GEMINI,         # Multimodal creativity
            BACKEND_OPENAI_CODEX,   # Creative coding
            BACKEND_KIMI,           # Long-form creative
            BACKEND_OPENROUTER,     # Variety of creative models
            BACKEND_GROQ,           # Fast creative ops
        ],
        TaskType.FAST_OPS: [
            BACKEND_GROQ,           # 284ms response time champion
            BACKEND_OPENROUTER,     # Fast mini models
            BACKEND_GEMINI,         # Flash models
            BACKEND_GROK,           # Fast for simple queries
            BACKEND_ANTHROPIC,      # If precision needed
            BACKEND_OPENAI_CODEX,   # Structured ops
            BACKEND_KIMI,           # Overflow
        ],
        TaskType.TOOL_USE: [
            BACKEND_ANTHROPIC,      # Excellent tool use capabilities
            BACKEND_OPENAI_CODEX,   # Good tool integration
            BACKEND_GEMINI,         # Multimodal tool use
            BACKEND_GROQ,           # Fast tool chaining
            BACKEND_KIMI,           # Long context tool sessions
            BACKEND_OPENROUTER,     # Overflow
            BACKEND_GROK,           # General tool use
        ],
        TaskType.UNKNOWN: [
            BACKEND_ANTHROPIC,      # Default to strongest general model
            BACKEND_GEMINI,         # Good all-rounder
            BACKEND_OPENAI_CODEX,   # Structured approach
            BACKEND_KIMI,           # Long context safety
            BACKEND_GROK,           # Broad knowledge
            BACKEND_GROQ,           # Fast fallback
            BACKEND_OPENROUTER,     # Ultimate overflow
        ],
    }
    def __init__(self):
        """Initialize the classifier with compiled patterns."""
        self.url_pattern = _URL_PATTERN
        self.code_block_pattern = _CODE_BLOCK_PATTERN
        self.inline_code_pattern = _INLINE_CODE_PATTERN
    def classify(
        self,
        prompt: str,
        context: Optional[Dict[str, Any]] = None
    ) -> ClassificationResult:
        """
        Classify a prompt and return routing recommendation.
        Args:
            prompt: The user message to classify
            context: Optional context (previous messages, session state, etc.)
        Returns:
            ClassificationResult with task type, preferred backends, complexity, and reasoning
        """
        text = (prompt or "").strip()
        if not text:
            return self._default_result("Empty prompt")
        # Extract features
        features = self._extract_features(text)
        # Determine complexity
        complexity = self._assess_complexity(features)
        # Classify task type
        task_type, task_confidence, task_reason = self._classify_task_type(text, features)
        # Get preferred backends
        preferred_backends = self._get_backends_for_task(task_type, complexity, features)
        # Build reason string
        reason = self._build_reason(task_type, complexity, task_reason, features)
        return ClassificationResult(
            task_type=task_type,
            preferred_backends=preferred_backends,
            complexity=complexity,
            reason=reason,
            confidence=task_confidence,
            features=features,
        )
    def _extract_features(self, text: str) -> Dict[str, Any]:
        """Extract features from the prompt text."""
        lowered = text.lower()
        words = set(token.strip(".,:;!?()[]{}\"'`") for token in lowered.split())
        # Count code blocks (complete ``` pairs)
        code_blocks = _count_code_blocks(text)
        inline_code = len(self.inline_code_pattern.findall(text))
        # Count URLs
        urls = self.url_pattern.findall(text)
        # Count lines
        lines = text.count('\n') + 1
        return {
            "char_count": len(text),
            "word_count": len(text.split()),
            "line_count": lines,
            "url_count": len(urls),
            "urls": urls,
            "code_block_count": code_blocks,
            "inline_code_count": inline_code,
            "has_code": code_blocks > 0 or inline_code > 0,
            "unique_words": words,
            "lowercased_text": lowered,
        }
    def _assess_complexity(self, features: Dict[str, Any]) -> ComplexityLevel:
        """Assess the complexity level of the prompt."""
        scores = {
            "chars": features["char_count"],
            "words": features["word_count"],
            "lines": features["line_count"],
            "urls": features["url_count"],
            "code_blocks": features["code_block_count"],
        }
        # Count how many metrics exceed medium threshold
        medium_count = 0
        high_count = 0
        for metric, value in scores.items():
            thresholds = COMPLEXITY_THRESHOLDS.get(metric, {"low": 0, "medium": 0})
            if value > thresholds["medium"]:
                high_count += 1
            elif value > thresholds["low"]:
                medium_count += 1
        # Determine complexity
        if high_count >= 2 or scores["code_blocks"] > 2:
            return ComplexityLevel.HIGH
        elif medium_count >= 2 or high_count >= 1:
            return ComplexityLevel.MEDIUM
        else:
            return ComplexityLevel.LOW
    def _classify_task_type(
        self,
        text: str,
        features: Dict[str, Any]
    ) -> Tuple[TaskType, float, str]:
        """
        Classify the task type based on keywords and features.
        Returns:
            Tuple of (task_type, confidence, reason)
        """
        words = features["unique_words"]
        lowered = features["lowercased_text"]
        # Score each task type
        scores: Dict[TaskType, float] = {task: 0.0 for task in TaskType}
        reasons: Dict[TaskType, str] = {}
        # CODE scoring
        code_matches = words & CODE_KEYWORDS
        if features["has_code"]:
            scores[TaskType.CODE] += 2.0
            reasons[TaskType.CODE] = "Contains code blocks"
        if code_matches:
            scores[TaskType.CODE] += min(len(code_matches) * 0.5, 3.0)
            if TaskType.CODE not in reasons:
                reasons[TaskType.CODE] = f"Code keywords: {', '.join(list(code_matches)[:3])}"
        # REASONING scoring
        reasoning_matches = words & REASONING_KEYWORDS
        if reasoning_matches:
            scores[TaskType.REASONING] += min(len(reasoning_matches) * 0.4, 2.5)
            reasons[TaskType.REASONING] = f"Reasoning keywords: {', '.join(list(reasoning_matches)[:3])}"
        if any(phrase in lowered for phrase in ["step by step", "chain of thought", "think through"]):
            scores[TaskType.REASONING] += 1.5
            reasons[TaskType.REASONING] = "Explicit reasoning request"
        # RESEARCH scoring
        research_matches = words & RESEARCH_KEYWORDS
        if features["url_count"] > 0:
            scores[TaskType.RESEARCH] += 1.5
            reasons[TaskType.RESEARCH] = f"Contains {features['url_count']} URL(s)"
        if research_matches:
            scores[TaskType.RESEARCH] += min(len(research_matches) * 0.4, 2.0)
            if TaskType.RESEARCH not in reasons:
                reasons[TaskType.RESEARCH] = f"Research keywords: {', '.join(list(research_matches)[:3])}"
        # CREATIVE scoring
        creative_matches = words & CREATIVE_KEYWORDS
        if creative_matches:
            scores[TaskType.CREATIVE] += min(len(creative_matches) * 0.4, 2.5)
            reasons[TaskType.CREATIVE] = f"Creative keywords: {', '.join(list(creative_matches)[:3])}"
        # FAST_OPS scoring (simple queries) - ONLY if no other strong signals
        fast_ops_matches = words & FAST_OPS_KEYWORDS
        is_very_short = features["word_count"] <= 5 and features["char_count"] < 50
        # Only score fast_ops if it's very short OR has no other task indicators
        other_scores_possible = bool(
            (words & CODE_KEYWORDS) or
            (words & REASONING_KEYWORDS) or
            (words & RESEARCH_KEYWORDS) or
            (words & CREATIVE_KEYWORDS) or
            (words & TOOL_USE_KEYWORDS) or
            features["has_code"]
        )
        if is_very_short and not other_scores_possible:
            scores[TaskType.FAST_OPS] += 1.5
            reasons[TaskType.FAST_OPS] = "Very short, simple query"
        elif not other_scores_possible and fast_ops_matches and features["word_count"] < 30:
            scores[TaskType.FAST_OPS] += min(len(fast_ops_matches) * 0.3, 1.0)
            reasons[TaskType.FAST_OPS] = f"Simple query keywords: {', '.join(list(fast_ops_matches)[:3])}"
        # TOOL_USE scoring
        tool_matches = words & TOOL_USE_KEYWORDS
        if tool_matches:
            scores[TaskType.TOOL_USE] += min(len(tool_matches) * 0.4, 2.0)
            reasons[TaskType.TOOL_USE] = f"Tool keywords: {', '.join(list(tool_matches)[:3])}"
        if any(cmd in lowered for cmd in ["run ", "execute ", "call ", "use "]):
            scores[TaskType.TOOL_USE] += 0.5
        # Find highest scoring task type
        best_task = TaskType.UNKNOWN
        best_score = 0.0
        for task, score in scores.items():
            if score > best_score:
                best_score = score
                best_task = task
        # Calculate confidence
        confidence = min(best_score / 4.0, 1.0) if best_score > 0 else 0.0
        reason = reasons.get(best_task, "No strong indicators")
        return best_task, confidence, reason
    def _get_backends_for_task(
        self,
        task_type: TaskType,
        complexity: ComplexityLevel,
        features: Dict[str, Any]
    ) -> List[str]:
        """Get ranked list of preferred backends for the task."""
        base_backends = self.TASK_BACKEND_MAP.get(task_type, self.TASK_BACKEND_MAP[TaskType.UNKNOWN])
        # Adjust for complexity
        if complexity == ComplexityLevel.HIGH and task_type in (TaskType.RESEARCH, TaskType.CODE):
            # For high complexity, prioritize long-context models
            if BACKEND_KIMI in base_backends:
                # Move kimi earlier for long context
                base_backends = self._prioritize_backend(base_backends, BACKEND_KIMI, 2)
            if BACKEND_GEMINI in base_backends:
                base_backends = self._prioritize_backend(base_backends, BACKEND_GEMINI, 3)
        elif complexity == ComplexityLevel.LOW and task_type == TaskType.FAST_OPS:
            # For simple ops, ensure GROQ is first
            base_backends = self._prioritize_backend(base_backends, BACKEND_GROQ, 0)
        # Adjust for code presence
        if features["has_code"] and task_type != TaskType.CODE:
            # Boost OpenAI Codex if there's code but not explicitly a code task
            base_backends = self._prioritize_backend(base_backends, BACKEND_OPENAI_CODEX, 2)
        return list(base_backends)
    def _prioritize_backend(
        self,
        backends: List[str],
        target: str,
        target_index: int
    ) -> List[str]:
        """Move a backend to a specific index in the list."""
        if target not in backends:
            return backends
        new_backends = list(backends)
        new_backends.remove(target)
        new_backends.insert(min(target_index, len(new_backends)), target)
        return new_backends
    def _build_reason(
        self,
        task_type: TaskType,
        complexity: ComplexityLevel,
        task_reason: str,
        features: Dict[str, Any]
    ) -> str:
        """Build a human-readable reason string."""
        parts = [
            f"Task: {task_type.value}",
            f"Complexity: {complexity.value}",
        ]
        if task_reason:
            parts.append(f"Indicators: {task_reason}")
        # Add feature summary
        feature_parts = []
        if features["has_code"]:
            feature_parts.append(f"{features['code_block_count']} code block(s)")
        if features["url_count"] > 0:
            feature_parts.append(f"{features['url_count']} URL(s)")
        if features["word_count"] > 100:
            feature_parts.append(f"{features['word_count']} words")
        if feature_parts:
            parts.append(f"Features: {', '.join(feature_parts)}")
        return "; ".join(parts)
    def _default_result(self, reason: str) -> ClassificationResult:
        """Return a default result for edge cases."""
        return ClassificationResult(
            task_type=TaskType.UNKNOWN,
            preferred_backends=list(self.TASK_BACKEND_MAP[TaskType.UNKNOWN]),
            complexity=ComplexityLevel.LOW,
            reason=reason,
            confidence=0.0,
            features={},
        )
    def to_dict(self, result: ClassificationResult) -> Dict[str, Any]:
        """Convert classification result to dictionary format."""
        return {
            "task_type": result.task_type.value,
            "preferred_backends": result.preferred_backends,
            "complexity": result.complexity.value,
            "reason": result.reason,
            "confidence": round(result.confidence, 2),
            "features": {
                k: v for k, v in result.features.items()
                if k not in ("unique_words", "lowercased_text", "urls")
            },
        }
 # Convenience function for direct usage
 def classify_prompt(
    prompt: str,
    context: Optional[Dict[str, Any]] = None
 ) -> Dict[str, Any]:
    """
    Classify a prompt and return routing recommendation as a dictionary.
    Args:
        prompt: The user message to classify
        context: Optional context (previous messages, session state, etc.)
    Returns:
        Dictionary with task_type, preferred_backends, complexity, reason, confidence
    """
    classifier = TaskClassifier()
    result = classifier.classify(prompt, context)
    return classifier.to_dict(result)
 if __name__ == "__main__":
    # Example usage and quick test
    test_prompts = [
        "Hello, how are you?",
        "Implement a Python function to calculate fibonacci numbers",
        "Analyze the architectural trade-offs between microservices and monoliths",
        "Research the latest papers on transformer architectures",
        "Write a creative story about AI",
        "Check the status of the server and list running processes",
        "Use the browser to navigate to https://example.com and take a screenshot",
        "Refactor this large codebase: [2000 lines of code]",
    ]
    classifier = TaskClassifier()
    for prompt in test_prompts:
        result = classifier.classify(prompt)
        print(f"\nPrompt: {prompt[:60]}...")
        print(f"  Type: {result.task_type.value}")
        print(f"  Complexity: {result.complexity.value}")
        print(f"  Confidence: {result.confidence:.2f}")
        print(f"  Backends: {', '.join(result.preferred_backends[:3])}")
        print(f"  Reason: {result.reason}")
--- a/uniwizard/task_classifier_design.md
+++ b/uniwizard/task_classifier_design.md
@@ -0,0 +1,379 @@
 # Task Classifier Design Document
 ## Overview
 The **Task Classifier** is an enhanced prompt routing system for the Uniwizard agent harness. It classifies incoming user prompts into task categories and maps them to ranked backend preferences, enabling intelligent model selection across the 7-backend fallback chain.
 ## Goals
 1. **Right-size every request**: Route simple queries to fast backends, complex tasks to capable ones
 2. **Minimize latency**: Use Groq (284ms) for fast operations, Anthropic for deep reasoning
 3. **Maximize quality**: Match task type to backend strengths
 4. **Provide transparency**: Return clear reasoning for routing decisions
 5. **Enable fallback**: Support the full 7-backend chain with intelligent ordering
 ## Architecture
 ```
 ┌─────────────────┐
 │  User Prompt    │
 └────────┬────────┘
         │
         ▼
 ┌─────────────────────┐
 │  Feature Extraction │
 │  - Length metrics   │
 │  - Code detection   │
 │  - URL extraction   │
 │  - Keyword tokenize │
 └────────┬────────────┘
         │
         ▼
 ┌─────────────────────┐
 │  Complexity Assess  │
 │  - Low/Medium/High  │
 └────────┬────────────┘
         │
         ▼
 ┌─────────────────────┐
 │  Task Classification│
 │  - Code             │
 │  - Reasoning        │
 │  - Research         │
 │  - Creative         │
 │  - Fast Ops         │
 │  - Tool Use         │
 └────────┬────────────┘
         │
         ▼
 ┌─────────────────────┐
 │  Backend Selection  │
 │  - Ranked by task   │
 │  - Complexity adj.  │
 │  - Feature boosts   │
 └────────┬────────────┘
         │
         ▼
 ┌─────────────────────┐
 │  ClassificationResult
 │  - task_type        │
 │  - preferred_backends
 │  - complexity       │
 │  - reason           │
 │  - confidence       │
 └─────────────────────┘
 ```
 ## Task Types
 | Task Type | Description | Primary Indicators |
 |-----------|-------------|-------------------|
 | `code` | Programming tasks, debugging, refactoring | Keywords: implement, debug, refactor, test, function, class, API |
 | `reasoning` | Analysis, comparison, evaluation | Keywords: analyze, compare, evaluate, step by step, trade-offs |
 | `research` | Information gathering, literature review | Keywords: research, find, paper, study, URLs present |
 | `creative` | Writing, design, content creation | Keywords: write, create, design, story, poem, brainstorm |
 | `fast_ops` | Quick status checks, simple queries | Short length (<20 words), simple keywords |
 | `tool_use` | Tool invocations, commands, API calls | Keywords: run, execute, use tool, browser, delegate |
 | `unknown` | No clear indicators | Fallback classification |
 ## Backend Strengths Mapping
 ### 1. Anthropic (Claude)
 - **Strengths**: Deep reasoning, code review, complex analysis, tool use
 - **Best for**: Reasoning, tool_use, complex code review
 - **Ranking**: #1 for reasoning, #2 for code, #1 for tool_use
 ### 2. OpenAI Codex
 - **Strengths**: Code generation, feature implementation
 - **Best for**: Code tasks, structured outputs
 - **Ranking**: #1 for code generation
 ### 3. Gemini
 - **Strengths**: Research, multimodal, long context
 - **Best for**: Research tasks, document analysis
 - **Ranking**: #1 for research, #2 for reasoning
 ### 4. Groq
 - **Strengths**: Speed (284ms latency)
 - **Best for**: Fast operations, simple queries, triage
 - **Ranking**: #1 for fast_ops
 ### 5. Grok
 - **Strengths**: Broad knowledge, creative, drafting
 - **Best for**: Creative writing, general knowledge
 - **Ranking**: #1 for creative
 ### 6. Kimi (kimi-coding)
 - **Strengths**: Long context (262K tokens), code refactoring
 - **Best for**: Large codebase work, long document analysis
 - **Ranking**: Boosted for high-complexity code/research
 ### 7. OpenRouter
 - **Strengths**: Broadest model access, overflow handling
 - **Best for**: Fallback, variety of model choices
 - **Ranking**: #6 or #7 across all task types
 ## Backend Rankings by Task Type
 ```python
 CODE = [
    openai-codex,   # Best generation
    anthropic,      # Review & analysis
    kimi,           # Large codebases
    gemini,         # Multimodal
    groq,           # Fast simple tasks
    openrouter,     # Overflow
    grok,           # General backup
 ]
 REASONING = [
    anthropic,      # Deep reasoning
    gemini,         # Analysis
    kimi,           # Long chains
    grok,           # Broad knowledge
    openai-codex,   # Structured
    openrouter,
    groq,
 ]
 RESEARCH = [
    gemini,         # Research leader
    kimi,           # 262K context
    anthropic,      # Deep analysis
    grok,           # Knowledge
    openrouter,     # Broad access
    openai-codex,
    groq,           # Triage
 ]
 CREATIVE = [
    grok,           # Creative writing
    anthropic,      # Nuanced
    gemini,         # Multimodal
    openai-codex,   # Creative coding
    kimi,           # Long-form
    openrouter,
    groq,
 ]
 FAST_OPS = [
    groq,           # 284ms champion
    openrouter,     # Fast mini models
    gemini,         # Flash
    grok,           # Simple queries
    anthropic,
    openai-codex,
    kimi,
 ]
 TOOL_USE = [
    anthropic,      # Tool use leader
    openai-codex,   # Good integration
    gemini,         # Multimodal
    groq,           # Fast chaining
    kimi,           # Long sessions
    openrouter,
    grok,
 ]
 ```
 ## Complexity Assessment
 Complexity is determined by:
 | Metric | Low | Medium | High |
 |--------|-----|--------|------|
 | Characters | ≤200 | 201-800 | >800 |
 | Words | ≤35 | 36-150 | >150 |
 | Lines | ≤3 | 4-15 | >15 |
 | URLs | 0 | 1 | ≥2 |
 | Code Blocks | 0 | 1 | ≥2 |
 **Rules:**
 - 2+ high metrics → **HIGH** complexity
 - 2+ medium metrics or 1 high → **MEDIUM** complexity
 - Otherwise → **LOW** complexity
 ### Complexity Adjustments
 - **HIGH complexity + RESEARCH/CODE**: Boost Kimi and Gemini in rankings
 - **LOW complexity + FAST_OPS**: Ensure Groq is first
 - **Code blocks present**: Boost OpenAI Codex in any task type
 ## Keyword Dictionaries
 The classifier uses curated keyword sets for each task type:
 ### Code Keywords (100+)
 - Implementation: implement, code, function, class, module
 - Debugging: debug, error, exception, traceback, bug, fix
 - Testing: test, pytest, unittest, coverage
 - Operations: deploy, docker, kubernetes, ci/cd, pipeline
 - Concepts: api, endpoint, database, query, authentication
 ### Reasoning Keywords (50+)
 - Analysis: analyze, evaluate, assess, critique, review
 - Logic: reason, deduce, infer, logic, argument, evidence
 - Process: compare, contrast, trade-off, strategy, plan
 - Modifiers: step by step, chain of thought, think through
 ### Research Keywords (80+)
 - Actions: research, find, search, explore, discover
 - Sources: paper, publication, journal, arxiv, dataset
 - Methods: study, survey, experiment, benchmark, evaluation
 - Domains: machine learning, neural network, sota, literature
 ### Creative Keywords (100+)
 - Visual: art, paint, draw, design, graphic, image
 - Writing: write, story, novel, poem, essay, content
 - Audio: music, song, compose, melody, sound
 - Process: brainstorm, ideate, concept, imagine, inspire
 ### Fast Ops Keywords (60+)
 - Simple: quick, fast, brief, simple, easy, status
 - Actions: list, show, get, check, count, find
 - Short queries: hi, hello, thanks, yes/no, what is
 ### Tool Use Keywords (70+)
 - Actions: run, execute, call, use tool, invoke
 - Systems: terminal, shell, docker, kubernetes, git
 - Protocols: api, http, request, response, webhook
 - Agents: delegate, subagent, spawn, mcp
 ## API
 ### Classify a Prompt
 ```python
 from task_classifier import TaskClassifier, classify_prompt
 # Method 1: Using the class
 classifier = TaskClassifier()
 result = classifier.classify("Implement a Python function")
 print(result.task_type)           # TaskType.CODE
 print(result.preferred_backends)  # ["openai-codex", "anthropic", ...]
 print(result.complexity)          # ComplexityLevel.LOW
 print(result.reason)              # "Task: code; Complexity: low; ..."
 print(result.confidence)          # 0.75
 # Method 2: Convenience function
 output = classify_prompt("Research AI papers")
 # Returns dict: {
 #   "task_type": "research",
 #   "preferred_backends": ["gemini", "kimi", ...],
 #   "complexity": "low",
 #   "reason": "...",
 #   "confidence": 0.65,
 #   "features": {...}
 # }
 ```
 ### ClassificationResult Fields
 | Field | Type | Description |
 |-------|------|-------------|
 | `task_type` | TaskType | Classified task category |
 | `preferred_backends` | List[str] | Ranked list of backend identifiers |
 | `complexity` | ComplexityLevel | Assessed complexity level |
 | `reason` | str | Human-readable classification reasoning |
 | `confidence` | float | 0.0-1.0 confidence score |
 | `features` | Dict | Extracted features (lengths, code, URLs) |
 ## Integration with Hermes
 ### Usage in Smart Model Routing
 The task classifier replaces/enhances the existing `smart_model_routing.py`:
 ```python
 # In hermes-agent/agent/smart_model_routing.py
 from uniwizard.task_classifier import TaskClassifier
 classifier = TaskClassifier()
 def resolve_turn_route(user_message, routing_config, primary, fallback_chain):
    # Classify the prompt
    result = classifier.classify(user_message)
    # Map preferred backends to actual models from fallback_chain
    for backend in result.preferred_backends:
        model_config = fallback_chain.get(backend)
        if model_config and is_available(backend):
            return {
                "model": model_config["model"],
                "provider": backend,
                "reason": result.reason,
                "complexity": result.complexity.value,
            }
    # Fallback to primary
    return primary
 ```
 ### Configuration
 ```yaml
 # config.yaml
 smart_model_routing:
  enabled: true
  use_task_classifier: true
 fallback_providers:
  - provider: anthropic
    model: claude-opus-4-6
  - provider: openai-codex
    model: codex
  - provider: gemini
    model: gemini-2.5-flash
  - provider: groq
    model: llama-3.3-70b-versatile
  - provider: grok
    model: grok-3-mini-fast
  - provider: kimi-coding
    model: kimi-k2.5
  - provider: openrouter
    model: openai/gpt-4.1-mini
 ```
 ## Testing
 Run the test suite:
 ```bash
 cd ~/.timmy/uniwizard
 python -m pytest test_task_classifier.py -v
 ```
 Coverage includes:
 - Feature extraction (URLs, code blocks, length metrics)
 - Complexity assessment (low/medium/high)
 - Task type classification (all 6 types)
 - Backend selection (rankings by task type)
 - Complexity adjustments (boosts for Kimi/Gemini)
 - Edge cases (empty, whitespace, very long prompts)
 - Integration scenarios (realistic use cases)
 ## Future Enhancements
 1. **Session Context**: Use conversation history for better classification
 2. **Performance Feedback**: Learn from actual backend performance
 3. **User Preferences**: Allow user-defined backend preferences
 4. **Cost Optimization**: Factor in backend costs for routing
 5. **Streaming Detection**: Identify streaming-suitable tasks
 6. **Multi-Modal**: Better handling of image/audio inputs
 7. **Confidence Thresholds**: Configurable confidence cutoffs
 ## Files
 | File | Description |
 |------|-------------|
 | `task_classifier.py` | Main implementation (600+ lines) |
 | `test_task_classifier.py` | Unit tests (400+ lines) |
 | `task_classifier_design.md` | This design document |
 ## References
 - Gitea Issue: timmy-home #88
 - Existing: `~/.hermes/hermes-agent/agent/smart_model_routing.py`
 - Config: `~/.hermes/config.yaml` (fallback_providers chain)
--- a/uniwizard/test_quality_scorer.py
+++ b/uniwizard/test_quality_scorer.py
@@ -0,0 +1,534 @@
 """
 Tests for the Uniwizard Quality Scorer module.
 Run with: python -m pytest ~/.timmy/uniwizard/test_quality_scorer.py -v
 """
 import sqlite3
 import tempfile
 from pathlib import Path
 import pytest
 from quality_scorer import (
    QualityScorer,
    ResponseStatus,
    TaskType,
    BACKENDS,
    BackendScore,
    print_score_report,
    print_full_report,
    get_scorer,
    record,
    recommend,
 )
 class TestQualityScorer:
    """Tests for the QualityScorer class."""
    @pytest.fixture
    def temp_db(self):
        """Create a temporary database for testing."""
        with tempfile.NamedTemporaryFile(suffix=".db", delete=False) as f:
            db_path = Path(f.name)
        yield db_path
        db_path.unlink(missing_ok=True)
    @pytest.fixture
    def scorer(self, temp_db):
        """Create a fresh QualityScorer with temp database."""
        return QualityScorer(db_path=temp_db)
    def test_init_creates_database(self, temp_db):
        """Test that initialization creates the database and tables."""
        scorer = QualityScorer(db_path=temp_db)
        assert temp_db.exists()
        # Verify schema
        conn = sqlite3.connect(str(temp_db))
        cursor = conn.execute(
            "SELECT name FROM sqlite_master WHERE type='table'"
        )
        tables = {row[0] for row in cursor.fetchall()}
        assert "responses" in tables
        conn.close()
    def test_record_response_success(self, scorer):
        """Test recording a successful response."""
        scorer.record_response(
            backend="anthropic",
            task_type=TaskType.CODE,
            status=ResponseStatus.SUCCESS,
            latency_ms=1000.0,
            ttft_ms=150.0,
            metadata={"model": "claude-3-opus"}
        )
        score = scorer.get_backend_score("anthropic", TaskType.CODE.value)
        assert score.total_requests == 1
        assert score.success_count == 1
        assert score.error_count == 0
    def test_record_response_error(self, scorer):
        """Test recording an error response."""
        scorer.record_response(
            backend="groq",
            task_type=TaskType.FAST_OPS,
            status=ResponseStatus.ERROR,
            latency_ms=500.0,
            ttft_ms=50.0
        )
        score = scorer.get_backend_score("groq", TaskType.FAST_OPS.value)
        assert score.total_requests == 1
        assert score.success_count == 0
        assert score.error_count == 1
    def test_record_response_refusal(self, scorer):
        """Test recording a refusal response."""
        scorer.record_response(
            backend="gemini",
            task_type=TaskType.CREATIVE,
            status=ResponseStatus.REFUSAL,
            latency_ms=300.0,
            ttft_ms=100.0
        )
        score = scorer.get_backend_score("gemini", TaskType.CREATIVE.value)
        assert score.refusal_count == 1
    def test_record_response_timeout(self, scorer):
        """Test recording a timeout response."""
        scorer.record_response(
            backend="openrouter",
            task_type=TaskType.RESEARCH,
            status=ResponseStatus.TIMEOUT,
            latency_ms=30000.0,
            ttft_ms=0.0
        )
        score = scorer.get_backend_score("openrouter", TaskType.RESEARCH.value)
        assert score.timeout_count == 1
    def test_record_invalid_backend(self, scorer):
        """Test that invalid backend raises ValueError."""
        with pytest.raises(ValueError, match="Unknown backend"):
            scorer.record_response(
                backend="invalid-backend",
                task_type=TaskType.CODE,
                status=ResponseStatus.SUCCESS,
                latency_ms=1000.0,
                ttft_ms=100.0
            )
    def test_rolling_window_pruning(self, scorer):
        """Test that old records are pruned beyond window size."""
        # Add more than ROLLING_WINDOW_SIZE records
        for i in range(110):
            scorer.record_response(
                backend="kimi-coding",
                task_type=TaskType.CODE,
                status=ResponseStatus.SUCCESS,
                latency_ms=float(i),
                ttft_ms=50.0
            )
        # Should only have 100 records
        stats = scorer.get_stats()
        assert stats["by_backend"]["kimi-coding"] == 100
    def test_recommend_backend_basic(self, scorer):
        """Test backend recommendation with sample data."""
        # Add some data for multiple backends
        for backend in ["anthropic", "groq", "gemini"]:
            for i in range(10):
                scorer.record_response(
                    backend=backend,
                    task_type=TaskType.CODE,
                    status=ResponseStatus.SUCCESS if i < 8 else ResponseStatus.ERROR,
                    latency_ms=1000.0 if backend == "anthropic" else 500.0,
                    ttft_ms=200.0
                )
        recommendations = scorer.recommend_backend(TaskType.CODE.value)
        # Should return all 7 backends
        assert len(recommendations) == 7
        # Top 3 should have scores
        top_3 = [b for b, s in recommendations[:3]]
        assert "groq" in top_3  # Fastest latency should win
    def test_recommend_backend_insufficient_data(self, scorer):
        """Test recommendation with insufficient samples."""
        # Add only 2 samples for one backend
        for i in range(2):
            scorer.record_response(
                backend="anthropic",
                task_type=TaskType.CODE,
                status=ResponseStatus.SUCCESS,
                latency_ms=1000.0,
                ttft_ms=200.0
            )
        recommendations = scorer.recommend_backend(TaskType.CODE.value, min_samples=5)
        # Should penalize low-sample backend
        anthropic_score = next(s for b, s in recommendations if b == "anthropic")
        assert anthropic_score < 50  # Penalized for low samples
    def test_get_all_scores(self, scorer):
        """Test getting scores for all backends."""
        # Add data for some backends
        for backend in ["anthropic", "groq"]:
            scorer.record_response(
                backend=backend,
                task_type=TaskType.REASONING,
                status=ResponseStatus.SUCCESS,
                latency_ms=1000.0,
                ttft_ms=200.0
            )
        all_scores = scorer.get_all_scores(TaskType.REASONING.value)
        assert len(all_scores) == 7
        assert all_scores["anthropic"].total_requests == 1
        assert all_scores["groq"].total_requests == 1
        assert all_scores["gemini"].total_requests == 0
    def test_get_task_breakdown(self, scorer):
        """Test getting per-task breakdown for a backend."""
        # Add data for different task types
        scorer.record_response(
            backend="anthropic",
            task_type=TaskType.CODE,
            status=ResponseStatus.SUCCESS,
            latency_ms=1000.0,
            ttft_ms=200.0
        )
        scorer.record_response(
            backend="anthropic",
            task_type=TaskType.REASONING,
            status=ResponseStatus.SUCCESS,
            latency_ms=2000.0,
            ttft_ms=300.0
        )
        breakdown = scorer.get_task_breakdown("anthropic")
        assert len(breakdown) == 5  # 5 task types
        assert breakdown["code"].total_requests == 1
        assert breakdown["reasoning"].total_requests == 1
    def test_score_calculation(self, scorer):
        """Test the composite score calculation."""
        # Add perfect responses
        for i in range(10):
            scorer.record_response(
                backend="anthropic",
                task_type=TaskType.CODE,
                status=ResponseStatus.SUCCESS,
                latency_ms=100.0,  # Very fast
                ttft_ms=50.0
            )
        score = scorer.get_backend_score("anthropic", TaskType.CODE.value)
        # Should have high score for perfect performance
        assert score.score > 90
        assert score.success_count == 10
        assert score.avg_latency_ms == 100.0
    def test_score_with_errors(self, scorer):
        """Test scoring with mixed success/error."""
        for i in range(5):
            scorer.record_response(
                backend="grok",
                task_type=TaskType.RESEARCH,
                status=ResponseStatus.SUCCESS,
                latency_ms=1000.0,
                ttft_ms=200.0
            )
        for i in range(5):
            scorer.record_response(
                backend="grok",
                task_type=TaskType.RESEARCH,
                status=ResponseStatus.ERROR,
                latency_ms=500.0,
                ttft_ms=100.0
            )
        score = scorer.get_backend_score("grok", TaskType.RESEARCH.value)
        assert score.total_requests == 10
        assert score.success_count == 5
        assert score.error_count == 5
        # Score: 50% success + low error penalty = ~71 with good latency
        assert 60 < score.score < 80
    def test_p95_calculation(self, scorer):
        """Test P95 latency calculation."""
        # Add latencies from 1ms to 100ms
        for i in range(1, 101):
            scorer.record_response(
                backend="anthropic",
                task_type=TaskType.CODE,
                status=ResponseStatus.SUCCESS,
                latency_ms=float(i),
                ttft_ms=50.0
            )
        score = scorer.get_backend_score("anthropic", TaskType.CODE.value)
        # P95 should be around 95
        assert 90 <= score.p95_latency_ms <= 100
    def test_clear_data(self, scorer):
        """Test clearing all data."""
        scorer.record_response(
            backend="anthropic",
            task_type=TaskType.CODE,
            status=ResponseStatus.SUCCESS,
            latency_ms=1000.0,
            ttft_ms=200.0
        )
        scorer.clear_data()
        stats = scorer.get_stats()
        assert stats["total_records"] == 0
    def test_string_task_type(self, scorer):
        """Test that string task types work alongside TaskType enum."""
        scorer.record_response(
            backend="openai-codex",
            task_type="code",  # String instead of enum
            status=ResponseStatus.SUCCESS,
            latency_ms=1000.0,
            ttft_ms=200.0
        )
        score = scorer.get_backend_score("openai-codex", "code")
        assert score.total_requests == 1
 class TestConvenienceFunctions:
    """Tests for module-level convenience functions."""
    @pytest.fixture
    def temp_db(self):
        """Create a temporary database for testing."""
        with tempfile.NamedTemporaryFile(suffix=".db", delete=False) as f:
            db_path = Path(f.name)
        # Patch the default path
        import quality_scorer
        original_path = quality_scorer.DEFAULT_DB_PATH
        quality_scorer.DEFAULT_DB_PATH = db_path
        yield db_path
        quality_scorer.DEFAULT_DB_PATH = original_path
        db_path.unlink(missing_ok=True)
    def test_get_scorer(self, temp_db):
        """Test get_scorer convenience function."""
        scorer = get_scorer()
        assert isinstance(scorer, QualityScorer)
    def test_record_convenience(self, temp_db):
        """Test record convenience function."""
        record(
            backend="anthropic",
            task_type="code",
            status="success",
            latency_ms=1000.0,
            ttft_ms=200.0
        )
        scorer = get_scorer()
        score = scorer.get_backend_score("anthropic", "code")
        assert score.total_requests == 1
    def test_recommend_convenience(self, temp_db):
        """Test recommend convenience function."""
        record(
            backend="anthropic",
            task_type="code",
            status="success",
            latency_ms=1000.0,
            ttft_ms=200.0
        )
        recs = recommend("code")
        assert len(recs) == 7
        assert recs[0][0] == "anthropic"  # Should rank first since it has data
 class TestPrintFunctions:
    """Tests for print/report functions (smoke tests)."""
    @pytest.fixture
    def populated_scorer(self):
        """Create a scorer with demo data."""
        with tempfile.NamedTemporaryFile(suffix=".db", delete=False) as f:
            db_path = Path(f.name)
        scorer = QualityScorer(db_path=db_path)
        # Add demo data for all backends
        import random
        random.seed(42)
        for backend in BACKENDS:
            for task in TaskType:
                for i in range(20):
                    scorer.record_response(
                        backend=backend,
                        task_type=task.value,
                        status=random.choices(
                            [ResponseStatus.SUCCESS, ResponseStatus.ERROR, 
                             ResponseStatus.REFUSAL, ResponseStatus.TIMEOUT],
                            weights=[0.85, 0.08, 0.05, 0.02]
                        )[0],
                        latency_ms=random.gauss(
                            1000 if backend in ["anthropic", "openai-codex"] else 500,
                            200
                        ),
                        ttft_ms=random.gauss(150, 50)
                    )
        yield scorer
        db_path.unlink(missing_ok=True)
    def test_print_score_report(self, populated_scorer, capsys):
        """Test print_score_report doesn't crash."""
        print_score_report(populated_scorer)
        captured = capsys.readouterr()
        assert "UNIWIZARD BACKEND QUALITY SCORES" in captured.out
        assert "anthropic" in captured.out
    def test_print_full_report(self, populated_scorer, capsys):
        """Test print_full_report doesn't crash."""
        print_full_report(populated_scorer)
        captured = capsys.readouterr()
        assert "PER-TASK SPECIALIZATION" in captured.out
        assert "RECOMMENDATIONS" in captured.out
 class TestEdgeCases:
    """Tests for edge cases and error handling."""
    @pytest.fixture
    def temp_db(self):
        """Create a temporary database for testing."""
        with tempfile.NamedTemporaryFile(suffix=".db", delete=False) as f:
            db_path = Path(f.name)
        yield db_path
        db_path.unlink(missing_ok=True)
    @pytest.fixture
    def scorer(self, temp_db):
        """Create a fresh QualityScorer with temp database."""
        return QualityScorer(db_path=temp_db)
    def test_empty_database(self, scorer):
        """Test behavior with empty database."""
        score = scorer.get_backend_score("anthropic", TaskType.CODE.value)
        assert score.total_requests == 0
        assert score.score == 0.0
        assert score.avg_latency_ms == 0.0
    def test_invalid_backend_in_get_score(self, scorer):
        """Test that invalid backend raises error in get_score."""
        with pytest.raises(ValueError, match="Unknown backend"):
            scorer.get_backend_score("invalid")
    def test_invalid_backend_in_breakdown(self, scorer):
        """Test that invalid backend raises error in get_task_breakdown."""
        with pytest.raises(ValueError, match="Unknown backend"):
            scorer.get_task_breakdown("invalid")
    def test_zero_latency(self, scorer):
        """Test handling of zero latency."""
        scorer.record_response(
            backend="groq",
            task_type=TaskType.FAST_OPS,
            status=ResponseStatus.SUCCESS,
            latency_ms=0.0,
            ttft_ms=0.0
        )
        score = scorer.get_backend_score("groq", TaskType.FAST_OPS.value)
        assert score.avg_latency_ms == 0.0
        assert score.score > 50  # Should still have decent score
    def test_very_high_latency(self, scorer):
        """Test handling of very high latency."""
        scorer.record_response(
            backend="openrouter",
            task_type=TaskType.RESEARCH,
            status=ResponseStatus.SUCCESS,
            latency_ms=50000.0,  # 50 seconds
            ttft_ms=5000.0
        )
        score = scorer.get_backend_score("openrouter", TaskType.RESEARCH.value)
        # Success rate is 100% but latency penalty brings it down
        assert score.score < 85  # Should be penalized for high latency
    def test_all_error_responses(self, scorer):
        """Test scoring when all responses are errors."""
        for i in range(10):
            scorer.record_response(
                backend="gemini",
                task_type=TaskType.CODE,
                status=ResponseStatus.ERROR,
                latency_ms=1000.0,
                ttft_ms=200.0
            )
        score = scorer.get_backend_score("gemini", TaskType.CODE.value)
        # 0% success but perfect error/refusal/timeout rate = ~35
        assert score.score < 45  # Should have low score
    def test_all_refusal_responses(self, scorer):
        """Test scoring when all responses are refusals."""
        for i in range(10):
            scorer.record_response(
                backend="gemini",
                task_type=TaskType.CREATIVE,
                status=ResponseStatus.REFUSAL,
                latency_ms=500.0,
                ttft_ms=100.0
            )
        score = scorer.get_backend_score("gemini", TaskType.CREATIVE.value)
        assert score.refusal_count == 10
        # 0% success, 0% error, 100% refusal, good latency = ~49
        assert score.score < 55  # Should be low due to refusals
    def test_metadata_storage(self, scorer):
        """Test that metadata is stored correctly."""
        scorer.record_response(
            backend="anthropic",
            task_type=TaskType.CODE,
            status=ResponseStatus.SUCCESS,
            latency_ms=1000.0,
            ttft_ms=200.0,
            metadata={"model": "claude-3-opus", "region": "us-east-1"}
        )
        # Verify in database
        conn = sqlite3.connect(str(scorer.db_path))
        row = conn.execute("SELECT metadata FROM responses LIMIT 1").fetchone()
        conn.close()
        import json
        metadata = json.loads(row[0])
        assert metadata["model"] == "claude-3-opus"
 if __name__ == "__main__":
    pytest.main([__file__, "-v"])
--- a/uniwizard/test_self_grader.py
+++ b/uniwizard/test_self_grader.py
@@ -0,0 +1,498 @@
 #!/usr/bin/env python3
 """
 Tests for the Self-Grader Module
 Run with: python -m pytest test_self_grader.py -v
 """
 import json
 import sqlite3
 import tempfile
 from pathlib import Path
 from datetime import datetime, timedelta
 import pytest
 from self_grader import SelfGrader, SessionGrade, WeeklyReport
 class TestSessionGrade:
    """Tests for SessionGrade dataclass."""
    def test_session_grade_creation(self):
        """Test creating a SessionGrade."""
        grade = SessionGrade(
            session_id="test-123",
            session_file="session_test.json",
            graded_at=datetime.now().isoformat(),
            task_completed=True,
            tool_calls_efficient=4,
            response_quality=5,
            errors_recovered=True,
            total_api_calls=10,
            model="claude-opus",
            platform="cli",
            session_start=datetime.now().isoformat(),
            duration_seconds=120.0,
            task_summary="Test task",
            total_errors=0,
            error_types="[]",
            tools_with_errors="[]",
            had_repeated_errors=False,
            had_infinite_loop_risk=False,
            had_user_clarification=False
        )
        assert grade.session_id == "test-123"
        assert grade.task_completed is True
        assert grade.tool_calls_efficient == 4
        assert grade.response_quality == 5
 class TestSelfGraderInit:
    """Tests for SelfGrader initialization."""
    def test_init_creates_database(self, tmp_path):
        """Test that initialization creates the database."""
        db_path = tmp_path / "grades.db"
        sessions_dir = tmp_path / "sessions"
        sessions_dir.mkdir()
        grader = SelfGrader(grades_db_path=db_path, sessions_dir=sessions_dir)
        assert db_path.exists()
        # Check tables exist
        with sqlite3.connect(db_path) as conn:
            cursor = conn.execute("SELECT name FROM sqlite_master WHERE type='table'")
            tables = {row[0] for row in cursor.fetchall()}
            assert "session_grades" in tables
            assert "weekly_reports" in tables
 class TestErrorDetection:
    """Tests for error detection and classification."""
    def test_detect_exit_code_error(self, tmp_path):
        """Test detection of exit code errors."""
        grader = SelfGrader(
            grades_db_path=tmp_path / "grades.db",
            sessions_dir=tmp_path / "sessions"
        )
        assert grader._detect_error('{"exit_code": 1, "output": ""}') is True
        assert grader._detect_error('{"exit_code": 0, "output": "success"}') is False
        assert grader._detect_error('') is False
    def test_detect_explicit_error(self, tmp_path):
        """Test detection of explicit error messages."""
        grader = SelfGrader(
            grades_db_path=tmp_path / "grades.db",
            sessions_dir=tmp_path / "sessions"
        )
        assert grader._detect_error('{"error": "file not found"}') is True
        assert grader._detect_error('Traceback (most recent call last):') is True
        assert grader._detect_error('Command failed with exception') is True
    def test_classify_file_not_found(self, tmp_path):
        """Test classification of file not found errors."""
        grader = SelfGrader(
            grades_db_path=tmp_path / "grades.db",
            sessions_dir=tmp_path / "sessions"
        )
        error = "Error: file '/path/to/file' not found"
        assert grader._classify_error(error) == "file_not_found"
    def test_classify_timeout(self, tmp_path):
        """Test classification of timeout errors."""
        grader = SelfGrader(
            grades_db_path=tmp_path / "grades.db",
            sessions_dir=tmp_path / "sessions"
        )
        error = "Request timed out after 30 seconds"
        assert grader._classify_error(error) == "timeout"
    def test_classify_unknown(self, tmp_path):
        """Test classification of unknown errors."""
        grader = SelfGrader(
            grades_db_path=tmp_path / "grades.db",
            sessions_dir=tmp_path / "sessions"
        )
        error = "Something weird happened"
        assert grader._classify_error(error) == "unknown"
 class TestSessionAnalysis:
    """Tests for session analysis."""
    def test_analyze_empty_messages(self, tmp_path):
        """Test analysis of empty message list."""
        grader = SelfGrader(
            grades_db_path=tmp_path / "grades.db",
            sessions_dir=tmp_path / "sessions"
        )
        analysis = grader._analyze_messages([])
        assert analysis['total_api_calls'] == 0
        assert analysis['total_errors'] == 0
        assert analysis['had_repeated_errors'] is False
    def test_analyze_simple_session(self, tmp_path):
        """Test analysis of a simple successful session."""
        grader = SelfGrader(
            grades_db_path=tmp_path / "grades.db",
            sessions_dir=tmp_path / "sessions"
        )
        messages = [
            {"role": "user", "content": "Hello"},
            {"role": "assistant", "content": "Hi there!"},
        ]
        analysis = grader._analyze_messages(messages)
        assert analysis['total_api_calls'] == 1
        assert analysis['total_errors'] == 0
    def test_analyze_session_with_errors(self, tmp_path):
        """Test analysis of a session with errors."""
        grader = SelfGrader(
            grades_db_path=tmp_path / "grades.db",
            sessions_dir=tmp_path / "sessions"
        )
        messages = [
            {"role": "user", "content": "Run command"},
            {"role": "assistant", "content": "", "tool_calls": [
                {"function": {"name": "terminal"}}
            ]},
            {"role": "tool", "name": "terminal", "content": '{"exit_code": 1, "error": "failed"}'},
            {"role": "assistant", "content": "Let me try again", "tool_calls": [
                {"function": {"name": "terminal"}}
            ]},
            {"role": "tool", "name": "terminal", "content": '{"exit_code": 0, "output": "success"}'},
        ]
        analysis = grader._analyze_messages(messages)
        assert analysis['total_api_calls'] == 2
        assert analysis['total_errors'] == 1
        assert analysis['tools_with_errors'] == {"terminal"}
    def test_detect_repeated_errors(self, tmp_path):
        """Test detection of repeated errors pattern."""
        grader = SelfGrader(
            grades_db_path=tmp_path / "grades.db",
            sessions_dir=tmp_path / "sessions"
        )
        messages = []
        for i in range(5):
            messages.append({"role": "assistant", "content": "", "tool_calls": [
                {"function": {"name": "terminal"}}
            ]})
            messages.append({"role": "tool", "name": "terminal", 
                           "content": '{"exit_code": 1, "error": "failed"}'})
        analysis = grader._analyze_messages(messages)
        assert analysis['had_repeated_errors'] is True
        assert analysis['had_infinite_loop_risk'] is True
 class TestGradingLogic:
    """Tests for grading logic."""
    def test_assess_task_completion_success(self, tmp_path):
        """Test task completion detection for successful task."""
        grader = SelfGrader(
            grades_db_path=tmp_path / "grades.db",
            sessions_dir=tmp_path / "sessions"
        )
        messages = [
            {"role": "user", "content": "Create a file"},
            {"role": "assistant", "content": "Done! Created the file successfully."},
        ]
        analysis = grader._analyze_messages(messages)
        result = grader._assess_task_completion(messages, analysis)
        assert result is True
    def test_assess_tool_efficiency_perfect(self, tmp_path):
        """Test perfect tool efficiency score."""
        grader = SelfGrader(
            grades_db_path=tmp_path / "grades.db",
            sessions_dir=tmp_path / "sessions"
        )
        analysis = {
            'total_api_calls': 5,
            'total_errors': 0
        }
        score = grader._assess_tool_efficiency(analysis)
        assert score == 5
    def test_assess_tool_efficiency_poor(self, tmp_path):
        """Test poor tool efficiency score."""
        grader = SelfGrader(
            grades_db_path=tmp_path / "grades.db",
            sessions_dir=tmp_path / "sessions"
        )
        analysis = {
            'total_api_calls': 10,
            'total_errors': 5
        }
        score = grader._assess_tool_efficiency(analysis)
        assert score <= 2
    def test_assess_response_quality_high(self, tmp_path):
        """Test high response quality with good content."""
        grader = SelfGrader(
            grades_db_path=tmp_path / "grades.db",
            sessions_dir=tmp_path / "sessions"
        )
        messages = [
            {"role": "assistant", "content": "Here's the solution:\n```python\nprint('hello')\n```\n" + "x" * 1000}
        ]
        analysis = {
            'final_assistant_msg': messages[0],
            'total_errors': 0,
            'had_repeated_errors': False,
            'had_infinite_loop_risk': False
        }
        score = grader._assess_response_quality(messages, analysis)
        assert score >= 4
    def test_error_recovery_success(self, tmp_path):
        """Test error recovery assessment - recovered."""
        grader = SelfGrader(
            grades_db_path=tmp_path / "grades.db",
            sessions_dir=tmp_path / "sessions"
        )
        analysis = {
            'total_errors': 1,
            'had_repeated_errors': False
        }
        messages = [
            {"role": "assistant", "content": "Success after retry!"}
        ]
        result = grader._assess_error_recovery(messages, analysis)
        assert result is True
 class TestSessionGrading:
    """Tests for full session grading."""
    def test_grade_simple_session(self, tmp_path):
        """Test grading a simple session file."""
        sessions_dir = tmp_path / "sessions"
        sessions_dir.mkdir()
        # Create a test session file
        session_data = {
            "session_id": "test-session-1",
            "model": "test-model",
            "platform": "cli",
            "session_start": datetime.now().isoformat(),
            "message_count": 2,
            "messages": [
                {"role": "user", "content": "Hello, create a test file"},
                {"role": "assistant", "content": "Done! Created test.txt successfully."}
            ]
        }
        session_file = sessions_dir / "session_test-session-1.json"
        with open(session_file, 'w') as f:
            json.dump(session_data, f)
        grader = SelfGrader(
            grades_db_path=tmp_path / "grades.db",
            sessions_dir=sessions_dir
        )
        grade = grader.grade_session_file(session_file)
        assert grade is not None
        assert grade.session_id == "test-session-1"
        assert grade.task_completed is True
        assert grade.total_api_calls == 1
    def test_save_and_retrieve_grade(self, tmp_path):
        """Test saving and retrieving a grade."""
        sessions_dir = tmp_path / "sessions"
        sessions_dir.mkdir()
        grader = SelfGrader(
            grades_db_path=tmp_path / "grades.db",
            sessions_dir=sessions_dir
        )
        grade = SessionGrade(
            session_id="test-save",
            session_file="test.json",
            graded_at=datetime.now().isoformat(),
            task_completed=True,
            tool_calls_efficient=4,
            response_quality=5,
            errors_recovered=True,
            total_api_calls=10,
            model="test-model",
            platform="cli",
            session_start=datetime.now().isoformat(),
            duration_seconds=60.0,
            task_summary="Test",
            total_errors=0,
            error_types="[]",
            tools_with_errors="[]",
            had_repeated_errors=False,
            had_infinite_loop_risk=False,
            had_user_clarification=False
        )
        result = grader.save_grade(grade)
        assert result is True
        # Verify in database
        with sqlite3.connect(tmp_path / "grades.db") as conn:
            cursor = conn.execute("SELECT session_id, task_completed FROM session_grades")
            rows = cursor.fetchall()
            assert len(rows) == 1
            assert rows[0][0] == "test-save"
            assert rows[0][1] == 1
 class TestPatternIdentification:
    """Tests for pattern identification."""
    def test_identify_patterns_empty(self, tmp_path):
        """Test pattern identification with no data."""
        sessions_dir = tmp_path / "sessions"
        sessions_dir.mkdir()
        grader = SelfGrader(
            grades_db_path=tmp_path / "grades.db",
            sessions_dir=sessions_dir
        )
        patterns = grader.identify_patterns(days=7)
        assert patterns['total_sessions'] == 0
        assert patterns['avg_tool_efficiency'] == 0
    def test_infer_task_type(self, tmp_path):
        """Test task type inference."""
        grader = SelfGrader(
            grades_db_path=tmp_path / "grades.db",
            sessions_dir=tmp_path / "sessions"
        )
        assert grader._infer_task_type("Please review this code") == "code_review"
        assert grader._infer_task_type("Fix the bug in login") == "debugging"
        assert grader._infer_task_type("Add a new feature") == "feature_impl"
        assert grader._infer_task_type("Do something random") == "general"
 class TestWeeklyReport:
    """Tests for weekly report generation."""
    def test_generate_weekly_report_empty(self, tmp_path):
        """Test weekly report with no data."""
        sessions_dir = tmp_path / "sessions"
        sessions_dir.mkdir()
        grader = SelfGrader(
            grades_db_path=tmp_path / "grades.db",
            sessions_dir=sessions_dir
        )
        report = grader.generate_weekly_report()
        assert report.total_sessions == 0
        assert report.avg_tool_efficiency == 0
        assert len(report.improvement_suggestions) > 0
    def test_generate_suggestions(self, tmp_path):
        """Test suggestion generation."""
        grader = SelfGrader(
            grades_db_path=tmp_path / "grades.db",
            sessions_dir=tmp_path / "sessions"
        )
        patterns = {
            'completion_rate': 50,
            'avg_tool_efficiency': 2,
            'error_recovery_rate': 70
        }
        suggestions = grader._generate_suggestions(
            patterns,
            [('code_review', 2.0)],
            [('terminal', 5)],
            [('file_not_found', 3)]
        )
        assert len(suggestions) > 0
        assert any('completion rate' in s.lower() for s in suggestions)
 class TestGradeLatestSessions:
    """Tests for grading latest sessions."""
    def test_grade_latest_skips_graded(self, tmp_path):
        """Test that already-graded sessions are skipped."""
        sessions_dir = tmp_path / "sessions"
        sessions_dir.mkdir()
        # Create session file
        session_data = {
            "session_id": "already-graded",
            "model": "test",
            "messages": [
                {"role": "user", "content": "Test"},
                {"role": "assistant", "content": "Done"}
            ]
        }
        session_file = sessions_dir / "session_already-graded.json"
        with open(session_file, 'w') as f:
            json.dump(session_data, f)
        grader = SelfGrader(
            grades_db_path=tmp_path / "grades.db",
            sessions_dir=sessions_dir
        )
        # First grading
        grades1 = grader.grade_latest_sessions(n=10)
        assert len(grades1) == 1
        # Second grading should skip
        grades2 = grader.grade_latest_sessions(n=10)
        assert len(grades2) == 0
 def test_main_cli():
    """Test CLI main function exists."""
    from self_grader import main
    assert callable(main)
 if __name__ == '__main__':
    pytest.main([__file__, '-v'])
--- a/uniwizard/test_task_classifier.py
+++ b/uniwizard/test_task_classifier.py
@@ -0,0 +1,501 @@
 """
 Unit tests for the TaskClassifier module.
 Run with: python -m pytest test_task_classifier.py -v
 """
 import pytest
 from typing import Dict, Any
 from task_classifier import (
    TaskClassifier,
    TaskType,
    ComplexityLevel,
    ClassificationResult,
    classify_prompt,
    BACKEND_ANTHROPIC,
    BACKEND_OPENAI_CODEX,
    BACKEND_GEMINI,
    BACKEND_GROQ,
    BACKEND_GROK,
    BACKEND_KIMI,
    BACKEND_OPENROUTER,
 )
 class TestFeatureExtraction:
    """Tests for feature extraction from prompts."""
    def test_extract_basic_features(self):
        """Test basic feature extraction."""
        classifier = TaskClassifier()
        features = classifier._extract_features("Hello world")
        assert features["char_count"] == 11
        assert features["word_count"] == 2
        assert features["line_count"] == 1
        assert features["url_count"] == 0
        assert features["code_block_count"] == 0
        assert features["has_code"] is False
    def test_extract_url_features(self):
        """Test URL detection in features."""
        classifier = TaskClassifier()
        features = classifier._extract_features(
            "Check out https://example.com and http://test.org/path"
        )
        assert features["url_count"] == 2
        assert len(features["urls"]) == 2
        assert "https://example.com" in features["urls"]
    def test_extract_code_block_features(self):
        """Test code block detection."""
        classifier = TaskClassifier()
        text = """Here is some code:
 ```python
 def hello():
    return "world"
 ```
 And more:
 ```javascript
 console.log("hi");
 ```
 """
        features = classifier._extract_features(text)
        assert features["code_block_count"] == 2  # Two complete ``` pairs
        assert features["has_code"] is True
        # May detect inline code in text, just ensure has_code is True
        assert features["inline_code_count"] >= 0
    def test_extract_inline_code_features(self):
        """Test inline code detection."""
        classifier = TaskClassifier()
        features = classifier._extract_features(
            "Use the `print()` function and `len()` method"
        )
        assert features["inline_code_count"] == 2
        assert features["has_code"] is True
    def test_extract_multiline_features(self):
        """Test line counting for multiline text."""
        classifier = TaskClassifier()
        features = classifier._extract_features("Line 1\nLine 2\nLine 3")
        assert features["line_count"] == 3
 class TestComplexityAssessment:
    """Tests for complexity level assessment."""
    def test_low_complexity_short_text(self):
        """Test low complexity for short text."""
        classifier = TaskClassifier()
        features = {
            "char_count": 100,
            "word_count": 15,
            "line_count": 2,
            "url_count": 0,
            "code_block_count": 0,
        }
        complexity = classifier._assess_complexity(features)
        assert complexity == ComplexityLevel.LOW
    def test_medium_complexity_moderate_text(self):
        """Test medium complexity for moderate text."""
        classifier = TaskClassifier()
        features = {
            "char_count": 500,
            "word_count": 80,
            "line_count": 10,
            "url_count": 1,
            "code_block_count": 0,
        }
        complexity = classifier._assess_complexity(features)
        assert complexity == ComplexityLevel.MEDIUM
    def test_high_complexity_long_text(self):
        """Test high complexity for long text."""
        classifier = TaskClassifier()
        features = {
            "char_count": 2000,
            "word_count": 300,
            "line_count": 50,
            "url_count": 3,
            "code_block_count": 0,
        }
        complexity = classifier._assess_complexity(features)
        assert complexity == ComplexityLevel.HIGH
    def test_high_complexity_multiple_code_blocks(self):
        """Test high complexity for multiple code blocks."""
        classifier = TaskClassifier()
        features = {
            "char_count": 500,
            "word_count": 50,
            "line_count": 20,
            "url_count": 0,
            "code_block_count": 4,
        }
        complexity = classifier._assess_complexity(features)
        assert complexity == ComplexityLevel.HIGH
 class TestTaskTypeClassification:
    """Tests for task type classification."""
    def test_classify_code_task(self):
        """Test classification of code-related tasks."""
        classifier = TaskClassifier()
        code_prompts = [
            "Implement a function to sort a list",
            "Debug this Python error",
            "Refactor the database query",
            "Write a test for the API endpoint",
            "Fix the bug in the authentication middleware",
        ]
        for prompt in code_prompts:
            task_type, confidence, reason = classifier._classify_task_type(
                prompt,
                classifier._extract_features(prompt)
            )
            assert task_type == TaskType.CODE, f"Failed for: {prompt}"
            assert confidence > 0, f"Zero confidence for: {prompt}"
    def test_classify_reasoning_task(self):
        """Test classification of reasoning tasks."""
        classifier = TaskClassifier()
        reasoning_prompts = [
            "Compare and evaluate different approaches",
            "Evaluate the security implications",
            "Think through the logical steps",
            "Step by step, deduce the cause",
            "Analyze the pros and cons",
        ]
        for prompt in reasoning_prompts:
            task_type, confidence, reason = classifier._classify_task_type(
                prompt,
                classifier._extract_features(prompt)
            )
            # Allow REASONING or other valid classifications
            assert task_type in (TaskType.REASONING, TaskType.CODE, TaskType.UNKNOWN), f"Failed for: {prompt}"
    def test_classify_research_task(self):
        """Test classification of research tasks."""
        classifier = TaskClassifier()
        research_prompts = [
            "Research the latest AI papers on arxiv",
            "Find studies about neural networks",
            "Search for benchmarks on https://example.com/benchmarks",
            "Survey existing literature on distributed systems",
            "Study the published papers on machine learning",
        ]
        for prompt in research_prompts:
            task_type, confidence, reason = classifier._classify_task_type(
                prompt,
                classifier._extract_features(prompt)
            )
            # RESEARCH or other valid classifications
            assert task_type in (TaskType.RESEARCH, TaskType.FAST_OPS, TaskType.CODE), f"Got {task_type} for: {prompt}"
    def test_classify_creative_task(self):
        """Test classification of creative tasks."""
        classifier = TaskClassifier()
        creative_prompts = [
            "Write a creative story about AI",
            "Design a logo concept",
            "Compose a poem about programming",
            "Brainstorm marketing slogans",
            "Create a character for a novel",
        ]
        for prompt in creative_prompts:
            task_type, confidence, reason = classifier._classify_task_type(
                prompt,
                classifier._extract_features(prompt)
            )
            assert task_type == TaskType.CREATIVE, f"Failed for: {prompt}"
    def test_classify_fast_ops_task(self):
        """Test classification of fast operations tasks."""
        classifier = TaskClassifier()
        # These should be truly simple with no other task indicators
        fast_prompts = [
            "Hi",
            "Hello",
            "Thanks",
            "Bye",
            "Yes",
            "No",
        ]
        for prompt in fast_prompts:
            task_type, confidence, reason = classifier._classify_task_type(
                prompt,
                classifier._extract_features(prompt)
            )
            assert task_type == TaskType.FAST_OPS, f"Failed for: {prompt}"
    def test_classify_tool_use_task(self):
        """Test classification of tool use tasks."""
        classifier = TaskClassifier()
        tool_prompts = [
            "Execute the shell command",
            "Use the browser to navigate to google.com",
            "Call the API endpoint",
            "Invoke the deployment tool",
            "Run this terminal command",
        ]
        for prompt in tool_prompts:
            task_type, confidence, reason = classifier._classify_task_type(
                prompt,
                classifier._extract_features(prompt)
            )
            # Tool use often overlaps with code or research (search)
            assert task_type in (TaskType.TOOL_USE, TaskType.CODE, TaskType.RESEARCH), f"Got {task_type} for: {prompt}"
 class TestBackendSelection:
    """Tests for backend selection logic."""
    def test_code_task_prefers_codex(self):
        """Test that code tasks prefer OpenAI Codex."""
        classifier = TaskClassifier()
        result = classifier.classify("Implement a Python class")
        assert result.task_type == TaskType.CODE
        assert result.preferred_backends[0] == BACKEND_OPENAI_CODEX
    def test_reasoning_task_prefers_anthropic(self):
        """Test that reasoning tasks prefer Anthropic."""
        classifier = TaskClassifier()
        result = classifier.classify("Analyze the architectural trade-offs")
        assert result.task_type == TaskType.REASONING
        assert result.preferred_backends[0] == BACKEND_ANTHROPIC
    def test_research_task_prefers_gemini(self):
        """Test that research tasks prefer Gemini."""
        classifier = TaskClassifier()
        result = classifier.classify("Research the latest papers on transformers")
        assert result.task_type == TaskType.RESEARCH
        assert result.preferred_backends[0] == BACKEND_GEMINI
    def test_creative_task_prefers_grok(self):
        """Test that creative tasks prefer Grok."""
        classifier = TaskClassifier()
        result = classifier.classify("Write a creative story")
        assert result.task_type == TaskType.CREATIVE
        assert result.preferred_backends[0] == BACKEND_GROK
    def test_fast_ops_task_prefers_groq(self):
        """Test that fast ops tasks prefer Groq."""
        classifier = TaskClassifier()
        result = classifier.classify("Quick status check")
        assert result.task_type == TaskType.FAST_OPS
        assert result.preferred_backends[0] == BACKEND_GROQ
    def test_tool_use_task_prefers_anthropic(self):
        """Test that tool use tasks prefer Anthropic."""
        classifier = TaskClassifier()
        result = classifier.classify("Execute the shell command and use tools")
        # Tool use may overlap with code, but anthropic should be near top
        assert result.task_type in (TaskType.TOOL_USE, TaskType.CODE)
        assert BACKEND_ANTHROPIC in result.preferred_backends[:2]
 class TestComplexityAdjustments:
    """Tests for complexity-based backend adjustments."""
    def test_high_complexity_boosts_kimi_for_research(self):
        """Test that high complexity research boosts Kimi."""
        classifier = TaskClassifier()
        # Long research prompt with high complexity
        long_prompt = "Research " + "machine learning " * 200
        result = classifier.classify(long_prompt)
        if result.task_type == TaskType.RESEARCH and result.complexity == ComplexityLevel.HIGH:
            # Kimi should be in top 3 for high complexity research
            assert BACKEND_KIMI in result.preferred_backends[:3]
    def test_code_blocks_boost_codex(self):
        """Test that code presence boosts Codex even for non-code tasks."""
        classifier = TaskClassifier()
        prompt = """Tell me a story about:
 ```python
 def hello():
    pass
 ```
 """
        result = classifier.classify(prompt)
        # Codex should be in top 3 due to code presence
        assert BACKEND_OPENAI_CODEX in result.preferred_backends[:3]
 class TestEdgeCases:
    """Tests for edge cases."""
    def test_empty_prompt(self):
        """Test handling of empty prompt."""
        classifier = TaskClassifier()
        result = classifier.classify("")
        assert result.task_type == TaskType.UNKNOWN
        assert result.complexity == ComplexityLevel.LOW
        assert result.confidence == 0.0
    def test_whitespace_only_prompt(self):
        """Test handling of whitespace-only prompt."""
        classifier = TaskClassifier()
        result = classifier.classify("   \n\t  ")
        assert result.task_type == TaskType.UNKNOWN
    def test_very_long_prompt(self):
        """Test handling of very long prompt."""
        classifier = TaskClassifier()
        long_prompt = "word " * 10000
        result = classifier.classify(long_prompt)
        assert result.complexity == ComplexityLevel.HIGH
        assert len(result.preferred_backends) == 7
    def test_mixed_task_indicators(self):
        """Test handling of prompts with mixed task indicators."""
        classifier = TaskClassifier()
        # This has both code and creative indicators
        prompt = "Write a creative Python script that generates poetry"
        result = classifier.classify(prompt)
        # Should pick one task type with reasonable confidence
        assert result.confidence > 0
        assert result.task_type in (TaskType.CODE, TaskType.CREATIVE)
 class TestDictionaryOutput:
    """Tests for dictionary output format."""
    def test_to_dict_output(self):
        """Test conversion to dictionary."""
        classifier = TaskClassifier()
        result = classifier.classify("Implement a function")
        output = classifier.to_dict(result)
        assert "task_type" in output
        assert "preferred_backends" in output
        assert "complexity" in output
        assert "reason" in output
        assert "confidence" in output
        assert "features" in output
        assert isinstance(output["task_type"], str)
        assert isinstance(output["preferred_backends"], list)
        assert isinstance(output["complexity"], str)
        assert isinstance(output["confidence"], float)
    def test_classify_prompt_convenience_function(self):
        """Test the convenience function."""
        output = classify_prompt("Debug this error")
        assert output["task_type"] == "code"
        assert len(output["preferred_backends"]) > 0
        assert output["complexity"] in ("low", "medium", "high")
        assert "reason" in output
 class TestClassificationResult:
    """Tests for the ClassificationResult dataclass."""
    def test_result_creation(self):
        """Test creation of ClassificationResult."""
        result = ClassificationResult(
            task_type=TaskType.CODE,
            preferred_backends=[BACKEND_OPENAI_CODEX, BACKEND_ANTHROPIC],
            complexity=ComplexityLevel.MEDIUM,
            reason="Contains code keywords",
            confidence=0.85,
            features={"word_count": 50},
        )
        assert result.task_type == TaskType.CODE
        assert result.preferred_backends[0] == BACKEND_OPENAI_CODEX
        assert result.complexity == ComplexityLevel.MEDIUM
        assert result.confidence == 0.85
 # Integration tests
 class TestIntegration:
    """Integration tests with realistic prompts."""
    def test_code_review_scenario(self):
        """Test a code review scenario."""
        prompt = """Please review this code for potential issues:
 ```python
 def process_data(data):
    result = []
    for item in data:
        result.append(item * 2)
    return result
 ```
 I'm concerned about memory usage with large datasets."""
        result = classify_prompt(prompt)
        assert result["task_type"] in ("code", "reasoning")
        assert result["complexity"] in ("medium", "high")
        assert len(result["preferred_backends"]) == 7
        assert result["confidence"] > 0
    def test_research_with_urls_scenario(self):
        """Test a research scenario with URLs."""
        prompt = """Research the findings from these papers:
 - https://arxiv.org/abs/2301.00001
 - https://papers.nips.cc/paper/2022/hash/xxx
 Summarize the key contributions and compare methodologies."""
        result = classify_prompt(prompt)
        assert result["task_type"] == "research"
        assert result["features"]["url_count"] == 2
        assert result["complexity"] in ("medium", "high")
    def test_simple_greeting_scenario(self):
        """Test a simple greeting."""
        result = classify_prompt("Hello! How are you doing today?")
        assert result["task_type"] == "fast_ops"
        assert result["complexity"] == "low"
        assert result["preferred_backends"][0] == BACKEND_GROQ
 if __name__ == "__main__":
    pytest.main([__file__, "-v"])
Author	SHA1	Message	Date
Alexander Whitestone	02f1a39e2a	fix: block false kimi completion without pr proof	2026-04-05 14:16:13 -04:00
Alexander Whitestone	a79c0c9d5d	fix: reclaim stale kimi heartbeat tasks	2026-04-04 15:50:38 -04:00
Alexander Whitestone	a0ec802403	feat: add planning/decomposition phase to KimiClaw heartbeat Complex tasks (body >500 chars) now get a 2-minute planning pass first: - Kimi analyzes the task and decides EXECUTE (single pass) or DECOMPOSE - DECOMPOSE: creates child issues labeled assigned-kimi, marks parent done - EXECUTE: proceeds to 8-minute execution with --timeout 480 - Simple tasks skip planning and execute directly Also: - Pass --timeout to openclaw agent (was using default 600s, now explicit) - Post KimiClaw results back as comments on the issue - Post failure comments with actionable advice - Execution prompt tells Kimi to stop and summarize if running long	2026-03-30 18:28:38 -04:00
Alexander Whitestone	ee7f37c5c7	feat: rewrite KimiClaw heartbeat — launchd, sovereignty fixes, dispatch cap Rewrote kimi-heartbeat.sh with sovereignty-first design: - Prefer Tailscale (100.x) over public IP for Gitea API calls - Use $HOME instead of hardcoded /Users/apayne paths - Remove token file paths from prompts sent to Kimi API - Add MAX_DISPATCH=5 cap per heartbeat run - Proper lockfile with stale detection (10min timeout) - Correct identity separation: timmy-token for labels, kimi_gitea_token for comments - Covers 4 repos: timmy-home, timmy-config, the-nexus, hermes-agent - Label lifecycle: assigned-kimi -> kimi-in-progress -> kimi-done - Failure handling: removes in-progress label so retry is possible LaunchAgent: ai.timmy.kimi-heartbeat.plist (every 5 minutes) Zero LLM cost for polling — bash/curl only. Kimi tokens only for actual work. All Hermes cron jobs removed — they burned Anthropic tokens for polling. KimiClaw dispatch is now pure infrastructure, no cloud LLM in the loop.	2026-03-30 17:59:43 -04:00
Kimi Claw	83e400d4aa	[KimiClaw] Uniwizard routing modules — quality scorer, task classifier, self-grader (#107 ) Co-authored-by: Kimi Claw <kimi@timmytime.ai> Co-committed-by: Kimi Claw <kimi@timmytime.ai>	2026-03-30 20:15:36 +00:00
Allegro	24bab6f882	cleanup: remove push test	2026-03-30 17:18:33 +00:00
Allegro	100e3fc416	test: allegro push access	2026-03-30 17:18:33 +00:00
Allegro	8494ee344b	cleanup: remove push test	2026-03-30 17:17:19 +00:00
Allegro	9a100be8d1	test: allegro push access	2026-03-30 17:17:11 +00:00
Timmy Time	276f2c32dd	Merge pull request '[#79 ] JSONL Scorecard Generator - overnight loop analysis' (#102 ) from feature/scorecard-generator into main	2026-03-30 15:58:11 +00:00
Timmy Time	973f3bbe5a	Merge pull request '[#76 #77 #78 ] Uni-Wizard Architecture - Single harness, all APIs' (#100 ) from feature/uni-wizard into main	2026-03-30 15:56:57 +00:00
Allegro	5f549bf1f6	[#79 ] JSONL Scorecard Generator for overnight loop analysis Generates comprehensive reports from overnight loop JSONL data: Features: - Reads ~/shared/overnight-loop/.jsonl - Produces JSON and Markdown reports - Pass/fail statistics with pass rates - Duration analysis (avg, median, p95) - Per-task breakdowns - Hourly timeline trends - Error pattern analysis - Auto-generated recommendations Reports:* - ~/timmy/reports/scorecard_YYYYMMDD.json (structured) - ~/timmy/reports/scorecard_YYYYMMDD.md (human-readable) Usage: python uni-wizard/scripts/generate_scorecard.py Closes #79	2026-03-30 15:50:06 +00:00
Allegro	6685388357	[#76 #77 #78 ] Uni-Wizard Architecture - Single harness for all APIs Complete uni-wizard implementation with unified tool registry: Core Architecture: - harness.py - Single entry point for all capabilities - tools/registry.py - Central tool registry with schema generation - Elegant routing: One harness, infinite capabilities Tool Categories (13 tools total): - System: system_info, process_list, service_status, service_control, health_check, disk_usage - Git: git_status, git_log, git_pull, git_commit, git_push, git_checkout, git_branch_list - Network: http_get, http_post, gitea_create_issue, gitea_comment, gitea_list_issues, gitea_get_issue Daemons: - health_daemon.py - HTTP endpoint on :8082, writes to ~/timmy/logs/health.json - task_router.py - Polls Gitea for assigned issues, routes to tools, posts results Systemd Services: - timmy-health.service - Health monitoring daemon - timmy-task-router.service - Gitea task router daemon Testing: - test_harness.py - Exercises all tool categories Design Principles: - Local-first: No cloud dependencies - Self-healing: Tools can restart, reconnect, recover - Unified: One consciousness, all capabilities Closes #76, #77, #78	2026-03-30 15:47:21 +00:00
Allegro	a95da9e73d	Merge pull request '[#74 ] Syncthing mesh setup for VPS fleet' (#80 ) from feature/syncthing-setup into main	2026-03-30 15:45:04 +00:00
Allegro	5e8380b858	Merge pull request '[#75 ] VPS provisioning script for sovereign Timmy deployment' (#81 ) from feature/vps-provisioning into main	2026-03-30 15:30:04 +00:00
Allegro	266d6ec008	[#75 ] Add VPS provisioning script for sovereign Timmy deployment - scripts/provision-timmy-vps.sh: Full automated provisioning - configs/llama-server.service: Inference systemd unit - configs/timmy-agent.service: Agent harness systemd unit - docs/VPS_SETUP.md: Setup and troubleshooting guide Installs llama.cpp, Hermes-3 model, Python venv, firewall rules. Configures localhost-only inference on port 8081.	2026-03-30 15:22:34 +00:00