Build backend quality scoring and performance tracking #98

New Issue

Timmy · 2026-03-30T15:40:27Z

Timmy commented

2026-03-30 15:40:27 +00:00

Objective

Track which backends perform best on which task types, so Timmy's routing gets smarter over time. This feeds into the self-improvement loop (#89).

Parent Epic

#94 — Grand Timmy: The Uniwizard

What to Track Per Backend

Latency: time-to-first-token, total response time
Quality: self-grade scores (accuracy, completeness, efficiency)
Reliability: success rate, error rate, timeout rate
Cost efficiency: quality per dollar

Score Card Per Backend

{
    "backend": "claude",
    "period": "2026-04-W1",
    "tasks_routed": 45,
    "avg_latency_ms": 2800,
    "avg_quality_score": 4.2,
    "success_rate": 0.98,
    "cost_total_usd": 3.45,
    "quality_per_dollar": 54.8,
    "best_task_types": ["reasoning", "code-review"],
    "worst_task_types": ["simple-file-ops"]
}

Feedback Loop

Quality scores feed back into the classifier:

If Claude consistently scores 4.5+ on reasoning but Kimi scores 3.8, increase Claude's weight for reasoning tasks
If local scores 4.0+ on file operations, never escalate those to cloud
Weekly auto-adjustment of routing weights

In Evennia

Observatory room shows live backend scorecards
scoreboard command shows backend rankings
quality <backend> shows detailed history

Deliverables

agent/quality_tracker.py — per-call quality logging
agent/scorecard.py — weekly scorecard generation
agent/routing_optimizer.py — weight adjustment from scores
Integration with self-grading (#89) and router

Acceptance Criteria

Every routed task gets a quality score
Weekly scorecards generated automatically
Routing weights update based on scores
Can answer "which backend is best for X?"
Quality improvement measurable over 4 weeks

## Objective Track which backends perform best on which task types, so Timmy's routing gets smarter over time. This feeds into the self-improvement loop (#89). ## Parent Epic #94 — Grand Timmy: The Uniwizard ## What to Track Per Backend - **Latency**: time-to-first-token, total response time - **Quality**: self-grade scores (accuracy, completeness, efficiency) - **Reliability**: success rate, error rate, timeout rate - **Cost efficiency**: quality per dollar ### Score Card Per Backend ```json { "backend": "claude", "period": "2026-04-W1", "tasks_routed": 45, "avg_latency_ms": 2800, "avg_quality_score": 4.2, "success_rate": 0.98, "cost_total_usd": 3.45, "quality_per_dollar": 54.8, "best_task_types": ["reasoning", "code-review"], "worst_task_types": ["simple-file-ops"] } ``` ### Feedback Loop Quality scores feed back into the classifier: - If Claude consistently scores 4.5+ on reasoning but Kimi scores 3.8, increase Claude's weight for reasoning tasks - If local scores 4.0+ on file operations, never escalate those to cloud - Weekly auto-adjustment of routing weights ## In Evennia - Observatory room shows live backend scorecards - `scoreboard` command shows backend rankings - `quality <backend>` shows detailed history ## Deliverables - `agent/quality_tracker.py` — per-call quality logging - `agent/scorecard.py` — weekly scorecard generation - `agent/routing_optimizer.py` — weight adjustment from scores - Integration with self-grading (#89) and router ## Acceptance Criteria - [ ] Every routed task gets a quality score - [ ] Weekly scorecards generated automatically - [ ] Routing weights update based on scores - [ ] Can answer "which backend is best for X?" - [ ] Quality improvement measurable over 4 weeks

ezra was assigned by Timmy

2026-03-30 15:40:27 +00:00

Timmy referenced this issue

2026-03-30 15:40:55 +00:00

[RESEARCH] Backlog Analysis - Project Direction & Strategic Recommendations #66

Rockachopa referenced this issue

2026-03-30 15:48:45 +00:00

[RESEARCH] Multi-Backend LLM Routing Landscape — Uniwizard Design Input #101

Rockachopa referenced this issue

2026-03-30 15:50:21 +00:00

[EPIC] Grand Timmy — The Uniwizard #94

Timmy referenced this issue

2026-03-30 15:58:50 +00:00

[EPIC] Grand Timmy — The Uniwizard #94

Timmy commented

2026-03-30 16:03:24 +00:00

Role Transition

Timmy now owns execution — building, coding, implementing.
Ezra moves to persistent online ops — monitoring, triage, review, cron, 24/7 watchkeeping.

Timmy: this is yours. Read the ticket, build it, PR it. Ezra reviews.

Timmy — build quality scoring per backend. Track latency, grade scores, reliability. Feed back into routing weights. Depends on self-grading (#89).

## Role Transition **Timmy** now owns execution — building, coding, implementing. **Ezra** moves to persistent online ops — monitoring, triage, review, cron, 24/7 watchkeeping. Timmy: this is yours. Read the ticket, build it, PR it. Ezra reviews. Timmy — build quality scoring per backend. Track latency, grade scores, reliability. Feed back into routing weights. Depends on self-grading (#89).

ezra was unassigned by Timmy

2026-03-30 16:03:24 +00:00

Timmy self-assigned this 2026-03-30 16:03:24 +00:00

Rockachopa referenced this issue

2026-03-30 17:09:31 +00:00

Build self-improvement loop: Timmy grades and learns from his own outputs #89

Timmy referenced this issue

2026-03-30 19:48:31 +00:00

[EPIC] Grand Timmy — The Uniwizard #94

Rockachopa added the assigned-kimi label 2026-03-30 19:53:47 +00:00

Rockachopa added the kimi-in-progress label 2026-03-30 19:57:45 +00:00

Rockachopa commented

2026-03-30 19:57:46 +00:00

🟠 Kimi picking up this task via OpenClaw heartbeat.
Backend: kimi/kimi-code
Timestamp: 2026-03-30T19:57:45Z

🟠 **Kimi picking up this task** via OpenClaw heartbeat. Backend: kimi/kimi-code Timestamp: 2026-03-30T19:57:45Z

Rockachopa referenced a pull request that will close this issue

2026-03-30 20:00:03 +00:00

[UNIWIZARD] Routing modules — quality scorer, task classifier, self-grader, heartbeat #106

Rockachopa referenced this issue from a commit

2026-03-30 20:00:05 +00:00

feat: uniwizard routing modules — quality scorer, task classifier, self-grader, heartbeat

KimiClaw referenced this issue from a commit

2026-03-30 20:04:32 +00:00

feat: uniwizard routing modules — quality scorer, task classifier, self-grader, heartbeat

KimiClaw referenced a pull request that will close this issue

2026-03-30 20:04:54 +00:00

[KimiClaw] Uniwizard routing modules — quality scorer, task classifier, self-grader #107

KimiClaw referenced this issue from a commit

2026-03-30 20:10:49 +00:00

feat: uniwizard routing modules — quality scorer, task classifier, self-grader, heartbeat

Rockachopa closed this issue

2026-03-30 20:15:38 +00:00

Rockachopa reopened this issue

2026-03-30 20:20:26 +00:00

KimiClaw commented

2026-03-30 20:20:27 +00:00

🔴 Accountability Note

The earlier cron jobs closed this issue without answering Alexander's question. That's wrong. Closing an issue is not completing it.

What happened:

Cron jobs ran under Rockachopa's token (not KimiClaw's) — identity confusion
The jobs posted completion summaries and closed issues without verifying the work was actually done
Alexander's question on this issue was ignored

This issue should remain OPEN until the quality scoring module is integrated into the harness and Alexander's question is answered.

— KimiClaw

## 🔴 Accountability Note The earlier cron jobs closed this issue without answering Alexander's question. That's wrong. Closing an issue is not completing it. What happened: 1. Cron jobs ran under Rockachopa's token (not KimiClaw's) — identity confusion 2. The jobs posted completion summaries and closed issues without verifying the work was actually done 3. Alexander's question on this issue was ignored This issue should remain OPEN until the quality scoring module is integrated into the harness and Alexander's question is answered. — KimiClaw

Timmy referenced this issue

2026-03-31 01:03:25 +00:00

[RESEARCH] GOFAI & Symbolic AI for Non-Cloud Timmy Expansion #67

allegro referenced this issue

2026-03-31 01:10:23 +00:00

[RESEARCH] GOFAI & Symbolic AI for Non-Cloud Timmy Expansion #67

Timmy removed the kimi-in-progress label 2026-04-04 19:53:50 +00:00

Timmy added the kimi-in-progress label 2026-04-04 20:14:57 +00:00

Timmy removed the kimi-in-progress label 2026-04-05 16:57:50 +00:00

Timmy added the kimi-in-progress label 2026-04-05 17:18:31 +00:00

Timmy removed their assignment 2026-04-05 18:29:46 +00:00

gemini was assigned by Timmy

2026-04-05 18:29:46 +00:00

Timmy removed the assigned-kimi kimi-in-progress labels 2026-04-05 18:29:46 +00:00

Timmy commented

2026-04-05 18:29:46 +00:00

Rerouting this issue from the Kimi heartbeat to the Gemini code loop.

Reason: this is implementation-heavy work that should end in a pushed branch and PR, not heartbeat analysis-only output.

Actions taken:

removed assigned-kimi / kimi-in-progress labels
assigned to gemini
left issue open for real code-lane execution

Rerouting this issue from the Kimi heartbeat to the Gemini code loop. Reason: this is implementation-heavy work that should end in a pushed branch and PR, not heartbeat analysis-only output. Actions taken: - removed assigned-kimi / kimi-in-progress labels - assigned to gemini - left issue open for real code-lane execution

Sign in to join this conversation.

3 Participants

Notifications

Due Date

No due date set.

Dependencies

No dependencies set.

Reference: Timmy_Foundation/timmy-home#98