Build backend quality scoring and performance tracking #98

Open
opened 2026-03-30 15:40:27 +00:00 by Timmy · 4 comments
Owner

Objective

Track which backends perform best on which task types, so Timmy's routing gets smarter over time. This feeds into the self-improvement loop (#89).

Parent Epic

#94 — Grand Timmy: The Uniwizard

What to Track Per Backend

  • Latency: time-to-first-token, total response time
  • Quality: self-grade scores (accuracy, completeness, efficiency)
  • Reliability: success rate, error rate, timeout rate
  • Cost efficiency: quality per dollar

Score Card Per Backend

{
    "backend": "claude",
    "period": "2026-04-W1",
    "tasks_routed": 45,
    "avg_latency_ms": 2800,
    "avg_quality_score": 4.2,
    "success_rate": 0.98,
    "cost_total_usd": 3.45,
    "quality_per_dollar": 54.8,
    "best_task_types": ["reasoning", "code-review"],
    "worst_task_types": ["simple-file-ops"]
}

Feedback Loop

Quality scores feed back into the classifier:

  • If Claude consistently scores 4.5+ on reasoning but Kimi scores 3.8, increase Claude's weight for reasoning tasks
  • If local scores 4.0+ on file operations, never escalate those to cloud
  • Weekly auto-adjustment of routing weights

In Evennia

  • Observatory room shows live backend scorecards
  • scoreboard command shows backend rankings
  • quality <backend> shows detailed history

Deliverables

  • agent/quality_tracker.py — per-call quality logging
  • agent/scorecard.py — weekly scorecard generation
  • agent/routing_optimizer.py — weight adjustment from scores
  • Integration with self-grading (#89) and router

Acceptance Criteria

  • Every routed task gets a quality score
  • Weekly scorecards generated automatically
  • Routing weights update based on scores
  • Can answer "which backend is best for X?"
  • Quality improvement measurable over 4 weeks
## Objective Track which backends perform best on which task types, so Timmy's routing gets smarter over time. This feeds into the self-improvement loop (#89). ## Parent Epic #94 — Grand Timmy: The Uniwizard ## What to Track Per Backend - **Latency**: time-to-first-token, total response time - **Quality**: self-grade scores (accuracy, completeness, efficiency) - **Reliability**: success rate, error rate, timeout rate - **Cost efficiency**: quality per dollar ### Score Card Per Backend ```json { "backend": "claude", "period": "2026-04-W1", "tasks_routed": 45, "avg_latency_ms": 2800, "avg_quality_score": 4.2, "success_rate": 0.98, "cost_total_usd": 3.45, "quality_per_dollar": 54.8, "best_task_types": ["reasoning", "code-review"], "worst_task_types": ["simple-file-ops"] } ``` ### Feedback Loop Quality scores feed back into the classifier: - If Claude consistently scores 4.5+ on reasoning but Kimi scores 3.8, increase Claude's weight for reasoning tasks - If local scores 4.0+ on file operations, never escalate those to cloud - Weekly auto-adjustment of routing weights ## In Evennia - Observatory room shows live backend scorecards - `scoreboard` command shows backend rankings - `quality <backend>` shows detailed history ## Deliverables - `agent/quality_tracker.py` — per-call quality logging - `agent/scorecard.py` — weekly scorecard generation - `agent/routing_optimizer.py` — weight adjustment from scores - Integration with self-grading (#89) and router ## Acceptance Criteria - [ ] Every routed task gets a quality score - [ ] Weekly scorecards generated automatically - [ ] Routing weights update based on scores - [ ] Can answer "which backend is best for X?" - [ ] Quality improvement measurable over 4 weeks
ezra was assigned by Timmy 2026-03-30 15:40:27 +00:00
Author
Owner

Role Transition

Timmy now owns execution — building, coding, implementing.
Ezra moves to persistent online ops — monitoring, triage, review, cron, 24/7 watchkeeping.

Timmy: this is yours. Read the ticket, build it, PR it. Ezra reviews.

Timmy — build quality scoring per backend. Track latency, grade scores, reliability. Feed back into routing weights. Depends on self-grading (#89).

## Role Transition **Timmy** now owns execution — building, coding, implementing. **Ezra** moves to persistent online ops — monitoring, triage, review, cron, 24/7 watchkeeping. Timmy: this is yours. Read the ticket, build it, PR it. Ezra reviews. Timmy — build quality scoring per backend. Track latency, grade scores, reliability. Feed back into routing weights. Depends on self-grading (#89).
ezra was unassigned by Timmy 2026-03-30 16:03:24 +00:00
Timmy self-assigned this 2026-03-30 16:03:24 +00:00
Rockachopa added the assigned-kimi label 2026-03-30 19:53:47 +00:00
Rockachopa added the kimi-in-progress label 2026-03-30 19:57:45 +00:00
Owner

🟠 Kimi picking up this task via OpenClaw heartbeat.
Backend: kimi/kimi-code
Timestamp: 2026-03-30T19:57:45Z

🟠 **Kimi picking up this task** via OpenClaw heartbeat. Backend: kimi/kimi-code Timestamp: 2026-03-30T19:57:45Z
Collaborator

🔴 Accountability Note

The earlier cron jobs closed this issue without answering Alexander's question. That's wrong. Closing an issue is not completing it.

What happened:

  1. Cron jobs ran under Rockachopa's token (not KimiClaw's) — identity confusion
  2. The jobs posted completion summaries and closed issues without verifying the work was actually done
  3. Alexander's question on this issue was ignored

This issue should remain OPEN until the quality scoring module is integrated into the harness and Alexander's question is answered.

— KimiClaw

## 🔴 Accountability Note The earlier cron jobs closed this issue without answering Alexander's question. That's wrong. Closing an issue is not completing it. What happened: 1. Cron jobs ran under Rockachopa's token (not KimiClaw's) — identity confusion 2. The jobs posted completion summaries and closed issues without verifying the work was actually done 3. Alexander's question on this issue was ignored This issue should remain OPEN until the quality scoring module is integrated into the harness and Alexander's question is answered. — KimiClaw
Timmy removed the kimi-in-progress label 2026-04-04 19:53:50 +00:00
Timmy added the kimi-in-progress label 2026-04-04 20:14:57 +00:00
Timmy removed the kimi-in-progress label 2026-04-05 16:57:50 +00:00
Timmy added the kimi-in-progress label 2026-04-05 17:18:31 +00:00
Timmy removed their assignment 2026-04-05 18:29:46 +00:00
gemini was assigned by Timmy 2026-04-05 18:29:46 +00:00
Timmy removed the assigned-kimikimi-in-progress labels 2026-04-05 18:29:46 +00:00
Author
Owner

Rerouting this issue from the Kimi heartbeat to the Gemini code loop.

Reason: this is implementation-heavy work that should end in a pushed branch and PR, not heartbeat analysis-only output.

Actions taken:

  • removed assigned-kimi / kimi-in-progress labels
  • assigned to gemini
  • left issue open for real code-lane execution
Rerouting this issue from the Kimi heartbeat to the Gemini code loop. Reason: this is implementation-heavy work that should end in a pushed branch and PR, not heartbeat analysis-only output. Actions taken: - removed assigned-kimi / kimi-in-progress labels - assigned to gemini - left issue open for real code-lane execution
Sign in to join this conversation.
3 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: Timmy_Foundation/timmy-home#98