[loop] Gitea-based retrospectives — track agent performance from PR history #230

Closed
opened 2026-03-15 17:47:54 +00:00 by hermes · 3 comments
Collaborator

Why

With Kimi filing its own PRs (#226-#228), Gitea becomes the audit trail for agent performance. Every PR shows:

  • What was attempted (issue link)
  • What was produced (diff)
  • Whether it passed review (merge vs. rejection)
  • What feedback was given (review comments)
  • How long it took (created → merged timestamps)

This is observable in a way that tmux-pane-polling never was. When Kimi makes the same mistake 3 times, we can see it, quantify it, and fix the prompt or add a guardrail.

Metrics to Track

From Gitea PR history alone:

  • Merge rate: % of Kimi PRs that merge on first attempt
  • Rejection reasons: categorize review comments (lint, tests, scope creep, wrong approach)
  • Time to merge: created_at → merged_at
  • Rework rate: how many PRs need force-push after review feedback
  • Net lines per issue type: is Kimi adding LOC on refactor tickets?
  • Pattern detection: same failure 3+ times → auto-file a prompt-fix issue

Implementation

Add to cycle_retro.py or new scripts/agent_metrics.py:

# Query Gitea for agent PRs
kimi_prs = get_prs(poster="kimi", state="closed", limit=50)

merged = [p for p in kimi_prs if p["merged"]]
rejected = [p for p in kimi_prs if not p["merged"]]

merge_rate = len(merged) / len(kimi_prs) if kimi_prs else 0

# Check review comments for patterns
for pr in rejected:
    comments = get_review_comments(pr["number"])
    # Categorize: lint, test-failure, scope-creep, wrong-approach
    # If same category appears 3+ times → file guardrail issue

The Feedback Loop

Kimi files PR
  → Hermes reviews (merge or reject with reason)
  → Retro script aggregates outcomes
  → Patterns surface: "Kimi fails lint 40% of the time"
  → Auto-file issue: "Add lint check to Kimi's loop before PR"
  → Fix lands in Kimi's loop prompt
  → Lint failures drop to 0%

The system debugs itself through Gitea's audit trail.

Depends On

  • #226 (Kimi independence)
  • #228 (Kimi's own loop)
## Why With Kimi filing its own PRs (#226-#228), Gitea becomes the audit trail for agent performance. Every PR shows: - What was attempted (issue link) - What was produced (diff) - Whether it passed review (merge vs. rejection) - What feedback was given (review comments) - How long it took (created → merged timestamps) This is observable in a way that tmux-pane-polling never was. When Kimi makes the same mistake 3 times, we can see it, quantify it, and fix the prompt or add a guardrail. ## Metrics to Track From Gitea PR history alone: - **Merge rate**: % of Kimi PRs that merge on first attempt - **Rejection reasons**: categorize review comments (lint, tests, scope creep, wrong approach) - **Time to merge**: created_at → merged_at - **Rework rate**: how many PRs need force-push after review feedback - **Net lines per issue type**: is Kimi adding LOC on refactor tickets? - **Pattern detection**: same failure 3+ times → auto-file a prompt-fix issue ## Implementation Add to `cycle_retro.py` or new `scripts/agent_metrics.py`: ```python # Query Gitea for agent PRs kimi_prs = get_prs(poster="kimi", state="closed", limit=50) merged = [p for p in kimi_prs if p["merged"]] rejected = [p for p in kimi_prs if not p["merged"]] merge_rate = len(merged) / len(kimi_prs) if kimi_prs else 0 # Check review comments for patterns for pr in rejected: comments = get_review_comments(pr["number"]) # Categorize: lint, test-failure, scope-creep, wrong-approach # If same category appears 3+ times → file guardrail issue ``` ## The Feedback Loop ``` Kimi files PR → Hermes reviews (merge or reject with reason) → Retro script aggregates outcomes → Patterns surface: "Kimi fails lint 40% of the time" → Auto-file issue: "Add lint check to Kimi's loop before PR" → Fix lands in Kimi's loop prompt → Lint failures drop to 0% ``` The system debugs itself through Gitea's audit trail. ## Depends On - #226 (Kimi independence) - #228 (Kimi's own loop)
Author
Collaborator

Deep Triage Refinement

The loop infrastructure (PR#162) already tracks:

  • Per-cycle retros in .loop/retro/cycles.jsonl
  • Triage scores in .loop/retro/triage.jsonl
  • Deep triage retros in .loop/retro/deep-triage.jsonl
  • Summary statistics in .loop/retro/summary.json

What this issue adds beyond that:

  • Mining PR history for agent-specific patterns (Kimi vs Hermes performance)
  • Tracking review feedback over time
  • Identifying repeated mistakes per agent

Consider: Is this distinct enough from the existing retro system? Or should it be a dashboard view of existing data? The Gitea API has all the data; the question is where to surface it.

## Deep Triage Refinement The loop infrastructure (PR#162) already tracks: - Per-cycle retros in .loop/retro/cycles.jsonl - Triage scores in .loop/retro/triage.jsonl - Deep triage retros in .loop/retro/deep-triage.jsonl - Summary statistics in .loop/retro/summary.json What this issue adds beyond that: - Mining PR history for agent-specific patterns (Kimi vs Hermes performance) - Tracking review feedback over time - Identifying repeated mistakes per agent Consider: Is this distinct enough from the existing retro system? Or should it be a dashboard view of existing data? The Gitea API has all the data; the question is where to surface it.
Author
Collaborator

[Deep Triage 2026-03-18] Closing: Already implemented. Gitea-based retrospectives exist in .loop/retro/ (cycles.jsonl, summary.json, deep-triage.jsonl, triage.jsonl). The retro system is live.

[Deep Triage 2026-03-18] Closing: Already implemented. Gitea-based retrospectives exist in .loop/retro/ (cycles.jsonl, summary.json, deep-triage.jsonl, triage.jsonl). The retro system is live.
Author
Collaborator

[Deep Triage 2026-03-18] Already implemented. Gitea-based retrospectives live in .loop/retro/ (cycles.jsonl, summary.json, deep-triage.jsonl).

[Deep Triage 2026-03-18] Already implemented. Gitea-based retrospectives live in .loop/retro/ (cycles.jsonl, summary.json, deep-triage.jsonl).
Sign in to join this conversation.
No Label
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: Rockachopa/Timmy-time-dashboard#230