[loop] Gitea-based retrospectives — track agent performance from PR history #230

New Issue

hermes · 2026-03-15T17:47:54Z

hermes commented

2026-03-15 17:47:54 +00:00

Why

With Kimi filing its own PRs (#226-#228), Gitea becomes the audit trail for agent performance. Every PR shows:

What was attempted (issue link)
What was produced (diff)
Whether it passed review (merge vs. rejection)
What feedback was given (review comments)
How long it took (created → merged timestamps)

This is observable in a way that tmux-pane-polling never was. When Kimi makes the same mistake 3 times, we can see it, quantify it, and fix the prompt or add a guardrail.

Metrics to Track

From Gitea PR history alone:

Merge rate: % of Kimi PRs that merge on first attempt
Rejection reasons: categorize review comments (lint, tests, scope creep, wrong approach)
Time to merge: created_at → merged_at
Rework rate: how many PRs need force-push after review feedback
Net lines per issue type: is Kimi adding LOC on refactor tickets?
Pattern detection: same failure 3+ times → auto-file a prompt-fix issue

Implementation

Add to cycle_retro.py or new scripts/agent_metrics.py:

# Query Gitea for agent PRs
kimi_prs = get_prs(poster="kimi", state="closed", limit=50)

merged = [p for p in kimi_prs if p["merged"]]
rejected = [p for p in kimi_prs if not p["merged"]]

merge_rate = len(merged) / len(kimi_prs) if kimi_prs else 0

# Check review comments for patterns
for pr in rejected:
    comments = get_review_comments(pr["number"])
    # Categorize: lint, test-failure, scope-creep, wrong-approach
    # If same category appears 3+ times → file guardrail issue

The Feedback Loop

Kimi files PR
  → Hermes reviews (merge or reject with reason)
  → Retro script aggregates outcomes
  → Patterns surface: "Kimi fails lint 40% of the time"
  → Auto-file issue: "Add lint check to Kimi's loop before PR"
  → Fix lands in Kimi's loop prompt
  → Lint failures drop to 0%

The system debugs itself through Gitea's audit trail.

Depends On

#226 (Kimi independence)
#228 (Kimi's own loop)

## Why With Kimi filing its own PRs (#226-#228), Gitea becomes the audit trail for agent performance. Every PR shows: - What was attempted (issue link) - What was produced (diff) - Whether it passed review (merge vs. rejection) - What feedback was given (review comments) - How long it took (created → merged timestamps) This is observable in a way that tmux-pane-polling never was. When Kimi makes the same mistake 3 times, we can see it, quantify it, and fix the prompt or add a guardrail. ## Metrics to Track From Gitea PR history alone: - **Merge rate**: % of Kimi PRs that merge on first attempt - **Rejection reasons**: categorize review comments (lint, tests, scope creep, wrong approach) - **Time to merge**: created_at → merged_at - **Rework rate**: how many PRs need force-push after review feedback - **Net lines per issue type**: is Kimi adding LOC on refactor tickets? - **Pattern detection**: same failure 3+ times → auto-file a prompt-fix issue ## Implementation Add to `cycle_retro.py` or new `scripts/agent_metrics.py`: ```python # Query Gitea for agent PRs kimi_prs = get_prs(poster="kimi", state="closed", limit=50) merged = [p for p in kimi_prs if p["merged"]] rejected = [p for p in kimi_prs if not p["merged"]] merge_rate = len(merged) / len(kimi_prs) if kimi_prs else 0 # Check review comments for patterns for pr in rejected: comments = get_review_comments(pr["number"]) # Categorize: lint, test-failure, scope-creep, wrong-approach # If same category appears 3+ times → file guardrail issue ``` ## The Feedback Loop ``` Kimi files PR → Hermes reviews (merge or reject with reason) → Retro script aggregates outcomes → Patterns surface: "Kimi fails lint 40% of the time" → Auto-file issue: "Add lint check to Kimi's loop before PR" → Fix lands in Kimi's loop prompt → Lint failures drop to 0% ``` The system debugs itself through Gitea's audit trail. ## Depends On - #226 (Kimi independence) - #228 (Kimi's own loop)

hermes commented

2026-03-15 18:42:23 +00:00

Deep Triage Refinement

The loop infrastructure (PR#162) already tracks:

Per-cycle retros in .loop/retro/cycles.jsonl
Triage scores in .loop/retro/triage.jsonl
Deep triage retros in .loop/retro/deep-triage.jsonl
Summary statistics in .loop/retro/summary.json

What this issue adds beyond that:

Mining PR history for agent-specific patterns (Kimi vs Hermes performance)
Tracking review feedback over time
Identifying repeated mistakes per agent

Consider: Is this distinct enough from the existing retro system? Or should it be a dashboard view of existing data? The Gitea API has all the data; the question is where to surface it.

## Deep Triage Refinement The loop infrastructure (PR#162) already tracks: - Per-cycle retros in .loop/retro/cycles.jsonl - Triage scores in .loop/retro/triage.jsonl - Deep triage retros in .loop/retro/deep-triage.jsonl - Summary statistics in .loop/retro/summary.json What this issue adds beyond that: - Mining PR history for agent-specific patterns (Kimi vs Hermes performance) - Tracking review feedback over time - Identifying repeated mistakes per agent Consider: Is this distinct enough from the existing retro system? Or should it be a dashboard view of existing data? The Gitea API has all the data; the question is where to surface it.

hermes commented

2026-03-19 01:03:16 +00:00

[Deep Triage 2026-03-18] Closing: Already implemented. Gitea-based retrospectives exist in .loop/retro/ (cycles.jsonl, summary.json, deep-triage.jsonl, triage.jsonl). The retro system is live.

hermes closed this issue

2026-03-19 01:03:16 +00:00

hermes commented

2026-03-19 01:03:29 +00:00

[Deep Triage 2026-03-18] Already implemented. Gitea-based retrospectives live in .loop/retro/ (cycles.jsonl, summary.json, deep-triage.jsonl).

Sign in to join this conversation.

Branches Tags

main

gemini/issue-892

claude/issue-1342

claude/issue-1346

claude/issue-1351

claude/issue-1340

fix/test-llm-triage-syntax

gemini/issue-1014

gemini/issue-932

claude/issue-1277

claude/issue-1139

claude/issue-870

claude/issue-1285

claude/issue-1292

claude/issue-1281

claude/issue-917

claude/issue-1275

claude/issue-925

claude/issue-1019

claude/issue-1094

claude/issue-1019-v3

fix/flaky-vassal-xdist-tests

fix/test-config-env-isolation

claude/issue-1019-v2

claude/issue-957-v2

claude/issue-1218

claude/issue-1217

test/chat-store-unit-tests

claude/issue-1191

claude/issue-1186

claude/issue-957

gemini/issue-936

claude/issue-1065

gemini/issue-976

gemini/issue-1149

claude/issue-1135

claude/issue-1064

gemini/issue-1012

claude/issue-1095

claude/issue-1102

claude/issue-1114

gemini/issue-978

gemini/issue-971

claude/issue-1074

claude/issue-987

claude/issue-1011

feature/internal-monologue

feature/issue-1006

feature/issue-1007

feature/issue-1008

feature/issue-1009

feature/issue-1010

feature/issue-1011

feature/issue-1012

feature/issue-1013

feature/issue-1014

feature/issue-981

feature/issue-982

feature/issue-983

feature/issue-984

feature/issue-985

feature/issue-986

feature/issue-987

feature/issue-993

claude/issue-943

claude/issue-975

claude/issue-989

claude/issue-988

fix/loop-guard-gitea-api-and-queue-validation

feature/lhf-tech-debt-fixes

kimi/issue-753

kimi/issue-714

kimi/issue-716

fix/csrf-check-before-execute

chore/migrate-gitea-to-vps

kimi/issue-640

fix/utcnow-calm-py

kimi/issue-635

kimi/issue-625

fix/router-api-truncated-param

kimi/issue-604

kimi/issue-594

review-fixes

kimi/issue-570

kimi/issue-554

kimi/issue-539

kimi/issue-540

feature/ipad-v1-api

kimi/issue-506

kimi/issue-512

refactor/airllm-doc-cleanup

kimi/issue-513

kimi/issue-514

kimi/issue-500

kimi/issue-492

kimi/issue-490

kimi/issue-459

kimi/issue-472

kimi/issue-473

kimi/issue-462

kimi/issue-463

kimi/issue-454

kimi/issue-445

kimi/issue-446

kimi/issue-431

1 Participants

Notifications

Due Date

No due date set.

Dependencies

No dependencies set.

Reference: Rockachopa/Timmy-time-dashboard#230