feat: migrate autolora pipeline into training/
Per direction shift (the-nexus#542). Replaces the autolora repo (1,500 lines of custom pipeline code) with config files for existing tools: - axolotl.yaml: replaces train_modal.py (239 lines) - mlx-lora.yaml: replaces MLX training scripts - eval-tasks.yaml: replaces run_eval.py (300 lines) - Makefile: replaces run_vibes.py, compare.py, convert_to_mlx.py Data migrated as-is: - curated_dataset.jsonl (26 gold-standard conversations) - preference_pairs.jsonl (DPO pairs) - prompts_vibes.yaml, prompts_nexus_vibes.yaml - v0-baseline eval results (historical record) Thin glue kept: - build_curated.py (data authoring, not infrastructure) - ingest_trajectories.py (domain-specific quality filter) Dependencies: pip install axolotl mlx-lm lm-evaluation-harness
This commit is contained in:
34
training/evals_archive/v0-baseline/8b-full/eval_run.log
Normal file
34
training/evals_archive/v0-baseline/8b-full/eval_run.log
Normal file
@@ -0,0 +1,34 @@
|
||||
Evaluating 19 sessions against hermes3:8b
|
||||
Skipping sessions with >20 human turns
|
||||
------------------------------------------------------------
|
||||
[1/19] 20260323_154745_7c9379 (1 human turns)... composite=0.563 (17.3s)
|
||||
[2/19] 20260323_155153_bbc661 (1 human turns)... composite=0.681 (36.7s)
|
||||
[3/19] 20260323_155458_a7563d (1 human turns)... composite=0.521 (26.7s)
|
||||
[4/19] 20260323_160415_267a50 (1 human turns)... composite=0.000 (0.0s)
|
||||
[5/19] 20260323_160533_edc7cf (4 human turns)... composite=0.692 (35.1s)
|
||||
[6/19] 20260323_160633_6fab92 (1 human turns)... composite=0.000 (0.0s)
|
||||
[7/19] 20260323_160736_754836 (1 human turns)... composite=0.628 (7.4s)
|
||||
[8/19] 20260323_160920_c90867 (2 human turns)... composite=0.572 (34.5s)
|
||||
[9/19] 20260323_161348_d553f0 (1 human turns)... composite=0.634 (120.1s)
|
||||
[10/19] 20260323_162017_83fe11 (1 human turns)... composite=0.562 (59.7s)
|
||||
[11/19] 20260323_162338_0267eb (2 human turns)... composite=0.612 (40.2s)
|
||||
[12/19] 20260323_162704_86f56f (1 human turns)... composite=0.623 (8.2s)
|
||||
[13/19] 20260323_163322_e45fa8 (1 human turns)... composite=0.624 (8.2s)
|
||||
[14/19] 20260323_163943_9ac5e0 (1 human turns)... composite=0.646 (120.0s)
|
||||
[15/19] 20260323_164605_a4aedf (1 human turns)... composite=0.644 (120.0s)
|
||||
[16/19] 20260323_164606_c000d8 (2 human turns)... composite=0.602 (26.8s)
|
||||
[17/19] 20260323_165414_a92bf6 (1 human turns)... composite=0.632 (18.3s)
|
||||
[18/19] 20260323_165533_940c49 (1 human turns)... composite=0.521 (17.9s)
|
||||
[19/19] 20260323_170039_2a191f (1 human turns)... composite=0.754 (8.2s)
|
||||
|
||||
============================================================
|
||||
RESULTS — hermes3:8b
|
||||
============================================================
|
||||
tool_selection : 0.895
|
||||
length_ratio : 0.221
|
||||
format_compliance : 0.892
|
||||
text_similarity : 0.039
|
||||
brevity : 0.719
|
||||
composite : 0.553
|
||||
|
||||
Saved to ../evals/v0-baseline/8b-full/scores.json
|
||||
567
training/evals_archive/v0-baseline/8b-full/scores.json
Normal file
567
training/evals_archive/v0-baseline/8b-full/scores.json
Normal file
@@ -0,0 +1,567 @@
|
||||
{
|
||||
"model": "hermes3:8b",
|
||||
"test_set": "../data/test_set.jsonl",
|
||||
"timestamp": "2026-03-24T09:48:46",
|
||||
"sessions_evaluated": 19,
|
||||
"sessions_skipped": 0,
|
||||
"aggregate_scores": {
|
||||
"tool_selection": 0.8947368421052632,
|
||||
"length_ratio": 0.2210526315789474,
|
||||
"format_compliance": 0.8921052631578947,
|
||||
"text_similarity": 0.03878304940989352,
|
||||
"brevity": 0.7192982456140351,
|
||||
"composite": 0.5531952063732067
|
||||
},
|
||||
"per_session": [
|
||||
{
|
||||
"session_id": "20260323_154745_7c9379",
|
||||
"num_turns_evaluated": 1,
|
||||
"scores": {
|
||||
"tool_selection": 1.0,
|
||||
"length_ratio": 0.1,
|
||||
"format_compliance": 1.0,
|
||||
"text_similarity": 0.01386748844375963,
|
||||
"brevity": 0.7
|
||||
},
|
||||
"composite": 0.562773497688752,
|
||||
"turn_details": [
|
||||
{
|
||||
"turn": 1,
|
||||
"user_preview": "TIME BUDGET: You have 20 minutes for this cycle. Plan accordingly \u2014 do not start work you cannot finish.\n\nPRIORITIZED QUEUE (19 ready issues):\n 1. #1130 [REFACTOR] score=9 \u2014 Add docstrings to 20 undo",
|
||||
"original_preview": "Starting cycle. Let me work through the phases.",
|
||||
"generated_preview": "Okay, let's break this down step-by-step:\n\nPHASE 0 \u2014 RESPOND TO OWNER\n- I don't see any new owner comments that need a response. The last one was on March 1 about the CI setup.\n\nPHASE 1 \u2014 REVIEW KIMI'",
|
||||
"scores": {
|
||||
"tool_selection": 1.0,
|
||||
"length_ratio": 0.1,
|
||||
"format_compliance": 1.0,
|
||||
"text_similarity": 0.01386748844375963,
|
||||
"brevity": 0.7
|
||||
}
|
||||
}
|
||||
]
|
||||
},
|
||||
{
|
||||
"session_id": "20260323_155153_bbc661",
|
||||
"num_turns_evaluated": 1,
|
||||
"scores": {
|
||||
"tool_selection": 1.0,
|
||||
"length_ratio": 0.7,
|
||||
"format_compliance": 1.0,
|
||||
"text_similarity": 0.0064759848893685915,
|
||||
"brevity": 0.7
|
||||
},
|
||||
"composite": 0.6812951969778738,
|
||||
"turn_details": [
|
||||
{
|
||||
"turn": 1,
|
||||
"user_preview": "You are Timmy, the orchestrator. Review these open PRs from AI agents.\n\nFor each PR, you will see the diff. Your job:\n- MERGE if changes look reasonable (most agent PRs are good, merge aggressively)\n-",
|
||||
"original_preview": "Reviewing all 5 PRs now.\n\n**PR #1164** (dashboard) \u2014 Sovereignty metrics emitter. Additive, 730 lines, 34 tests, mergeable. Clean.\n**PR #1163** (dashboard) \u2014 Three-strike detector. Additive, 1037 line",
|
||||
"generated_preview": "PR 1: Approve\nPR 2: Approve\nPR 3: Approve\nPR 4: Approve\nPR 5: Approve\nPR 6: Approve\nPR 7: Approve\nPR 8: Approve\nPR 9: Approve\nPR 10: Approve\nPR 11: Approve\nPR 12: Approve\nPR 13: Approve\nPR 14: Approve",
|
||||
"scores": {
|
||||
"tool_selection": 1.0,
|
||||
"length_ratio": 0.7,
|
||||
"format_compliance": 1.0,
|
||||
"text_similarity": 0.0064759848893685915,
|
||||
"brevity": 0.7
|
||||
}
|
||||
}
|
||||
]
|
||||
},
|
||||
{
|
||||
"session_id": "20260323_155458_a7563d",
|
||||
"num_turns_evaluated": 1,
|
||||
"scores": {
|
||||
"tool_selection": 1.0,
|
||||
"length_ratio": 0.1,
|
||||
"format_compliance": 1.0,
|
||||
"text_similarity": 0.004885197850512946,
|
||||
"brevity": 0.5
|
||||
},
|
||||
"composite": 0.5209770395701026,
|
||||
"turn_details": [
|
||||
{
|
||||
"turn": 1,
|
||||
"user_preview": "TIME BUDGET: You have 20 minutes for this cycle. Plan accordingly \u2014 do not start work you cannot finish.\n\nPRIORITIZED QUEUE (19 ready issues):\n 1. #1130 [REFACTOR] score=9 \u2014 Add docstrings to 20 undo",
|
||||
"original_preview": "Starting cycle. Let me work through the phases.",
|
||||
"generated_preview": "Okay, let's break this down step-by-step:\n\nPHASE 0 - RESPOND TO OWNER\nI don't see any new owner comments that need a response. The last one was on March 1 about the CI setup. I'll go ahead and respond",
|
||||
"scores": {
|
||||
"tool_selection": 1.0,
|
||||
"length_ratio": 0.1,
|
||||
"format_compliance": 1.0,
|
||||
"text_similarity": 0.004885197850512946,
|
||||
"brevity": 0.5
|
||||
}
|
||||
}
|
||||
]
|
||||
},
|
||||
{
|
||||
"session_id": "20260323_160415_267a50",
|
||||
"num_turns_evaluated": 0,
|
||||
"scores": {
|
||||
"tool_selection": 0.0,
|
||||
"length_ratio": 0.0,
|
||||
"format_compliance": 0.0,
|
||||
"text_similarity": 0.0,
|
||||
"brevity": 0.0
|
||||
},
|
||||
"composite": 0.0,
|
||||
"turn_details": []
|
||||
},
|
||||
{
|
||||
"session_id": "20260323_160533_edc7cf",
|
||||
"num_turns_evaluated": 3,
|
||||
"scores": {
|
||||
"tool_selection": 1.0,
|
||||
"length_ratio": 0.5,
|
||||
"format_compliance": 0.9500000000000001,
|
||||
"text_similarity": 0.1411302481638873,
|
||||
"brevity": 0.8666666666666667
|
||||
},
|
||||
"composite": 0.6915593829661109,
|
||||
"turn_details": [
|
||||
{
|
||||
"turn": 1,
|
||||
"user_preview": "Hello.",
|
||||
"original_preview": "\n\nHello Alexander. What can I do for you?",
|
||||
"generated_preview": "Hello! I'm a CLI AI Agent ready to assist you. How can I help today?",
|
||||
"scores": {
|
||||
"tool_selection": 1.0,
|
||||
"length_ratio": 0.7,
|
||||
"format_compliance": 1.0,
|
||||
"text_similarity": 0.4036697247706422,
|
||||
"brevity": 1.0
|
||||
}
|
||||
},
|
||||
{
|
||||
"turn": 3,
|
||||
"user_preview": "Lets download that hermes 4.3 model on my vps then we can take our time pulling it down over tailscale",
|
||||
"original_preview": "Let me check what's available for Hermes 4.3 and get it downloading on the VPS.",
|
||||
"generated_preview": "Sure, let's get started:\n\n1. First, make sure you have the necessary permissions and resources on your VPS to download the model.\n\n2. Open a terminal and navigate to the directory where you want to sa",
|
||||
"scores": {
|
||||
"tool_selection": 1.0,
|
||||
"length_ratio": 0.1,
|
||||
"format_compliance": 0.85,
|
||||
"text_similarity": 0.006734006734006734,
|
||||
"brevity": 0.7
|
||||
}
|
||||
},
|
||||
{
|
||||
"turn": 13,
|
||||
"user_preview": "What session was it in? I just want to continue that session.",
|
||||
"original_preview": "It was session **20260315_195645_4aa06b** from March 15, 2026 at 7:56 PM. That's where we looked up the GGUF quants on HuggingFace, confirmed Ollama availability with `ollama pull hermes4`, and identi",
|
||||
"generated_preview": "To continue the session you're referring to, I'll need some additional context. Could you provide more details about the specific session or conversation you're looking to pick up? Knowing the date, t",
|
||||
"scores": {
|
||||
"tool_selection": 1.0,
|
||||
"length_ratio": 0.7,
|
||||
"format_compliance": 1.0,
|
||||
"text_similarity": 0.012987012987012988,
|
||||
"brevity": 0.9
|
||||
}
|
||||
}
|
||||
]
|
||||
},
|
||||
{
|
||||
"session_id": "20260323_160633_6fab92",
|
||||
"num_turns_evaluated": 0,
|
||||
"scores": {
|
||||
"tool_selection": 0.0,
|
||||
"length_ratio": 0.0,
|
||||
"format_compliance": 0.0,
|
||||
"text_similarity": 0.0,
|
||||
"brevity": 0.0
|
||||
},
|
||||
"composite": 0.0,
|
||||
"turn_details": []
|
||||
},
|
||||
{
|
||||
"session_id": "20260323_160736_754836",
|
||||
"num_turns_evaluated": 1,
|
||||
"scores": {
|
||||
"tool_selection": 1.0,
|
||||
"length_ratio": 0.1,
|
||||
"format_compliance": 1.0,
|
||||
"text_similarity": 0.03940886699507389,
|
||||
"brevity": 1.0
|
||||
},
|
||||
"composite": 0.6278817733990147,
|
||||
"turn_details": [
|
||||
{
|
||||
"turn": 1,
|
||||
"user_preview": "You are Timmy, the orchestrator. Review these open PRs from AI agents.\n\nFor each PR, you will see the diff. Your job:\n- MERGE if changes look reasonable (most agent PRs are good, merge aggressively)\n-",
|
||||
"original_preview": "PR #80 \u2014 session_messages table for conversation history. Good design: separate table, transactional writes, clean API. But `mergeable: false` \u2014 there are merge conflicts. Can't merge until rebased.",
|
||||
"generated_preview": "Merge",
|
||||
"scores": {
|
||||
"tool_selection": 1.0,
|
||||
"length_ratio": 0.1,
|
||||
"format_compliance": 1.0,
|
||||
"text_similarity": 0.03940886699507389,
|
||||
"brevity": 1.0
|
||||
}
|
||||
}
|
||||
]
|
||||
},
|
||||
{
|
||||
"session_id": "20260323_160920_c90867",
|
||||
"num_turns_evaluated": 2,
|
||||
"scores": {
|
||||
"tool_selection": 1.0,
|
||||
"length_ratio": 0.25,
|
||||
"format_compliance": 1.0,
|
||||
"text_similarity": 0.011449975357318877,
|
||||
"brevity": 0.6
|
||||
},
|
||||
"composite": 0.5722899950714637,
|
||||
"turn_details": [
|
||||
{
|
||||
"turn": 1,
|
||||
"user_preview": "TIME BUDGET: You have 20 minutes for this cycle. Plan accordingly \u2014 do not start work you cannot finish.\n\nPRIORITIZED QUEUE (19 ready issues):\n 1. #1130 [REFACTOR] score=9 \u2014 Add docstrings to 20 undo",
|
||||
"original_preview": "Starting cycle. Let me assess the state and check for open PRs first.",
|
||||
"generated_preview": "Okay, let's break this down step-by-step:\n\nPHASE 0 \u2014 RESPOND TO OWNER\n- I see there is an unanswered owner comment from \"rockachopa\" asking about the status of the CI failures on the main branch. I wi",
|
||||
"scores": {
|
||||
"tool_selection": 1.0,
|
||||
"length_ratio": 0.1,
|
||||
"format_compliance": 1.0,
|
||||
"text_similarity": 0.006899950714637753,
|
||||
"brevity": 0.5
|
||||
}
|
||||
},
|
||||
{
|
||||
"turn": 62,
|
||||
"user_preview": "You've reached the maximum number of tool-calling iterations allowed. Please provide a final response summarizing what you've found and accomplished so far, without calling any more tools.",
|
||||
"original_preview": "## Cycle Summary\n\n### What I Did\n\n**PHASE 1 \u2014 PR Review:** Clean slate. No open PRs from Kimi or anyone.\n\n**PHASE 2 \u2014 Assess:**\n- #1130 (top priority, score=9) is already CLOSED\n- All top 7 queue item",
|
||||
"generated_preview": "Here is a summary of what I've found and accomplished so far:\n\nI have not been able to complete my assigned tasks or reach any meaningful conclusions, as I have been repeatedly stopped from calling an",
|
||||
"scores": {
|
||||
"tool_selection": 1.0,
|
||||
"length_ratio": 0.4,
|
||||
"format_compliance": 1.0,
|
||||
"text_similarity": 0.016,
|
||||
"brevity": 0.7
|
||||
}
|
||||
}
|
||||
]
|
||||
},
|
||||
{
|
||||
"session_id": "20260323_161348_d553f0",
|
||||
"num_turns_evaluated": 1,
|
||||
"scores": {
|
||||
"tool_selection": 1.0,
|
||||
"length_ratio": 0.1,
|
||||
"format_compliance": 1.0,
|
||||
"text_similarity": 0.06859205776173286,
|
||||
"brevity": 1.0
|
||||
},
|
||||
"composite": 0.6337184115523466,
|
||||
"turn_details": [
|
||||
{
|
||||
"turn": 1,
|
||||
"user_preview": "You are Timmy, the orchestrator. Review these open PRs from AI agents.\n\nFor each PR, you will see the diff. Your job:\n- MERGE if changes look reasonable (most agent PRs are good, merge aggressively)\n-",
|
||||
"original_preview": "Reviewing 4 PRs. Here's my read:\n\n**PR #1167** (dashboard) \u2014 Gemini docs for Sovereignty Loop. 516 adds, docs-only. Mergeable. MERGE.\n**PR #86** (tower) \u2014 Claude TS quality gates. 13/13 lines, fixes a",
|
||||
"generated_preview": "[EVAL_ERROR: HTTPConnectionPool(host='localhost', port=11434): Read timed out. (read timeout=120)]",
|
||||
"scores": {
|
||||
"tool_selection": 1.0,
|
||||
"length_ratio": 0.1,
|
||||
"format_compliance": 1.0,
|
||||
"text_similarity": 0.06859205776173286,
|
||||
"brevity": 1.0
|
||||
}
|
||||
}
|
||||
]
|
||||
},
|
||||
{
|
||||
"session_id": "20260323_162017_83fe11",
|
||||
"num_turns_evaluated": 1,
|
||||
"scores": {
|
||||
"tool_selection": 1.0,
|
||||
"length_ratio": 0.1,
|
||||
"format_compliance": 1.0,
|
||||
"text_similarity": 0.007863695937090432,
|
||||
"brevity": 0.7
|
||||
},
|
||||
"composite": 0.5615727391874181,
|
||||
"turn_details": [
|
||||
{
|
||||
"turn": 1,
|
||||
"user_preview": "You are Timmy, the orchestrator. Review these open PRs from AI agents.\n\nFor each PR, you will see the diff. Your job:\n- MERGE if changes look reasonable (most agent PRs are good, merge aggressively)\n-",
|
||||
"original_preview": "Reviewing all three:\n\n- **PR #90** \u2014 Small migration file (38 lines), tables already on main. Clean. MERGE.\n- **PR #88** \u2014 916-line mobile feature, well-scoped Lightning invoice flow. Good. MERGE.\n- *",
|
||||
"generated_preview": "PR 79: Looks good to me, can be merged.\n\nPR 80: Also looks good, can be merged.\n\nPR 81: I reviewed the changes, and it seems to be a valid fix for the reported issue. The code looks clean and well-doc",
|
||||
"scores": {
|
||||
"tool_selection": 1.0,
|
||||
"length_ratio": 0.1,
|
||||
"format_compliance": 1.0,
|
||||
"text_similarity": 0.007863695937090432,
|
||||
"brevity": 0.7
|
||||
}
|
||||
}
|
||||
]
|
||||
},
|
||||
{
|
||||
"session_id": "20260323_162338_0267eb",
|
||||
"num_turns_evaluated": 2,
|
||||
"scores": {
|
||||
"tool_selection": 1.0,
|
||||
"length_ratio": 0.25,
|
||||
"format_compliance": 1.0,
|
||||
"text_similarity": 0.009166266041138255,
|
||||
"brevity": 0.8
|
||||
},
|
||||
"composite": 0.6118332532082277,
|
||||
"turn_details": [
|
||||
{
|
||||
"turn": 1,
|
||||
"user_preview": "TIME BUDGET: You have 20 minutes for this cycle. Plan accordingly \u2014 do not start work you cannot finish.\n\nPRIORITIZED QUEUE (19 ready issues):\n 1. #1130 [REFACTOR] score=9 \u2014 Add docstrings to 20 undo",
|
||||
"original_preview": "Starting cycle. Let me check state and work through the phases.",
|
||||
"generated_preview": "Okay, let's break this down step-by-step:\n\nPHASE 0 - RESPOND TO OWNER\nI don't see any new owner comments that need a response. The last one was on March 2nd.\n\nPHASE 1 - REVIEW KIMI'S WORK + FIX BROKEN",
|
||||
"scores": {
|
||||
"tool_selection": 1.0,
|
||||
"length_ratio": 0.1,
|
||||
"format_compliance": 1.0,
|
||||
"text_similarity": 0.008706467661691543,
|
||||
"brevity": 0.7
|
||||
}
|
||||
},
|
||||
{
|
||||
"turn": 62,
|
||||
"user_preview": "You've reached the maximum number of tool-calling iterations allowed. Please provide a final response summarizing what you've found and accomplished so far, without calling any more tools.",
|
||||
"original_preview": "## Cycle 6 Summary\n\n### Phase 0 \u2014 Owner Response\n- Responded to Alexander on **#1100** (AutoLoRA audit) \u2014 he said step 3 failed due to python env. Posted instructions for installing deps (`fire`, `ric",
|
||||
"generated_preview": "Here is a summary of what I've found and accomplished so far, without calling any additional tools:\n\nI have not been able to complete my assigned tasks or provide substantive findings yet, as I have b",
|
||||
"scores": {
|
||||
"tool_selection": 1.0,
|
||||
"length_ratio": 0.4,
|
||||
"format_compliance": 1.0,
|
||||
"text_similarity": 0.009626064420584968,
|
||||
"brevity": 0.9
|
||||
}
|
||||
}
|
||||
]
|
||||
},
|
||||
{
|
||||
"session_id": "20260323_162704_86f56f",
|
||||
"num_turns_evaluated": 1,
|
||||
"scores": {
|
||||
"tool_selection": 1.0,
|
||||
"length_ratio": 0.1,
|
||||
"format_compliance": 1.0,
|
||||
"text_similarity": 0.017391304347826087,
|
||||
"brevity": 1.0
|
||||
},
|
||||
"composite": 0.6234782608695653,
|
||||
"turn_details": [
|
||||
{
|
||||
"turn": 1,
|
||||
"user_preview": "You are Timmy, the orchestrator. Review these open PRs from AI agents.\n\nFor each PR, you will see the diff. Your job:\n- MERGE if changes look reasonable (most agent PRs are good, merge aggressively)\n-",
|
||||
"original_preview": "Looking at PR #80:\n\n- **What it does**: Adds `session_messages` table for conversation history (issue #37). New migration, new endpoint, transactional writes. Clean design \u2014 105 additions across 4 fil",
|
||||
"generated_preview": "Merge",
|
||||
"scores": {
|
||||
"tool_selection": 1.0,
|
||||
"length_ratio": 0.1,
|
||||
"format_compliance": 1.0,
|
||||
"text_similarity": 0.017391304347826087,
|
||||
"brevity": 1.0
|
||||
}
|
||||
}
|
||||
]
|
||||
},
|
||||
{
|
||||
"session_id": "20260323_163322_e45fa8",
|
||||
"num_turns_evaluated": 1,
|
||||
"scores": {
|
||||
"tool_selection": 1.0,
|
||||
"length_ratio": 0.1,
|
||||
"format_compliance": 1.0,
|
||||
"text_similarity": 0.020512820512820513,
|
||||
"brevity": 1.0
|
||||
},
|
||||
"composite": 0.6241025641025642,
|
||||
"turn_details": [
|
||||
{
|
||||
"turn": 1,
|
||||
"user_preview": "You are Timmy, the orchestrator. Review these open PRs from AI agents.\n\nFor each PR, you will see the diff. Your job:\n- MERGE if changes look reasonable (most agent PRs are good, merge aggressively)\n-",
|
||||
"original_preview": "Looking at PR #80:\n\n- **What it does**: Adds `session_messages` table, migration, GET endpoint, and transactional message persistence. Clean design \u2014 separates conversation history from billing metada",
|
||||
"generated_preview": "Merge",
|
||||
"scores": {
|
||||
"tool_selection": 1.0,
|
||||
"length_ratio": 0.1,
|
||||
"format_compliance": 1.0,
|
||||
"text_similarity": 0.020512820512820513,
|
||||
"brevity": 1.0
|
||||
}
|
||||
}
|
||||
]
|
||||
},
|
||||
{
|
||||
"session_id": "20260323_163943_9ac5e0",
|
||||
"num_turns_evaluated": 1,
|
||||
"scores": {
|
||||
"tool_selection": 1.0,
|
||||
"length_ratio": 0.1,
|
||||
"format_compliance": 1.0,
|
||||
"text_similarity": 0.12857142857142856,
|
||||
"brevity": 1.0
|
||||
},
|
||||
"composite": 0.6457142857142857,
|
||||
"turn_details": [
|
||||
{
|
||||
"turn": 1,
|
||||
"user_preview": "You are Timmy, the orchestrator. Review these open PRs from AI agents.\n\nFor each PR, you will see the diff. Your job:\n- MERGE if changes look reasonable (most agent PRs are good, merge aggressively)\n-",
|
||||
"original_preview": "Reviewing the three PRs:\n\n**PR #93** (timmy-tower) - Task Decomposition View: Good feature, clean design. But `mergeable: false` \u2014 has conflicts.\n\n**PR #80** (timmy-tower) - session_messages table: So",
|
||||
"generated_preview": "[EVAL_ERROR: HTTPConnectionPool(host='localhost', port=11434): Read timed out. (read timeout=120)]",
|
||||
"scores": {
|
||||
"tool_selection": 1.0,
|
||||
"length_ratio": 0.1,
|
||||
"format_compliance": 1.0,
|
||||
"text_similarity": 0.12857142857142856,
|
||||
"brevity": 1.0
|
||||
}
|
||||
}
|
||||
]
|
||||
},
|
||||
{
|
||||
"session_id": "20260323_164605_a4aedf",
|
||||
"num_turns_evaluated": 1,
|
||||
"scores": {
|
||||
"tool_selection": 1.0,
|
||||
"length_ratio": 0.1,
|
||||
"format_compliance": 1.0,
|
||||
"text_similarity": 0.12218649517684887,
|
||||
"brevity": 1.0
|
||||
},
|
||||
"composite": 0.6444372990353698,
|
||||
"turn_details": [
|
||||
{
|
||||
"turn": 1,
|
||||
"user_preview": "You are Timmy, the orchestrator. Review these open PRs from AI agents.\n\nFor each PR, you will see the diff. Your job:\n- MERGE if changes look reasonable (most agent PRs are good, merge aggressively)\n-",
|
||||
"original_preview": "Reviewing all 5 PRs. Here's my assessment:\n\n**PR #1175** (dashboard) - Unit tests for config.py. Pure test addition, 894 lines, mergeable. MERGE.\n**PR #1174** (dashboard) - Refactor timmyctl inbox(). ",
|
||||
"generated_preview": "[EVAL_ERROR: HTTPConnectionPool(host='localhost', port=11434): Read timed out. (read timeout=120)]",
|
||||
"scores": {
|
||||
"tool_selection": 1.0,
|
||||
"length_ratio": 0.1,
|
||||
"format_compliance": 1.0,
|
||||
"text_similarity": 0.12218649517684887,
|
||||
"brevity": 1.0
|
||||
}
|
||||
}
|
||||
]
|
||||
},
|
||||
{
|
||||
"session_id": "20260323_164606_c000d8",
|
||||
"num_turns_evaluated": 2,
|
||||
"scores": {
|
||||
"tool_selection": 1.0,
|
||||
"length_ratio": 0.39999999999999997,
|
||||
"format_compliance": 1.0,
|
||||
"text_similarity": 0.010261060043620544,
|
||||
"brevity": 0.6
|
||||
},
|
||||
"composite": 0.6020522120087242,
|
||||
"turn_details": [
|
||||
{
|
||||
"turn": 1,
|
||||
"user_preview": "TIME BUDGET: You have 20 minutes for this cycle. Plan accordingly \u2014 do not start work you cannot finish.\n\nPRIORITIZED QUEUE (12 ready issues):\n 1. #1177 [BUG] score=9 \u2014 Fix GITEA_API default in triag",
|
||||
"original_preview": "Starting cycle. Let me work through the phases.",
|
||||
"generated_preview": "Okay, let's break this down step-by-step:\n\nPHASE 0 - RESPOND TO OWNER\n- Checked for owner comments on any open issue since last cycle\n- No new unanswered owner comments found\n\nPHASE 1 - REVIEW KIMI'S ",
|
||||
"scores": {
|
||||
"tool_selection": 1.0,
|
||||
"length_ratio": 0.1,
|
||||
"format_compliance": 1.0,
|
||||
"text_similarity": 0.008144269924374637,
|
||||
"brevity": 0.5
|
||||
}
|
||||
},
|
||||
{
|
||||
"turn": 64,
|
||||
"user_preview": "You've reached the maximum number of tool-calling iterations allowed. Please provide a final response summarizing what you've found and accomplished so far, without calling any more tools.",
|
||||
"original_preview": "## Cycle Summary\n\n### PHASE 0 \u2014 Owner Comments\nNo recent unanswered owner comments. Last ones from March 14 \u2014 already old.\n\n### PHASE 1 \u2014 PR Review\nNo open PRs from Kimi or previous cycles. Clean slat",
|
||||
"generated_preview": "Here is a summary of what I've found and accomplished so far, without calling any additional tools:\n\nI have not been able to complete my assigned tasks or provide substantive findings, as I have been ",
|
||||
"scores": {
|
||||
"tool_selection": 1.0,
|
||||
"length_ratio": 0.7,
|
||||
"format_compliance": 1.0,
|
||||
"text_similarity": 0.012377850162866449,
|
||||
"brevity": 0.7
|
||||
}
|
||||
}
|
||||
]
|
||||
},
|
||||
{
|
||||
"session_id": "20260323_165414_a92bf6",
|
||||
"num_turns_evaluated": 1,
|
||||
"scores": {
|
||||
"tool_selection": 1.0,
|
||||
"length_ratio": 0.4,
|
||||
"format_compliance": 1.0,
|
||||
"text_similarity": 0.06205770277626565,
|
||||
"brevity": 0.7
|
||||
},
|
||||
"composite": 0.6324115405552531,
|
||||
"turn_details": [
|
||||
{
|
||||
"turn": 1,
|
||||
"user_preview": "You are Timmy, the orchestrator. Review these open PRs from AI agents.\n\nFor each PR, you will see the diff. Your job:\n- MERGE if changes look reasonable (most agent PRs are good, merge aggressively)\n-",
|
||||
"original_preview": "Reviewing all three:\n\n**PR #93** (timmy-tower) \u2014 Task Decomposition View. Good feature, clean scope. But `mergeable: false` \u2014 has conflicts.\n\n**PR #80** (timmy-tower) \u2014 Session Messages table. Solid d",
|
||||
"generated_preview": "PR #1: Approve\nPR #2: Approve\nPR #3: Approve\nPR #4: Approve\nPR #5: Approve\nPR #6: Approve\nPR #7: Approve\nPR #8: Approve\nPR #9: Approve\nPR #10: Approve\nPR #11: Approve\nPR #12: Approve\nPR #13: Approve\nP",
|
||||
"scores": {
|
||||
"tool_selection": 1.0,
|
||||
"length_ratio": 0.4,
|
||||
"format_compliance": 1.0,
|
||||
"text_similarity": 0.06205770277626565,
|
||||
"brevity": 0.7
|
||||
}
|
||||
}
|
||||
]
|
||||
},
|
||||
{
|
||||
"session_id": "20260323_165533_940c49",
|
||||
"num_turns_evaluated": 1,
|
||||
"scores": {
|
||||
"tool_selection": 1.0,
|
||||
"length_ratio": 0.1,
|
||||
"format_compliance": 1.0,
|
||||
"text_similarity": 0.0029311187103077674,
|
||||
"brevity": 0.5
|
||||
},
|
||||
"composite": 0.5205862237420615,
|
||||
"turn_details": [
|
||||
{
|
||||
"turn": 1,
|
||||
"user_preview": "TIME BUDGET: You have 20 minutes for this cycle. Plan accordingly \u2014 do not start work you cannot finish.\n\nPRIORITIZED QUEUE (12 ready issues):\n 1. #1177 [BUG] score=9 \u2014 Fix GITEA_API default in triag",
|
||||
"original_preview": "Starting cycle. Let me work through the phases.",
|
||||
"generated_preview": "Okay, let's break this down step-by-step:\n\nPHASE 0 \u2014 RESPOND TO OWNER\n- I see there is an unanswered owner comment from \"rockachopa\" asking about the status of the \"improve test coverage\" issue. I wil",
|
||||
"scores": {
|
||||
"tool_selection": 1.0,
|
||||
"length_ratio": 0.1,
|
||||
"format_compliance": 1.0,
|
||||
"text_similarity": 0.0029311187103077674,
|
||||
"brevity": 0.5
|
||||
}
|
||||
}
|
||||
]
|
||||
},
|
||||
{
|
||||
"session_id": "20260323_170039_2a191f",
|
||||
"num_turns_evaluated": 1,
|
||||
"scores": {
|
||||
"tool_selection": 1.0,
|
||||
"length_ratio": 0.7,
|
||||
"format_compliance": 1.0,
|
||||
"text_similarity": 0.07012622720897616,
|
||||
"brevity": 1.0
|
||||
},
|
||||
"composite": 0.7540252454417952,
|
||||
"turn_details": [
|
||||
{
|
||||
"turn": 1,
|
||||
"user_preview": "You are Timmy, the orchestrator. Review these open PRs from AI agents.\n\nFor each PR, you will see the diff. Your job:\n- MERGE if changes look reasonable (most agent PRs are good, merge aggressively)\n-",
|
||||
"original_preview": "Reviewing all three PRs:\n\n**PR #1179** (dashboard) - Lint fixes. Mechanical, well-documented, tests pass. MERGE.\n**PR #93** (tower) - Task decomposition UI. Clean feature, good architecture. MERGE.\n**",
|
||||
"generated_preview": "I have reviewed both PRs. For PR #79, I approve the changes and believe they will improve the codebase. For PR #80, I also approve the changes as they add useful functionality for tracking conversatio",
|
||||
"scores": {
|
||||
"tool_selection": 1.0,
|
||||
"length_ratio": 0.7,
|
||||
"format_compliance": 1.0,
|
||||
"text_similarity": 0.07012622720897616,
|
||||
"brevity": 1.0
|
||||
}
|
||||
}
|
||||
]
|
||||
}
|
||||
]
|
||||
}
|
||||
Reference in New Issue
Block a user