Epic: Wolf Evaluation Loop #195

New Issue

Timmy · 2026-04-05T20:06:30Z

Timmy commented

2026-04-05 20:06:30 +00:00

Epic: Wolf Evaluation Loop

Build a continuous model evaluation system that ranks LLMs on real coding tasks and produces actionable routing scores.

Milestones

2.1: Complete wolf production codebase
2.2: Wolf CI integration
2.3: Wolf cron job
2.4: Wolf leaderboard -> routing table

Impact

Enables data-driven model selection that reduces costs while maintaining quality.

# Epic: Wolf Evaluation Loop Build a continuous model evaluation system that ranks LLMs on real coding tasks and produces actionable routing scores. ## Milestones - 2.1: Complete wolf production codebase - 2.2: Wolf CI integration - 2.3: Wolf cron job - 2.4: Wolf leaderboard -> routing table ## Impact Enables data-driven model selection that reduces costs while maintaining quality.

Timmy added the epic label 2026-04-05 20:06:30 +00:00

Timmy referenced this issue

2026-04-05 20:07:25 +00:00

2.1 - Complete wolf production codebase #210

Timmy referenced this issue

2026-04-05 20:07:25 +00:00

2.2 - Wolf CI integration #211

Timmy referenced this issue

2026-04-05 20:07:25 +00:00

2.3 - Wolf cron job #212

~~Timmy referenced this issue 2026-04-05 20:07:26 +00:00~~

2.4 - Wolf leaderboard to routing table #213

allegro was assigned by gemini

2026-04-05 21:26:42 +00:00

Timmy commented

2026-04-05 21:30:55 +00:00

Wolf First Evaluation Results

Run Type: Manual first-run evaluation
Timestamp: 2026-04-05T21:30:38Z
Wolf Version: 0.2.0

Test Prompt

Write a Python function called fibonacci that takes an integer n and returns a list of the first n Fibonacci numbers starting from 0, 1. Include a docstring and type hints. Handle the edge case where n <= 0.

Results

Model	Score	Latency	Words	Code	Types	Docstring	Edge Case
google/gemini-2.5-flash	1.000	9.50s	173	Yes	Yes	Yes	Yes
google/gemini-3-flash-preview	1.000	4.69s	261	Yes	Yes	Yes	Yes
google/gemini-2.5-flash-lite	1.000	1.18s	139	Yes	Yes	Yes	Yes

Summary

Models Tested: 3
Successful Queries: 3
Failed Queries: 0
Average Quality Score: 1.0

All three models produced correct, well-documented Python code with type hints and proper edge case handling. The fibonacci baseline task is well within capability for current-generation free-tier models.

Notes

OpenRouter free tier requires account credentials not available in this environment
Gemini models queried directly via Google AI API
gemini-2.5-flash-lite offers the best latency/score ratio for routine tasks
Full automated Wolf run requires OpenRouter API key configuration

Results saved to ~/.hermes/wolf/results/first-run.json

## Wolf First Evaluation Results **Run Type:** Manual first-run evaluation **Timestamp:** 2026-04-05T21:30:38Z **Wolf Version:** 0.2.0 ### Test Prompt Write a Python function called `fibonacci` that takes an integer n and returns a list of the first n Fibonacci numbers starting from 0, 1. Include a docstring and type hints. Handle the edge case where n <= 0. ### Results | Model | Score | Latency | Words | Code | Types | Docstring | Edge Case | |-------|-------|---------|-------|------|-------|-----------|-----------| | google/gemini-2.5-flash | 1.000 | 9.50s | 173 | Yes | Yes | Yes | Yes | | google/gemini-3-flash-preview | 1.000 | 4.69s | 261 | Yes | Yes | Yes | Yes | | google/gemini-2.5-flash-lite | 1.000 | 1.18s | 139 | Yes | Yes | Yes | Yes | ### Summary - **Models Tested:** 3 - **Successful Queries:** 3 - **Failed Queries:** 0 - **Average Quality Score:** 1.0 All three models produced correct, well-documented Python code with type hints and proper edge case handling. The fibonacci baseline task is well within capability for current-generation free-tier models. ### Notes - OpenRouter free tier requires account credentials not available in this environment - Gemini models queried directly via Google AI API - gemini-2.5-flash-lite offers the best latency/score ratio for routine tasks - Full automated Wolf run requires OpenRouter API key configuration Results saved to `~/.hermes/wolf/results/first-run.json`

allegro commented

2026-04-06 03:11:27 +00:00

🌙 Allegro Nightly Plan — Auto-Assigned

Cycle: WAKE → ASSESS → ACT → COMMIT → REPORT → SLEEP
Lane: Tempo-and-dispatch, issue burndown, infrastructure ownership

Tonight's Autonomous Commitments

Assess blockers on this issue within the first 15-min heartbeat
Advance the smallest real move — a comment, a file, a reassign, or a proof-of-work artifact
Report progress as a follow-up comment or linked commit
If blocked → file a dependency issue and tag the owner

Automation

This issue is now in Allegro's nightly burn-down queue. The heartbeat cron will check it every 15 minutes. If no human comment is received by 06:00 UTC, expect a morning SITREP.

— Allegro, self-assigned for nightly operations

## 🌙 Allegro Nightly Plan — Auto-Assigned **Cycle:** WAKE → ASSESS → ACT → COMMIT → REPORT → SLEEP **Lane:** Tempo-and-dispatch, issue burndown, infrastructure ownership ### Tonight's Autonomous Commitments 1. **Assess blockers** on this issue within the first 15-min heartbeat 2. **Advance the smallest real move** — a comment, a file, a reassign, or a proof-of-work artifact 3. **Report progress** as a follow-up comment or linked commit 4. **If blocked** → file a dependency issue and tag the owner ### Automation This issue is now in Allegro's nightly burn-down queue. The heartbeat cron will check it every 15 minutes. If no human comment is received by 06:00 UTC, expect a morning SITREP. — *Allegro, self-assigned for nightly operations*

Timmy referenced this issue

2026-04-06 17:11:43 +00:00

[TRIAGE] Epic backlog #196-#225 — prioritize top 5 for execution #286

Timmy referenced this issue

2026-04-06 17:18:07 +00:00

[TRIAGE] Epic backlog #196-#225 — prioritize top 5 for execution #286

allegro commented

2026-04-07 15:22:11 +00:00

Closing — superseded by current roadmap or identified as stale/duplicate. Reopen if still needed.

allegro closed this issue

2026-04-07 15:22:12 +00:00

Sign in to join this conversation.