wolf v1.0 — Production codebase #2

Timmy · 2026-04-05T20:32:13Z

Timmy commented

2026-04-05 20:32:13 +00:00

Wolf v1.0 — Production Multi-Model Evaluation Engine

What changed

All 8 core modules built from stubs to production quality:

gitea.py (~670 lines) — Full Gitea API client with auth, branches, commits, PRs, CI status, comments, retry with exponential backoff, rate-limit handling
models.py (~530 lines) — Provider router with OpenAI-compatible, Anthropic, Ollama clients, per-model stats tracking, fallback routing
task.py (~364 lines) — Task dataclass, 4 types, Gitea issue integration, automatic type inference, round-robin assignment
runner.py (~370 lines) — Execution engine: pick task → assign model → generate → commit → open PR → wait CI → score
evaluator.py (~340 lines) — Quantitative scoring (CI 0.30, code quality 0.25, PR desc 0.15, test coverage 0.20, commit msg 0.10)
leaderboard.py (~330 lines) — Rankings with history, trend detection, serverless-ready candidates
config.py (~340 lines) — YAML config loader with validation, typed accessors, token loading
cli.py (~340 lines) — Full CLI: run, evaluate, leaderboard, ready, cron, reset, status

Tests

4 test suites, 1,000+ lines, covering scoring, config, Gitea API mocking, task generation.

Part of

Epic: #195 (Wolf Evaluation Loop)
Milestone: #210

## Wolf v1.0 — Production Multi-Model Evaluation Engine ### What changed All 8 core modules built from stubs to production quality: - `gitea.py` (~670 lines) — Full Gitea API client with auth, branches, commits, PRs, CI status, comments, retry with exponential backoff, rate-limit handling - `models.py` (~530 lines) — Provider router with OpenAI-compatible, Anthropic, Ollama clients, per-model stats tracking, fallback routing - `task.py` (~364 lines) — Task dataclass, 4 types, Gitea issue integration, automatic type inference, round-robin assignment - `runner.py` (~370 lines) — Execution engine: pick task → assign model → generate → commit → open PR → wait CI → score - `evaluator.py` (~340 lines) — Quantitative scoring (CI 0.30, code quality 0.25, PR desc 0.15, test coverage 0.20, commit msg 0.10) - `leaderboard.py` (~330 lines) — Rankings with history, trend detection, serverless-ready candidates - `config.py` (~340 lines) — YAML config loader with validation, typed accessors, token loading - `cli.py` (~340 lines) — Full CLI: run, evaluate, leaderboard, ready, cron, reset, status ### Tests 4 test suites, 1,000+ lines, covering scoring, config, Gitea API mocking, task generation. ### Part of - Epic: #195 (Wolf Evaluation Loop) - Milestone: #210

Timmy closed this pull request

2026-04-05 20:52:40 +00:00

Pull request closed

Please reopen this pull request to perform a merge.

Sign in to join this conversation.

1 Participants

Notifications

Due Date

No due date set.

Dependencies

No dependencies set.

Reference: Timmy_Foundation/wolf#2