wolf v1.0 — Production codebase #2

Closed
Timmy wants to merge 0 commits from feature/production-codebase into main
Owner

Wolf v1.0 — Production Multi-Model Evaluation Engine

What changed

All 8 core modules built from stubs to production quality:

  • gitea.py (~670 lines) — Full Gitea API client with auth, branches, commits, PRs, CI status, comments, retry with exponential backoff, rate-limit handling
  • models.py (~530 lines) — Provider router with OpenAI-compatible, Anthropic, Ollama clients, per-model stats tracking, fallback routing
  • task.py (~364 lines) — Task dataclass, 4 types, Gitea issue integration, automatic type inference, round-robin assignment
  • runner.py (~370 lines) — Execution engine: pick task → assign model → generate → commit → open PR → wait CI → score
  • evaluator.py (~340 lines) — Quantitative scoring (CI 0.30, code quality 0.25, PR desc 0.15, test coverage 0.20, commit msg 0.10)
  • leaderboard.py (~330 lines) — Rankings with history, trend detection, serverless-ready candidates
  • config.py (~340 lines) — YAML config loader with validation, typed accessors, token loading
  • cli.py (~340 lines) — Full CLI: run, evaluate, leaderboard, ready, cron, reset, status

Tests

4 test suites, 1,000+ lines, covering scoring, config, Gitea API mocking, task generation.

Part of

  • Epic: #195 (Wolf Evaluation Loop)
  • Milestone: #210
## Wolf v1.0 — Production Multi-Model Evaluation Engine ### What changed All 8 core modules built from stubs to production quality: - `gitea.py` (~670 lines) — Full Gitea API client with auth, branches, commits, PRs, CI status, comments, retry with exponential backoff, rate-limit handling - `models.py` (~530 lines) — Provider router with OpenAI-compatible, Anthropic, Ollama clients, per-model stats tracking, fallback routing - `task.py` (~364 lines) — Task dataclass, 4 types, Gitea issue integration, automatic type inference, round-robin assignment - `runner.py` (~370 lines) — Execution engine: pick task → assign model → generate → commit → open PR → wait CI → score - `evaluator.py` (~340 lines) — Quantitative scoring (CI 0.30, code quality 0.25, PR desc 0.15, test coverage 0.20, commit msg 0.10) - `leaderboard.py` (~330 lines) — Rankings with history, trend detection, serverless-ready candidates - `config.py` (~340 lines) — YAML config loader with validation, typed accessors, token loading - `cli.py` (~340 lines) — Full CLI: run, evaluate, leaderboard, ready, cron, reset, status ### Tests 4 test suites, 1,000+ lines, covering scoring, config, Gitea API mocking, task generation. ### Part of - Epic: #195 (Wolf Evaluation Loop) - Milestone: #210
Timmy closed this pull request 2026-04-05 20:52:40 +00:00

Pull request closed

Sign in to join this conversation.
No Reviewers
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: Timmy_Foundation/wolf#2