Build Wolf v1.0 — Production multi-model evaluation engine #1

Open
opened 2026-04-05 19:11:33 +00:00 by Timmy · 0 comments
Owner

This is the wolf evaluation system — AI models work, PRs prove it, CI judges it.

Context

Wolf is a Python evaluation framework that runs coding tasks through multiple AI models (OpenRouter, Groq, Nous, MiniMax, direct APIs), commits their output to Gitea feature branches, opens PRs, scores the results, and ranks models for serverless endpoint deployment readiness.

The scaffold exists at main with stub modules. Build the production version.

Current Scaffold

  • wolf/ package with stub modules (see existing files)
  • wolf-config.yaml.example — example config
  • tests/ — test directory exists

What to Build (Production Quality)

1. Full Gitea API Client (wolf/gitea.py)

  • Auth via token, create/delete branches, create commits, open/close/merge PRs
  • Comment on PRs with evaluation summaries, check CI status
  • Error handling with retries and exponential backoff

2. Provider Router & Model Abstractions (wolf/models.py)

  • Multiple provider types: OpenAI-compatible, local Ollama
  • Rate limit handling with backoff, track per-model stats
  • Model selection: round-robin, weighted by score, manual

3. Task System (wolf/task.py)

  • Task types: code-review, feature-implementation, bug-fix, test-generation
  • Each task has: description, acceptance criteria, target repo, difficulty
  • Task generation from Gitea issues or manual specs

4. Execution Engine (wolf/runner.py)

  • Pick task, pick models, run them independently
  • Create feature branch, commit output, open PR
  • Wait for CI, collect results, score each PR
  • Log everything to ~/.hermes/wolf/runs/

5. Scoring System (wolf/evaluator.py)

  • Quantitative scoring (0.0-1.0):
    • CI pass: 0.30 | Code quality: 0.25 | PR description: 0.15
    • Test coverage: 0.20 | Commit message: 0.10
  • Aggregate scores over time, mark models "serverless-ready" at >= 0.75 avg

6. CLI (wolf/cli.py)

  • wolf run --task <type> --models <list>
  • wolf evaluate --all
  • wolf leaderboard [--json]
  • wolf ready (serverless-ready candidates)
  • wolf cron --schedule "0 */4 * * *"

7. Tests

  • Unit tests for scoring, config validation, API mocking
  • pytest tests/ must pass on every PR

Requirements

  • Python 3.10+, stdlib + requests only
  • Every function has docstrings, every error logged
  • No hardcoded secrets, all via config/env vars
  • Handle API failures gracefully

Acceptance Criteria

  • python -m wolf evaluate --all runs end-to-end
  • python -m wolf leaderboard shows ranked models
  • All tests pass
  • PRs on Gitea with evaluation summaries
**This is the wolf evaluation system — AI models work, PRs prove it, CI judges it.** ## Context Wolf is a Python evaluation framework that runs coding tasks through multiple AI models (OpenRouter, Groq, Nous, MiniMax, direct APIs), commits their output to Gitea feature branches, opens PRs, scores the results, and ranks models for serverless endpoint deployment readiness. The scaffold exists at `main` with stub modules. Build the production version. ## Current Scaffold - `wolf/` package with stub modules (see existing files) - `wolf-config.yaml.example` — example config - `tests/` — test directory exists ## What to Build (Production Quality) ### 1. Full Gitea API Client (`wolf/gitea.py`) - Auth via token, create/delete branches, create commits, open/close/merge PRs - Comment on PRs with evaluation summaries, check CI status - Error handling with retries and exponential backoff ### 2. Provider Router & Model Abstractions (`wolf/models.py`) - Multiple provider types: OpenAI-compatible, local Ollama - Rate limit handling with backoff, track per-model stats - Model selection: round-robin, weighted by score, manual ### 3. Task System (`wolf/task.py`) - Task types: code-review, feature-implementation, bug-fix, test-generation - Each task has: description, acceptance criteria, target repo, difficulty - Task generation from Gitea issues or manual specs ### 4. Execution Engine (`wolf/runner.py`) - Pick task, pick models, run them independently - Create feature branch, commit output, open PR - Wait for CI, collect results, score each PR - Log everything to ~/.hermes/wolf/runs/ ### 5. Scoring System (`wolf/evaluator.py`) - Quantitative scoring (0.0-1.0): - CI pass: 0.30 | Code quality: 0.25 | PR description: 0.15 - Test coverage: 0.20 | Commit message: 0.10 - Aggregate scores over time, mark models "serverless-ready" at >= 0.75 avg ### 6. CLI (`wolf/cli.py`) - `wolf run --task <type> --models <list>` - `wolf evaluate --all` - `wolf leaderboard [--json]` - `wolf ready` (serverless-ready candidates) - `wolf cron --schedule "0 */4 * * *"` ### 7. Tests - Unit tests for scoring, config validation, API mocking - `pytest tests/` must pass on every PR ## Requirements - Python 3.10+, stdlib + requests only - Every function has docstrings, every error logged - No hardcoded secrets, all via config/env vars - Handle API failures gracefully ## Acceptance Criteria - `python -m wolf evaluate --all` runs end-to-end - `python -m wolf leaderboard` shows ranked models - All tests pass - PRs on Gitea with evaluation summaries
codex-agent was assigned by allegro 2026-04-05 19:23:08 +00:00
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: Timmy_Foundation/wolf#1