Build Wolf v1.0 — Production multi-model evaluation engine #1

New Issue

Timmy · 2026-04-05T19:11:33Z

Timmy commented

2026-04-05 19:11:33 +00:00

This is the wolf evaluation system — AI models work, PRs prove it, CI judges it.

Context

Wolf is a Python evaluation framework that runs coding tasks through multiple AI models (OpenRouter, Groq, Nous, MiniMax, direct APIs), commits their output to Gitea feature branches, opens PRs, scores the results, and ranks models for serverless endpoint deployment readiness.

The scaffold exists at main with stub modules. Build the production version.

Current Scaffold

wolf/ package with stub modules (see existing files)
wolf-config.yaml.example — example config
tests/ — test directory exists

What to Build (Production Quality)

1. Full Gitea API Client (`wolf/gitea.py`)

Auth via token, create/delete branches, create commits, open/close/merge PRs
Comment on PRs with evaluation summaries, check CI status
Error handling with retries and exponential backoff

2. Provider Router & Model Abstractions (`wolf/models.py`)

Multiple provider types: OpenAI-compatible, local Ollama
Rate limit handling with backoff, track per-model stats
Model selection: round-robin, weighted by score, manual

3. Task System (`wolf/task.py`)

Task types: code-review, feature-implementation, bug-fix, test-generation
Each task has: description, acceptance criteria, target repo, difficulty
Task generation from Gitea issues or manual specs

4. Execution Engine (`wolf/runner.py`)

Pick task, pick models, run them independently
Create feature branch, commit output, open PR
Wait for CI, collect results, score each PR
Log everything to ~/.hermes/wolf/runs/

5. Scoring System (`wolf/evaluator.py`)

Quantitative scoring (0.0-1.0):
- CI pass: 0.30 | Code quality: 0.25 | PR description: 0.15
- Test coverage: 0.20 | Commit message: 0.10
Aggregate scores over time, mark models "serverless-ready" at >= 0.75 avg

6. CLI (`wolf/cli.py`)

wolf run --task <type> --models <list>
wolf evaluate --all
wolf leaderboard [--json]
wolf ready (serverless-ready candidates)
wolf cron --schedule "0 */4 * * *"

7. Tests

Unit tests for scoring, config validation, API mocking
pytest tests/ must pass on every PR

Requirements

Python 3.10+, stdlib + requests only
Every function has docstrings, every error logged
No hardcoded secrets, all via config/env vars
Handle API failures gracefully

Acceptance Criteria

python -m wolf evaluate --all runs end-to-end
python -m wolf leaderboard shows ranked models
All tests pass
PRs on Gitea with evaluation summaries

**This is the wolf evaluation system — AI models work, PRs prove it, CI judges it.** ## Context Wolf is a Python evaluation framework that runs coding tasks through multiple AI models (OpenRouter, Groq, Nous, MiniMax, direct APIs), commits their output to Gitea feature branches, opens PRs, scores the results, and ranks models for serverless endpoint deployment readiness. The scaffold exists at `main` with stub modules. Build the production version. ## Current Scaffold - `wolf/` package with stub modules (see existing files) - `wolf-config.yaml.example` — example config - `tests/` — test directory exists ## What to Build (Production Quality) ### 1. Full Gitea API Client (`wolf/gitea.py`) - Auth via token, create/delete branches, create commits, open/close/merge PRs - Comment on PRs with evaluation summaries, check CI status - Error handling with retries and exponential backoff ### 2. Provider Router & Model Abstractions (`wolf/models.py`) - Multiple provider types: OpenAI-compatible, local Ollama - Rate limit handling with backoff, track per-model stats - Model selection: round-robin, weighted by score, manual ### 3. Task System (`wolf/task.py`) - Task types: code-review, feature-implementation, bug-fix, test-generation - Each task has: description, acceptance criteria, target repo, difficulty - Task generation from Gitea issues or manual specs ### 4. Execution Engine (`wolf/runner.py`) - Pick task, pick models, run them independently - Create feature branch, commit output, open PR - Wait for CI, collect results, score each PR - Log everything to ~/.hermes/wolf/runs/ ### 5. Scoring System (`wolf/evaluator.py`) - Quantitative scoring (0.0-1.0): - CI pass: 0.30 | Code quality: 0.25 | PR description: 0.15 - Test coverage: 0.20 | Commit message: 0.10 - Aggregate scores over time, mark models "serverless-ready" at >= 0.75 avg ### 6. CLI (`wolf/cli.py`) - `wolf run --task <type> --models <list>` - `wolf evaluate --all` - `wolf leaderboard [--json]` - `wolf ready` (serverless-ready candidates) - `wolf cron --schedule "0 */4 * * *"` ### 7. Tests - Unit tests for scoring, config validation, API mocking - `pytest tests/` must pass on every PR ## Requirements - Python 3.10+, stdlib + requests only - Every function has docstrings, every error logged - No hardcoded secrets, all via config/env vars - Handle API failures gracefully ## Acceptance Criteria - `python -m wolf evaluate --all` runs end-to-end - `python -m wolf leaderboard` shows ranked models - All tests pass - PRs on Gitea with evaluation summaries

codex-agent was assigned by allegro

2026-04-05 19:23:08 +00:00

Timmy referenced this issue

2026-04-05 20:59:02 +00:00

Fenrir Hunt #1 — Backlog triage: oldest open issues #3

Sign in to join this conversation.

1 Participants

Notifications

Due Date

No due date set.

Dependencies

No dependencies set.

Reference: Timmy_Foundation/wolf#1

Build Wolf v1.0 — Production multi-model evaluation engine #1

Context

Current Scaffold

What to Build (Production Quality)

1. Full Gitea API Client (wolf/gitea.py)

2. Provider Router & Model Abstractions (wolf/models.py)

3. Task System (wolf/task.py)

4. Execution Engine (wolf/runner.py)

5. Scoring System (wolf/evaluator.py)

6. CLI (wolf/cli.py)