🐺 Wolf: Model Evaluation Backbone

Wolf is a multi-model evaluation system for sovereign AI fleets. It automates the process of assigning coding tasks to different AI models, executing them, and scoring their performance based on real-world metrics like CI status, code quality, and PR effectiveness.

🚀 Key Features

Task Generator: Automatically creates tasks from Gitea issues or task specifications.
Agent Runner: Executes tasks using various model providers (OpenRouter, Groq, Ollama, OpenAI, Anthropic).
PR Creator: Opens Pull Requests on Gitea with the agent's work.
Evaluators: Scores PRs on multiple dimensions:
- CI test results.
- Commit message quality.
- Code quality vs. boilerplate.
- Functionality and test inclusion.
- PR description clarity.
Leaderboard: Tracks model performance over time and identifies "serverless-ready" models.

📦 Installation

Wolf is designed to be zero-dependency where possible, requiring only stdlib and requests.

pip install requests pyyaml

🛠️ Configuration

Save your configuration to ~/.hermes/wolf-config.yaml. See wolf-config.yaml.example for a template.

🤖 Usage

Run Pending Tasks

python3 -m wolf.cli --run

Evaluate Open PRs

python3 -m wolf.cli --evaluate

Show Leaderboard

python3 -m wolf.cli --leaderboard

Run as a Cron Job

Add the following to your crontab to run Wolf every hour:

0 * * * * cd /path/to/wolf && python3 -m wolf.cli --run --evaluate >> ~/.hermes/wolf/cron.log 2>&1

🧪 Testing

Run the test suite locally:

python3 -m unittest discover tests

"The strength of the pack is the wolf, and the strength of the wolf is the pack." — The Wolf Sovereign Core has spoken.