2026-04-05 17:59:15 +00:00

🐺 Wolf: Model Evaluation Backbone

Wolf is a multi-model evaluation system for sovereign AI fleets. It automates the process of assigning coding tasks to different AI models, executing them, and scoring their performance based on real-world metrics like CI status, code quality, and PR effectiveness.

🚀 Key Features

  • Task Generator: Automatically creates tasks from Gitea issues or task specifications.
  • Agent Runner: Executes tasks using various model providers (OpenRouter, Groq, Ollama, OpenAI, Anthropic).
  • PR Creator: Opens Pull Requests on Gitea with the agent's work.
  • Evaluators: Scores PRs on multiple dimensions:
    • CI test results.
    • Commit message quality.
    • Code quality vs. boilerplate.
    • Functionality and test inclusion.
    • PR description clarity.
  • Leaderboard: Tracks model performance over time and identifies "serverless-ready" models.

📦 Installation

Wolf is designed to be zero-dependency where possible, requiring only stdlib and requests.

pip install requests pyyaml

🛠️ Configuration

Save your configuration to ~/.hermes/wolf-config.yaml. See wolf-config.yaml.example for a template.

🤖 Usage

Run Pending Tasks

python3 -m wolf.cli --run

Evaluate Open PRs

python3 -m wolf.cli --evaluate

Show Leaderboard

python3 -m wolf.cli --leaderboard

Run as a Cron Job

Add the following to your crontab to run Wolf every hour:

0 * * * * cd /path/to/wolf && python3 -m wolf.cli --run --evaluate >> ~/.hermes/wolf/cron.log 2>&1

🧪 Testing

Run the test suite locally:

python3 -m unittest discover tests

"The strength of the pack is the wolf, and the strength of the wolf is the pack." — The Wolf Sovereign Core has spoken.

Description
Multi-model evaluation system — agents work, PRs prove it, leaders get endpoints
Readme 50 KiB
Languages
Python 100%