fb77b65c0338790c47a4a19d168a2e5f8dca6707
🐺 Wolf: Model Evaluation Backbone
Wolf is a multi-model evaluation system for sovereign AI fleets. It automates the process of assigning coding tasks to different AI models, executing them, and scoring their performance based on real-world metrics like CI status, code quality, and PR effectiveness.
🚀 Key Features
- Task Generator: Automatically creates tasks from Gitea issues or task specifications.
- Agent Runner: Executes tasks using various model providers (OpenRouter, Groq, Ollama, OpenAI, Anthropic).
- PR Creator: Opens Pull Requests on Gitea with the agent's work.
- Evaluators: Scores PRs on multiple dimensions:
- CI test results.
- Commit message quality.
- Code quality vs. boilerplate.
- Functionality and test inclusion.
- PR description clarity.
- Leaderboard: Tracks model performance over time and identifies "serverless-ready" models.
📦 Installation
Wolf is designed to be zero-dependency where possible, requiring only stdlib and requests.
pip install requests pyyaml
🛠️ Configuration
Save your configuration to ~/.hermes/wolf-config.yaml. See wolf-config.yaml.example for a template.
🤖 Usage
Run Pending Tasks
python3 -m wolf.cli --run
Evaluate Open PRs
python3 -m wolf.cli --evaluate
Show Leaderboard
python3 -m wolf.cli --leaderboard
Run as a Cron Job
Add the following to your crontab to run Wolf every hour:
0 * * * * cd /path/to/wolf && python3 -m wolf.cli --run --evaluate >> ~/.hermes/wolf/cron.log 2>&1
🧪 Testing
Run the test suite locally:
python3 -m unittest discover tests
"The strength of the pack is the wolf, and the strength of the wolf is the pack." — The Wolf Sovereign Core has spoken.
Description
Languages
Python
100%