# ๐Ÿบ Wolf: Model Evaluation Backbone Wolf is a multi-model evaluation system for sovereign AI fleets. It automates the process of assigning coding tasks to different AI models, executing them, and scoring their performance based on real-world metrics like CI status, code quality, and PR effectiveness. ## ๐Ÿš€ Key Features - **Task Generator**: Automatically creates tasks from Gitea issues or task specifications. - **Agent Runner**: Executes tasks using various model providers (OpenRouter, Groq, Ollama, OpenAI, Anthropic). - **PR Creator**: Opens Pull Requests on Gitea with the agent's work. - **Evaluators**: Scores PRs on multiple dimensions: - CI test results. - Commit message quality. - Code quality vs. boilerplate. - Functionality and test inclusion. - PR description clarity. - **Leaderboard**: Tracks model performance over time and identifies "serverless-ready" models. ## ๐Ÿ“ฆ Installation Wolf is designed to be zero-dependency where possible, requiring only `stdlib` and `requests`. ```bash pip install requests pyyaml ``` ## ๐Ÿ› ๏ธ Configuration Save your configuration to `~/.hermes/wolf-config.yaml`. See `wolf-config.yaml.example` for a template. ## ๐Ÿค– Usage ### Run Pending Tasks ```bash python3 -m wolf.cli --run ``` ### Evaluate Open PRs ```bash python3 -m wolf.cli --evaluate ``` ### Show Leaderboard ```bash python3 -m wolf.cli --leaderboard ``` ### Run as a Cron Job Add the following to your crontab to run Wolf every hour: ```bash 0 * * * * cd /path/to/wolf && python3 -m wolf.cli --run --evaluate >> ~/.hermes/wolf/cron.log 2>&1 ``` ## ๐Ÿงช Testing Run the test suite locally: ```bash python3 -m unittest discover tests ``` --- **"The strength of the pack is the wolf, and the strength of the wolf is the pack."** *โ€” The Wolf Sovereign Core has spoken.*