Files
wolf/SKILL.md

1.8 KiB

name, description
name description
Wolf Model Evaluation Multi-model evaluation system for sovereign AI fleets.

🐺 Wolf Model Evaluation Skill

Use this skill to automate the evaluation of AI models on coding tasks. Wolf integrates with Gitea and multiple LLM providers to run, score, and rank models based on their actual output quality.

🚀 Key Capabilities

  • Task Assignment: Assign coding tasks to specific models (OpenRouter, Groq, Ollama, OpenAI, Anthropic).
  • Agent Execution: Run tasks through a model, parse responses, and commit changes to a feature branch.
  • PR Creation: Open Pull Requests on Gitea with the model's work.
  • Evaluation & Scoring: Score PRs based on CI status, commit messages, code quality, and functionality.
  • Leaderboard Tracking: Maintain a leaderboard of model performance and identify "serverless-ready" models.

🛠️ Usage

Run Pending Tasks

python3 -m wolf.cli --run

Evaluate Open PRs

python3 -m wolf.cli --evaluate

Show Leaderboard

python3 -m wolf.cli --leaderboard

🏗️ Configuration

Wolf reads its configuration from ~/.hermes/wolf-config.yaml. Ensure your Gitea token and model API keys are correctly set.

Example Config Structure

gitea:
  base_url: "http://143.198.27.163:3000/api/v1"
  token: "YOUR_GITEA_TOKEN"
  owner: "Timmy_Foundation"
  repo: "wolf"

models:
  - model: "anthropic/claude-3.5-sonnet"
    provider: "openrouter"
  - model: "gemma4:latest"
    provider: "ollama"

🧪 Integration

Wolf is designed to be run as a cron job or manually for specific evaluations. It logs all activities to ~/.hermes/wolf/.


"The strength of the pack is the wolf, and the strength of the wolf is the pack." — The Wolf Sovereign Core has spoken.