SKILL.md

---
name: Wolf Model Evaluation
description: Multi-model evaluation system for sovereign AI fleets.
---

# 🐺 Wolf Model Evaluation Skill

Use this skill to automate the evaluation of AI models on coding tasks. Wolf integrates with Gitea and multiple LLM providers to run, score, and rank models based on their actual output quality.

## 🚀 Key Capabilities

- **Task Assignment**: Assign coding tasks to specific models (OpenRouter, Groq, Ollama, OpenAI, Anthropic).
- **Agent Execution**: Run tasks through a model, parse responses, and commit changes to a feature branch.
- **PR Creation**: Open Pull Requests on Gitea with the model's work.
- **Evaluation & Scoring**: Score PRs based on CI status, commit messages, code quality, and functionality.
- **Leaderboard Tracking**: Maintain a leaderboard of model performance and identify "serverless-ready" models.

## 🛠️ Usage

### Run Pending Tasks
```bash
python3 -m wolf.cli --run
```

### Evaluate Open PRs
```bash
python3 -m wolf.cli --evaluate
```

### Show Leaderboard
```bash
python3 -m wolf.cli --leaderboard
```

## 🏗️ Configuration

Wolf reads its configuration from `~/.hermes/wolf-config.yaml`. Ensure your Gitea token and model API keys are correctly set.

### Example Config Structure
```yaml
gitea:
  base_url: "http://143.198.27.163:3000/api/v1"
  token: "YOUR_GITEA_TOKEN"
  owner: "Timmy_Foundation"
  repo: "wolf"

models:
  - model: "anthropic/claude-3.5-sonnet"
    provider: "openrouter"
  - model: "gemma4:latest"
    provider: "ollama"
```

## 🧪 Integration

Wolf is designed to be run as a cron job or manually for specific evaluations. It logs all activities to `~/.hermes/wolf/`.

---

**"The strength of the pack is the wolf, and the strength of the wolf is the pack."**
*— The Wolf Sovereign Core has spoken.*
Add/Update SKILL.md by Wolf 2026-04-05 17:59:16 +00:00			`---`
			`name: Wolf Model Evaluation`
			`description: Multi-model evaluation system for sovereign AI fleets.`
			`---`

			`# 🐺 Wolf Model Evaluation Skill`

			`Use this skill to automate the evaluation of AI models on coding tasks. Wolf integrates with Gitea and multiple LLM providers to run, score, and rank models based on their actual output quality.`

			`## 🚀 Key Capabilities`

			`- Task Assignment: Assign coding tasks to specific models (OpenRouter, Groq, Ollama, OpenAI, Anthropic).`
			`- Agent Execution: Run tasks through a model, parse responses, and commit changes to a feature branch.`
			`- PR Creation: Open Pull Requests on Gitea with the model's work.`
			`- Evaluation & Scoring: Score PRs based on CI status, commit messages, code quality, and functionality.`
			`- Leaderboard Tracking: Maintain a leaderboard of model performance and identify "serverless-ready" models.`

			`## 🛠️ Usage`

			`### Run Pending Tasks`
			```bash
			`python3 -m wolf.cli --run`
			```

			`### Evaluate Open PRs`
			```bash
			`python3 -m wolf.cli --evaluate`
			```

			`### Show Leaderboard`
			```bash
			`python3 -m wolf.cli --leaderboard`
			```

			`## 🏗️ Configuration`

			Wolf reads its configuration from `~/.hermes/wolf-config.yaml`. Ensure your Gitea token and model API keys are correctly set.

			`### Example Config Structure`
			```yaml
			`gitea:`
			`base_url: "http://143.198.27.163:3000/api/v1"`
			`token: "YOUR_GITEA_TOKEN"`
			`owner: "Timmy_Foundation"`
			`repo: "wolf"`

			`models:`
			`- model: "anthropic/claude-3.5-sonnet"`
			`provider: "openrouter"`
			`- model: "gemma4:latest"`
			`provider: "ollama"`
			```

			`## 🧪 Integration`

			Wolf is designed to be run as a cron job or manually for specific evaluations. It logs all activities to `~/.hermes/wolf/`.

			`---`

			`"The strength of the pack is the wolf, and the strength of the wolf is the pack."`
			`— The Wolf Sovereign Core has spoken.`