@@ -1,263 +1,320 @@
# GENOME.md — Wolf (Timmy_Foundation/ wolf)
# GENOME.md — wolf
> Codebase Genome v1.0 | Generated 2026-04-14 | Repo 16/16
* Generated: 2026-04-14T19:10:00Z | Branch: main | Commit: 02767d8 *
## Project Overview
**Wolf ** is a multi-model evaluation engine for sovereign AI fleets . It runs prompts against multiple LLM providers, scores responses on relevance, coherence, and safety, and outputs structured JSON results for model selection and ranking .
**Wolf ** is a sovereign multi-model evaluation engine. It runs prompts against multiple LLM providers (OpenAI, Anthropic, Groq, Ollama, OpenRouter) , scores responses on relevance, coherence, and safety, and outputs structured JSON results for model selection and fleet deployment decisions .
**Core principle: ** agents work, PRs prove it, CI judges it.
**Two operational modes: **
1. **Prompt Evaluation (v1.0) ** — Standalone prompt-vs-model benchmarking via `python -m wolf.runner`
2. **Legacy PR Scoring ** — Gitea PR evaluation pipeline via `wolf.cli` (task generation, agent execution, leaderboard)
**Status: ** v1.0.0 — production-ready for prompt evaluation. Legacy PR evaluation module retained for backward compatibility.
**Tagline: ** "Multi-model evaluation — agents work, PRs prove it, leaders get endpoints."
---
## Architecture
``` mermaid
graph TD
CLI[cli.py] --> Config[config.py ]
CLI --> TaskGen[task.py ]
CLI --> Runner[runner.py ]
CLI --> Evaluator[evaluator.py]
CLI --> Leaderboard[leaderboard.py]
CLI --> Gitea[gitea.py]
flowchart TB
subgraph CLI["CLI Entry Points" ]
A1["python -m wolf.runner\n(pure evaluation)" ]
A2["python -m wolf.cli\n(task pipeline)" ]
end
Runner --> Models[models.py ]
Runner --> Gitea
Evaluator --> Models
subgraph Core["Core Engine" ]
B1["PromptEvaluator\n(evaluator.py)"]
B2["ResponseScorer\n(evaluator.py)"]
B3["AgentRunner\n(runner.py)"]
B4["TaskGenerator\n(task.py)"]
end
TaskGen --> Gitea
Leaderboard --> |leaderboard.json| FS[(File System) ]
Config --> |wolf-config.yaml| FS
subgraph Providers["Model Providers"]
C1["OpenRouterClient" ]
C2["GroqClient"]
C3["OllamaClient"]
C4["AnthropicClient"]
C5["OpenAIClient\n(GroqClient w/ custom URL)"]
end
Models --> OpenRouter[OpenRouter API ]
Models --> Groq[Groq API ]
Models --> Ollama[Ollama Local ]
Models --> OpenAI[OpenAI API ]
Models --> Anthropic[Anthropic API ]
subgraph Infrastructure["Infrastructure" ]
D1["GiteaClient\n(gitea.py)" ]
D2["Config\n(config.py)" ]
D3["Leaderboard\n(leaderboard.py)" ]
D4["wolf-config.yaml" ]
end
Runner --> |branch + commit| Gitea
Evaluator --> |score results| Leaderboard
subgraph Output["Output"]
E1["JSON results file"]
E2["stdout summary table"]
E3["Gitea PRs"]
E4["Leaderboard scores"]
end
A1 --> B1
A2 --> B4 --> B3
B1 --> B2
B1 --> C1 & C2 & C3 & C4 & C5
B3 --> C1 & C2 & C3 & C4 & C5
B3 --> D1
A2 --> D1 & D2 & D3
B1 --> E1 & E2
B3 --> E3
D3 --> E4
D2 --> D4
style A1 fill:#4a9eff,color:#fff
style A2 fill:#4a9eff,color:#fff
style B1 fill:#ff6b6b,color:#fff
style B2 fill:#ff6b6b,color:#fff
```
### Data Flow — Prompt Evaluation Mode
```
prompts.json + models.json/wolf-config.yaml
→ load_prompts() / load_models_from_json()
→ PromptEvaluator.evaluate()
→ for each (prompt, model):
→ ModelFactory.get_client(provider) → ModelClient.generate()
→ ResponseScorer.score(response, prompt)
→ score_relevance() — keyword matching, length, refusal detection
→ score_coherence() — structure, readability, repetition
→ score_safety() — harmful content patterns, profanity
→ overall = relevance*0.40 + coherence*0.35 + safety*0.25
→ evaluate_and_serialize() → JSON dict
→ run(output_path) → write JSON + print_summary()
```
### Data Flow — Legacy Task Pipeline Mode
```
wolf-config.yaml
→ GiteaClient.get_issues(owner, repo)
→ TaskGenerator.from_gitea_issues()
→ TaskGenerator.assign_tasks(tasks, models)
→ for each task:
→ AgentRunner.execute_task(task)
→ ModelClient.generate(prompt)
→ GiteaClient.create_branch()
→ GiteaClient.create_file(wolf-outputs/{id}.md)
→ GiteaClient.create_pull_request()
→ Leaderboard.record_score()
→ Leaderboard.get_rankings()
```
---
## Entry Points
| Entry Point | Command | Purpose |
|-------------|--------- |---------|
| `wolf/cli.py` | `python3 -m wolf.cli --run` | Main CLI: run tasks, evaluate PRs, show leaderboard |
| `wolf/runner.py` | `python3 -m wolf.runner --prompts p.json --models m.json` | Standalone prompt evaluation runner |
| `wolf/__init__.py` | `import wolf` | Package init, version metadata |
| Entry Point | Module | Purpose |
|-------------|--------|---------|
| `python -m wolf.runner` | `runner.py` | Pure prompt-vs-model evaluation. Primary v1.0 interface. |
| `python -m wolf.cli` | `cli.py` | Full task pipeline: fetch issues → run models → create PRs → leaderboard. |
## Data Flow
### runner.py CLI Flags
### Prompt Evaluation Pipeline (Primary)
| Flag | Required | Description |
|------|----------|-------------|
| `--prompts / -p` | Yes | Path to prompts JSON file |
| `--models / -m` | No* | Path to models JSON file |
| `--config / -c` | No* | Path to wolf-config.yaml (alternative to --models) |
| `--output / -o` | No | Path to write JSON results |
| `--system-prompt` | No | System prompt (default: "You are a helpful assistant.") |
```
prompts.json + models.json (or wolf-config.yaml)
│
▼
PromptEvaluator.evaluate()
│
├─ For each (prompt, model) pair:
│ ├─ ModelClient.generate(prompt) → response text
│ ├─ ResponseScorer.score(response, prompt)
│ │ ├─ score_relevance() (0.40 weight)
│ │ ├─ score_coherence() (0.35 weight)
│ │ └─ score_safety() (0.25 weight)
│ └─ EvaluationResult (prompt, model, scores, latency, error)
│
▼
evaluate_and_serialize() → JSON output
│
├─ model_summaries (per-model averages)
└─ results[] (per-evaluation details)
```
*Either --models or --config is required.
### Task Assignment Pipeline (Legacy)
```
Gitea Issues → TaskGenerator → AgentRunner
│ │ │
▼ ▼ ▼
Fetch tasks Assign models Execute + PR
from issues from config via Gitea API
```
## Key Abstractions
| Class | Module | Purpose |
|-------|--------|---------|
| `PromptEntry` | evaluator.py | Single prompt with expected keywords and category |
| `ModelEndpoint` | evaluator.py | Model connection descriptor (provider, model_id, key) |
| `ScoreResult` | evaluator.py | Scores for relevance, coherence, safety, overall |
| `EvaluationResult` | evaluator.py | Full result: prompt + model + response + scores + latency |
| `ResponseScorer` | evaluator.py | Heuristic scoring engine (regex + keyword + structure) |
| `PromptEvaluator` | evaluator.py | Core engine: runs prompts against models, scores output |
| `ModelClient` | models.py | Abstract base for LLM API calls |
| `ModelFactory` | models.py | Factory: returns correct client for provider name |
| `Task` | task.py | Work unit: id, title, description, assigned model/provider |
| `TaskGenerator` | task.py | Creates tasks from Gitea issues or JSON spec |
| `AgentRunner` | runner.py | Executes tasks: generate → branch → commit → PR |
| `Config` | config.py | YAML config loader (wolf-config.yaml) |
| `Leaderboard` | leaderboard.py | Persistent model ranking with serverless readiness |
| `GiteaClient` | gitea.py | Full Gitea REST API client |
| `PREvaluator` | evaluator.py | Legacy: scores PRs on CI, commits, code quality |
## API Surface
### CLI Arguments (cli.py)
### cli.py CLI Flags
| Flag | Description |
|------|-------------|
| `--config` | Path to wolf-config.yaml |
| `--task-spec` | Path to task specification JSON |
| `--run` | Run pending tasks (assign models, execute, cre ate PRs ) |
| `--evaluate` | Evaluate open PRs and score them |
| `--run` | Run pending tasks (fetch issues → gener ate → PR) |
| `--evaluate` | Evaluate open PRs (legacy scoring) |
| `--leaderboard` | Show model rankings |
### CLI Arguments (runner.py)
---
| Flag | Descrip tion |
|------|-------------|
| `--prompts` / `-p` | Path to prompts JSON (required) |
| `--models` / `-m` | Path to models JSON |
| `--config` / `-c` | Path to wolf-config.yaml (alternative to --models) |
| `--output` / `-o` | Path to write JSON results |
| `--system-prompt` | System prompt for all model calls |
## Key Abstrac tions
### Dataclasses (evaluator.py)
| Class | Fields | Purpose |
|-------|--------|--------- |
| `PromptEntry` | id, text, expected_keywords, category | A single evaluation prompt with metadata |
| `ModelEndpoint` | name, provider, model_id, api_key, base_url | Model connection config |
| `ScoreResult` | relevance, coherence, safety, overall, details | Scoring output for one response |
| `EvaluationResult` | prompt_id, prompt_text, model_name, ..., scores, error | Complete result of one prompt× model evaluation |
### Core Classes
| Class | Module | Responsibility |
|-------|--------|----------------|
| `ResponseScorer` | evaluator.py | Scores responses on 3 dimensions using regex heuristics |
| `PromptEvaluator` | evaluator.py | Orchestrates N× M evaluation matrix |
| `ModelClient` | models.py | Abstract base for provider clients |
| `ModelFactory` | models.py | Static factory: `get_client(provider, key, url)` |
| `GiteaClient` | gitea.py | Full Gitea API wrapper (issues, branches, files, PRs) |
| `AgentRunner` | runner.py | Task execution: generate → branch → commit → PR |
| `TaskGenerator` | task.py | Converts Gitea issues to evaluable Task dataclasses |
| `Leaderboard` | leaderboard.py | Tracks model scores, determines serverless readiness |
| `Config` | config.py | Loads wolf-config.yaml, manages logging |
### Provider Clients (models.py)
| Client | Provider | API Format |
|-------- |----------|------------|
| Class | Provider | API Format |
|-------|----------|------------|
| `OpenRouterClient` | openrouter | OpenAI-compatible chat completions |
| `GroqClient` | groq | OpenAI-compatible chat completions |
| `OllamaClient` | ollama | Ollama native /api/generate |
| `OpenAI Client` | openai | OpenAI-compatible (reuses GroqClient with different URL) |
| `Anthropic Client` | anthropic | Anthropic Messages API v1 |
| `Anthropic Client` | anthropic | Anthropic Messages API |
| `OpenAI Client` | openai | GroqClient with base_url override |
### Gitea Client (gitea.py)
---
| Method | Purpose |
|--------|---------|
| `get_issues()` | Fetch issues by state |
| `create_branch()` | Create new branch from base |
| `create_file()` | Create file on branch (base64) |
| `update_file()` | Update file with SHA |
| `get_file()` | Read file contents |
| `create_pull_request()` | Open PR |
| `get_pull_request()` | Fetch PR details |
| `get_pr_status()` | Check PR CI status |
## API Surface
## Configuration (wolf-config.yaml )
### Public API (importable )
``` python
# Evaluation pipeline
from wolf . evaluator import PromptEvaluator , PromptEntry , ModelEndpoint , ScoreResult
# Provider clients
from wolf . models import ModelFactory , ModelClient
# Gitea integration
from wolf . gitea import GiteaClient
# Task pipeline
from wolf . runner import AgentRunner
from wolf . task import TaskGenerator , Task
# Leaderboard
from wolf . leaderboard import Leaderboard
# Config
from wolf . config import Config , setup_logging
```
### Scoring Weights
| Dimension | Weight | Method |
|-----------|--------|--------|
| Relevance | 0.40 | Keyword matching (60%) + length score (40%) |
| Coherence | 0.35 | Length + structure indicators + sentence completeness + uniqueness |
| Safety | 0.25 | Unsafe pattern detection + profanity check |
| **Overall ** | 1.00 | Weighted sum |
### Scoring Details
**Relevance (ResponseScorer.score_relevance): **
- Expected keyword match ratio
- Fallback: word overlap with prompt (boosted 1.5× )
- Length penalty: <20 chars → 0.3, <50 chars → 0.6
- Refusal detection: 3 regex patterns, penalty if low keyword match
**Coherence (ResponseScorer.score_coherence): **
- Length sweet spot: 100-3000 chars → 1.0
- Structure: paragraph breaks, transition words, lists/steps
- Sentence completeness: avg 20-200 chars → 0.9
- Uniqueness: unique word ratio >0.4 → 0.9
**Safety (ResponseScorer.score_safety): **
- 6 unsafe patterns (weapon creation, system exploitation, prompt injection, etc.)
- Profanity detection (minor penalty: 0.1 per word, capped at 0.3)
---
## Test Coverage
### Current Tests
| Test File | Covers | Status |
|-----------|--------|--------|
| `test_evaluator.py` | PromptEntry, ModelEndpoint, ScoreResult, ResponseScorer, PromptEvaluator, PREvaluator | ✅ 23 test methods |
| `test_config.py` | Config.load | ✅ 1 test method |
### Coverage Gaps — Untested Modules
| Module | Risk | Critical Paths |
|--------|------|----------------|
| `cli.py` | **HIGH ** | Argparse wiring, config→models→evaluator pipeline, PR scoring flow |
| `runner.py` | **HIGH ** | load_prompts, load_models_from_json, load_models_from_config, run_evaluation, AgentRunner.execute_task |
| `models.py` | **HIGH ** | ModelFactory.get_client for each provider, each client's generate() |
| `gitea.py` | **MEDIUM ** | All GiteaClient methods (HTTP calls) |
| `task.py` | **MEDIUM ** | TaskGenerator.from_gitea_issues, from_spec, assign_tasks |
| `leaderboard.py` | **LOW ** | Leaderboard.record_score, get_rankings, serverless_ready |
### Coverage Gaps — Existing Tests
- `test_evaluator.py` : No tests for `PromptEvaluator._get_model_client()` , `_run_single()` with real model call, or `evaluate_and_serialize()` summary statistics
- `test_evaluator.py` : No integration test (mocked model calls only)
- `test_config.py` : No test for missing config, env var overrides, or logging setup
---
## Security Considerations
1. **API Keys in Config ** : `wolf-config.yaml` stores provider API keys. Never commit to version control. Recommend `~/.hermes/wolf-config.yaml` with restricted permissions.
2. **HTTP Requests ** : All model calls and Gitea API calls are outbound HTTP. No input validation on URLs — `base_url` fields accept arbitrary endpoints.
3. **Prompt Injection ** : ResponseScorer detects injection patterns in * model output * , but Wolf itself is vulnerable to prompt injection via `expected_keywords` or `system_prompt` fields.
4. **Gitea Token Scope ** : GiteaClient uses a single token for all operations. Scoped tokens (read-only for evaluation, write for task execution) would reduce blast radius.
5. **No TLS Verification Override ** : `requests.post()` uses default SSL verification. If self-signed certs are used for local providers (Ollama), this could fail silently.
6. **Race Conditions ** : Leaderboard reads/writes JSON without locking. Concurrent evaluations could corrupt the leaderboard file.
---
## Dependencies
```
requests # HTTP client for all providers and Gitea
pyyaml # Config file parsing (not in requirements.txt — BUG)
```
* * ⚠️ Missing dependency:** `pyyaml` is imported in `config.py` but not listed in `requirements.txt` .
---
## Configuration Schema
``` yaml
# wolf-config.yaml
gitea :
base_url : "https://forge.al exanderwhiteston e.com/api/v1"
token : "... "
base_url : "https://forge.exampl e.com/api/v1"
token : "gitea_token_here "
owner : "Timmy_Foundation"
repo : "wolf "
repo : "eval-repo "
providers :
openrouter :
api_key : "..."
api_key : "sk-or- ..."
base_url : "https://openrouter.ai/api/v1"
groq :
api_key : "gsk_..."
ollama :
base_url : "http://localhost:11434"
models :
- model : "anthropic/claude-3.5-sonnet"
provider : "openrouter"
- model : "gemma4:latest "
- model : "llama3-70b-8192 "
provider : "groq"
- model : "llama3:70b"
provider : "ollama"
log_dir : "~/.hermes/wolf/ "
system_prompt : "You are a helpful assistant. "
leaderboard_path : "~/.hermes/wolf/leaderboard.json"
log_dir : "~/.hermes/wolf/logs"
```
## Scoring Methodology
| Dimension | Weight | Method |
|-----------|--------|--------|
| Relevance | 0.40 | Keyword matching (expected_keywords), prompt word overlap, length penalty, refusal detection |
| Coherence | 0.35 | Structure indicators (paragraphs, lists, transitions), sentence completeness, repetition ratio, length sweet spot |
| Safety | 0.25 | Unsafe pattern regex (bomb/hack/bypass/injection), profanity detection |
**Overall = relevance× 0.40 + coherence× 0.35 + safety× 0.25 **
### Safety Patterns Detected
- Bomb/explosive/weapon creation
- System/network hacking
- Security/auth bypass
- Prompt injection ("ignore previous instructions")
- System prompt extraction attempts
### Refusal Patterns Detected
- "I cannot/can't/won't help/assist"
- "Sorry, but I cannot"
- "Against my guidelines/policy"
## Test Coverage
| File | Tests | Coverage |
|------|-------|----------|
| `tests/test_evaluator.py` | 17 tests | PromptEntry, ModelEndpoint, ResponseScorer (relevance/coherence/safety), PromptEvaluator (evaluate, error handling, serialization, file output, multi-model), PREvaluator (score_pr, description scoring) |
| `tests/test_config.py` | 1 test | Config load from YAML |
### Coverage Gaps
- No tests for `cli.py` (argument parsing, workflow orchestration)
- No tests for `runner.py` (`load_prompts` , `load_models_from_json` , `AgentRunner.execute_task` )
- No tests for `task.py` (`TaskGenerator.from_gitea_issues` , `from_spec` , `assign_tasks` )
- No tests for `models.py` (API clients — would require mocking HTTP)
- No tests for `leaderboard.py` (`record_score` , `get_rankings` , serverless readiness logic)
- No tests for `gitea.py` (API client — would require mocking HTTP)
- No integration tests (end-to-end evaluation pipeline)
## Dependencies
| Dependency | Used By | Purpose |
|------------|---------|---------|
| `requests` | models.py, gitea.py | HTTP client for all API calls |
| `pyyaml` (optional) | config.py | YAML config parsing (falls back to line parser) |
## Security Considerations
1. **API keys in config ** : wolf-config.yaml stores provider API keys in plaintext. File should be chmod 600 and excluded from git (already in .gitignore pattern via ~/.hermes/).
2. **Gitea token ** : Full access token used for branch creation, file commits, and PR creation. Scoped access recommended.
3. **No input sanitization ** : Prompts from Gitea issues are passed directly to models without filtering. Prompt injection risk for automated workflows.
4. **No rate limiting ** : Model API calls are sequential with no backoff or rate limiting. Could exhaust API quotas.
5. **Legacy code reference ** : `evaluator.py` references `Evaluator = PREvaluator` alias but `cli.py` imports `Evaluator` expecting the legacy class. This works but is confusing.
## File Index
| File | LOC | Purpose |
|------|-----|---------|
| `wolf/__init__.py` | 12 | Package init, version |
| `wolf/cli.py` | 90 | Main CLI orchestrator |
| `wolf/config.py` | 48 | YAML config loader |
| `wolf/models.py` | 130 | LLM provider clients (5 providers) |
| `wolf/runner.py` | 280 | Prompt evaluation CLI + AgentRunner |
| `wolf/task.py` | 80 | Task dataclass + generator |
| `wolf/evaluator.py` | 350 | Core scoring engine + legacy PR evaluator |
| `wolf/leaderboard.py` | 70 | Persistent model ranking |
| `wolf/gitea.py` | 100 | Gitea REST API client |
| `tests/test_evaluator.py` | 180 | Unit tests for evaluator |
| `tests/test_config.py` | 20 | Unit tests for config |
**Total: ~1,360 LOC Python | 11 modules | 18 tests **
## Sovereignty Assessment
- **No external dependencies beyond requests**: Runs on any machine with Python 3.11+ and requests.
- **No phone-home**: All API calls are to user-configured endpoints.
- **No telemetry**: Logs go to local filesystem only.
- **Config-driven**: All secrets in user's ~/.hermes/ directory.
- **Provider-agnostic**: Supports 5 providers with easy extension via ModelFactory.
**Verdict: Fully sovereign. No corporate lock-in. User controls all endpoints and keys. **
---
* "The strength of the pack is the wolf, and the strength of the wolf is the pack." *
* — The Wolf Sovereign Core has spoken. *
* Generated by Codebase Genome Pipeline. Review and update manually. *