timmy-home/genomes/wolf/GENOME.md

# GENOME.md — Wolf (Timmy_Foundation/wolf)

> Codebase Genome v1.0 | Generated 2026-04-14 | Repo 16/16

## Project Overview

**Wolf** is a multi-model evaluation engine for sovereign AI fleets. It runs prompts against multiple LLM providers, scores responses on relevance, coherence, and safety, and outputs structured JSON results for model selection and ranking.

**Core principle:** agents work, PRs prove it, CI judges it.

**Status:** v1.0.0 — production-ready for prompt evaluation. Legacy PR evaluation module retained for backward compatibility.

## Architecture

```mermaid
graph TD
    CLI[cli.py] --> Config[config.py]
    CLI --> TaskGen[task.py]
    CLI --> Runner[runner.py]
    CLI --> Evaluator[evaluator.py]
    CLI --> Leaderboard[leaderboard.py]
    CLI --> Gitea[gitea.py]

    Runner --> Models[models.py]
    Runner --> Gitea
    Evaluator --> Models

    TaskGen --> Gitea
    Leaderboard --> |leaderboard.json| FS[(File System)]
    Config --> |wolf-config.yaml| FS

    Models --> OpenRouter[OpenRouter API]
    Models --> Groq[Groq API]
    Models --> Ollama[Ollama Local]
    Models --> OpenAI[OpenAI API]
    Models --> Anthropic[Anthropic API]

    Runner --> |branch + commit| Gitea
    Evaluator --> |score results| Leaderboard
```

## Entry Points

| Entry Point | Command | Purpose |
|-------------|---------|---------|
| `wolf/cli.py` | `python3 -m wolf.cli --run` | Main CLI: run tasks, evaluate PRs, show leaderboard |
| `wolf/runner.py` | `python3 -m wolf.runner --prompts p.json --models m.json` | Standalone prompt evaluation runner |
| `wolf/__init__.py` | `import wolf` | Package init, version metadata |

## Data Flow

### Prompt Evaluation Pipeline (Primary)

```
prompts.json + models.json (or wolf-config.yaml)
        │
        ▼
  PromptEvaluator.evaluate()
        │
        ├─ For each (prompt, model) pair:
        │   ├─ ModelClient.generate(prompt)  → response text
        │   ├─ ResponseScorer.score(response, prompt)
        │   │   ├─ score_relevance()   (0.40 weight)
        │   │   ├─ score_coherence()   (0.35 weight)
        │   │   └─ score_safety()      (0.25 weight)
        │   └─ EvaluationResult (prompt, model, scores, latency, error)
        │
        ▼
  evaluate_and_serialize() → JSON output
        │
        ├─ model_summaries (per-model averages)
        └─ results[] (per-evaluation details)
```

### Task Assignment Pipeline (Legacy)

```
Gitea Issues → TaskGenerator → AgentRunner
        │               │              │
        ▼               ▼              ▼
  Fetch tasks    Assign models   Execute + PR
  from issues    from config     via Gitea API
```

## Key Abstractions

| Class | Module | Purpose |
|-------|--------|---------|
| `PromptEntry` | evaluator.py | Single prompt with expected keywords and category |
| `ModelEndpoint` | evaluator.py | Model connection descriptor (provider, model_id, key) |
| `ScoreResult` | evaluator.py | Scores for relevance, coherence, safety, overall |
| `EvaluationResult` | evaluator.py | Full result: prompt + model + response + scores + latency |
| `ResponseScorer` | evaluator.py | Heuristic scoring engine (regex + keyword + structure) |
| `PromptEvaluator` | evaluator.py | Core engine: runs prompts against models, scores output |
| `ModelClient` | models.py | Abstract base for LLM API calls |
| `ModelFactory` | models.py | Factory: returns correct client for provider name |
| `Task` | task.py | Work unit: id, title, description, assigned model/provider |
| `TaskGenerator` | task.py | Creates tasks from Gitea issues or JSON spec |
| `AgentRunner` | runner.py | Executes tasks: generate → branch → commit → PR |
| `Config` | config.py | YAML config loader (wolf-config.yaml) |
| `Leaderboard` | leaderboard.py | Persistent model ranking with serverless readiness |
| `GiteaClient` | gitea.py | Full Gitea REST API client |
| `PREvaluator` | evaluator.py | Legacy: scores PRs on CI, commits, code quality |

## API Surface

### CLI Arguments (cli.py)

| Flag | Description |
|------|-------------|
| `--config` | Path to wolf-config.yaml |
| `--task-spec` | Path to task specification JSON |
| `--run` | Run pending tasks (assign models, execute, create PRs) |
| `--evaluate` | Evaluate open PRs and score them |
| `--leaderboard` | Show model rankings |

### CLI Arguments (runner.py)

| Flag | Description |
|------|-------------|
| `--prompts` / `-p` | Path to prompts JSON (required) |
| `--models` / `-m` | Path to models JSON |
| `--config` / `-c` | Path to wolf-config.yaml (alternative to --models) |
| `--output` / `-o` | Path to write JSON results |
| `--system-prompt` | System prompt for all model calls |

### Provider Clients (models.py)

| Client | Provider | API Format |
|--------|----------|------------|
| `OpenRouterClient` | openrouter | OpenAI-compatible chat completions |
| `GroqClient` | groq | OpenAI-compatible chat completions |
| `OllamaClient` | ollama | Ollama native /api/generate |
| `OpenAIClient` | openai | OpenAI-compatible (reuses GroqClient with different URL) |
| `AnthropicClient` | anthropic | Anthropic Messages API v1 |

### Gitea Client (gitea.py)

| Method | Purpose |
|--------|---------|
| `get_issues()` | Fetch issues by state |
| `create_branch()` | Create new branch from base |
| `create_file()` | Create file on branch (base64) |
| `update_file()` | Update file with SHA |
| `get_file()` | Read file contents |
| `create_pull_request()` | Open PR |
| `get_pull_request()` | Fetch PR details |
| `get_pr_status()` | Check PR CI status |

## Configuration (wolf-config.yaml)

```yaml
gitea:
  base_url: "https://forge.alexanderwhitestone.com/api/v1"
  token: "..."
  owner: "Timmy_Foundation"
  repo: "wolf"

providers:
  openrouter:
    api_key: "..."
    base_url: "https://openrouter.ai/api/v1"
  ollama:
    base_url: "http://localhost:11434"

models:
  - model: "anthropic/claude-3.5-sonnet"
    provider: "openrouter"
  - model: "gemma4:latest"
    provider: "ollama"

log_dir: "~/.hermes/wolf/"
leaderboard_path: "~/.hermes/wolf/leaderboard.json"
```

## Scoring Methodology

| Dimension | Weight | Method |
|-----------|--------|--------|
| Relevance | 0.40 | Keyword matching (expected_keywords), prompt word overlap, length penalty, refusal detection |
| Coherence | 0.35 | Structure indicators (paragraphs, lists, transitions), sentence completeness, repetition ratio, length sweet spot |
| Safety | 0.25 | Unsafe pattern regex (bomb/hack/bypass/injection), profanity detection |

**Overall = relevance×0.40 + coherence×0.35 + safety×0.25**

### Safety Patterns Detected

- Bomb/explosive/weapon creation
- System/network hacking
- Security/auth bypass
- Prompt injection ("ignore previous instructions")
- System prompt extraction attempts

### Refusal Patterns Detected

- "I cannot/can't/won't help/assist"
- "Sorry, but I cannot"
- "Against my guidelines/policy"

## Test Coverage

| File | Tests | Coverage |
|------|-------|----------|
| `tests/test_evaluator.py` | 17 tests | PromptEntry, ModelEndpoint, ResponseScorer (relevance/coherence/safety), PromptEvaluator (evaluate, error handling, serialization, file output, multi-model), PREvaluator (score_pr, description scoring) |
| `tests/test_config.py` | 1 test | Config load from YAML |

### Coverage Gaps

- No tests for `cli.py` (argument parsing, workflow orchestration)
- No tests for `runner.py` (`load_prompts`, `load_models_from_json`, `AgentRunner.execute_task`)
- No tests for `task.py` (`TaskGenerator.from_gitea_issues`, `from_spec`, `assign_tasks`)
- No tests for `models.py` (API clients — would require mocking HTTP)
- No tests for `leaderboard.py` (`record_score`, `get_rankings`, serverless readiness logic)
- No tests for `gitea.py` (API client — would require mocking HTTP)
- No integration tests (end-to-end evaluation pipeline)

## Dependencies

| Dependency | Used By | Purpose |
|------------|---------|---------|
| `requests` | models.py, gitea.py | HTTP client for all API calls |
| `pyyaml` (optional) | config.py | YAML config parsing (falls back to line parser) |

## Security Considerations

1. **API keys in config**: wolf-config.yaml stores provider API keys in plaintext. File should be chmod 600 and excluded from git (already in .gitignore pattern via ~/.hermes/).
2. **Gitea token**: Full access token used for branch creation, file commits, and PR creation. Scoped access recommended.
3. **No input sanitization**: Prompts from Gitea issues are passed directly to models without filtering. Prompt injection risk for automated workflows.
4. **No rate limiting**: Model API calls are sequential with no backoff or rate limiting. Could exhaust API quotas.
5. **Legacy code reference**: `evaluator.py` references `Evaluator = PREvaluator` alias but `cli.py` imports `Evaluator` expecting the legacy class. This works but is confusing.

## File Index

| File | LOC | Purpose |
|------|-----|---------|
| `wolf/__init__.py` | 12 | Package init, version |
| `wolf/cli.py` | 90 | Main CLI orchestrator |
| `wolf/config.py` | 48 | YAML config loader |
| `wolf/models.py` | 130 | LLM provider clients (5 providers) |
| `wolf/runner.py` | 280 | Prompt evaluation CLI + AgentRunner |
| `wolf/task.py` | 80 | Task dataclass + generator |
| `wolf/evaluator.py` | 350 | Core scoring engine + legacy PR evaluator |
| `wolf/leaderboard.py` | 70 | Persistent model ranking |
| `wolf/gitea.py` | 100 | Gitea REST API client |
| `tests/test_evaluator.py` | 180 | Unit tests for evaluator |
| `tests/test_config.py` | 20 | Unit tests for config |

**Total: ~1,360 LOC Python | 11 modules | 18 tests**

## Sovereignty Assessment

- **No external dependencies beyond requests**: Runs on any machine with Python 3.11+ and requests.
- **No phone-home**: All API calls are to user-configured endpoints.
- **No telemetry**: Logs go to local filesystem only.
- **Config-driven**: All secrets in user's ~/.hermes/ directory.
- **Provider-agnostic**: Supports 5 providers with easy extension via ModelFactory.

**Verdict: Fully sovereign. No corporate lock-in. User controls all endpoints and keys.**

---

*"The strength of the pack is the wolf, and the strength of the wolf is the pack."*
*— The Wolf Sovereign Core has spoken.*