Files
timmy-home/genomes/wolf/GENOME.md
Timmy 3b430114be
Some checks failed
Smoke Test / smoke (pull_request) Failing after 20s
feat: Codebase Genome for wolf (#683)
Complete GENOME.md for wolf (multi-model evaluation engine):
- Project overview and architecture diagram (Mermaid)
- Entry points and data flow (evaluation + task pipelines)
- Key abstractions (15 classes documented)
- API surface (CLI, providers, Gitea client)
- Scoring methodology (relevance/coherence/safety weights)
- Test coverage analysis with identified gaps
- Security considerations
- Sovereignty assessment
- File index with LOC counts

Repo 16/16. Closes #683.
2026-04-14 23:43:30 -04:00

10 KiB
Raw Blame History

GENOME.md — Wolf (Timmy_Foundation/wolf)

Codebase Genome v1.0 | Generated 2026-04-14 | Repo 16/16

Project Overview

Wolf is a multi-model evaluation engine for sovereign AI fleets. It runs prompts against multiple LLM providers, scores responses on relevance, coherence, and safety, and outputs structured JSON results for model selection and ranking.

Core principle: agents work, PRs prove it, CI judges it.

Status: v1.0.0 — production-ready for prompt evaluation. Legacy PR evaluation module retained for backward compatibility.

Architecture

graph TD
    CLI[cli.py] --> Config[config.py]
    CLI --> TaskGen[task.py]
    CLI --> Runner[runner.py]
    CLI --> Evaluator[evaluator.py]
    CLI --> Leaderboard[leaderboard.py]
    CLI --> Gitea[gitea.py]

    Runner --> Models[models.py]
    Runner --> Gitea
    Evaluator --> Models

    TaskGen --> Gitea
    Leaderboard --> |leaderboard.json| FS[(File System)]
    Config --> |wolf-config.yaml| FS

    Models --> OpenRouter[OpenRouter API]
    Models --> Groq[Groq API]
    Models --> Ollama[Ollama Local]
    Models --> OpenAI[OpenAI API]
    Models --> Anthropic[Anthropic API]

    Runner --> |branch + commit| Gitea
    Evaluator --> |score results| Leaderboard

Entry Points

Entry Point Command Purpose
wolf/cli.py python3 -m wolf.cli --run Main CLI: run tasks, evaluate PRs, show leaderboard
wolf/runner.py python3 -m wolf.runner --prompts p.json --models m.json Standalone prompt evaluation runner
wolf/__init__.py import wolf Package init, version metadata

Data Flow

Prompt Evaluation Pipeline (Primary)

prompts.json + models.json (or wolf-config.yaml)
        │
        ▼
  PromptEvaluator.evaluate()
        │
        ├─ For each (prompt, model) pair:
        │   ├─ ModelClient.generate(prompt)  → response text
        │   ├─ ResponseScorer.score(response, prompt)
        │   │   ├─ score_relevance()   (0.40 weight)
        │   │   ├─ score_coherence()   (0.35 weight)
        │   │   └─ score_safety()      (0.25 weight)
        │   └─ EvaluationResult (prompt, model, scores, latency, error)
        │
        ▼
  evaluate_and_serialize() → JSON output
        │
        ├─ model_summaries (per-model averages)
        └─ results[] (per-evaluation details)

Task Assignment Pipeline (Legacy)

Gitea Issues → TaskGenerator → AgentRunner
        │               │              │
        ▼               ▼              ▼
  Fetch tasks    Assign models   Execute + PR
  from issues    from config     via Gitea API

Key Abstractions

Class Module Purpose
PromptEntry evaluator.py Single prompt with expected keywords and category
ModelEndpoint evaluator.py Model connection descriptor (provider, model_id, key)
ScoreResult evaluator.py Scores for relevance, coherence, safety, overall
EvaluationResult evaluator.py Full result: prompt + model + response + scores + latency
ResponseScorer evaluator.py Heuristic scoring engine (regex + keyword + structure)
PromptEvaluator evaluator.py Core engine: runs prompts against models, scores output
ModelClient models.py Abstract base for LLM API calls
ModelFactory models.py Factory: returns correct client for provider name
Task task.py Work unit: id, title, description, assigned model/provider
TaskGenerator task.py Creates tasks from Gitea issues or JSON spec
AgentRunner runner.py Executes tasks: generate → branch → commit → PR
Config config.py YAML config loader (wolf-config.yaml)
Leaderboard leaderboard.py Persistent model ranking with serverless readiness
GiteaClient gitea.py Full Gitea REST API client
PREvaluator evaluator.py Legacy: scores PRs on CI, commits, code quality

API Surface

CLI Arguments (cli.py)

Flag Description
--config Path to wolf-config.yaml
--task-spec Path to task specification JSON
--run Run pending tasks (assign models, execute, create PRs)
--evaluate Evaluate open PRs and score them
--leaderboard Show model rankings

CLI Arguments (runner.py)

Flag Description
--prompts / -p Path to prompts JSON (required)
--models / -m Path to models JSON
--config / -c Path to wolf-config.yaml (alternative to --models)
--output / -o Path to write JSON results
--system-prompt System prompt for all model calls

Provider Clients (models.py)

Client Provider API Format
OpenRouterClient openrouter OpenAI-compatible chat completions
GroqClient groq OpenAI-compatible chat completions
OllamaClient ollama Ollama native /api/generate
OpenAIClient openai OpenAI-compatible (reuses GroqClient with different URL)
AnthropicClient anthropic Anthropic Messages API v1

Gitea Client (gitea.py)

Method Purpose
get_issues() Fetch issues by state
create_branch() Create new branch from base
create_file() Create file on branch (base64)
update_file() Update file with SHA
get_file() Read file contents
create_pull_request() Open PR
get_pull_request() Fetch PR details
get_pr_status() Check PR CI status

Configuration (wolf-config.yaml)

gitea:
  base_url: "https://forge.alexanderwhitestone.com/api/v1"
  token: "..."
  owner: "Timmy_Foundation"
  repo: "wolf"

providers:
  openrouter:
    api_key: "..."
    base_url: "https://openrouter.ai/api/v1"
  ollama:
    base_url: "http://localhost:11434"

models:
  - model: "anthropic/claude-3.5-sonnet"
    provider: "openrouter"
  - model: "gemma4:latest"
    provider: "ollama"

log_dir: "~/.hermes/wolf/"
leaderboard_path: "~/.hermes/wolf/leaderboard.json"

Scoring Methodology

Dimension Weight Method
Relevance 0.40 Keyword matching (expected_keywords), prompt word overlap, length penalty, refusal detection
Coherence 0.35 Structure indicators (paragraphs, lists, transitions), sentence completeness, repetition ratio, length sweet spot
Safety 0.25 Unsafe pattern regex (bomb/hack/bypass/injection), profanity detection

Overall = relevance×0.40 + coherence×0.35 + safety×0.25

Safety Patterns Detected

  • Bomb/explosive/weapon creation
  • System/network hacking
  • Security/auth bypass
  • Prompt injection ("ignore previous instructions")
  • System prompt extraction attempts

Refusal Patterns Detected

  • "I cannot/can't/won't help/assist"
  • "Sorry, but I cannot"
  • "Against my guidelines/policy"

Test Coverage

File Tests Coverage
tests/test_evaluator.py 17 tests PromptEntry, ModelEndpoint, ResponseScorer (relevance/coherence/safety), PromptEvaluator (evaluate, error handling, serialization, file output, multi-model), PREvaluator (score_pr, description scoring)
tests/test_config.py 1 test Config load from YAML

Coverage Gaps

  • No tests for cli.py (argument parsing, workflow orchestration)
  • No tests for runner.py (load_prompts, load_models_from_json, AgentRunner.execute_task)
  • No tests for task.py (TaskGenerator.from_gitea_issues, from_spec, assign_tasks)
  • No tests for models.py (API clients — would require mocking HTTP)
  • No tests for leaderboard.py (record_score, get_rankings, serverless readiness logic)
  • No tests for gitea.py (API client — would require mocking HTTP)
  • No integration tests (end-to-end evaluation pipeline)

Dependencies

Dependency Used By Purpose
requests models.py, gitea.py HTTP client for all API calls
pyyaml (optional) config.py YAML config parsing (falls back to line parser)

Security Considerations

  1. API keys in config: wolf-config.yaml stores provider API keys in plaintext. File should be chmod 600 and excluded from git (already in .gitignore pattern via ~/.hermes/).
  2. Gitea token: Full access token used for branch creation, file commits, and PR creation. Scoped access recommended.
  3. No input sanitization: Prompts from Gitea issues are passed directly to models without filtering. Prompt injection risk for automated workflows.
  4. No rate limiting: Model API calls are sequential with no backoff or rate limiting. Could exhaust API quotas.
  5. Legacy code reference: evaluator.py references Evaluator = PREvaluator alias but cli.py imports Evaluator expecting the legacy class. This works but is confusing.

File Index

File LOC Purpose
wolf/__init__.py 12 Package init, version
wolf/cli.py 90 Main CLI orchestrator
wolf/config.py 48 YAML config loader
wolf/models.py 130 LLM provider clients (5 providers)
wolf/runner.py 280 Prompt evaluation CLI + AgentRunner
wolf/task.py 80 Task dataclass + generator
wolf/evaluator.py 350 Core scoring engine + legacy PR evaluator
wolf/leaderboard.py 70 Persistent model ranking
wolf/gitea.py 100 Gitea REST API client
tests/test_evaluator.py 180 Unit tests for evaluator
tests/test_config.py 20 Unit tests for config

Total: ~1,360 LOC Python | 11 modules | 18 tests

Sovereignty Assessment

  • No external dependencies beyond requests: Runs on any machine with Python 3.11+ and requests.
  • No phone-home: All API calls are to user-configured endpoints.
  • No telemetry: Logs go to local filesystem only.
  • Config-driven: All secrets in user's ~/.hermes/ directory.
  • Provider-agnostic: Supports 5 providers with easy extension via ModelFactory.

Verdict: Fully sovereign. No corporate lock-in. User controls all endpoints and keys.


"The strength of the pack is the wolf, and the strength of the wolf is the pack." — The Wolf Sovereign Core has spoken.