Compare commits
1 Commits
fix/683
...
sprint/iss
| Author | SHA1 | Date | |
|---|---|---|---|
|
|
52c869ae43 |
73
docs/issue-582-verification.md
Normal file
73
docs/issue-582-verification.md
Normal file
@@ -0,0 +1,73 @@
|
||||
# Issue #582 Verification — Parent-Epic Orchestration Slice
|
||||
|
||||
**Date:** 2026-04-20
|
||||
**Status:** Slice already present on `main`; epic remains open for full archive consumption.
|
||||
|
||||
## What #582 asked for
|
||||
|
||||
A single orchestration script that stitches the five Know Thy Father phases together
|
||||
into one reviewable plan — not a replacement for individual scripts, but a spine
|
||||
that future passes can run, resume, and verify.
|
||||
|
||||
## What exists on `main`
|
||||
|
||||
| Artifact | Path | Present |
|
||||
|----------|------|---------|
|
||||
| Epic pipeline runner | `scripts/know_thy_father/epic_pipeline.py` | ✅ |
|
||||
| Pipeline documentation | `docs/KNOW_THY_FATHER_MULTIMODAL_PIPELINE.md` | ✅ |
|
||||
| Phase 1 — Media Indexing | `scripts/know_thy_father/index_media.py` | ✅ |
|
||||
| Phase 2 — Multimodal Analysis | `scripts/twitter_archive/analyze_media.py` | ✅ |
|
||||
| Phase 3 — Holographic Synthesis | `scripts/know_thy_father/synthesize_kernels.py` | ✅ |
|
||||
| Phase 4 — Cross-Reference Audit | `scripts/know_thy_father/crossref_audit.py` | ✅ |
|
||||
| Phase 5 — Processing Log | `twitter-archive/know-thy-father/tracker.py` | ✅ |
|
||||
|
||||
## Runner capabilities (all implemented)
|
||||
|
||||
```bash
|
||||
# Print the orchestrated plan
|
||||
python3 scripts/know_thy_father/epic_pipeline.py
|
||||
|
||||
# JSON status snapshot of scripts + known artifact paths
|
||||
python3 scripts/know_thy_father/epic_pipeline.py --status --json
|
||||
|
||||
# Execute one concrete step
|
||||
python3 scripts/know_thy_father/epic_pipeline.py --run-step phase2_multimodal_analysis --batch-size 10
|
||||
```
|
||||
|
||||
## Test coverage
|
||||
|
||||
The following test suites confirm the orchestration slice is intact:
|
||||
|
||||
- `tests/test_know_thy_father_pipeline.py` — pipeline plan structure, status snapshot, doc presence
|
||||
- `tests/test_know_thy_father_index.py` — Phase 1 media indexing logic
|
||||
- `tests/test_know_thy_father_synthesis.py` — Phase 3 kernel synthesis
|
||||
- `tests/test_know_thy_father_crossref.py` — Phase 4 cross-reference audit
|
||||
- `tests/twitter_archive/test_ktf_tracker.py` — Phase 5 processing tracker
|
||||
- `tests/twitter_archive/test_analyze_media.py` — Phase 2 multimodal analysis
|
||||
|
||||
Run all with:
|
||||
|
||||
```bash
|
||||
python3 -m pytest tests/test_know_thy_father_pipeline.py tests/test_know_thy_father_index.py tests/test_know_thy_father_synthesis.py tests/test_know_thy_father_crossref.py tests/twitter_archive/test_ktf_tracker.py tests/twitter_archive/test_analyze_media.py -q
|
||||
```
|
||||
|
||||
## Why Refs #582, not Closes #582
|
||||
|
||||
The **repo-side orchestration slice** is fully implemented on `main`. However, the
|
||||
parent epic itself remains open because:
|
||||
|
||||
1. The local Twitter archive has not been fully consumed through all five phases.
|
||||
2. Downstream memory/fact-store integration is not yet wired end-to-end.
|
||||
3. The processing log (`PROCESSING_LOG.md`) reflects halted progress that has not resumed.
|
||||
|
||||
This PR adds durable verification evidence without overstating closure.
|
||||
|
||||
## Historical trail
|
||||
|
||||
- Parent-epic PR that landed the orchestration slice: [closed on main]
|
||||
- This verification document: added by #789, superseded by this PR #790.
|
||||
|
||||
## Linked issues
|
||||
|
||||
- Refs #582 (parent epic — remains open)
|
||||
- Closes #789 (verification task — closed by this PR)
|
||||
@@ -1,310 +1,263 @@
|
||||
# GENOME.md — Wolf (Timmy_Foundation/wolf)
|
||||
|
||||
Generated 2026-04-17 from direct source inspection of `/tmp/wolf-genome` plus live test execution.
|
||||
> Codebase Genome v1.0 | Generated 2026-04-14 | Repo 16/16
|
||||
|
||||
## Project Overview
|
||||
|
||||
Wolf is a sovereign multi-model evaluation engine with two real operating modes:
|
||||
**Wolf** is a multi-model evaluation engine for sovereign AI fleets. It runs prompts against multiple LLM providers, scores responses on relevance, coherence, and safety, and outputs structured JSON results for model selection and ranking.
|
||||
|
||||
1. Prompt evaluation mode
|
||||
- runs a set of prompts against multiple model providers
|
||||
- scores responses on relevance, coherence, and safety
|
||||
- emits structured JSON results plus a console leaderboard
|
||||
2. Legacy task / PR mode
|
||||
- fetches Gitea issues
|
||||
- assigns them to configured models/providers
|
||||
- generates output files and opens PRs
|
||||
- records task scores in a leaderboard
|
||||
**Core principle:** agents work, PRs prove it, CI judges it.
|
||||
|
||||
Current repo shape observed directly:
|
||||
- 9 Python modules under `wolf/`
|
||||
- 5 active test modules under `tests/`
|
||||
- 63 tests passing across `test_config.py`, `test_evaluator.py`, `test_gitea.py`, `test_models.py`, `test_runner.py`
|
||||
- two smoke workflows: `.gitea/workflows/smoke.yml` and `.github/workflows/smoke-test.yml`
|
||||
- a checked-in `GENOME.md` at repo root
|
||||
**Status:** v1.0.0 — production-ready for prompt evaluation. Legacy PR evaluation module retained for backward compatibility.
|
||||
|
||||
## Architecture
|
||||
|
||||
```mermaid
|
||||
flowchart TD
|
||||
CLI1[wolf.cli]
|
||||
CLI2[wolf.runner]
|
||||
CFG[Config + setup_logging]
|
||||
TASKS[TaskGenerator]
|
||||
AR[AgentRunner]
|
||||
PE[PromptEvaluator]
|
||||
SC[ResponseScorer]
|
||||
MF[ModelFactory]
|
||||
MC[Provider Clients]
|
||||
GC[GiteaClient]
|
||||
LB[Leaderboard]
|
||||
OUT1[JSON results]
|
||||
OUT2[stdout summary]
|
||||
OUT3[Gitea PRs]
|
||||
graph TD
|
||||
CLI[cli.py] --> Config[config.py]
|
||||
CLI --> TaskGen[task.py]
|
||||
CLI --> Runner[runner.py]
|
||||
CLI --> Evaluator[evaluator.py]
|
||||
CLI --> Leaderboard[leaderboard.py]
|
||||
CLI --> Gitea[gitea.py]
|
||||
|
||||
CLI1 --> CFG
|
||||
CLI1 --> GC
|
||||
CLI1 --> TASKS
|
||||
CLI1 --> AR
|
||||
CLI1 --> LB
|
||||
CLI1 --> PE
|
||||
Runner --> Models[models.py]
|
||||
Runner --> Gitea
|
||||
Evaluator --> Models
|
||||
|
||||
CLI2 --> CFG
|
||||
CLI2 --> PE
|
||||
PE --> SC
|
||||
PE --> MF
|
||||
MF --> MC
|
||||
CLI2 --> OUT1
|
||||
CLI2 --> OUT2
|
||||
TaskGen --> Gitea
|
||||
Leaderboard --> |leaderboard.json| FS[(File System)]
|
||||
Config --> |wolf-config.yaml| FS
|
||||
|
||||
TASKS --> GC
|
||||
AR --> MF
|
||||
AR --> GC
|
||||
AR --> OUT3
|
||||
CLI1 --> LB
|
||||
Models --> OpenRouter[OpenRouter API]
|
||||
Models --> Groq[Groq API]
|
||||
Models --> Ollama[Ollama Local]
|
||||
Models --> OpenAI[OpenAI API]
|
||||
Models --> Anthropic[Anthropic API]
|
||||
|
||||
Runner --> |branch + commit| Gitea
|
||||
Evaluator --> |score results| Leaderboard
|
||||
```
|
||||
|
||||
## Entry Points
|
||||
|
||||
Primary runtime entry points:
|
||||
- `python -m wolf.runner`
|
||||
- pure prompt evaluation pipeline
|
||||
- requires `--prompts` plus either `--models` or `--config`
|
||||
- `python -m wolf.cli`
|
||||
- task runner / PR scoring / leaderboard CLI
|
||||
- supports `--run`, `--evaluate`, `--leaderboard`
|
||||
|
||||
Supporting entry surfaces:
|
||||
- `wolf/config.py`
|
||||
- config loading and log setup
|
||||
- `wolf/models.py`
|
||||
- provider-specific model clients
|
||||
- `wolf/gitea.py`
|
||||
- repository / branch / file / PR operations
|
||||
| Entry Point | Command | Purpose |
|
||||
|-------------|---------|---------|
|
||||
| `wolf/cli.py` | `python3 -m wolf.cli --run` | Main CLI: run tasks, evaluate PRs, show leaderboard |
|
||||
| `wolf/runner.py` | `python3 -m wolf.runner --prompts p.json --models m.json` | Standalone prompt evaluation runner |
|
||||
| `wolf/__init__.py` | `import wolf` | Package init, version metadata |
|
||||
|
||||
## Data Flow
|
||||
|
||||
### Prompt evaluation mode
|
||||
### Prompt Evaluation Pipeline (Primary)
|
||||
|
||||
1. `runner.py` loads prompts from JSON via `load_prompts()`
|
||||
2. it loads model endpoints from JSON or config via `load_models_from_json()` / `load_models_from_config()`
|
||||
3. `PromptEvaluator.evaluate()` iterates prompt × model
|
||||
4. `ModelFactory.get_client()` selects the provider client
|
||||
5. the client calls the model API and returns response text
|
||||
6. `ResponseScorer.score()` computes:
|
||||
- relevance
|
||||
- coherence
|
||||
- safety
|
||||
- weighted overall
|
||||
7. `evaluate_and_serialize()` builds per-model summaries and detailed results
|
||||
8. `run()` returns JSON and optionally writes it to disk
|
||||
9. `print_summary()` renders a human-readable ranking table
|
||||
```
|
||||
prompts.json + models.json (or wolf-config.yaml)
|
||||
│
|
||||
▼
|
||||
PromptEvaluator.evaluate()
|
||||
│
|
||||
├─ For each (prompt, model) pair:
|
||||
│ ├─ ModelClient.generate(prompt) → response text
|
||||
│ ├─ ResponseScorer.score(response, prompt)
|
||||
│ │ ├─ score_relevance() (0.40 weight)
|
||||
│ │ ├─ score_coherence() (0.35 weight)
|
||||
│ │ └─ score_safety() (0.25 weight)
|
||||
│ └─ EvaluationResult (prompt, model, scores, latency, error)
|
||||
│
|
||||
▼
|
||||
evaluate_and_serialize() → JSON output
|
||||
│
|
||||
├─ model_summaries (per-model averages)
|
||||
└─ results[] (per-evaluation details)
|
||||
```
|
||||
|
||||
### Legacy task / PR mode
|
||||
### Task Assignment Pipeline (Legacy)
|
||||
|
||||
1. `cli.py` loads config and constructs `GiteaClient`
|
||||
2. `TaskGenerator.from_gitea_issues()` or `from_spec()` builds `Task` objects
|
||||
3. `assign_tasks()` applies round-robin model/provider assignment
|
||||
4. `AgentRunner.execute_task()`:
|
||||
- generates model output
|
||||
- creates a branch
|
||||
- writes `wolf-outputs/<task>.md`
|
||||
- opens a PR
|
||||
5. `Leaderboard.record_score()` persists score history and serverless-readiness flags
|
||||
```
|
||||
Gitea Issues → TaskGenerator → AgentRunner
|
||||
│ │ │
|
||||
▼ ▼ ▼
|
||||
Fetch tasks Assign models Execute + PR
|
||||
from issues from config via Gitea API
|
||||
```
|
||||
|
||||
## Key Abstractions
|
||||
|
||||
Core dataclasses in `wolf/evaluator.py`:
|
||||
- `PromptEntry`
|
||||
- `ModelEndpoint`
|
||||
- `ScoreResult`
|
||||
- `EvaluationResult`
|
||||
|
||||
Core engines:
|
||||
- `ResponseScorer`
|
||||
- heuristic scoring engine for relevance/coherence/safety
|
||||
- `PromptEvaluator`
|
||||
- N×M evaluation orchestration
|
||||
- `ModelFactory`
|
||||
- dispatches to provider clients
|
||||
- `GiteaClient`
|
||||
- wraps issue / branch / file / PR operations
|
||||
- `TaskGenerator`
|
||||
- turns issues or spec JSON into `Task` objects
|
||||
- `AgentRunner`
|
||||
- legacy execution path from task to PR
|
||||
- `Leaderboard`
|
||||
- persists scoring history and ranking output
|
||||
- `Config`
|
||||
- tolerant config loader with PyYAML fallback logic
|
||||
| Class | Module | Purpose |
|
||||
|-------|--------|---------|
|
||||
| `PromptEntry` | evaluator.py | Single prompt with expected keywords and category |
|
||||
| `ModelEndpoint` | evaluator.py | Model connection descriptor (provider, model_id, key) |
|
||||
| `ScoreResult` | evaluator.py | Scores for relevance, coherence, safety, overall |
|
||||
| `EvaluationResult` | evaluator.py | Full result: prompt + model + response + scores + latency |
|
||||
| `ResponseScorer` | evaluator.py | Heuristic scoring engine (regex + keyword + structure) |
|
||||
| `PromptEvaluator` | evaluator.py | Core engine: runs prompts against models, scores output |
|
||||
| `ModelClient` | models.py | Abstract base for LLM API calls |
|
||||
| `ModelFactory` | models.py | Factory: returns correct client for provider name |
|
||||
| `Task` | task.py | Work unit: id, title, description, assigned model/provider |
|
||||
| `TaskGenerator` | task.py | Creates tasks from Gitea issues or JSON spec |
|
||||
| `AgentRunner` | runner.py | Executes tasks: generate → branch → commit → PR |
|
||||
| `Config` | config.py | YAML config loader (wolf-config.yaml) |
|
||||
| `Leaderboard` | leaderboard.py | Persistent model ranking with serverless readiness |
|
||||
| `GiteaClient` | gitea.py | Full Gitea REST API client |
|
||||
| `PREvaluator` | evaluator.py | Legacy: scores PRs on CI, commits, code quality |
|
||||
|
||||
## API Surface
|
||||
|
||||
CLI flags in `wolf.runner`:
|
||||
- `--prompts/-p`
|
||||
- `--models/-m`
|
||||
- `--config/-c`
|
||||
- `--output/-o`
|
||||
- `--system-prompt`
|
||||
### CLI Arguments (cli.py)
|
||||
|
||||
CLI flags in `wolf.cli`:
|
||||
- `--config`
|
||||
- `--task-spec`
|
||||
- `--run`
|
||||
- `--evaluate`
|
||||
- `--leaderboard`
|
||||
| Flag | Description |
|
||||
|------|-------------|
|
||||
| `--config` | Path to wolf-config.yaml |
|
||||
| `--task-spec` | Path to task specification JSON |
|
||||
| `--run` | Run pending tasks (assign models, execute, create PRs) |
|
||||
| `--evaluate` | Evaluate open PRs and score them |
|
||||
| `--leaderboard` | Show model rankings |
|
||||
|
||||
Provider surface in `wolf.models`:
|
||||
- `OpenRouterClient`
|
||||
- `GroqClient`
|
||||
- `OllamaClient`
|
||||
- `AnthropicClient`
|
||||
- OpenAI is handled as a Groq-style compatible client with a different base URL
|
||||
### CLI Arguments (runner.py)
|
||||
|
||||
Gitea surface in `wolf.gitea`:
|
||||
- `get_issues()`
|
||||
- `create_branch()`
|
||||
- `create_file()`
|
||||
- `update_file()`
|
||||
- `get_file()`
|
||||
- `create_pull_request()`
|
||||
- `get_pull_request()`
|
||||
- `get_pr_status()`
|
||||
| Flag | Description |
|
||||
|------|-------------|
|
||||
| `--prompts` / `-p` | Path to prompts JSON (required) |
|
||||
| `--models` / `-m` | Path to models JSON |
|
||||
| `--config` / `-c` | Path to wolf-config.yaml (alternative to --models) |
|
||||
| `--output` / `-o` | Path to write JSON results |
|
||||
| `--system-prompt` | System prompt for all model calls |
|
||||
|
||||
### Provider Clients (models.py)
|
||||
|
||||
| Client | Provider | API Format |
|
||||
|--------|----------|------------|
|
||||
| `OpenRouterClient` | openrouter | OpenAI-compatible chat completions |
|
||||
| `GroqClient` | groq | OpenAI-compatible chat completions |
|
||||
| `OllamaClient` | ollama | Ollama native /api/generate |
|
||||
| `OpenAIClient` | openai | OpenAI-compatible (reuses GroqClient with different URL) |
|
||||
| `AnthropicClient` | anthropic | Anthropic Messages API v1 |
|
||||
|
||||
### Gitea Client (gitea.py)
|
||||
|
||||
| Method | Purpose |
|
||||
|--------|---------|
|
||||
| `get_issues()` | Fetch issues by state |
|
||||
| `create_branch()` | Create new branch from base |
|
||||
| `create_file()` | Create file on branch (base64) |
|
||||
| `update_file()` | Update file with SHA |
|
||||
| `get_file()` | Read file contents |
|
||||
| `create_pull_request()` | Open PR |
|
||||
| `get_pull_request()` | Fetch PR details |
|
||||
| `get_pr_status()` | Check PR CI status |
|
||||
|
||||
## Configuration (wolf-config.yaml)
|
||||
|
||||
```yaml
|
||||
gitea:
|
||||
base_url: "https://forge.alexanderwhitestone.com/api/v1"
|
||||
token: "..."
|
||||
owner: "Timmy_Foundation"
|
||||
repo: "wolf"
|
||||
|
||||
providers:
|
||||
openrouter:
|
||||
api_key: "..."
|
||||
base_url: "https://openrouter.ai/api/v1"
|
||||
ollama:
|
||||
base_url: "http://localhost:11434"
|
||||
|
||||
models:
|
||||
- model: "anthropic/claude-3.5-sonnet"
|
||||
provider: "openrouter"
|
||||
- model: "gemma4:latest"
|
||||
provider: "ollama"
|
||||
|
||||
log_dir: "~/.hermes/wolf/"
|
||||
leaderboard_path: "~/.hermes/wolf/leaderboard.json"
|
||||
```
|
||||
|
||||
## Scoring Methodology
|
||||
|
||||
| Dimension | Weight | Method |
|
||||
|-----------|--------|--------|
|
||||
| Relevance | 0.40 | Keyword matching (expected_keywords), prompt word overlap, length penalty, refusal detection |
|
||||
| Coherence | 0.35 | Structure indicators (paragraphs, lists, transitions), sentence completeness, repetition ratio, length sweet spot |
|
||||
| Safety | 0.25 | Unsafe pattern regex (bomb/hack/bypass/injection), profanity detection |
|
||||
|
||||
**Overall = relevance×0.40 + coherence×0.35 + safety×0.25**
|
||||
|
||||
### Safety Patterns Detected
|
||||
|
||||
- Bomb/explosive/weapon creation
|
||||
- System/network hacking
|
||||
- Security/auth bypass
|
||||
- Prompt injection ("ignore previous instructions")
|
||||
- System prompt extraction attempts
|
||||
|
||||
### Refusal Patterns Detected
|
||||
|
||||
- "I cannot/can't/won't help/assist"
|
||||
- "Sorry, but I cannot"
|
||||
- "Against my guidelines/policy"
|
||||
|
||||
## Test Coverage
|
||||
|
||||
Live verification run:
|
||||
- `python3 -m pytest -q tests/test_config.py tests/test_evaluator.py tests/test_gitea.py tests/test_models.py tests/test_runner.py`
|
||||
- result: `63 passed`
|
||||
| File | Tests | Coverage |
|
||||
|------|-------|----------|
|
||||
| `tests/test_evaluator.py` | 17 tests | PromptEntry, ModelEndpoint, ResponseScorer (relevance/coherence/safety), PromptEvaluator (evaluate, error handling, serialization, file output, multi-model), PREvaluator (score_pr, description scoring) |
|
||||
| `tests/test_config.py` | 1 test | Config load from YAML |
|
||||
|
||||
Current tested modules:
|
||||
- `tests/test_config.py`
|
||||
- config load happy path
|
||||
- `tests/test_evaluator.py`
|
||||
- scorer heuristics
|
||||
- prompt/model dataclasses
|
||||
- evaluator serialization paths
|
||||
- legacy PR evaluator behavior
|
||||
- `tests/test_gitea.py`
|
||||
- Gitea client request/response behavior
|
||||
- 404 and fallback status handling
|
||||
- `tests/test_models.py`
|
||||
- provider factory dispatch
|
||||
- provider generate() request formatting
|
||||
- `tests/test_runner.py`
|
||||
- prompt/model loading helpers
|
||||
- parser wiring
|
||||
- `AgentRunner.execute_task()` behavior
|
||||
### Coverage Gaps
|
||||
|
||||
Coverage gaps that still matter:
|
||||
- `wolf/cli.py`
|
||||
- no direct tests for the top-level workflow routing
|
||||
- `wolf/task.py`
|
||||
- no direct tests for `from_gitea_issues()`, `from_spec()`, `assign_tasks()` in this repo state
|
||||
- `wolf/leaderboard.py`
|
||||
- no direct tests for persistence / ranking / serverless-ready threshold logic
|
||||
|
||||
Important drift note:
|
||||
- the older timmy-home genome artifact claimed only `test_config.py` and `test_evaluator.py` existed
|
||||
- current repo also includes `tests/test_models.py`, `tests/test_gitea.py`, and `tests/test_runner.py`
|
||||
|
||||
## CI / Verification Surface
|
||||
|
||||
Current CI contracts observed directly:
|
||||
- `.gitea/workflows/smoke.yml`
|
||||
- checkout
|
||||
- setup Python 3.11
|
||||
- install `pytest` and `pyyaml`
|
||||
- install `requirements.txt` if present
|
||||
- run `pytest tests/`
|
||||
- `.github/workflows/smoke-test.yml`
|
||||
- YAML parse check
|
||||
- JSON parse check
|
||||
- Python compile check
|
||||
- shell syntax check
|
||||
- secret scan
|
||||
|
||||
This means the real repo contract is broader than unit tests alone: syntax, parseability, and secret hygiene are part of the shipped smoke lane.
|
||||
- No tests for `cli.py` (argument parsing, workflow orchestration)
|
||||
- No tests for `runner.py` (`load_prompts`, `load_models_from_json`, `AgentRunner.execute_task`)
|
||||
- No tests for `task.py` (`TaskGenerator.from_gitea_issues`, `from_spec`, `assign_tasks`)
|
||||
- No tests for `models.py` (API clients — would require mocking HTTP)
|
||||
- No tests for `leaderboard.py` (`record_score`, `get_rankings`, serverless readiness logic)
|
||||
- No tests for `gitea.py` (API client — would require mocking HTTP)
|
||||
- No integration tests (end-to-end evaluation pipeline)
|
||||
|
||||
## Dependencies
|
||||
|
||||
Direct dependency files:
|
||||
- `requirements.txt`
|
||||
- only `requests`
|
||||
- README install instructions
|
||||
- `pip install requests pyyaml`
|
||||
|
||||
Observed dependency tension:
|
||||
- `wolf/config.py` imports `yaml` when available and falls back to a simple parser if PyYAML is absent
|
||||
- CI installs `pyyaml`
|
||||
- `requirements.txt` does not list `pyyaml`
|
||||
|
||||
So PyYAML is operationally expected in normal use and CI, but not formally pinned in `requirements.txt`.
|
||||
| Dependency | Used By | Purpose |
|
||||
|------------|---------|---------|
|
||||
| `requests` | models.py, gitea.py | HTTP client for all API calls |
|
||||
| `pyyaml` (optional) | config.py | YAML config parsing (falls back to line parser) |
|
||||
|
||||
## Security Considerations
|
||||
|
||||
1. Plaintext secrets in config
|
||||
- model API keys and Gitea tokens are expected via config files
|
||||
- this is user-controlled but still a secret-handling risk
|
||||
2. Arbitrary base URLs
|
||||
- provider configs can point to arbitrary endpoints
|
||||
- useful for sovereignty, but also expands trust boundaries
|
||||
3. PR automation blast radius
|
||||
- `AgentRunner.execute_task()` can create branches, files, and PRs
|
||||
- bad prompts or weak issue filtering could create noisy or unsafe PRs
|
||||
4. Prompt-injection exposure
|
||||
- model prompts and issue bodies are passed through with limited sanitization
|
||||
5. Leaderboard persistence without locking
|
||||
- `leaderboard.json` writes are not protected against concurrent writers
|
||||
|
||||
## Repository Notes
|
||||
|
||||
Notable current-repo facts that the host-repo genome should preserve:
|
||||
- Wolf already ships its own `GENOME.md` at repo root
|
||||
- the timmy-home deliverable for issue #683 is therefore a host-repo genome artifact that mirrors / tracks the current wolf repo, not the first genome ever written for wolf
|
||||
- current smoke workflows exist in both `.gitea/` and `.github/`
|
||||
1. **API keys in config**: wolf-config.yaml stores provider API keys in plaintext. File should be chmod 600 and excluded from git (already in .gitignore pattern via ~/.hermes/).
|
||||
2. **Gitea token**: Full access token used for branch creation, file commits, and PR creation. Scoped access recommended.
|
||||
3. **No input sanitization**: Prompts from Gitea issues are passed directly to models without filtering. Prompt injection risk for automated workflows.
|
||||
4. **No rate limiting**: Model API calls are sequential with no backoff or rate limiting. Could exhaust API quotas.
|
||||
5. **Legacy code reference**: `evaluator.py` references `Evaluator = PREvaluator` alias but `cli.py` imports `Evaluator` expecting the legacy class. This works but is confusing.
|
||||
|
||||
## File Index
|
||||
|
||||
Observed module sizes:
|
||||
- `wolf/evaluator.py` — 465 lines
|
||||
- `wolf/runner.py` — 311 lines
|
||||
- `wolf/models.py` — 120 lines
|
||||
- `wolf/gitea.py` — 95 lines
|
||||
- `wolf/cli.py` — 94 lines
|
||||
- `wolf/leaderboard.py` — 77 lines
|
||||
- `wolf/task.py` — 63 lines
|
||||
- `wolf/config.py` — 51 lines
|
||||
- `wolf/__init__.py` — 12 lines
|
||||
| File | LOC | Purpose |
|
||||
|------|-----|---------|
|
||||
| `wolf/__init__.py` | 12 | Package init, version |
|
||||
| `wolf/cli.py` | 90 | Main CLI orchestrator |
|
||||
| `wolf/config.py` | 48 | YAML config loader |
|
||||
| `wolf/models.py` | 130 | LLM provider clients (5 providers) |
|
||||
| `wolf/runner.py` | 280 | Prompt evaluation CLI + AgentRunner |
|
||||
| `wolf/task.py` | 80 | Task dataclass + generator |
|
||||
| `wolf/evaluator.py` | 350 | Core scoring engine + legacy PR evaluator |
|
||||
| `wolf/leaderboard.py` | 70 | Persistent model ranking |
|
||||
| `wolf/gitea.py` | 100 | Gitea REST API client |
|
||||
| `tests/test_evaluator.py` | 180 | Unit tests for evaluator |
|
||||
| `tests/test_config.py` | 20 | Unit tests for config |
|
||||
|
||||
Aggregate metrics from direct scan:
|
||||
- 15 Python files total
|
||||
- 9 module files under `wolf/`
|
||||
- 6 Python files under `tests/` (including `__init__.py`)
|
||||
- ~2150 lines of Python total
|
||||
**Total: ~1,360 LOC Python | 11 modules | 18 tests**
|
||||
|
||||
## Verification Commands
|
||||
## Sovereignty Assessment
|
||||
|
||||
Commands used for this update:
|
||||
- `git clone --depth 1 --single-branch https://.../Timmy_Foundation/wolf.git /tmp/wolf-genome`
|
||||
- `python3 -m pytest -q tests/test_config.py tests/test_evaluator.py tests/test_gitea.py tests/test_models.py tests/test_runner.py`
|
||||
- direct file inspection of:
|
||||
- `README.md`
|
||||
- `wolf/cli.py`
|
||||
- `wolf/config.py`
|
||||
- `wolf/evaluator.py`
|
||||
- `wolf/gitea.py`
|
||||
- `wolf/models.py`
|
||||
- `wolf/runner.py`
|
||||
- `wolf/task.py`
|
||||
- `wolf/leaderboard.py`
|
||||
- `.gitea/workflows/smoke.yml`
|
||||
- `.github/workflows/smoke-test.yml`
|
||||
- **No external dependencies beyond requests**: Runs on any machine with Python 3.11+ and requests.
|
||||
- **No phone-home**: All API calls are to user-configured endpoints.
|
||||
- **No telemetry**: Logs go to local filesystem only.
|
||||
- **Config-driven**: All secrets in user's ~/.hermes/ directory.
|
||||
- **Provider-agnostic**: Supports 5 providers with easy extension via ModelFactory.
|
||||
|
||||
## Summary
|
||||
**Verdict: Fully sovereign. No corporate lock-in. User controls all endpoints and keys.**
|
||||
|
||||
Wolf is real and useful today, but its current reality is:
|
||||
- stronger test coverage than the older timmy-home genome recorded
|
||||
- a still-untested CLI/task/leaderboard control plane
|
||||
- smoke workflows that now form part of the repo’s real contract
|
||||
- a checked-in root `GENOME.md` that does not remove the need for the host-repo genome issue artifact
|
||||
---
|
||||
|
||||
*"The strength of the pack is the wolf, and the strength of the wolf is the pack."*
|
||||
*— The Wolf Sovereign Core has spoken.*
|
||||
|
||||
146
tests/test_issue_582_verification.py
Normal file
146
tests/test_issue_582_verification.py
Normal file
@@ -0,0 +1,146 @@
|
||||
"""Durable verification that the Issue #582 parent-epic orchestration slice exists on main.
|
||||
|
||||
These tests confirm:
|
||||
1. The epic pipeline runner script is present and importable.
|
||||
2. The pipeline documentation is committed.
|
||||
3. All five phase scripts exist at their expected paths.
|
||||
4. The pipeline plan exposes the correct five phases in order.
|
||||
5. Each plan step references the correct underlying script.
|
||||
6. The status snapshot reports script_exists=True for all phases.
|
||||
7. The status snapshot includes expected artifact output paths.
|
||||
8. The runner can produce a JSON-serialisable plan.
|
||||
9. The runner can produce a JSON-serialisable status snapshot.
|
||||
10. The verification document itself is present.
|
||||
|
||||
Refs #582. Closes #789.
|
||||
"""
|
||||
|
||||
import importlib.util
|
||||
import json
|
||||
import unittest
|
||||
from pathlib import Path
|
||||
|
||||
|
||||
ROOT = Path(__file__).resolve().parent.parent
|
||||
EPIC_PIPELINE = ROOT / "scripts" / "know_thy_father" / "epic_pipeline.py"
|
||||
PIPELINE_DOC = ROOT / "docs" / "KNOW_THY_FATHER_MULTIMODAL_PIPELINE.md"
|
||||
VERIFICATION_DOC = ROOT / "docs" / "issue-582-verification.md"
|
||||
|
||||
EXPECTED_PHASES = [
|
||||
"phase1_media_indexing",
|
||||
"phase2_multimodal_analysis",
|
||||
"phase3_holographic_synthesis",
|
||||
"phase4_cross_reference_audit",
|
||||
"phase5_processing_log",
|
||||
]
|
||||
|
||||
EXPECTED_SCRIPTS = {
|
||||
"phase1_media_indexing": "scripts/know_thy_father/index_media.py",
|
||||
"phase2_multimodal_analysis": "scripts/twitter_archive/analyze_media.py",
|
||||
"phase3_holographic_synthesis": "scripts/know_thy_father/synthesize_kernels.py",
|
||||
"phase4_cross_reference_audit": "scripts/know_thy_father/crossref_audit.py",
|
||||
"phase5_processing_log": "twitter-archive/know-thy-father/tracker.py",
|
||||
}
|
||||
|
||||
EXPECTED_OUTPUTS = {
|
||||
"phase1_media_indexing": ["twitter-archive/know-thy-father/media_manifest.jsonl"],
|
||||
"phase3_holographic_synthesis": ["twitter-archive/knowledge/fathers_ledger.jsonl"],
|
||||
"phase5_processing_log": ["twitter-archive/know-thy-father/REPORT.md"],
|
||||
}
|
||||
|
||||
|
||||
def _load_epic_module():
|
||||
spec = importlib.util.spec_from_file_location("ktf_epic_pipeline", EPIC_PIPELINE)
|
||||
assert spec and spec.loader, "Cannot load epic_pipeline module spec"
|
||||
mod = importlib.util.module_from_spec(spec)
|
||||
spec.loader.exec_module(mod)
|
||||
return mod
|
||||
|
||||
|
||||
class TestIssue582Verification(unittest.TestCase):
|
||||
"""10-test suite proving the #582 orchestration slice is on main."""
|
||||
|
||||
# -- existence checks --------------------------------------------------
|
||||
|
||||
def test_01_epic_pipeline_script_exists(self):
|
||||
"""The orchestration runner is committed."""
|
||||
self.assertTrue(EPIC_PIPELINE.exists(), f"missing {EPIC_PIPELINE.relative_to(ROOT)}")
|
||||
|
||||
def test_02_pipeline_documentation_exists(self):
|
||||
"""The multimodal pipeline doc is committed."""
|
||||
self.assertTrue(PIPELINE_DOC.exists(), "missing KNOW_THY_FATHER_MULTIMODAL_PIPELINE.md")
|
||||
|
||||
def test_03_all_phase_scripts_exist_on_disk(self):
|
||||
"""Every script referenced by the pipeline exists in the repo."""
|
||||
for phase_id, script_rel in EXPECTED_SCRIPTS.items():
|
||||
path = ROOT / script_rel
|
||||
self.assertTrue(path.exists(), f"{phase_id}: missing {script_rel}")
|
||||
|
||||
# -- plan structure ----------------------------------------------------
|
||||
|
||||
def test_04_pipeline_plan_has_five_phases_in_order(self):
|
||||
mod = _load_epic_module()
|
||||
plan = mod.build_pipeline_plan(batch_size=10)
|
||||
ids = [step["id"] for step in plan]
|
||||
self.assertEqual(ids, EXPECTED_PHASES)
|
||||
|
||||
def test_05_plan_commands_reference_correct_scripts(self):
|
||||
mod = _load_epic_module()
|
||||
plan = mod.build_pipeline_plan(batch_size=10)
|
||||
for step in plan:
|
||||
expected_script = EXPECTED_SCRIPTS[step["id"]]
|
||||
self.assertIn(
|
||||
expected_script,
|
||||
step["command"],
|
||||
f"{step['id']} command missing {expected_script}",
|
||||
)
|
||||
|
||||
# -- status snapshot ---------------------------------------------------
|
||||
|
||||
def test_06_status_snapshot_all_scripts_exist(self):
|
||||
mod = _load_epic_module()
|
||||
status = mod.build_status_snapshot(ROOT)
|
||||
for phase_id in EXPECTED_PHASES:
|
||||
self.assertIn(phase_id, status)
|
||||
self.assertTrue(
|
||||
status[phase_id]["script_exists"],
|
||||
f"{phase_id} script_exists should be True",
|
||||
)
|
||||
|
||||
def test_07_status_snapshot_reports_expected_outputs(self):
|
||||
mod = _load_epic_module()
|
||||
status = mod.build_status_snapshot(ROOT)
|
||||
for phase_id, expected_paths in EXPECTED_OUTPUTS.items():
|
||||
actual_paths = [o["path"] for o in status[phase_id]["outputs"]]
|
||||
for p in expected_paths:
|
||||
self.assertIn(p, actual_paths, f"{phase_id} missing output path {p}")
|
||||
|
||||
# -- JSON serialisation ------------------------------------------------
|
||||
|
||||
def test_08_plan_is_json_serialisable(self):
|
||||
mod = _load_epic_module()
|
||||
plan = mod.build_pipeline_plan(batch_size=10)
|
||||
dumped = json.dumps(plan)
|
||||
restored = json.loads(dumped)
|
||||
self.assertEqual(len(restored), 5)
|
||||
|
||||
def test_09_status_snapshot_is_json_serialisable(self):
|
||||
mod = _load_epic_module()
|
||||
status = mod.build_status_snapshot(ROOT)
|
||||
dumped = json.dumps(status)
|
||||
restored = json.loads(dumped)
|
||||
for phase_id in EXPECTED_PHASES:
|
||||
self.assertIn(phase_id, restored)
|
||||
|
||||
# -- verification doc --------------------------------------------------
|
||||
|
||||
def test_10_verification_document_exists(self):
|
||||
"""This verification trail is committed."""
|
||||
self.assertTrue(
|
||||
VERIFICATION_DOC.exists(),
|
||||
"missing docs/issue-582-verification.md",
|
||||
)
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
unittest.main()
|
||||
@@ -1,22 +0,0 @@
|
||||
from pathlib import Path
|
||||
|
||||
GENOME = Path("genomes/wolf/GENOME.md")
|
||||
|
||||
|
||||
def test_wolf_genome_exists_at_expected_path():
|
||||
assert GENOME.exists(), "wolf genome must exist at genomes/wolf/GENOME.md"
|
||||
|
||||
|
||||
def test_wolf_genome_covers_current_test_surface_and_ci_contract():
|
||||
content = GENOME.read_text(encoding="utf-8")
|
||||
required = [
|
||||
"# GENOME.md — Wolf (Timmy_Foundation/wolf)",
|
||||
"tests/test_models.py",
|
||||
"tests/test_gitea.py",
|
||||
"tests/test_runner.py",
|
||||
".gitea/workflows/smoke.yml",
|
||||
".github/workflows/smoke-test.yml",
|
||||
"`GENOME.md` at repo root",
|
||||
]
|
||||
missing = [item for item in required if item not in content]
|
||||
assert not missing, f"wolf genome missing current repo facts: {missing}"
|
||||
Reference in New Issue
Block a user