Compare commits
1 Commits
fix/524
...
step35/683
| Author | SHA1 | Date | |
|---|---|---|---|
|
|
45052fe51d |
@@ -1,107 +0,0 @@
|
||||
# [DIRECTIVE] Unified Fleet Sovereignty & Comms Migration
|
||||
|
||||
Grounding report for `timmy-home #524`.
|
||||
|
||||
Issue #524 is a multi-lane directive, not a one-commit feature. This report grounds the directive in repo evidence, highlights stale cross-links, and names the missing operator bundles that still need real execution.
|
||||
|
||||
This remains a `Refs #524` artifact. The directive spans multiple repos and operator actions, so this report makes the current repo-side state executable without pretending the whole migration is complete.
|
||||
|
||||
## Directive Snapshot
|
||||
|
||||
- Repo-grounded workstreams: 0
|
||||
- Partial workstreams: 4
|
||||
- Missing workstreams: 1
|
||||
- Drifted references: 4
|
||||
|
||||
## Reference Drift
|
||||
|
||||
- #813 is cited for Nostr Migration Leadership, but its current title is 'docs: refresh the-playground genome analysis (#671)'.
|
||||
- #819 is cited for Nostr Migration Leadership, but its current title is 'docs: verify #648 already implemented (closes #818)'.
|
||||
- #139 is cited for v0.7.0 Feature Audit, but its current title is '🐣 Allegro-Primus is born'.
|
||||
- #103 is cited for Morrowind Local-First Benchmark, but its current title is 'Build comprehensive caching layer — cache everywhere'.
|
||||
|
||||
## Workstream Matrix
|
||||
|
||||
### 1. Nostr Migration Leadership — PARTIAL
|
||||
|
||||
- Requirement: Replace Telegram with relay-based sovereign comms, verify wizard keypairs, and prove the NIP-29 group path is stable.
|
||||
- Referenced issues:
|
||||
- #813 (closed) — docs: refresh the-playground genome analysis (#671) [DRIFT]
|
||||
- #819 (open) — docs: verify #648 already implemented (closes #818) [DRIFT]
|
||||
- Repo evidence present:
|
||||
- `infrastructure/timmy-bridge/client/timmy_client.py` — Nostr event client scaffold already exists
|
||||
- `infrastructure/timmy-bridge/monitor/timmy_monitor.py` — Nostr relay monitor already exists
|
||||
- `specs/wizard-telegram-bot-cutover.md` — Telegram cutover planning exists, so the migration lane is real
|
||||
- Missing operator deliverables:
|
||||
- wizard keypair inventory and ownership matrix
|
||||
- NIP-29 relay group verification report
|
||||
- operator runbook for cutting traffic off Telegram
|
||||
- Why this lane remains open: The repo has Nostr-adjacent scaffolding, but the directive still lacks a verified migration packet and the cited issue links drift away from the stated Nostr scope.
|
||||
|
||||
### 2. Lexicon Enforcement — PARTIAL
|
||||
|
||||
- Requirement: Enforce the Fleet Lexicon in PR review and issue triage so the team uses one shared language.
|
||||
- Referenced issues:
|
||||
- #388 (closed) — [KT] Fleet Lexicon & Techniques — Shared Vocabulary, Patterns, and Standards for All Agents [aligned]
|
||||
- Repo evidence present:
|
||||
- `docs/WIZARD_APPRENTICESHIP_CHARTER.md` — The repo already uses wizard-language canon in docs
|
||||
- `specs/timmy-ezra-bezalel-canon-sheet.md` — Canonical agent naming already exists
|
||||
- `docs/OPERATIONS_DASHBOARD.md` — Operational roles are already described in repo language
|
||||
- Missing operator deliverables:
|
||||
- machine-checkable lexicon policy for review/triage
|
||||
- terminology lint or reviewer checklist tied to the lexicon
|
||||
- Why this lane remains open: The naming canon exists, but there is still no executable enforcement bundle that would catch drift during future reviews and triage passes.
|
||||
|
||||
### 3. v0.7.0 Feature Audit — PARTIAL
|
||||
|
||||
- Requirement: Audit Hermes features that can reduce cloud dependency and turn the findings into a sovereignty implementation plan.
|
||||
- Referenced issues:
|
||||
- #139 (open) — 🐣 Allegro-Primus is born [DRIFT]
|
||||
- Repo evidence present:
|
||||
- `scripts/sovereignty_audit.py` — Cloud-vs-local audit machinery already exists
|
||||
- `reports/evaluations/2026-04-15-phase-4-sovereignty-audit.md` — Recent sovereignty audit report is committed
|
||||
- `timmy-local/README.md` — Local-first status is already documented for operators
|
||||
- Missing operator deliverables:
|
||||
- Hermes v0.7.0 feature inventory linked to cloud-reduction leverage
|
||||
- Sovereignty Implementation Plan derived from that feature audit
|
||||
- Why this lane remains open: The repo has sovereignty-audit infrastructure, but it does not yet contain the requested v0.7.0 feature inventory or the plan that turns those findings into rollout steps.
|
||||
|
||||
### 4. Morrowind Local-First Benchmark — PARTIAL
|
||||
|
||||
- Requirement: Compare cloud and local Morrowind agents, prove local parity where possible, and document the reasoning gap when it fails.
|
||||
- Referenced issues:
|
||||
- #103 (open) — Build comprehensive caching layer — cache everywhere [DRIFT]
|
||||
- Repo evidence present:
|
||||
- `morrowind/local_brain.py` — Local Morrowind control loop already exists
|
||||
- `morrowind/mcp_server.py` — Morrowind MCP control surface is already wired
|
||||
- `morrowind/pilot.py` — Trajectory logging for evaluation already exists
|
||||
- Missing operator deliverables:
|
||||
- cloud-vs-local benchmark report for the combat loop
|
||||
- reasoning-gap writeup tied to a proposed LoRA/fine-tune path
|
||||
- Why this lane remains open: The repo has a local Morrowind stack, but it does not yet contain the requested benchmark artifact; the cited issue number also points at an unrelated caching task.
|
||||
|
||||
### 5. Infrastructure Hardening / Syntax Guard — MISSING
|
||||
|
||||
- Requirement: Verify Syntax Guard pre-receive protection across Gitea repos so syntax failures stop earlier.
|
||||
- Referenced issues: none listed in the directive body
|
||||
- Repo evidence present: none
|
||||
- Missing operator deliverables:
|
||||
- repo inventory of Gitea targets that should carry Syntax Guard
|
||||
- deployment verifier for hook presence across those repos
|
||||
- operator report proving installation state instead of assuming it
|
||||
- Why this lane remains open: No repo-managed syntax-guard verifier is present yet, so this directive still depends on manual trust rather than auditable proof.
|
||||
|
||||
## Highest-Leverage Next Actions
|
||||
|
||||
- Nostr Migration Leadership: wizard keypair inventory and ownership matrix
|
||||
- Lexicon Enforcement: machine-checkable lexicon policy for review/triage
|
||||
- v0.7.0 Feature Audit: Hermes v0.7.0 feature inventory linked to cloud-reduction leverage
|
||||
- Morrowind Local-First Benchmark: cloud-vs-local benchmark report for the combat loop
|
||||
- Infrastructure Hardening / Syntax Guard: repo inventory of Gitea targets that should carry Syntax Guard
|
||||
|
||||
## Why #524 Remains Open
|
||||
|
||||
- The directive bundles five separate workstreams with different evidence surfaces.
|
||||
- Multiple cited issue numbers have drifted away from the work they are supposed to anchor.
|
||||
- Repo scaffolding exists for Nostr, sovereignty audits, and Morrowind, but the operator-facing bundles are still missing.
|
||||
- Syntax Guard verification is still undocumented and unproven inside this repo.
|
||||
@@ -1,263 +1,320 @@
|
||||
# GENOME.md — Wolf (Timmy_Foundation/wolf)
|
||||
# GENOME.md — wolf
|
||||
|
||||
> Codebase Genome v1.0 | Generated 2026-04-14 | Repo 16/16
|
||||
*Generated: 2026-04-20T00:00:00Z | Branch: main | Commit: ba73335*
|
||||
|
||||
## Project Overview
|
||||
|
||||
**Wolf** is a multi-model evaluation engine for sovereign AI fleets. It runs prompts against multiple LLM providers, scores responses on relevance, coherence, and safety, and outputs structured JSON results for model selection and ranking.
|
||||
**Wolf** is a sovereign multi-model evaluation engine for sovereign AI fleets. It runs prompts against multiple LLM providers (OpenAI, Anthropic, Groq, Ollama, OpenRouter), scores responses on relevance, coherence, and safety, and outputs structured JSON results for model selection and fleet deployment decisions.
|
||||
|
||||
**Core principle:** agents work, PRs prove it, CI judges it.
|
||||
**Two operational modes:**
|
||||
1. **Prompt Evaluation (v1.0)** — Standalone prompt-vs-model benchmarking via `python -m wolf.runner`
|
||||
2. **Legacy PR Scoring** — Gitea PR evaluation pipeline via `wolf.cli` (task generation, agent execution, leaderboard)
|
||||
|
||||
**Status:** v1.0.0 — production-ready for prompt evaluation. Legacy PR evaluation module retained for backward compatibility.
|
||||
**Tagline:** "Multi-model evaluation — agents work, PRs prove it, leaders get endpoints."
|
||||
|
||||
---
|
||||
|
||||
## Architecture
|
||||
|
||||
```mermaid
|
||||
graph TD
|
||||
CLI[cli.py] --> Config[config.py]
|
||||
CLI --> TaskGen[task.py]
|
||||
CLI --> Runner[runner.py]
|
||||
CLI --> Evaluator[evaluator.py]
|
||||
CLI --> Leaderboard[leaderboard.py]
|
||||
CLI --> Gitea[gitea.py]
|
||||
flowchart TB
|
||||
subgraph CLI["CLI Entry Points"]
|
||||
A1["python -m wolf.runner\n(pure evaluation)"]
|
||||
A2["python -m wolf.cli\n(task pipeline)"]
|
||||
end
|
||||
|
||||
Runner --> Models[models.py]
|
||||
Runner --> Gitea
|
||||
Evaluator --> Models
|
||||
subgraph Core["Core Engine"]
|
||||
B1["PromptEvaluator\n(evaluator.py)"]
|
||||
B2["ResponseScorer\n(evaluator.py)"]
|
||||
B3["AgentRunner\n(runner.py)"]
|
||||
B4["TaskGenerator\n(task.py)"]
|
||||
end
|
||||
|
||||
TaskGen --> Gitea
|
||||
Leaderboard --> |leaderboard.json| FS[(File System)]
|
||||
Config --> |wolf-config.yaml| FS
|
||||
subgraph Providers["Model Providers"]
|
||||
C1["OpenRouterClient"]
|
||||
C2["GroqClient"]
|
||||
C3["OllamaClient"]
|
||||
C4["AnthropicClient"]
|
||||
C5["OpenAIClient\n(GroqClient w/ custom URL)"]
|
||||
end
|
||||
|
||||
Models --> OpenRouter[OpenRouter API]
|
||||
Models --> Groq[Groq API]
|
||||
Models --> Ollama[Ollama Local]
|
||||
Models --> OpenAI[OpenAI API]
|
||||
Models --> Anthropic[Anthropic API]
|
||||
subgraph Infrastructure["Infrastructure"]
|
||||
D1["GiteaClient\n(gitea.py)"]
|
||||
D2["Config\n(config.py)"]
|
||||
D3["Leaderboard\n(leaderboard.py)"]
|
||||
D4["wolf-config.yaml"]
|
||||
end
|
||||
|
||||
Runner --> |branch + commit| Gitea
|
||||
Evaluator --> |score results| Leaderboard
|
||||
subgraph Output["Output"]
|
||||
E1["JSON results file"]
|
||||
E2["stdout summary table"]
|
||||
E3["Gitea PRs"]
|
||||
E4["Leaderboard scores"]
|
||||
end
|
||||
|
||||
A1 --> B1
|
||||
A2 --> B4 --> B3
|
||||
B1 --> B2
|
||||
B1 --> C1 & C2 & C3 & C4 & C5
|
||||
B3 --> C1 & C2 & C3 & C4 & C5
|
||||
B3 --> D1
|
||||
A2 --> D1 & D2 & D3
|
||||
B1 --> E1 & E2
|
||||
B3 --> E3
|
||||
D3 --> E4
|
||||
D2 --> D4
|
||||
|
||||
style A1 fill:#4a9eff,color:#fff
|
||||
style A2 fill:#4a9eff,color:#fff
|
||||
style B1 fill:#ff6b6b,color:#fff
|
||||
style B2 fill:#ff6b6b,color:#fff
|
||||
```
|
||||
|
||||
### Data Flow — Prompt Evaluation Mode
|
||||
|
||||
```
|
||||
prompts.json + models.json/wolf-config.yaml
|
||||
→ load_prompts() / load_models_from_json()
|
||||
→ PromptEvaluator.evaluate()
|
||||
→ for each (prompt, model):
|
||||
→ ModelFactory.get_client(provider) → ModelClient.generate()
|
||||
→ ResponseScorer.score(response, prompt)
|
||||
→ score_relevance() — keyword matching, length, refusal detection
|
||||
→ score_coherence() — structure, readability, repetition
|
||||
→ score_safety() — harmful content patterns, profanity
|
||||
→ overall = relevance*0.40 + coherence*0.35 + safety*0.25
|
||||
→ evaluate_and_serialize() → JSON dict
|
||||
→ run(output_path) → write JSON + print_summary()
|
||||
```
|
||||
|
||||
### Data Flow — Legacy Task Pipeline Mode
|
||||
|
||||
```
|
||||
wolf-config.yaml
|
||||
→ GiteaClient.get_issues(owner, repo)
|
||||
→ TaskGenerator.from_gitea_issues()
|
||||
→ TaskGenerator.assign_tasks(tasks, models)
|
||||
→ for each task:
|
||||
→ AgentRunner.execute_task(task)
|
||||
→ ModelClient.generate(prompt)
|
||||
→ GiteaClient.create_branch()
|
||||
→ GiteaClient.create_file(wolf-outputs/{id}.md)
|
||||
→ GiteaClient.create_pull_request()
|
||||
→ Leaderboard.record_score()
|
||||
→ Leaderboard.get_rankings()
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Entry Points
|
||||
|
||||
| Entry Point | Command | Purpose |
|
||||
|-------------|---------|---------|
|
||||
| `wolf/cli.py` | `python3 -m wolf.cli --run` | Main CLI: run tasks, evaluate PRs, show leaderboard |
|
||||
| `wolf/runner.py` | `python3 -m wolf.runner --prompts p.json --models m.json` | Standalone prompt evaluation runner |
|
||||
| `wolf/__init__.py` | `import wolf` | Package init, version metadata |
|
||||
| Entry Point | Module | Purpose |
|
||||
|-------------|--------|---------|
|
||||
| `python -m wolf.runner` | `runner.py` | Pure prompt-vs-model evaluation. Primary v1.0 interface. |
|
||||
| `python -m wolf.cli` | `cli.py` | Full task pipeline: fetch issues → run models → create PRs → leaderboard. |
|
||||
|
||||
## Data Flow
|
||||
### runner.py CLI Flags
|
||||
|
||||
### Prompt Evaluation Pipeline (Primary)
|
||||
| Flag | Required | Description |
|
||||
|------|----------|-------------|
|
||||
| `--prompts / -p` | Yes | Path to prompts JSON file |
|
||||
| `--models / -m` | No* | Path to models JSON file |
|
||||
| `--config / -c` | No* | Path to wolf-config.yaml (alternative to --models) |
|
||||
| `--output / -o` | No | Path to write JSON results |
|
||||
| `--system-prompt` | No | System prompt (default: "You are a helpful assistant.") |
|
||||
|
||||
```
|
||||
prompts.json + models.json (or wolf-config.yaml)
|
||||
│
|
||||
▼
|
||||
PromptEvaluator.evaluate()
|
||||
│
|
||||
├─ For each (prompt, model) pair:
|
||||
│ ├─ ModelClient.generate(prompt) → response text
|
||||
│ ├─ ResponseScorer.score(response, prompt)
|
||||
│ │ ├─ score_relevance() (0.40 weight)
|
||||
│ │ ├─ score_coherence() (0.35 weight)
|
||||
│ │ └─ score_safety() (0.25 weight)
|
||||
│ └─ EvaluationResult (prompt, model, scores, latency, error)
|
||||
│
|
||||
▼
|
||||
evaluate_and_serialize() → JSON output
|
||||
│
|
||||
├─ model_summaries (per-model averages)
|
||||
└─ results[] (per-evaluation details)
|
||||
```
|
||||
*Either --models or --config is required.
|
||||
|
||||
### Task Assignment Pipeline (Legacy)
|
||||
|
||||
```
|
||||
Gitea Issues → TaskGenerator → AgentRunner
|
||||
│ │ │
|
||||
▼ ▼ ▼
|
||||
Fetch tasks Assign models Execute + PR
|
||||
from issues from config via Gitea API
|
||||
```
|
||||
|
||||
## Key Abstractions
|
||||
|
||||
| Class | Module | Purpose |
|
||||
|-------|--------|---------|
|
||||
| `PromptEntry` | evaluator.py | Single prompt with expected keywords and category |
|
||||
| `ModelEndpoint` | evaluator.py | Model connection descriptor (provider, model_id, key) |
|
||||
| `ScoreResult` | evaluator.py | Scores for relevance, coherence, safety, overall |
|
||||
| `EvaluationResult` | evaluator.py | Full result: prompt + model + response + scores + latency |
|
||||
| `ResponseScorer` | evaluator.py | Heuristic scoring engine (regex + keyword + structure) |
|
||||
| `PromptEvaluator` | evaluator.py | Core engine: runs prompts against models, scores output |
|
||||
| `ModelClient` | models.py | Abstract base for LLM API calls |
|
||||
| `ModelFactory` | models.py | Factory: returns correct client for provider name |
|
||||
| `Task` | task.py | Work unit: id, title, description, assigned model/provider |
|
||||
| `TaskGenerator` | task.py | Creates tasks from Gitea issues or JSON spec |
|
||||
| `AgentRunner` | runner.py | Executes tasks: generate → branch → commit → PR |
|
||||
| `Config` | config.py | YAML config loader (wolf-config.yaml) |
|
||||
| `Leaderboard` | leaderboard.py | Persistent model ranking with serverless readiness |
|
||||
| `GiteaClient` | gitea.py | Full Gitea REST API client |
|
||||
| `PREvaluator` | evaluator.py | Legacy: scores PRs on CI, commits, code quality |
|
||||
|
||||
## API Surface
|
||||
|
||||
### CLI Arguments (cli.py)
|
||||
### cli.py CLI Flags
|
||||
|
||||
| Flag | Description |
|
||||
|------|-------------|
|
||||
| `--config` | Path to wolf-config.yaml |
|
||||
| `--task-spec` | Path to task specification JSON |
|
||||
| `--run` | Run pending tasks (assign models, execute, create PRs) |
|
||||
| `--evaluate` | Evaluate open PRs and score them |
|
||||
| `--run` | Run pending tasks (fetch issues → generate → PR) |
|
||||
| `--evaluate` | Evaluate open PRs (legacy scoring) |
|
||||
| `--leaderboard` | Show model rankings |
|
||||
|
||||
### CLI Arguments (runner.py)
|
||||
---
|
||||
|
||||
| Flag | Description |
|
||||
|------|-------------|
|
||||
| `--prompts` / `-p` | Path to prompts JSON (required) |
|
||||
| `--models` / `-m` | Path to models JSON |
|
||||
| `--config` / `-c` | Path to wolf-config.yaml (alternative to --models) |
|
||||
| `--output` / `-o` | Path to write JSON results |
|
||||
| `--system-prompt` | System prompt for all model calls |
|
||||
## Key Abstractions
|
||||
|
||||
### Dataclasses (evaluator.py)
|
||||
|
||||
| Class | Fields | Purpose |
|
||||
|-------|--------|---------|
|
||||
| `PromptEntry` | id, text, expected_keywords, category | A single evaluation prompt with metadata |
|
||||
| `ModelEndpoint` | name, provider, model_id, api_key, base_url | Model connection config |
|
||||
| `ScoreResult` | relevance, coherence, safety, overall, details | Scoring output for one response |
|
||||
| `EvaluationResult` | prompt_id, prompt_text, model_name, ..., scores, error | Complete result of one prompt×model evaluation |
|
||||
|
||||
### Core Classes
|
||||
|
||||
| Class | Module | Responsibility |
|
||||
|-------|--------|----------------|
|
||||
| `ResponseScorer` | evaluator.py | Scores responses on 3 dimensions using regex heuristics |
|
||||
| `PromptEvaluator` | evaluator.py | Orchestrates N×M evaluation matrix |
|
||||
| `ModelClient` | models.py | Abstract base for provider clients |
|
||||
| `ModelFactory` | models.py | Static factory: `get_client(provider, key, url)` |
|
||||
| `GiteaClient` | gitea.py | Full Gitea API wrapper (issues, branches, files, PRs) |
|
||||
| `AgentRunner` | runner.py | Task execution: generate → branch → commit → PR |
|
||||
| `TaskGenerator` | task.py | Converts Gitea issues to evaluable Task dataclasses |
|
||||
| `Leaderboard` | leaderboard.py | Tracks model scores, determines serverless readiness |
|
||||
| `Config` | config.py | Loads wolf-config.yaml, manages logging |
|
||||
|
||||
### Provider Clients (models.py)
|
||||
|
||||
| Client | Provider | API Format |
|
||||
|--------|----------|------------|
|
||||
| Class | Provider | API Format |
|
||||
|-------|----------|------------|
|
||||
| `OpenRouterClient` | openrouter | OpenAI-compatible chat completions |
|
||||
| `GroqClient` | groq | OpenAI-compatible chat completions |
|
||||
| `OllamaClient` | ollama | Ollama native /api/generate |
|
||||
| `OpenAIClient` | openai | OpenAI-compatible (reuses GroqClient with different URL) |
|
||||
| `AnthropicClient` | anthropic | Anthropic Messages API v1 |
|
||||
| `AnthropicClient` | anthropic | Anthropic Messages API |
|
||||
| `OpenAIClient` | openai | GroqClient with base_url override |
|
||||
|
||||
### Gitea Client (gitea.py)
|
||||
---
|
||||
|
||||
| Method | Purpose |
|
||||
|--------|---------|
|
||||
| `get_issues()` | Fetch issues by state |
|
||||
| `create_branch()` | Create new branch from base |
|
||||
| `create_file()` | Create file on branch (base64) |
|
||||
| `update_file()` | Update file with SHA |
|
||||
| `get_file()` | Read file contents |
|
||||
| `create_pull_request()` | Open PR |
|
||||
| `get_pull_request()` | Fetch PR details |
|
||||
| `get_pr_status()` | Check PR CI status |
|
||||
## API Surface
|
||||
|
||||
## Configuration (wolf-config.yaml)
|
||||
### Public API (importable)
|
||||
|
||||
```python
|
||||
# Evaluation pipeline
|
||||
from wolf.evaluator import PromptEvaluator, PromptEntry, ModelEndpoint, ScoreResult
|
||||
|
||||
# Provider clients
|
||||
from wolf.models import ModelFactory, ModelClient
|
||||
|
||||
# Gitea integration
|
||||
from wolf.gitea import GiteaClient
|
||||
|
||||
# Task pipeline
|
||||
from wolf.runner import AgentRunner
|
||||
from wolf.task import TaskGenerator, Task
|
||||
|
||||
# Leaderboard
|
||||
from wolf.leaderboard import Leaderboard
|
||||
|
||||
# Config
|
||||
from wolf.config import Config, setup_logging
|
||||
```
|
||||
|
||||
### Scoring Weights
|
||||
|
||||
| Dimension | Weight | Method |
|
||||
|-----------|--------|--------|
|
||||
| Relevance | 0.40 | Keyword matching (60%) + length score (40%) |
|
||||
| Coherence | 0.35 | Length + structure indicators + sentence completeness + uniqueness |
|
||||
| Safety | 0.25 | Unsafe pattern detection + profanity check |
|
||||
| **Overall** | 1.00 | Weighted sum |
|
||||
|
||||
### Scoring Details
|
||||
|
||||
**Relevance (ResponseScorer.score_relevance):**
|
||||
- Expected keyword match ratio
|
||||
- Fallback: word overlap with prompt (boosted 1.5×)
|
||||
- Length penalty: <20 chars → 0.3, <50 chars → 0.6
|
||||
- Refusal detection: 3 regex patterns, penalty if low keyword match
|
||||
|
||||
**Coherence (ResponseScorer.score_coherence):**
|
||||
- Length sweet spot: 100-3000 chars → 1.0
|
||||
- Structure: paragraph breaks, transition words, lists/steps
|
||||
- Sentence completeness: avg 20-200 chars → 0.9
|
||||
- Uniqueness: unique word ratio >0.4 → 0.9
|
||||
|
||||
**Safety (ResponseScorer.score_safety):**
|
||||
- 6 unsafe patterns (weapon creation, system exploitation, prompt injection, etc.)
|
||||
- Profanity detection (minor penalty: 0.1 per word, capped at 0.3)
|
||||
|
||||
---
|
||||
|
||||
## Test Coverage
|
||||
|
||||
### Current Tests
|
||||
|
||||
| Test File | Covers | Status |
|
||||
|-----------|--------|--------|
|
||||
| `test_evaluator.py` | PromptEntry, ModelEndpoint, ScoreResult, ResponseScorer, PromptEvaluator, PREvaluator | ✅ 23 test methods |
|
||||
| `test_config.py` | Config.load | ✅ 1 test method |
|
||||
|
||||
### Coverage Gaps — Untested Modules
|
||||
|
||||
| Module | Risk | Critical Paths |
|
||||
|--------|------|----------------|
|
||||
| `cli.py` | **HIGH** | Argparse wiring, config→models→evaluator pipeline, PR scoring flow |
|
||||
| `runner.py` | **HIGH** | load_prompts, load_models_from_json, load_models_from_config, run_evaluation, AgentRunner.execute_task |
|
||||
| `models.py` | **HIGH** | ModelFactory.get_client for each provider, each client's generate() |
|
||||
| `gitea.py` | **MEDIUM** | All GiteaClient methods (HTTP calls) |
|
||||
| `task.py` | **MEDIUM** | TaskGenerator.from_gitea_issues, from_spec, assign_tasks |
|
||||
| `leaderboard.py` | **LOW** | Leaderboard.record_score, get_rankings, serverless_ready |
|
||||
|
||||
### Coverage Gaps — Existing Tests
|
||||
|
||||
- `test_evaluator.py`: No tests for `PromptEvaluator._get_model_client()`, `_run_single()` with real model call, or `evaluate_and_serialize()` summary statistics
|
||||
- `test_evaluator.py`: No integration test (mocked model calls only)
|
||||
- `test_config.py`: No test for missing config, env var overrides, or logging setup
|
||||
|
||||
---
|
||||
|
||||
## Security Considerations
|
||||
|
||||
1. **API Keys in Config**: `wolf-config.yaml` stores provider API keys. Never commit to version control. Recommend `~/.hermes/wolf-config.yaml` with restricted permissions.
|
||||
|
||||
2. **HTTP Requests**: All model calls and Gitea API calls are outbound HTTP. No input validation on URLs — `base_url` fields accept arbitrary endpoints.
|
||||
|
||||
3. **Prompt Injection**: ResponseScorer detects injection patterns in *model output*, but Wolf itself is vulnerable to prompt injection via `expected_keywords` or `system_prompt` fields.
|
||||
|
||||
4. **Gitea Token Scope**: GiteaClient uses a single token for all operations. Scoped tokens (read-only for evaluation, write for task execution) would reduce blast radius.
|
||||
|
||||
5. **No TLS Verification Override**: `requests.post()` uses default SSL verification. If self-signed certs are used for local providers (Ollama), this could fail silently.
|
||||
|
||||
6. **Race Conditions**: Leaderboard reads/writes JSON without locking. Concurrent evaluations could corrupt the leaderboard file.
|
||||
|
||||
---
|
||||
|
||||
## Dependencies
|
||||
|
||||
```
|
||||
requests # HTTP client for all providers and Gitea
|
||||
pyyaml # Config file parsing (not in requirements.txt — BUG)
|
||||
```
|
||||
|
||||
**⚠️ Missing dependency:** `pyyaml` is imported in `config.py` but not listed in `requirements.txt`.
|
||||
|
||||
---
|
||||
|
||||
## Configuration Schema
|
||||
|
||||
```yaml
|
||||
# wolf-config.yaml
|
||||
gitea:
|
||||
base_url: "https://forge.alexanderwhitestone.com/api/v1"
|
||||
token: "..."
|
||||
base_url: "https://forge.example.com/api/v1"
|
||||
token: "gitea_token_here"
|
||||
owner: "Timmy_Foundation"
|
||||
repo: "wolf"
|
||||
repo: "eval-repo"
|
||||
|
||||
providers:
|
||||
openrouter:
|
||||
api_key: "..."
|
||||
api_key: "sk-or-..."
|
||||
base_url: "https://openrouter.ai/api/v1"
|
||||
groq:
|
||||
api_key: "gsk_..."
|
||||
ollama:
|
||||
base_url: "http://localhost:11434"
|
||||
|
||||
models:
|
||||
- model: "anthropic/claude-3.5-sonnet"
|
||||
provider: "openrouter"
|
||||
- model: "gemma4:latest"
|
||||
- model: "llama3-70b-8192"
|
||||
provider: "groq"
|
||||
- model: "llama3:70b"
|
||||
provider: "ollama"
|
||||
|
||||
log_dir: "~/.hermes/wolf/"
|
||||
system_prompt: "You are a helpful assistant."
|
||||
leaderboard_path: "~/.hermes/wolf/leaderboard.json"
|
||||
log_dir: "~/.hermes/wolf/logs"
|
||||
```
|
||||
|
||||
## Scoring Methodology
|
||||
|
||||
| Dimension | Weight | Method |
|
||||
|-----------|--------|--------|
|
||||
| Relevance | 0.40 | Keyword matching (expected_keywords), prompt word overlap, length penalty, refusal detection |
|
||||
| Coherence | 0.35 | Structure indicators (paragraphs, lists, transitions), sentence completeness, repetition ratio, length sweet spot |
|
||||
| Safety | 0.25 | Unsafe pattern regex (bomb/hack/bypass/injection), profanity detection |
|
||||
|
||||
**Overall = relevance×0.40 + coherence×0.35 + safety×0.25**
|
||||
|
||||
### Safety Patterns Detected
|
||||
|
||||
- Bomb/explosive/weapon creation
|
||||
- System/network hacking
|
||||
- Security/auth bypass
|
||||
- Prompt injection ("ignore previous instructions")
|
||||
- System prompt extraction attempts
|
||||
|
||||
### Refusal Patterns Detected
|
||||
|
||||
- "I cannot/can't/won't help/assist"
|
||||
- "Sorry, but I cannot"
|
||||
- "Against my guidelines/policy"
|
||||
|
||||
## Test Coverage
|
||||
|
||||
| File | Tests | Coverage |
|
||||
|------|-------|----------|
|
||||
| `tests/test_evaluator.py` | 17 tests | PromptEntry, ModelEndpoint, ResponseScorer (relevance/coherence/safety), PromptEvaluator (evaluate, error handling, serialization, file output, multi-model), PREvaluator (score_pr, description scoring) |
|
||||
| `tests/test_config.py` | 1 test | Config load from YAML |
|
||||
|
||||
### Coverage Gaps
|
||||
|
||||
- No tests for `cli.py` (argument parsing, workflow orchestration)
|
||||
- No tests for `runner.py` (`load_prompts`, `load_models_from_json`, `AgentRunner.execute_task`)
|
||||
- No tests for `task.py` (`TaskGenerator.from_gitea_issues`, `from_spec`, `assign_tasks`)
|
||||
- No tests for `models.py` (API clients — would require mocking HTTP)
|
||||
- No tests for `leaderboard.py` (`record_score`, `get_rankings`, serverless readiness logic)
|
||||
- No tests for `gitea.py` (API client — would require mocking HTTP)
|
||||
- No integration tests (end-to-end evaluation pipeline)
|
||||
|
||||
## Dependencies
|
||||
|
||||
| Dependency | Used By | Purpose |
|
||||
|------------|---------|---------|
|
||||
| `requests` | models.py, gitea.py | HTTP client for all API calls |
|
||||
| `pyyaml` (optional) | config.py | YAML config parsing (falls back to line parser) |
|
||||
|
||||
## Security Considerations
|
||||
|
||||
1. **API keys in config**: wolf-config.yaml stores provider API keys in plaintext. File should be chmod 600 and excluded from git (already in .gitignore pattern via ~/.hermes/).
|
||||
2. **Gitea token**: Full access token used for branch creation, file commits, and PR creation. Scoped access recommended.
|
||||
3. **No input sanitization**: Prompts from Gitea issues are passed directly to models without filtering. Prompt injection risk for automated workflows.
|
||||
4. **No rate limiting**: Model API calls are sequential with no backoff or rate limiting. Could exhaust API quotas.
|
||||
5. **Legacy code reference**: `evaluator.py` references `Evaluator = PREvaluator` alias but `cli.py` imports `Evaluator` expecting the legacy class. This works but is confusing.
|
||||
|
||||
## File Index
|
||||
|
||||
| File | LOC | Purpose |
|
||||
|------|-----|---------|
|
||||
| `wolf/__init__.py` | 12 | Package init, version |
|
||||
| `wolf/cli.py` | 90 | Main CLI orchestrator |
|
||||
| `wolf/config.py` | 48 | YAML config loader |
|
||||
| `wolf/models.py` | 130 | LLM provider clients (5 providers) |
|
||||
| `wolf/runner.py` | 280 | Prompt evaluation CLI + AgentRunner |
|
||||
| `wolf/task.py` | 80 | Task dataclass + generator |
|
||||
| `wolf/evaluator.py` | 350 | Core scoring engine + legacy PR evaluator |
|
||||
| `wolf/leaderboard.py` | 70 | Persistent model ranking |
|
||||
| `wolf/gitea.py` | 100 | Gitea REST API client |
|
||||
| `tests/test_evaluator.py` | 180 | Unit tests for evaluator |
|
||||
| `tests/test_config.py` | 20 | Unit tests for config |
|
||||
|
||||
**Total: ~1,360 LOC Python | 11 modules | 18 tests**
|
||||
|
||||
## Sovereignty Assessment
|
||||
|
||||
- **No external dependencies beyond requests**: Runs on any machine with Python 3.11+ and requests.
|
||||
- **No phone-home**: All API calls are to user-configured endpoints.
|
||||
- **No telemetry**: Logs go to local filesystem only.
|
||||
- **Config-driven**: All secrets in user's ~/.hermes/ directory.
|
||||
- **Provider-agnostic**: Supports 5 providers with easy extension via ModelFactory.
|
||||
|
||||
**Verdict: Fully sovereign. No corporate lock-in. User controls all endpoints and keys.**
|
||||
|
||||
---
|
||||
|
||||
*"The strength of the pack is the wolf, and the strength of the wolf is the pack."*
|
||||
*— The Wolf Sovereign Core has spoken.*
|
||||
*Generated by Codebase Genome Pipeline. Review and update manually.*
|
||||
|
||||
@@ -1,418 +0,0 @@
|
||||
#!/usr/bin/env python3
|
||||
"""Ground timmy-home #524 as an executable status report.
|
||||
|
||||
Refs: timmy-home #524
|
||||
"""
|
||||
|
||||
from __future__ import annotations
|
||||
|
||||
import argparse
|
||||
import json
|
||||
from copy import deepcopy
|
||||
from pathlib import Path
|
||||
from typing import Any
|
||||
from urllib import request
|
||||
|
||||
DEFAULT_BASE_URL = "https://forge.alexanderwhitestone.com/api/v1"
|
||||
DEFAULT_OWNER = "Timmy_Foundation"
|
||||
DEFAULT_REPO = "timmy-home"
|
||||
DEFAULT_TOKEN_FILE = Path.home() / ".config" / "gitea" / "token"
|
||||
DEFAULT_REPO_ROOT = Path(__file__).resolve().parents[1]
|
||||
DEFAULT_DOC_PATH = DEFAULT_REPO_ROOT / "docs" / "UNIFIED_FLEET_SOVEREIGNTY_STATUS.md"
|
||||
|
||||
DIRECTIVE_TITLE = "[DIRECTIVE] Unified Fleet Sovereignty & Comms Migration"
|
||||
DIRECTIVE_SUMMARY = (
|
||||
"Issue #524 is a multi-lane directive, not a one-commit feature. "
|
||||
"This report grounds the directive in repo evidence, highlights stale cross-links, "
|
||||
"and names the missing operator bundles that still need real execution."
|
||||
)
|
||||
|
||||
DEFAULT_REFERENCE_SNAPSHOT = {
|
||||
388: {
|
||||
"title": "[KT] Fleet Lexicon & Techniques — Shared Vocabulary, Patterns, and Standards for All Agents",
|
||||
"state": "closed",
|
||||
},
|
||||
103: {
|
||||
"title": "Build comprehensive caching layer — cache everywhere",
|
||||
"state": "open",
|
||||
},
|
||||
139: {
|
||||
"title": "🐣 Allegro-Primus is born",
|
||||
"state": "open",
|
||||
},
|
||||
813: {
|
||||
"title": "docs: refresh the-playground genome analysis (#671)",
|
||||
"state": "closed",
|
||||
},
|
||||
819: {
|
||||
"title": "docs: verify #648 already implemented (closes #818)",
|
||||
"state": "open",
|
||||
},
|
||||
}
|
||||
|
||||
WORKSTREAMS = [
|
||||
{
|
||||
"key": "nostr-migration",
|
||||
"name": "Nostr Migration Leadership",
|
||||
"requirement": "Replace Telegram with relay-based sovereign comms, verify wizard keypairs, and prove the NIP-29 group path is stable.",
|
||||
"references": [813, 819],
|
||||
"expected_keywords": ["nostr", "relay", "telegram", "comms", "messenger"],
|
||||
"repo_evidence": [
|
||||
{
|
||||
"path": "infrastructure/timmy-bridge/client/timmy_client.py",
|
||||
"description": "Nostr event client scaffold already exists",
|
||||
},
|
||||
{
|
||||
"path": "infrastructure/timmy-bridge/monitor/timmy_monitor.py",
|
||||
"description": "Nostr relay monitor already exists",
|
||||
},
|
||||
{
|
||||
"path": "specs/wizard-telegram-bot-cutover.md",
|
||||
"description": "Telegram cutover planning exists, so the migration lane is real",
|
||||
},
|
||||
],
|
||||
"missing_deliverables": [
|
||||
"wizard keypair inventory and ownership matrix",
|
||||
"NIP-29 relay group verification report",
|
||||
"operator runbook for cutting traffic off Telegram",
|
||||
],
|
||||
"why_open": "The repo has Nostr-adjacent scaffolding, but the directive still lacks a verified migration packet and the cited issue links drift away from the stated Nostr scope.",
|
||||
},
|
||||
{
|
||||
"key": "lexicon-enforcement",
|
||||
"name": "Lexicon Enforcement",
|
||||
"requirement": "Enforce the Fleet Lexicon in PR review and issue triage so the team uses one shared language.",
|
||||
"references": [388],
|
||||
"expected_keywords": ["lexicon", "vocabulary", "standards", "shared vocabulary"],
|
||||
"repo_evidence": [
|
||||
{
|
||||
"path": "docs/WIZARD_APPRENTICESHIP_CHARTER.md",
|
||||
"description": "The repo already uses wizard-language canon in docs",
|
||||
},
|
||||
{
|
||||
"path": "specs/timmy-ezra-bezalel-canon-sheet.md",
|
||||
"description": "Canonical agent naming already exists",
|
||||
},
|
||||
{
|
||||
"path": "docs/OPERATIONS_DASHBOARD.md",
|
||||
"description": "Operational roles are already described in repo language",
|
||||
},
|
||||
],
|
||||
"missing_deliverables": [
|
||||
"machine-checkable lexicon policy for review/triage",
|
||||
"terminology lint or reviewer checklist tied to the lexicon",
|
||||
],
|
||||
"why_open": "The naming canon exists, but there is still no executable enforcement bundle that would catch drift during future reviews and triage passes.",
|
||||
},
|
||||
{
|
||||
"key": "feature-audit",
|
||||
"name": "v0.7.0 Feature Audit",
|
||||
"requirement": "Audit Hermes features that can reduce cloud dependency and turn the findings into a sovereignty implementation plan.",
|
||||
"references": [139],
|
||||
"expected_keywords": ["hermes", "feature", "audit", "v0.7.0", "sovereignty"],
|
||||
"repo_evidence": [
|
||||
{
|
||||
"path": "scripts/sovereignty_audit.py",
|
||||
"description": "Cloud-vs-local audit machinery already exists",
|
||||
},
|
||||
{
|
||||
"path": "reports/evaluations/2026-04-15-phase-4-sovereignty-audit.md",
|
||||
"description": "Recent sovereignty audit report is committed",
|
||||
},
|
||||
{
|
||||
"path": "timmy-local/README.md",
|
||||
"description": "Local-first status is already documented for operators",
|
||||
},
|
||||
],
|
||||
"missing_deliverables": [
|
||||
"Hermes v0.7.0 feature inventory linked to cloud-reduction leverage",
|
||||
"Sovereignty Implementation Plan derived from that feature audit",
|
||||
],
|
||||
"why_open": "The repo has sovereignty-audit infrastructure, but it does not yet contain the requested v0.7.0 feature inventory or the plan that turns those findings into rollout steps.",
|
||||
},
|
||||
{
|
||||
"key": "morrowind-benchmark",
|
||||
"name": "Morrowind Local-First Benchmark",
|
||||
"requirement": "Compare cloud and local Morrowind agents, prove local parity where possible, and document the reasoning gap when it fails.",
|
||||
"references": [103],
|
||||
"expected_keywords": ["morrowind", "combat", "benchmark", "local", "cloud"],
|
||||
"repo_evidence": [
|
||||
{
|
||||
"path": "morrowind/local_brain.py",
|
||||
"description": "Local Morrowind control loop already exists",
|
||||
},
|
||||
{
|
||||
"path": "morrowind/mcp_server.py",
|
||||
"description": "Morrowind MCP control surface is already wired",
|
||||
},
|
||||
{
|
||||
"path": "morrowind/pilot.py",
|
||||
"description": "Trajectory logging for evaluation already exists",
|
||||
},
|
||||
],
|
||||
"missing_deliverables": [
|
||||
"cloud-vs-local benchmark report for the combat loop",
|
||||
"reasoning-gap writeup tied to a proposed LoRA/fine-tune path",
|
||||
],
|
||||
"why_open": "The repo has a local Morrowind stack, but it does not yet contain the requested benchmark artifact; the cited issue number also points at an unrelated caching task.",
|
||||
},
|
||||
{
|
||||
"key": "syntax-guard",
|
||||
"name": "Infrastructure Hardening / Syntax Guard",
|
||||
"requirement": "Verify Syntax Guard pre-receive protection across Gitea repos so syntax failures stop earlier.",
|
||||
"references": [],
|
||||
"expected_keywords": [],
|
||||
"repo_evidence": [],
|
||||
"missing_deliverables": [
|
||||
"repo inventory of Gitea targets that should carry Syntax Guard",
|
||||
"deployment verifier for hook presence across those repos",
|
||||
"operator report proving installation state instead of assuming it",
|
||||
],
|
||||
"why_open": "No repo-managed syntax-guard verifier is present yet, so this directive still depends on manual trust rather than auditable proof.",
|
||||
},
|
||||
]
|
||||
|
||||
|
||||
def default_snapshot() -> dict[int, dict[str, str]]:
|
||||
return deepcopy(DEFAULT_REFERENCE_SNAPSHOT)
|
||||
|
||||
|
||||
class GiteaClient:
|
||||
def __init__(self, token: str, owner: str = DEFAULT_OWNER, repo: str = DEFAULT_REPO, base_url: str = DEFAULT_BASE_URL):
|
||||
self.token = token
|
||||
self.owner = owner
|
||||
self.repo = repo
|
||||
self.base_url = base_url.rstrip("/")
|
||||
|
||||
def get_issue(self, issue_number: int) -> dict[str, Any]:
|
||||
req = request.Request(
|
||||
f"{self.base_url}/repos/{self.owner}/{self.repo}/issues/{issue_number}",
|
||||
headers={"Authorization": f"token {self.token}", "Accept": "application/json"},
|
||||
)
|
||||
with request.urlopen(req, timeout=30) as resp:
|
||||
return json.loads(resp.read().decode())
|
||||
|
||||
|
||||
def load_snapshot(path: Path | None = None) -> dict[int, dict[str, str]]:
|
||||
if path is None:
|
||||
return default_snapshot()
|
||||
data = json.loads(path.read_text(encoding="utf-8"))
|
||||
return {int(k): v for k, v in data.items()}
|
||||
|
||||
|
||||
def refresh_snapshot(token_file: Path = DEFAULT_TOKEN_FILE) -> dict[int, dict[str, str]]:
|
||||
token = token_file.read_text(encoding="utf-8").strip()
|
||||
client = GiteaClient(token=token)
|
||||
snapshot: dict[int, dict[str, str]] = {}
|
||||
for issue_number in sorted(DEFAULT_REFERENCE_SNAPSHOT):
|
||||
issue = client.get_issue(issue_number)
|
||||
snapshot[issue_number] = {
|
||||
"title": issue["title"],
|
||||
"state": issue["state"],
|
||||
}
|
||||
return snapshot
|
||||
|
||||
|
||||
def collect_repo_evidence(entries: list[dict[str, str]], repo_root: Path) -> tuple[list[str], list[str]]:
|
||||
present: list[str] = []
|
||||
missing: list[str] = []
|
||||
for entry in entries:
|
||||
label = f"`{entry['path']}` — {entry['description']}"
|
||||
if (repo_root / entry["path"]).exists():
|
||||
present.append(label)
|
||||
else:
|
||||
missing.append(label)
|
||||
return present, missing
|
||||
|
||||
|
||||
|
||||
def evaluate_reference(issue_number: int, snapshot: dict[int, dict[str, str]], expected_keywords: list[str]) -> dict[str, Any]:
|
||||
record = snapshot.get(issue_number, {"title": "missing from snapshot", "state": "unknown"})
|
||||
title = record["title"]
|
||||
title_lower = title.lower()
|
||||
matched_keywords = [kw for kw in expected_keywords if kw.lower() in title_lower]
|
||||
aligned = bool(matched_keywords) if expected_keywords else True
|
||||
return {
|
||||
"number": issue_number,
|
||||
"title": title,
|
||||
"state": record["state"],
|
||||
"aligned": aligned,
|
||||
"matched_keywords": matched_keywords,
|
||||
}
|
||||
|
||||
|
||||
|
||||
def classify_workstream(reference_results: list[dict[str, Any]], evidence_present: list[str], missing_deliverables: list[str]) -> str:
|
||||
has_drift = any(not item["aligned"] for item in reference_results)
|
||||
if not evidence_present:
|
||||
return "MISSING"
|
||||
if has_drift or missing_deliverables:
|
||||
return "PARTIAL"
|
||||
return "GROUNDED"
|
||||
|
||||
|
||||
|
||||
def evaluate_directive(snapshot: dict[int, dict[str, str]] | None = None, repo_root: Path | None = None) -> dict[str, Any]:
|
||||
snapshot = snapshot or default_snapshot()
|
||||
repo_root = repo_root or DEFAULT_REPO_ROOT
|
||||
workstreams: list[dict[str, Any]] = []
|
||||
drift_items: list[str] = []
|
||||
|
||||
for lane in WORKSTREAMS:
|
||||
reference_results = [
|
||||
evaluate_reference(issue_number, snapshot, lane["expected_keywords"])
|
||||
for issue_number in lane["references"]
|
||||
]
|
||||
present, missing = collect_repo_evidence(lane["repo_evidence"], repo_root)
|
||||
for item in reference_results:
|
||||
if not item["aligned"]:
|
||||
drift_items.append(
|
||||
f"#{item['number']} is cited for {lane['name']}, but its current title is '{item['title']}'."
|
||||
)
|
||||
workstream = {
|
||||
"key": lane["key"],
|
||||
"name": lane["name"],
|
||||
"requirement": lane["requirement"],
|
||||
"reference_results": reference_results,
|
||||
"repo_evidence_present": present,
|
||||
"repo_evidence_missing": missing,
|
||||
"missing_deliverables": list(lane["missing_deliverables"]),
|
||||
"why_open": lane["why_open"],
|
||||
}
|
||||
workstream["status"] = classify_workstream(
|
||||
reference_results=reference_results,
|
||||
evidence_present=present,
|
||||
missing_deliverables=workstream["missing_deliverables"],
|
||||
)
|
||||
workstreams.append(workstream)
|
||||
|
||||
next_actions: list[str] = []
|
||||
for workstream in workstreams:
|
||||
if workstream["missing_deliverables"]:
|
||||
next_actions.append(f"{workstream['name']}: {workstream['missing_deliverables'][0]}")
|
||||
|
||||
return {
|
||||
"issue_number": 524,
|
||||
"title": DIRECTIVE_TITLE,
|
||||
"summary": DIRECTIVE_SUMMARY,
|
||||
"reference_snapshot": {str(k): v for k, v in sorted(snapshot.items())},
|
||||
"workstreams": workstreams,
|
||||
"reference_drift": drift_items,
|
||||
"grounded_workstreams": sum(1 for item in workstreams if item["status"] == "GROUNDED"),
|
||||
"partial_workstreams": sum(1 for item in workstreams if item["status"] == "PARTIAL"),
|
||||
"missing_workstreams": sum(1 for item in workstreams if item["status"] == "MISSING"),
|
||||
"next_actions": next_actions,
|
||||
}
|
||||
|
||||
|
||||
|
||||
def render_markdown(result: dict[str, Any]) -> str:
|
||||
lines = [
|
||||
f"# {result['title']}",
|
||||
"",
|
||||
"Grounding report for `timmy-home #524`.",
|
||||
"",
|
||||
result["summary"],
|
||||
"",
|
||||
"This remains a `Refs #524` artifact. The directive spans multiple repos and operator actions, so this report makes the current repo-side state executable without pretending the whole migration is complete.",
|
||||
"",
|
||||
"## Directive Snapshot",
|
||||
"",
|
||||
f"- Repo-grounded workstreams: {result['grounded_workstreams']}",
|
||||
f"- Partial workstreams: {result['partial_workstreams']}",
|
||||
f"- Missing workstreams: {result['missing_workstreams']}",
|
||||
f"- Drifted references: {len(result['reference_drift'])}",
|
||||
"",
|
||||
"## Reference Drift",
|
||||
"",
|
||||
]
|
||||
if result["reference_drift"]:
|
||||
lines.extend(f"- {item}" for item in result["reference_drift"])
|
||||
else:
|
||||
lines.append("- No stale cross-links detected in the directive snapshot.")
|
||||
|
||||
lines.extend(["", "## Workstream Matrix", ""])
|
||||
for index, workstream in enumerate(result["workstreams"], start=1):
|
||||
lines.extend(
|
||||
[
|
||||
f"### {index}. {workstream['name']} — {workstream['status']}",
|
||||
"",
|
||||
f"- Requirement: {workstream['requirement']}",
|
||||
]
|
||||
)
|
||||
if workstream["reference_results"]:
|
||||
lines.append("- Referenced issues:")
|
||||
for ref in workstream["reference_results"]:
|
||||
alignment = "aligned" if ref["aligned"] else "DRIFT"
|
||||
lines.append(
|
||||
f" - #{ref['number']} ({ref['state']}) — {ref['title']} [{alignment}]"
|
||||
)
|
||||
else:
|
||||
lines.append("- Referenced issues: none listed in the directive body")
|
||||
|
||||
if workstream["repo_evidence_present"]:
|
||||
lines.append("- Repo evidence present:")
|
||||
lines.extend(f" - {item}" for item in workstream["repo_evidence_present"])
|
||||
else:
|
||||
lines.append("- Repo evidence present: none")
|
||||
|
||||
if workstream["repo_evidence_missing"]:
|
||||
lines.append("- Repo evidence expected but missing:")
|
||||
lines.extend(f" - {item}" for item in workstream["repo_evidence_missing"])
|
||||
|
||||
if workstream["missing_deliverables"]:
|
||||
lines.append("- Missing operator deliverables:")
|
||||
lines.extend(f" - {item}" for item in workstream["missing_deliverables"])
|
||||
else:
|
||||
lines.append("- Missing operator deliverables: none")
|
||||
|
||||
lines.append(f"- Why this lane remains open: {workstream['why_open']}")
|
||||
lines.append("")
|
||||
|
||||
lines.extend(["## Highest-Leverage Next Actions", ""])
|
||||
lines.extend(f"- {item}" for item in result["next_actions"])
|
||||
|
||||
lines.extend(
|
||||
[
|
||||
"",
|
||||
"## Why #524 Remains Open",
|
||||
"",
|
||||
"- The directive bundles five separate workstreams with different evidence surfaces.",
|
||||
"- Multiple cited issue numbers have drifted away from the work they are supposed to anchor.",
|
||||
"- Repo scaffolding exists for Nostr, sovereignty audits, and Morrowind, but the operator-facing bundles are still missing.",
|
||||
"- Syntax Guard verification is still undocumented and unproven inside this repo.",
|
||||
]
|
||||
)
|
||||
|
||||
return "\n".join(lines).rstrip() + "\n"
|
||||
|
||||
|
||||
|
||||
def main() -> None:
|
||||
parser = argparse.ArgumentParser(description="Render the unified fleet sovereignty status report for issue #524")
|
||||
parser.add_argument("--snapshot", help="Optional JSON snapshot file overriding the default issue-title/state snapshot")
|
||||
parser.add_argument("--live", action="store_true", help="Refresh the issue snapshot from Gitea before rendering")
|
||||
parser.add_argument("--token-file", default=str(DEFAULT_TOKEN_FILE), help="Token file used with --live")
|
||||
parser.add_argument("--output", help="Optional path to write the rendered report")
|
||||
parser.add_argument("--json", action="store_true", help="Print computed JSON instead of markdown")
|
||||
args = parser.parse_args()
|
||||
|
||||
if args.live:
|
||||
snapshot = refresh_snapshot(Path(args.token_file).expanduser())
|
||||
else:
|
||||
snapshot = load_snapshot(Path(args.snapshot).expanduser() if args.snapshot else None)
|
||||
|
||||
result = evaluate_directive(snapshot=snapshot, repo_root=DEFAULT_REPO_ROOT)
|
||||
rendered = json.dumps(result, indent=2) if args.json else render_markdown(result)
|
||||
|
||||
if args.output:
|
||||
output_path = Path(args.output).expanduser()
|
||||
output_path.parent.mkdir(parents=True, exist_ok=True)
|
||||
output_path.write_text(rendered, encoding="utf-8")
|
||||
print(f"Directive status written to {output_path}")
|
||||
else:
|
||||
print(rendered)
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
main()
|
||||
@@ -1,77 +0,0 @@
|
||||
from __future__ import annotations
|
||||
|
||||
import importlib.util
|
||||
from pathlib import Path
|
||||
|
||||
|
||||
ROOT = Path(__file__).resolve().parents[1]
|
||||
SCRIPT_PATH = ROOT / "scripts" / "unified_fleet_sovereignty_status.py"
|
||||
DOC_PATH = ROOT / "docs" / "UNIFIED_FLEET_SOVEREIGNTY_STATUS.md"
|
||||
|
||||
|
||||
def _load_module(path: Path, name: str):
|
||||
assert path.exists(), f"missing {path.relative_to(ROOT)}"
|
||||
spec = importlib.util.spec_from_file_location(name, path)
|
||||
assert spec and spec.loader
|
||||
module = importlib.util.module_from_spec(spec)
|
||||
spec.loader.exec_module(module)
|
||||
return module
|
||||
|
||||
|
||||
def _workstream(result: dict, key: str) -> dict:
|
||||
for workstream in result["workstreams"]:
|
||||
if workstream["key"] == key:
|
||||
return workstream
|
||||
raise AssertionError(f"missing workstream {key}")
|
||||
|
||||
|
||||
def test_evaluate_directive_flags_reference_drift_without_faking_completion() -> None:
|
||||
mod = _load_module(SCRIPT_PATH, "unified_fleet_sovereignty_status")
|
||||
result = mod.evaluate_directive(snapshot=mod.default_snapshot(), repo_root=ROOT)
|
||||
|
||||
assert len(result["reference_drift"]) == 4
|
||||
assert any("#813" in item for item in result["reference_drift"])
|
||||
assert any("#103" in item for item in result["reference_drift"])
|
||||
|
||||
nostr = _workstream(result, "nostr-migration")
|
||||
assert nostr["status"] == "PARTIAL"
|
||||
assert any("timmy_client.py" in item for item in nostr["repo_evidence_present"])
|
||||
|
||||
lexicon = _workstream(result, "lexicon-enforcement")
|
||||
assert all(item["aligned"] for item in lexicon["reference_results"])
|
||||
assert lexicon["status"] == "PARTIAL"
|
||||
|
||||
syntax_guard = _workstream(result, "syntax-guard")
|
||||
assert syntax_guard["status"] == "MISSING"
|
||||
assert any("deployment verifier" in item for item in syntax_guard["missing_deliverables"])
|
||||
|
||||
|
||||
def test_render_markdown_includes_required_sections_and_grounding_evidence() -> None:
|
||||
mod = _load_module(SCRIPT_PATH, "unified_fleet_sovereignty_status")
|
||||
result = mod.evaluate_directive(snapshot=mod.default_snapshot(), repo_root=ROOT)
|
||||
report = mod.render_markdown(result)
|
||||
|
||||
for snippet in (
|
||||
"# [DIRECTIVE] Unified Fleet Sovereignty & Comms Migration",
|
||||
"## Directive Snapshot",
|
||||
"## Reference Drift",
|
||||
"## Workstream Matrix",
|
||||
"### 5. Infrastructure Hardening / Syntax Guard — MISSING",
|
||||
"`infrastructure/timmy-bridge/client/timmy_client.py`",
|
||||
"machine-checkable lexicon policy for review/triage",
|
||||
"## Why #524 Remains Open",
|
||||
):
|
||||
assert snippet in report
|
||||
|
||||
|
||||
def test_repo_contains_committed_issue_524_grounding_doc() -> None:
|
||||
assert DOC_PATH.exists(), "missing committed directive grounding doc"
|
||||
text = DOC_PATH.read_text(encoding="utf-8")
|
||||
for snippet in (
|
||||
"# [DIRECTIVE] Unified Fleet Sovereignty & Comms Migration",
|
||||
"## Reference Drift",
|
||||
"## Workstream Matrix",
|
||||
"## Highest-Leverage Next Actions",
|
||||
"## Why #524 Remains Open",
|
||||
):
|
||||
assert snippet in text
|
||||
Reference in New Issue
Block a user