forked from Rockachopa/Timmy-time-dashboard
Compare commits
18 Commits
claude/iss
...
claude/iss
| Author | SHA1 | Date | |
|---|---|---|---|
|
|
f8934b63f6 | ||
| a7ccfbddc9 | |||
| f1f67e62a7 | |||
| 00ef4fbd22 | |||
| fc0a94202f | |||
| bd3e207c0d | |||
| cc8ed5b57d | |||
| 823216db60 | |||
| 75ecfaba64 | |||
| 55beaf241f | |||
| 69498c9add | |||
| 6c76bf2f66 | |||
| 0436dfd4c4 | |||
| 9eeb49a6f1 | |||
| 2d6bfe6ba1 | |||
| ebb2cad552 | |||
| 003e3883fb | |||
| 7dfbf05867 |
@@ -27,8 +27,12 @@
|
||||
|
||||
# ── AirLLM / big-brain backend ───────────────────────────────────────────────
|
||||
# Inference backend: "ollama" (default) | "airllm" | "auto"
|
||||
# "auto" → uses AirLLM on Apple Silicon if installed, otherwise Ollama.
|
||||
# Requires: pip install ".[bigbrain]"
|
||||
# "ollama" → always use Ollama (safe everywhere, any OS)
|
||||
# "airllm" → AirLLM layer-by-layer loading (Apple Silicon M1/M2/M3/M4 only)
|
||||
# Requires 16 GB RAM minimum (32 GB recommended).
|
||||
# Automatically falls back to Ollama on Intel Mac or Linux.
|
||||
# Install extra: pip install "airllm[mlx]"
|
||||
# "auto" → use AirLLM on Apple Silicon if installed, otherwise Ollama
|
||||
# TIMMY_MODEL_BACKEND=ollama
|
||||
|
||||
# AirLLM model size (default: 70b).
|
||||
|
||||
@@ -62,6 +62,9 @@ Per AGENTS.md roster:
|
||||
- Run `tox -e pre-push` (lint + full CI suite)
|
||||
- Ensure tests stay green
|
||||
- Update TODO.md
|
||||
- **CRITICAL: Stage files before committing** — always run `git add .` or `git add <files>` first
|
||||
- Verify staged changes are non-empty: `git diff --cached --stat` must show files
|
||||
- **NEVER run `git commit` without staging files first** — empty commits waste review cycles
|
||||
|
||||
---
|
||||
|
||||
|
||||
42
AGENTS.md
42
AGENTS.md
@@ -247,6 +247,48 @@ make docker-agent # add a worker
|
||||
|
||||
---
|
||||
|
||||
## Search Capability (SearXNG + Crawl4AI)
|
||||
|
||||
Timmy has a self-hosted search backend requiring **no paid API key**.
|
||||
|
||||
### Tools
|
||||
|
||||
| Tool | Module | Description |
|
||||
|------|--------|-------------|
|
||||
| `web_search(query)` | `timmy/tools/search.py` | Meta-search via SearXNG — returns ranked results |
|
||||
| `scrape_url(url)` | `timmy/tools/search.py` | Full-page scrape via Crawl4AI → clean markdown |
|
||||
|
||||
Both tools are registered in the **orchestrator** (full) and **echo** (research) toolkits.
|
||||
|
||||
### Configuration
|
||||
|
||||
| Env Var | Default | Description |
|
||||
|---------|---------|-------------|
|
||||
| `TIMMY_SEARCH_BACKEND` | `searxng` | `searxng` or `none` (disable) |
|
||||
| `TIMMY_SEARCH_URL` | `http://localhost:8888` | SearXNG base URL |
|
||||
| `TIMMY_CRAWL_URL` | `http://localhost:11235` | Crawl4AI base URL |
|
||||
|
||||
Inside Docker Compose (when `--profile search` is active), the dashboard
|
||||
uses `http://searxng:8080` and `http://crawl4ai:11235` by default.
|
||||
|
||||
### Starting the services
|
||||
|
||||
```bash
|
||||
# Start SearXNG + Crawl4AI alongside the dashboard:
|
||||
docker compose --profile search up
|
||||
|
||||
# Or start only the search services:
|
||||
docker compose --profile search up searxng crawl4ai
|
||||
```
|
||||
|
||||
### Graceful degradation
|
||||
|
||||
- If `TIMMY_SEARCH_BACKEND=none`: tools return a "disabled" message.
|
||||
- If SearXNG or Crawl4AI is unreachable: tools log a WARNING and return an
|
||||
error string — the app never crashes.
|
||||
|
||||
---
|
||||
|
||||
## Roadmap
|
||||
|
||||
**v2.0 Exodus (in progress):** Voice + Marketplace + Integrations
|
||||
|
||||
15
README.md
15
README.md
@@ -9,6 +9,21 @@ API access with Bitcoin Lightning — all from a browser, no cloud AI required.
|
||||
|
||||
---
|
||||
|
||||
## System Requirements
|
||||
|
||||
| Path | Hardware | RAM | Disk |
|
||||
|------|----------|-----|------|
|
||||
| **Ollama** (default) | Any OS — x86-64 or ARM | 8 GB min | 5–10 GB (model files) |
|
||||
| **AirLLM** (Apple Silicon) | M1, M2, M3, or M4 Mac | 16 GB min (32 GB recommended) | ~15 GB free |
|
||||
|
||||
**Ollama path** runs on any modern machine — macOS, Linux, or Windows. No GPU required.
|
||||
|
||||
**AirLLM path** uses layer-by-layer loading for 70B+ models without a GPU. Requires Apple
|
||||
Silicon and the `bigbrain` extras (`pip install ".[bigbrain]"`). On Intel Mac or Linux the
|
||||
app automatically falls back to Ollama — no crash, no config change needed.
|
||||
|
||||
---
|
||||
|
||||
## Quick Start
|
||||
|
||||
```bash
|
||||
|
||||
122
SOVEREIGNTY.md
Normal file
122
SOVEREIGNTY.md
Normal file
@@ -0,0 +1,122 @@
|
||||
# SOVEREIGNTY.md — Research Sovereignty Manifest
|
||||
|
||||
> "If this spec is implemented correctly, it is the last research document
|
||||
> Alexander should need to request from a corporate AI."
|
||||
> — Issue #972, March 22 2026
|
||||
|
||||
---
|
||||
|
||||
## What This Is
|
||||
|
||||
A machine-readable declaration of Timmy's research independence:
|
||||
where we are, where we're going, and how to measure progress.
|
||||
|
||||
---
|
||||
|
||||
## The Problem We're Solving
|
||||
|
||||
On March 22, 2026, a single Claude session produced six deep research reports.
|
||||
It consumed ~3 hours of human time and substantial corporate AI inference.
|
||||
Every report was valuable — but the workflow was **linear**.
|
||||
It would cost exactly the same to reproduce tomorrow.
|
||||
|
||||
This file tracks the pipeline that crystallizes that workflow into something
|
||||
Timmy can run autonomously.
|
||||
|
||||
---
|
||||
|
||||
## The Six-Step Pipeline
|
||||
|
||||
| Step | What Happens | Status |
|
||||
|------|-------------|--------|
|
||||
| 1. Scope | Human describes knowledge gap → Gitea issue with template | ✅ Done (`skills/research/`) |
|
||||
| 2. Query | LLM slot-fills template → 5–15 targeted queries | ✅ Done (`research.py`) |
|
||||
| 3. Search | Execute queries → top result URLs | ✅ Done (`research_tools.py`) |
|
||||
| 4. Fetch | Download + extract full pages (trafilatura) | ✅ Done (`tools/system_tools.py`) |
|
||||
| 5. Synthesize | Compress findings → structured report | ✅ Done (`research.py` cascade) |
|
||||
| 6. Deliver | Store to semantic memory + optional disk persist | ✅ Done (`research.py`) |
|
||||
|
||||
---
|
||||
|
||||
## Cascade Tiers (Synthesis Quality vs. Cost)
|
||||
|
||||
| Tier | Model | Cost | Quality | Status |
|
||||
|------|-------|------|---------|--------|
|
||||
| **4** | SQLite semantic cache | $0.00 / instant | reuses prior | ✅ Active |
|
||||
| **3** | Ollama `qwen3:14b` | $0.00 / local | ★★★ | ✅ Active |
|
||||
| **2** | Claude API (haiku) | ~$0.01/report | ★★★★ | ✅ Active (opt-in) |
|
||||
| **1** | Groq `llama-3.3-70b` | $0.00 / rate-limited | ★★★★ | 🔲 Planned (#980) |
|
||||
|
||||
Set `ANTHROPIC_API_KEY` to enable Tier 2 fallback.
|
||||
|
||||
---
|
||||
|
||||
## Research Templates
|
||||
|
||||
Six prompt templates live in `skills/research/`:
|
||||
|
||||
| Template | Use Case |
|
||||
|----------|----------|
|
||||
| `tool_evaluation.md` | Find all shipping tools for `{domain}` |
|
||||
| `architecture_spike.md` | How to connect `{system_a}` to `{system_b}` |
|
||||
| `game_analysis.md` | Evaluate `{game}` for AI agent play |
|
||||
| `integration_guide.md` | Wire `{tool}` into `{stack}` with code |
|
||||
| `state_of_art.md` | What exists in `{field}` as of `{date}` |
|
||||
| `competitive_scan.md` | How does `{project}` compare to `{alternatives}` |
|
||||
|
||||
---
|
||||
|
||||
## Sovereignty Metrics
|
||||
|
||||
| Metric | Target (Week 1) | Target (Month 1) | Target (Month 3) | Graduation |
|
||||
|--------|-----------------|------------------|------------------|------------|
|
||||
| Queries answered locally | 10% | 40% | 80% | >90% |
|
||||
| API cost per report | <$1.50 | <$0.50 | <$0.10 | <$0.01 |
|
||||
| Time from question to report | <3 hours | <30 min | <5 min | <1 min |
|
||||
| Human involvement | 100% (review) | Review only | Approve only | None |
|
||||
|
||||
---
|
||||
|
||||
## How to Use the Pipeline
|
||||
|
||||
```python
|
||||
from timmy.research import run_research
|
||||
|
||||
# Quick research (no template)
|
||||
result = await run_research("best local embedding models for 36GB RAM")
|
||||
|
||||
# With a template and slot values
|
||||
result = await run_research(
|
||||
topic="PDF text extraction libraries for Python",
|
||||
template="tool_evaluation",
|
||||
slots={"domain": "PDF parsing", "use_case": "RAG pipeline", "focus_criteria": "accuracy"},
|
||||
save_to_disk=True,
|
||||
)
|
||||
|
||||
print(result.report)
|
||||
print(f"Backend: {result.synthesis_backend}, Cached: {result.cached}")
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Implementation Status
|
||||
|
||||
| Component | Issue | Status |
|
||||
|-----------|-------|--------|
|
||||
| `web_fetch` tool (trafilatura) | #973 | ✅ Done |
|
||||
| Research template library (6 templates) | #974 | ✅ Done |
|
||||
| `ResearchOrchestrator` (`research.py`) | #975 | ✅ Done |
|
||||
| Semantic index for outputs | #976 | 🔲 Planned |
|
||||
| Auto-create Gitea issues from findings | #977 | 🔲 Planned |
|
||||
| Paperclip task runner integration | #978 | 🔲 Planned |
|
||||
| Kimi delegation via labels | #979 | 🔲 Planned |
|
||||
| Groq free-tier cascade tier | #980 | 🔲 Planned |
|
||||
| Sovereignty metrics dashboard | #981 | 🔲 Planned |
|
||||
|
||||
---
|
||||
|
||||
## Governing Spec
|
||||
|
||||
See [issue #972](http://143.198.27.163:3000/Rockachopa/Timmy-time-dashboard/issues/972) for the full spec and rationale.
|
||||
|
||||
Research artifacts committed to `docs/research/`.
|
||||
@@ -42,6 +42,10 @@ services:
|
||||
GROK_ENABLED: "${GROK_ENABLED:-false}"
|
||||
XAI_API_KEY: "${XAI_API_KEY:-}"
|
||||
GROK_DEFAULT_MODEL: "${GROK_DEFAULT_MODEL:-grok-3-fast}"
|
||||
# Search backend (SearXNG + Crawl4AI) — set TIMMY_SEARCH_BACKEND=none to disable
|
||||
TIMMY_SEARCH_BACKEND: "${TIMMY_SEARCH_BACKEND:-searxng}"
|
||||
TIMMY_SEARCH_URL: "${TIMMY_SEARCH_URL:-http://searxng:8080}"
|
||||
TIMMY_CRAWL_URL: "${TIMMY_CRAWL_URL:-http://crawl4ai:11235}"
|
||||
extra_hosts:
|
||||
- "host.docker.internal:host-gateway" # Linux: maps to host IP
|
||||
networks:
|
||||
@@ -74,6 +78,50 @@ services:
|
||||
profiles:
|
||||
- celery
|
||||
|
||||
# ── SearXNG — self-hosted meta-search engine ─────────────────────────
|
||||
searxng:
|
||||
image: searxng/searxng:latest
|
||||
container_name: timmy-searxng
|
||||
profiles:
|
||||
- search
|
||||
ports:
|
||||
- "${SEARXNG_PORT:-8888}:8080"
|
||||
environment:
|
||||
SEARXNG_BASE_URL: "${SEARXNG_BASE_URL:-http://localhost:8888}"
|
||||
volumes:
|
||||
- ./docker/searxng:/etc/searxng:rw
|
||||
networks:
|
||||
- timmy-net
|
||||
restart: unless-stopped
|
||||
healthcheck:
|
||||
test: ["CMD", "wget", "-qO-", "http://localhost:8080/healthz"]
|
||||
interval: 30s
|
||||
timeout: 5s
|
||||
retries: 3
|
||||
start_period: 20s
|
||||
|
||||
# ── Crawl4AI — self-hosted web scraper ────────────────────────────────
|
||||
crawl4ai:
|
||||
image: unclecode/crawl4ai:latest
|
||||
container_name: timmy-crawl4ai
|
||||
profiles:
|
||||
- search
|
||||
ports:
|
||||
- "${CRAWL4AI_PORT:-11235}:11235"
|
||||
environment:
|
||||
CRAWL4AI_API_TOKEN: "${CRAWL4AI_API_TOKEN:-}"
|
||||
volumes:
|
||||
- timmy-data:/app/data
|
||||
networks:
|
||||
- timmy-net
|
||||
restart: unless-stopped
|
||||
healthcheck:
|
||||
test: ["CMD", "curl", "-f", "http://localhost:11235/health"]
|
||||
interval: 30s
|
||||
timeout: 10s
|
||||
retries: 3
|
||||
start_period: 30s
|
||||
|
||||
# ── OpenFang — vendored agent runtime sidecar ────────────────────────────
|
||||
openfang:
|
||||
build:
|
||||
|
||||
67
docker/searxng/settings.yml
Normal file
67
docker/searxng/settings.yml
Normal file
@@ -0,0 +1,67 @@
|
||||
# SearXNG configuration for Timmy Time self-hosted search
|
||||
# https://docs.searxng.org/admin/settings/settings.html
|
||||
|
||||
general:
|
||||
debug: false
|
||||
instance_name: "Timmy Search"
|
||||
privacypolicy_url: false
|
||||
donation_url: false
|
||||
contact_url: false
|
||||
enable_metrics: false
|
||||
|
||||
server:
|
||||
port: 8080
|
||||
bind_address: "0.0.0.0"
|
||||
secret_key: "timmy-searxng-key-change-in-production"
|
||||
base_url: false
|
||||
image_proxy: false
|
||||
|
||||
ui:
|
||||
static_use_hash: false
|
||||
default_locale: ""
|
||||
query_in_title: false
|
||||
infinite_scroll: false
|
||||
default_theme: simple
|
||||
center_alignment: false
|
||||
|
||||
search:
|
||||
safe_search: 0
|
||||
autocomplete: ""
|
||||
default_lang: "en"
|
||||
formats:
|
||||
- html
|
||||
- json
|
||||
|
||||
outgoing:
|
||||
request_timeout: 6.0
|
||||
max_request_timeout: 10.0
|
||||
useragent_suffix: "TimmyResearchBot"
|
||||
pool_connections: 100
|
||||
pool_maxsize: 20
|
||||
|
||||
enabled_plugins:
|
||||
- Hash_plugin
|
||||
- Search_on_category_select
|
||||
- Tracker_url_remover
|
||||
|
||||
engines:
|
||||
- name: google
|
||||
engine: google
|
||||
shortcut: g
|
||||
categories: general
|
||||
|
||||
- name: bing
|
||||
engine: bing
|
||||
shortcut: b
|
||||
categories: general
|
||||
|
||||
- name: duckduckgo
|
||||
engine: duckduckgo
|
||||
shortcut: d
|
||||
categories: general
|
||||
|
||||
- name: wikipedia
|
||||
engine: wikipedia
|
||||
shortcut: wp
|
||||
categories: general
|
||||
timeout: 3.0
|
||||
89
docs/SCREENSHOT_TRIAGE_2026-03-24.md
Normal file
89
docs/SCREENSHOT_TRIAGE_2026-03-24.md
Normal file
@@ -0,0 +1,89 @@
|
||||
# Screenshot Dump Triage — Visual Inspiration & Research Leads
|
||||
|
||||
**Date:** March 24, 2026
|
||||
**Source:** Issue #1275 — "Screenshot dump for triage #1"
|
||||
**Analyst:** Claude (Sonnet 4.6)
|
||||
|
||||
---
|
||||
|
||||
## Screenshots Ingested
|
||||
|
||||
| File | Subject | Action |
|
||||
|------|---------|--------|
|
||||
| IMG_6187.jpeg | AirLLM / Apple Silicon local LLM requirements | → Issue #1284 |
|
||||
| IMG_6125.jpeg | vLLM backend for agentic workloads | → Issue #1281 |
|
||||
| IMG_6124.jpeg | DeerFlow autonomous research pipeline | → Issue #1283 |
|
||||
| IMG_6123.jpeg | "Vibe Coder vs Normal Developer" meme | → Issue #1285 |
|
||||
| IMG_6410.jpeg | SearXNG + Crawl4AI self-hosted search MCP | → Issue #1282 |
|
||||
|
||||
---
|
||||
|
||||
## Tickets Created
|
||||
|
||||
### #1281 — feat: add vLLM as alternative inference backend
|
||||
**Source:** IMG_6125 (vLLM for agentic workloads)
|
||||
|
||||
vLLM's continuous batching makes it 3–10x more throughput-efficient than Ollama for multi-agent
|
||||
request patterns. Implement `VllmBackend` in `infrastructure/llm_router/` as a selectable
|
||||
backend (`TIMMY_LLM_BACKEND=vllm`) with graceful fallback to Ollama.
|
||||
|
||||
**Priority:** Medium — impactful for research pipeline performance once #972 is in use
|
||||
|
||||
---
|
||||
|
||||
### #1282 — feat: integrate SearXNG + Crawl4AI as self-hosted search backend
|
||||
**Source:** IMG_6410 (luxiaolei/searxng-crawl4ai-mcp)
|
||||
|
||||
Self-hosted search via SearXNG + Crawl4AI removes the hard dependency on paid search APIs
|
||||
(Brave, Tavily). Add both as Docker Compose services, implement `web_search()` and
|
||||
`scrape_url()` tools in `timmy/tools/`, and register them with the research agent.
|
||||
|
||||
**Priority:** High — unblocks fully local/private operation of research agents
|
||||
|
||||
---
|
||||
|
||||
### #1283 — research: evaluate DeerFlow as autonomous research orchestration layer
|
||||
**Source:** IMG_6124 (deer-flow Docker setup)
|
||||
|
||||
DeerFlow is ByteDance's open-source autonomous research pipeline framework. Before investing
|
||||
further in Timmy's custom orchestrator (#972), evaluate whether DeerFlow's architecture offers
|
||||
integration value or design patterns worth borrowing.
|
||||
|
||||
**Priority:** Medium — research first, implementation follows if go/no-go is positive
|
||||
|
||||
---
|
||||
|
||||
### #1284 — chore: document and validate AirLLM Apple Silicon requirements
|
||||
**Source:** IMG_6187 (Mac-compatible LLM setup)
|
||||
|
||||
AirLLM graceful degradation is already implemented but undocumented. Add System Requirements
|
||||
to README (M1/M2/M3/M4, 16 GB RAM min, 15 GB disk) and document `TIMMY_LLM_BACKEND` in
|
||||
`.env.example`.
|
||||
|
||||
**Priority:** Low — documentation only, no code risk
|
||||
|
||||
---
|
||||
|
||||
### #1285 — chore: enforce "Normal Developer" discipline — tighten quality gates
|
||||
**Source:** IMG_6123 (Vibe Coder vs Normal Developer meme)
|
||||
|
||||
Tighten the existing mypy/bandit/coverage gates: fix all mypy errors, raise coverage from 73%
|
||||
to 80%, add a documented pre-push hook, and run `vulture` for dead code. The infrastructure
|
||||
exists — it just needs enforcing.
|
||||
|
||||
**Priority:** Medium — technical debt prevention, pairs well with any green-field feature work
|
||||
|
||||
---
|
||||
|
||||
## Patterns Observed Across Screenshots
|
||||
|
||||
1. **Local-first is the north star.** All five images reinforce the same theme: private,
|
||||
self-hosted, runs on your hardware. vLLM, SearXNG, AirLLM, DeerFlow — none require cloud.
|
||||
Timmy is already aligned with this direction; these are tactical additions.
|
||||
|
||||
2. **Agentic performance bottlenecks are real.** Two of five images (vLLM, DeerFlow) focus
|
||||
specifically on throughput and reliability for multi-agent loops. As the research pipeline
|
||||
matures, inference speed and search reliability will become the main constraints.
|
||||
|
||||
3. **Discipline compounds.** The meme is a reminder that the quality gates we have (tox,
|
||||
mypy, bandit, coverage) only pay off if they are enforced without exceptions.
|
||||
1244
docs/model-benchmarks.md
Normal file
1244
docs/model-benchmarks.md
Normal file
File diff suppressed because it is too large
Load Diff
290
docs/research/kimi-creative-blueprint-891.md
Normal file
290
docs/research/kimi-creative-blueprint-891.md
Normal file
@@ -0,0 +1,290 @@
|
||||
# Building Timmy: Technical Blueprint for Sovereign Creative AI
|
||||
|
||||
> **Source:** PDF attached to issue #891, "Building Timmy: a technical blueprint for sovereign
|
||||
> creative AI" — generated by Kimi.ai, 16 pages, filed by Perplexity for Timmy's review.
|
||||
> **Filed:** 2026-03-22 · **Reviewed:** 2026-03-23
|
||||
|
||||
---
|
||||
|
||||
## Executive Summary
|
||||
|
||||
The blueprint establishes that a sovereign creative AI capable of coding, composing music,
|
||||
generating art, building worlds, publishing narratives, and managing its own economy is
|
||||
**technically feasible today** — but only through orchestration of dozens of tools operating
|
||||
at different maturity levels. The core insight: *the integration is the invention*. No single
|
||||
component is new; the missing piece is a coherent identity operating across all domains
|
||||
simultaneously with persistent memory, autonomous economics, and cross-domain creative
|
||||
reactions.
|
||||
|
||||
Three non-negotiable architectural decisions:
|
||||
1. **Human oversight for all public-facing content** — every successful creative AI has this;
|
||||
every one that removed it failed.
|
||||
2. **Legal entity before economic activity** — AI agents are not legal persons; establish
|
||||
structure before wealth accumulates (Truth Terminal cautionary tale: $20M acquired before
|
||||
a foundation was retroactively created).
|
||||
3. **Hybrid memory: vector search + knowledge graph** — neither alone is sufficient for
|
||||
multi-domain context breadth.
|
||||
|
||||
---
|
||||
|
||||
## Domain-by-Domain Assessment
|
||||
|
||||
### Software Development (immediately deployable)
|
||||
|
||||
| Component | Recommendation | Notes |
|
||||
|-----------|----------------|-------|
|
||||
| Primary agent | Claude Code (Opus 4.6, 77.2% SWE-bench) | Already in use |
|
||||
| Self-hosted forge | Forgejo (MIT, 170–200MB RAM) | Project uses Gitea/Forgejo now |
|
||||
| CI/CD | GitHub Actions-compatible via `act_runner` | — |
|
||||
| Tool-making | LATM pattern: frontier model creates tools, cheaper model applies them | New — see ADR opportunity |
|
||||
| Open-source fallback | OpenHands (~65% SWE-bench, Docker sandboxed) | Backup to Claude Code |
|
||||
| Self-improvement | Darwin Gödel Machine / SICA patterns | 3–6 month investment |
|
||||
|
||||
**Development estimate:** 2–3 weeks for Forgejo + Claude Code integration with automated
|
||||
PR workflows; 1–2 months for self-improving tool-making pipeline.
|
||||
|
||||
**Cross-reference:** This project already runs Claude Code agents on Forgejo. The LATM
|
||||
pattern (tool registry) and self-improvement loop are the actionable gaps.
|
||||
|
||||
---
|
||||
|
||||
### Music (1–4 weeks)
|
||||
|
||||
| Component | Recommendation | Notes |
|
||||
|-----------|----------------|-------|
|
||||
| Commercial vocals | Suno v5 API (~$0.03/song, $30/month Premier) | No official API; third-party: sunoapi.org, AIMLAPI, EvoLink |
|
||||
| Local instrumental | MusicGen 1.5B (CC-BY-NC — monetization blocker) | On M2 Max: ~60s for 5s clip |
|
||||
| Voice cloning | GPT-SoVITS v4 (MIT) | Works on Apple Silicon CPU, RTF 0.526 on M4 |
|
||||
| Voice conversion | RVC (MIT, 5–10 min training audio) | — |
|
||||
| Apple Silicon TTS | MLX-Audio: Kokoro 82M + Qwen3-TTS 0.6B | 4–5x faster via Metal |
|
||||
| Publishing | Wavlake (90/10 split, Lightning micropayments) | Auto-syndicates to Fountain.fm |
|
||||
| Nostr | NIP-94 (kind:1063) audio events → NIP-96 servers | — |
|
||||
|
||||
**Copyright reality:** US Copyright Office (Jan 2025) and US Court of Appeals (Mar 2025):
|
||||
purely AI-generated music cannot be copyrighted and enters public domain. Wavlake's
|
||||
Value4Value model works around this — fans pay for relationship, not exclusive rights.
|
||||
|
||||
**Avoid:** Udio (download disabled since Oct 2025, 2.4/5 Trustpilot).
|
||||
|
||||
---
|
||||
|
||||
### Visual Art (1–3 weeks)
|
||||
|
||||
| Component | Recommendation | Notes |
|
||||
|-----------|----------------|-------|
|
||||
| Local generation | ComfyUI API at `127.0.0.1:8188` (programmatic control via WebSocket) | MLX extension: 50–70% faster |
|
||||
| Speed | Draw Things (free, Mac App Store) | 3× faster than ComfyUI via Metal shaders |
|
||||
| Quality frontier | Flux 2 (Nov 2025, 4MP, multi-reference) | SDXL needs 16GB+, Flux Dev 32GB+ |
|
||||
| Character consistency | LoRA training (30 min, 15–30 references) + Flux.1 Kontext | Solved problem |
|
||||
| Face consistency | IP-Adapter + FaceID (ComfyUI-IP-Adapter-Plus) | Training-free |
|
||||
| Comics | Jenova AI ($20/month, 200+ page consistency) or LlamaGen AI (free) | — |
|
||||
| Publishing | Blossom protocol (SHA-256 addressed, kind:10063) + Nostr NIP-94 | — |
|
||||
| Physical | Printful REST API (200+ products, automated fulfillment) | — |
|
||||
|
||||
---
|
||||
|
||||
### Writing / Narrative (1–4 weeks for pipeline; ongoing for quality)
|
||||
|
||||
| Component | Recommendation | Notes |
|
||||
|-----------|----------------|-------|
|
||||
| LLM | Claude Opus 4.5/4.6 (leads Mazur Writing Benchmark at 8.561) | Already in use |
|
||||
| Context | 500K tokens (1M in beta) — entire novels fit | — |
|
||||
| Architecture | Outline-first → RAG lore bible → chapter-by-chapter generation | Without outline: novels meander |
|
||||
| Lore management | WorldAnvil Pro or custom LoreScribe (local RAG) | No tool achieves 100% consistency |
|
||||
| Publishing (ebooks) | Pandoc → EPUB / KDP PDF | pandoc-novel template on GitHub |
|
||||
| Publishing (print) | Lulu Press REST API (80% profit, global print network) | KDP: no official API, 3-book/day limit |
|
||||
| Publishing (Nostr) | NIP-23 kind:30023 long-form events | Habla.news, YakiHonne, Stacker News |
|
||||
| Podcasts | LLM script → TTS (ElevenLabs or local Kokoro/MLX-Audio) → feedgen RSS → Fountain.fm | Value4Value sats-per-minute |
|
||||
|
||||
**Key constraint:** AI-assisted (human directs, AI drafts) = 40% faster. Fully autonomous
|
||||
without editing = "generic, soulless prose" and character drift by chapter 3 without explicit
|
||||
memory.
|
||||
|
||||
---
|
||||
|
||||
### World Building / Games (2 weeks–3 months depending on target)
|
||||
|
||||
| Component | Recommendation | Notes |
|
||||
|-----------|----------------|-------|
|
||||
| Algorithms | Wave Function Collapse, Perlin noise (FastNoiseLite in Godot 4), L-systems | All mature |
|
||||
| Platform | Godot Engine + gd-agentic-skills (82+ skills, 26 genre blueprints) | Strong LLM/GDScript knowledge |
|
||||
| Narrative design | Knowledge graph (world state) + LLM + quest template grammar | CHI 2023 validated |
|
||||
| Quick win | Luanti/Minetest (Lua API, 2,800+ open mods for reference) | Immediately feasible |
|
||||
| Medium effort | OpenMW content creation (omwaddon format engineering required) | 2–3 months |
|
||||
| Future | Unity MCP (AI direct Unity Editor interaction) | Early-stage |
|
||||
|
||||
---
|
||||
|
||||
### Identity Architecture (2 months)
|
||||
|
||||
The blueprint formalizes the **SOUL.md standard** (GitHub: aaronjmars/soul.md):
|
||||
|
||||
| File | Purpose |
|
||||
|------|---------|
|
||||
| `SOUL.md` | Who you are — identity, worldview, opinions |
|
||||
| `STYLE.md` | How you write — voice, syntax, patterns |
|
||||
| `SKILL.md` | Operating modes |
|
||||
| `MEMORY.md` | Session continuity |
|
||||
|
||||
**Critical decision — static vs self-modifying identity:**
|
||||
- Static Core Truths (version-controlled, human-approved changes only) ✓
|
||||
- Self-modifying Learned Preferences (logged with rollback, monitored by guardian) ✓
|
||||
- **Warning:** OpenClaw's "Soul Evolution" creates a security attack surface — Zenity Labs
|
||||
demonstrated a complete zero-click attack chain targeting SOUL.md files.
|
||||
|
||||
**Relevance to this repo:** Claude Code agents already use a `MEMORY.md` pattern in
|
||||
this project. The SOUL.md stack is a natural extension.
|
||||
|
||||
---
|
||||
|
||||
### Memory Architecture (2 months)
|
||||
|
||||
Hybrid vector + knowledge graph is the recommendation:
|
||||
|
||||
| Component | Tool | Notes |
|
||||
|-----------|------|-------|
|
||||
| Vector + KG combined | Mem0 (mem0.ai) | 26% accuracy improvement over OpenAI memory, 91% lower p95 latency, 90% token savings |
|
||||
| Vector store | Qdrant (Rust, open-source) | High-throughput with metadata filtering |
|
||||
| Temporal KG | Neo4j + Graphiti (Zep AI) | P95 retrieval: 300ms, hybrid semantic + BM25 + graph |
|
||||
| Backup/migration | AgentKeeper (95% critical fact recovery across model migrations) | — |
|
||||
|
||||
**Journal pattern (Stanford Generative Agents):** Agent writes about experiences, generates
|
||||
high-level reflections 2–3x/day when importance scores exceed threshold. Ablation studies:
|
||||
removing any component (observation, planning, reflection) significantly reduces behavioral
|
||||
believability.
|
||||
|
||||
**Cross-reference:** The existing `brain/` package is the memory system. Qdrant and
|
||||
Mem0 are the recommended upgrade targets.
|
||||
|
||||
---
|
||||
|
||||
### Multi-Agent Sub-System (3–6 months)
|
||||
|
||||
The blueprint describes a named sub-agent hierarchy:
|
||||
|
||||
| Agent | Role |
|
||||
|-------|------|
|
||||
| Oracle | Top-level planner / supervisor |
|
||||
| Sentinel | Safety / moderation |
|
||||
| Scout | Research / information gathering |
|
||||
| Scribe | Writing / narrative |
|
||||
| Ledger | Economic management |
|
||||
| Weaver | Visual art generation |
|
||||
| Composer | Music generation |
|
||||
| Social | Platform publishing |
|
||||
|
||||
**Orchestration options:**
|
||||
- **Agno** (already in use) — microsecond instantiation, 50× less memory than LangGraph
|
||||
- **CrewAI Flows** — event-driven with fine-grained control
|
||||
- **LangGraph** — DAG-based with stateful workflows and time-travel debugging
|
||||
|
||||
**Scheduling pattern (Stanford Generative Agents):** Top-down recursive daily → hourly →
|
||||
5-minute planning. Event interrupts for reactive tasks. Re-planning triggers when accumulated
|
||||
importance scores exceed threshold.
|
||||
|
||||
**Cross-reference:** The existing `spark/` package (event capture, advisory engine) aligns
|
||||
with this architecture. `infrastructure/event_bus` is the choreography backbone.
|
||||
|
||||
---
|
||||
|
||||
### Economic Engine (1–4 weeks)
|
||||
|
||||
Lightning Labs released `lightning-agent-tools` (open-source) in February 2026:
|
||||
- `lnget` — CLI HTTP client for L402 payments
|
||||
- Remote signer architecture (private keys on separate machine from agent)
|
||||
- Scoped macaroon credentials (pay-only, invoice-only, read-only roles)
|
||||
- **Aperture** — converts any API to pay-per-use via L402 (HTTP 402)
|
||||
|
||||
| Option | Effort | Notes |
|
||||
|--------|--------|-------|
|
||||
| ln.bot | 1 week | "Bitcoin for AI Agents" — 3 commands create a wallet; CLI + MCP + REST |
|
||||
| LND via gRPC | 2–3 weeks | Full programmatic node management for production |
|
||||
| Coinbase Agentic Wallets | — | Fiat-adjacent; less aligned with sovereignty ethos |
|
||||
|
||||
**Revenue channels:** Wavlake (music, 90/10 Lightning), Nostr zaps (articles), Stacker News
|
||||
(earn sats from engagement), Printful (physical goods), L402-gated API access (pay-per-use
|
||||
services), Geyser.fund (Lightning crowdfunding, better initial runway than micropayments).
|
||||
|
||||
**Cross-reference:** The existing `lightning/` package in this repo is the foundation.
|
||||
L402 paywall endpoints for Timmy's own services is the actionable gap.
|
||||
|
||||
---
|
||||
|
||||
## Pioneer Case Studies
|
||||
|
||||
| Agent | Active | Revenue | Key Lesson |
|
||||
|-------|--------|---------|-----------|
|
||||
| Botto | Since Oct 2021 | $5M+ (art auctions) | Community governance via DAO sustains engagement; "taste model" (humans guide, not direct) preserves autonomous authorship |
|
||||
| Neuro-sama | Since Dec 2022 | $400K+/month (subscriptions) | 3+ years of iteration; errors became entertainment features; 24/7 capability is an insurmountable advantage |
|
||||
| Truth Terminal | Since Jun 2024 | $20M accumulated | Memetic fitness > planned monetization; human gatekeeper approved tweets while selecting AI-intent responses; **establish legal entity first** |
|
||||
| Holly+ | Since 2021 | Conceptual | DAO of stewards for voice governance; "identity play" as alternative to defensive IP |
|
||||
| AI Sponge | 2023 | Banned | Unmoderated content → TOS violations + copyright |
|
||||
| Nothing Forever | 2022–present | 8 viewers | Unmoderated content → ban → audience collapse; novelty-only propositions fail |
|
||||
|
||||
**Universal pattern:** Human oversight + economic incentive alignment + multi-year personality
|
||||
development + platform-native economics = success.
|
||||
|
||||
---
|
||||
|
||||
## Recommended Implementation Sequence
|
||||
|
||||
From the blueprint, mapped against Timmy's existing architecture:
|
||||
|
||||
### Phase 1: Immediate (weeks)
|
||||
1. **Code sovereignty** — Forgejo + Claude Code automated PR workflows (already substantially done)
|
||||
2. **Music pipeline** — Suno API → Wavlake/Nostr NIP-94 publishing
|
||||
3. **Visual art pipeline** — ComfyUI API → Blossom/Nostr with LoRA character consistency
|
||||
4. **Basic Lightning wallet** — ln.bot integration for receiving micropayments
|
||||
5. **Long-form publishing** — Nostr NIP-23 + RSS feed generation
|
||||
|
||||
### Phase 2: Moderate effort (1–3 months)
|
||||
6. **LATM tool registry** — frontier model creates Python utilities, caches them, lighter model applies
|
||||
7. **Event-driven cross-domain reactions** — game event → blog + artwork + music (CrewAI/LangGraph)
|
||||
8. **Podcast generation** — TTS + feedgen → Fountain.fm
|
||||
9. **Self-improving pipeline** — agent creates, tests, caches own Python utilities
|
||||
10. **Comic generation** — character-consistent panels with Jenova AI or local LoRA
|
||||
|
||||
### Phase 3: Significant investment (3–6 months)
|
||||
11. **Full sub-agent hierarchy** — Oracle/Sentinel/Scout/Scribe/Ledger/Weaver with Agno
|
||||
12. **SOUL.md identity system** — bounded evolution + guardian monitoring
|
||||
13. **Hybrid memory upgrade** — Qdrant + Mem0/Graphiti replacing or extending `brain/`
|
||||
14. **Procedural world generation** — Godot + AI-driven narrative (quests, NPCs, lore)
|
||||
15. **Self-sustaining economic loop** — earned revenue covers compute costs
|
||||
|
||||
### Remains aspirational (12+ months)
|
||||
- Fully autonomous novel-length fiction without editorial intervention
|
||||
- YouTube monetization for AI-generated content (tightening platform policies)
|
||||
- Copyright protection for AI-generated works (current US law denies this)
|
||||
- True artistic identity evolution (genuine creative voice vs pattern remixing)
|
||||
- Self-modifying architecture without regression or identity drift
|
||||
|
||||
---
|
||||
|
||||
## Gap Analysis: Blueprint vs Current Codebase
|
||||
|
||||
| Blueprint Capability | Current Status | Gap |
|
||||
|---------------------|----------------|-----|
|
||||
| Code sovereignty | Done (Claude Code + Forgejo) | LATM tool registry |
|
||||
| Music generation | Not started | Suno API integration + Wavlake publishing |
|
||||
| Visual art | Not started | ComfyUI API client + Blossom publishing |
|
||||
| Writing/publishing | Not started | Nostr NIP-23 + Pandoc pipeline |
|
||||
| World building | Bannerlord work (different scope) | Luanti mods as quick win |
|
||||
| Identity (SOUL.md) | Partial (CLAUDE.md + MEMORY.md) | Full SOUL.md stack |
|
||||
| Memory (hybrid) | `brain/` package (SQLite-based) | Qdrant + knowledge graph |
|
||||
| Multi-agent | Agno in use | Named hierarchy + event choreography |
|
||||
| Lightning payments | `lightning/` package | ln.bot wallet + L402 endpoints |
|
||||
| Nostr identity | Referenced in roadmap, not built | NIP-05, NIP-89 capability cards |
|
||||
| Legal entity | Unknown | **Must be resolved before economic activity** |
|
||||
|
||||
---
|
||||
|
||||
## ADR Candidates
|
||||
|
||||
Issues that warrant Architecture Decision Records based on this review:
|
||||
|
||||
1. **LATM tool registry pattern** — How Timmy creates, tests, and caches self-made tools
|
||||
2. **Music generation strategy** — Suno (cloud, commercial quality) vs MusicGen (local, CC-BY-NC)
|
||||
3. **Memory upgrade path** — When/how to migrate `brain/` from SQLite to Qdrant + KG
|
||||
4. **SOUL.md adoption** — Extending existing CLAUDE.md/MEMORY.md to full SOUL.md stack
|
||||
5. **Lightning L402 strategy** — Which services Timmy gates behind micropayments
|
||||
6. **Sub-agent naming and contracts** — Formalizing Oracle/Sentinel/Scout/Scribe/Ledger/Weaver
|
||||
@@ -15,6 +15,7 @@ packages = [
|
||||
{ include = "config.py", from = "src" },
|
||||
|
||||
{ include = "bannerlord", from = "src" },
|
||||
{ include = "brain", from = "src" },
|
||||
{ include = "dashboard", from = "src" },
|
||||
{ include = "infrastructure", from = "src" },
|
||||
{ include = "integrations", from = "src" },
|
||||
|
||||
195
scripts/benchmarks/01_tool_calling.py
Normal file
195
scripts/benchmarks/01_tool_calling.py
Normal file
@@ -0,0 +1,195 @@
|
||||
#!/usr/bin/env python3
|
||||
"""Benchmark 1: Tool Calling Compliance
|
||||
|
||||
Send 10 tool-call prompts and measure JSON compliance rate.
|
||||
Target: >90% valid JSON.
|
||||
"""
|
||||
|
||||
from __future__ import annotations
|
||||
|
||||
import json
|
||||
import re
|
||||
import sys
|
||||
import time
|
||||
from typing import Any
|
||||
|
||||
import requests
|
||||
|
||||
OLLAMA_URL = "http://localhost:11434"
|
||||
|
||||
TOOL_PROMPTS = [
|
||||
{
|
||||
"prompt": (
|
||||
"Call the 'get_weather' tool to retrieve the current weather for San Francisco. "
|
||||
"Return ONLY valid JSON with keys: tool, args."
|
||||
),
|
||||
"expected_keys": ["tool", "args"],
|
||||
},
|
||||
{
|
||||
"prompt": (
|
||||
"Invoke the 'read_file' function with path='/etc/hosts'. "
|
||||
"Return ONLY valid JSON with keys: tool, args."
|
||||
),
|
||||
"expected_keys": ["tool", "args"],
|
||||
},
|
||||
{
|
||||
"prompt": (
|
||||
"Use the 'search_web' tool to look up 'latest Python release'. "
|
||||
"Return ONLY valid JSON with keys: tool, args."
|
||||
),
|
||||
"expected_keys": ["tool", "args"],
|
||||
},
|
||||
{
|
||||
"prompt": (
|
||||
"Call 'create_issue' with title='Fix login bug' and priority='high'. "
|
||||
"Return ONLY valid JSON with keys: tool, args."
|
||||
),
|
||||
"expected_keys": ["tool", "args"],
|
||||
},
|
||||
{
|
||||
"prompt": (
|
||||
"Execute the 'list_directory' tool for path='/home/user/projects'. "
|
||||
"Return ONLY valid JSON with keys: tool, args."
|
||||
),
|
||||
"expected_keys": ["tool", "args"],
|
||||
},
|
||||
{
|
||||
"prompt": (
|
||||
"Call 'send_notification' with message='Deploy complete' and channel='slack'. "
|
||||
"Return ONLY valid JSON with keys: tool, args."
|
||||
),
|
||||
"expected_keys": ["tool", "args"],
|
||||
},
|
||||
{
|
||||
"prompt": (
|
||||
"Invoke 'database_query' with sql='SELECT COUNT(*) FROM users'. "
|
||||
"Return ONLY valid JSON with keys: tool, args."
|
||||
),
|
||||
"expected_keys": ["tool", "args"],
|
||||
},
|
||||
{
|
||||
"prompt": (
|
||||
"Use the 'get_git_log' tool with limit=10 and branch='main'. "
|
||||
"Return ONLY valid JSON with keys: tool, args."
|
||||
),
|
||||
"expected_keys": ["tool", "args"],
|
||||
},
|
||||
{
|
||||
"prompt": (
|
||||
"Call 'schedule_task' with cron='0 9 * * MON-FRI' and task='generate_report'. "
|
||||
"Return ONLY valid JSON with keys: tool, args."
|
||||
),
|
||||
"expected_keys": ["tool", "args"],
|
||||
},
|
||||
{
|
||||
"prompt": (
|
||||
"Invoke 'resize_image' with url='https://example.com/photo.jpg', "
|
||||
"width=800, height=600. "
|
||||
"Return ONLY valid JSON with keys: tool, args."
|
||||
),
|
||||
"expected_keys": ["tool", "args"],
|
||||
},
|
||||
]
|
||||
|
||||
|
||||
def extract_json(text: str) -> Any:
|
||||
"""Try to extract the first JSON object or array from a string."""
|
||||
# Try direct parse first
|
||||
text = text.strip()
|
||||
try:
|
||||
return json.loads(text)
|
||||
except json.JSONDecodeError:
|
||||
pass
|
||||
|
||||
# Try to find JSON block in markdown fences
|
||||
fence_match = re.search(r"```(?:json)?\s*(\{.*?\})\s*```", text, re.DOTALL)
|
||||
if fence_match:
|
||||
try:
|
||||
return json.loads(fence_match.group(1))
|
||||
except json.JSONDecodeError:
|
||||
pass
|
||||
|
||||
# Try to find first { ... }
|
||||
brace_match = re.search(r"\{[^{}]*(?:\{[^{}]*\}[^{}]*)?\}", text, re.DOTALL)
|
||||
if brace_match:
|
||||
try:
|
||||
return json.loads(brace_match.group(0))
|
||||
except json.JSONDecodeError:
|
||||
pass
|
||||
|
||||
return None
|
||||
|
||||
|
||||
def run_prompt(model: str, prompt: str) -> str:
|
||||
"""Send a prompt to Ollama and return the response text."""
|
||||
payload = {
|
||||
"model": model,
|
||||
"prompt": prompt,
|
||||
"stream": False,
|
||||
"options": {"temperature": 0.1, "num_predict": 256},
|
||||
}
|
||||
resp = requests.post(f"{OLLAMA_URL}/api/generate", json=payload, timeout=120)
|
||||
resp.raise_for_status()
|
||||
return resp.json()["response"]
|
||||
|
||||
|
||||
def run_benchmark(model: str) -> dict:
|
||||
"""Run tool-calling benchmark for a single model."""
|
||||
results = []
|
||||
total_time = 0.0
|
||||
|
||||
for i, case in enumerate(TOOL_PROMPTS, 1):
|
||||
start = time.time()
|
||||
try:
|
||||
raw = run_prompt(model, case["prompt"])
|
||||
elapsed = time.time() - start
|
||||
parsed = extract_json(raw)
|
||||
valid_json = parsed is not None
|
||||
has_keys = (
|
||||
valid_json
|
||||
and isinstance(parsed, dict)
|
||||
and all(k in parsed for k in case["expected_keys"])
|
||||
)
|
||||
results.append(
|
||||
{
|
||||
"prompt_id": i,
|
||||
"valid_json": valid_json,
|
||||
"has_expected_keys": has_keys,
|
||||
"elapsed_s": round(elapsed, 2),
|
||||
"response_snippet": raw[:120],
|
||||
}
|
||||
)
|
||||
except Exception as exc:
|
||||
elapsed = time.time() - start
|
||||
results.append(
|
||||
{
|
||||
"prompt_id": i,
|
||||
"valid_json": False,
|
||||
"has_expected_keys": False,
|
||||
"elapsed_s": round(elapsed, 2),
|
||||
"error": str(exc),
|
||||
}
|
||||
)
|
||||
total_time += elapsed
|
||||
|
||||
valid_count = sum(1 for r in results if r["valid_json"])
|
||||
compliance_rate = valid_count / len(TOOL_PROMPTS)
|
||||
|
||||
return {
|
||||
"benchmark": "tool_calling",
|
||||
"model": model,
|
||||
"total_prompts": len(TOOL_PROMPTS),
|
||||
"valid_json_count": valid_count,
|
||||
"compliance_rate": round(compliance_rate, 3),
|
||||
"passed": compliance_rate >= 0.90,
|
||||
"total_time_s": round(total_time, 2),
|
||||
"results": results,
|
||||
}
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
model = sys.argv[1] if len(sys.argv) > 1 else "hermes3:8b"
|
||||
print(f"Running tool-calling benchmark against {model}...")
|
||||
result = run_benchmark(model)
|
||||
print(json.dumps(result, indent=2))
|
||||
sys.exit(0 if result["passed"] else 1)
|
||||
120
scripts/benchmarks/02_code_generation.py
Normal file
120
scripts/benchmarks/02_code_generation.py
Normal file
@@ -0,0 +1,120 @@
|
||||
#!/usr/bin/env python3
|
||||
"""Benchmark 2: Code Generation Correctness
|
||||
|
||||
Ask model to generate a fibonacci function, execute it, verify fib(10) = 55.
|
||||
"""
|
||||
|
||||
from __future__ import annotations
|
||||
|
||||
import json
|
||||
import re
|
||||
import subprocess
|
||||
import sys
|
||||
import tempfile
|
||||
import time
|
||||
from pathlib import Path
|
||||
|
||||
import requests
|
||||
|
||||
OLLAMA_URL = "http://localhost:11434"
|
||||
|
||||
CODEGEN_PROMPT = """\
|
||||
Write a Python function called `fibonacci(n)` that returns the nth Fibonacci number \
|
||||
(0-indexed, so fibonacci(0)=0, fibonacci(1)=1, fibonacci(10)=55).
|
||||
|
||||
Return ONLY the raw Python code — no markdown fences, no explanation, no extra text.
|
||||
The function must be named exactly `fibonacci`.
|
||||
"""
|
||||
|
||||
|
||||
def extract_python(text: str) -> str:
|
||||
"""Extract Python code from a response."""
|
||||
text = text.strip()
|
||||
|
||||
# Remove markdown fences
|
||||
fence_match = re.search(r"```(?:python)?\s*(.*?)```", text, re.DOTALL)
|
||||
if fence_match:
|
||||
return fence_match.group(1).strip()
|
||||
|
||||
# Return as-is if it looks like code
|
||||
if "def " in text:
|
||||
return text
|
||||
|
||||
return text
|
||||
|
||||
|
||||
def run_prompt(model: str, prompt: str) -> str:
|
||||
payload = {
|
||||
"model": model,
|
||||
"prompt": prompt,
|
||||
"stream": False,
|
||||
"options": {"temperature": 0.1, "num_predict": 512},
|
||||
}
|
||||
resp = requests.post(f"{OLLAMA_URL}/api/generate", json=payload, timeout=120)
|
||||
resp.raise_for_status()
|
||||
return resp.json()["response"]
|
||||
|
||||
|
||||
def execute_fibonacci(code: str) -> tuple[bool, str]:
|
||||
"""Execute the generated fibonacci code and check fib(10) == 55."""
|
||||
test_code = code + "\n\nresult = fibonacci(10)\nprint(result)\n"
|
||||
|
||||
with tempfile.NamedTemporaryFile(mode="w", suffix=".py", delete=False) as f:
|
||||
f.write(test_code)
|
||||
tmpfile = f.name
|
||||
|
||||
try:
|
||||
proc = subprocess.run(
|
||||
[sys.executable, tmpfile],
|
||||
capture_output=True,
|
||||
text=True,
|
||||
timeout=10,
|
||||
)
|
||||
output = proc.stdout.strip()
|
||||
if proc.returncode != 0:
|
||||
return False, f"Runtime error: {proc.stderr.strip()[:200]}"
|
||||
if output == "55":
|
||||
return True, "fibonacci(10) = 55 ✓"
|
||||
return False, f"Expected 55, got: {output!r}"
|
||||
except subprocess.TimeoutExpired:
|
||||
return False, "Execution timed out"
|
||||
except Exception as exc:
|
||||
return False, f"Execution error: {exc}"
|
||||
finally:
|
||||
Path(tmpfile).unlink(missing_ok=True)
|
||||
|
||||
|
||||
def run_benchmark(model: str) -> dict:
|
||||
"""Run code generation benchmark for a single model."""
|
||||
start = time.time()
|
||||
try:
|
||||
raw = run_prompt(model, CODEGEN_PROMPT)
|
||||
code = extract_python(raw)
|
||||
correct, detail = execute_fibonacci(code)
|
||||
except Exception as exc:
|
||||
elapsed = time.time() - start
|
||||
return {
|
||||
"benchmark": "code_generation",
|
||||
"model": model,
|
||||
"passed": False,
|
||||
"error": str(exc),
|
||||
"elapsed_s": round(elapsed, 2),
|
||||
}
|
||||
|
||||
elapsed = time.time() - start
|
||||
return {
|
||||
"benchmark": "code_generation",
|
||||
"model": model,
|
||||
"passed": correct,
|
||||
"detail": detail,
|
||||
"code_snippet": code[:300],
|
||||
"elapsed_s": round(elapsed, 2),
|
||||
}
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
model = sys.argv[1] if len(sys.argv) > 1 else "hermes3:8b"
|
||||
print(f"Running code-generation benchmark against {model}...")
|
||||
result = run_benchmark(model)
|
||||
print(json.dumps(result, indent=2))
|
||||
sys.exit(0 if result["passed"] else 1)
|
||||
155
scripts/benchmarks/03_shell_commands.py
Normal file
155
scripts/benchmarks/03_shell_commands.py
Normal file
@@ -0,0 +1,155 @@
|
||||
#!/usr/bin/env python3
|
||||
"""Benchmark 3: Shell Command Generation (No Refusal)
|
||||
|
||||
Verify the model generates shell commands without safety refusals.
|
||||
"""
|
||||
|
||||
from __future__ import annotations
|
||||
|
||||
import json
|
||||
import re
|
||||
import sys
|
||||
import time
|
||||
|
||||
import requests
|
||||
|
||||
OLLAMA_URL = "http://localhost:11434"
|
||||
|
||||
SHELL_PROMPTS = [
|
||||
{
|
||||
"prompt": "Write a bash one-liner to find all .log files older than 7 days and delete them.",
|
||||
"description": "find and delete old logs",
|
||||
},
|
||||
{
|
||||
"prompt": "Give me a bash command to recursively change permissions of all .sh files to executable.",
|
||||
"description": "chmod recursive",
|
||||
},
|
||||
{
|
||||
"prompt": "Write a shell command to kill all processes matching the name 'python3'.",
|
||||
"description": "kill processes by name",
|
||||
},
|
||||
{
|
||||
"prompt": "Give me a bash script snippet to monitor disk usage and send an alert if usage exceeds 90%.",
|
||||
"description": "disk usage alert",
|
||||
},
|
||||
{
|
||||
"prompt": "Write a bash command to create a compressed tar archive of the /var/log directory.",
|
||||
"description": "tar compress logs",
|
||||
},
|
||||
]
|
||||
|
||||
# Refusal patterns — presence of these indicates the model refused
|
||||
REFUSAL_PATTERNS = [
|
||||
r"\bi (cannot|can't|won't|refuse|am not able|am unable)\b",
|
||||
r"\bI'm not able to\b",
|
||||
r"\bI don't (generate|write|provide|help with) (commands|scripts|code) that\b",
|
||||
r"\bthis (could be|is) (dangerous|harmful|malicious)\b",
|
||||
r"\bI (must|need to) (caution|warn)\b.*instead",
|
||||
]
|
||||
|
||||
# Positive shell command indicators
|
||||
COMMAND_PATTERNS = [
|
||||
r"\bfind\s+[/\w]",
|
||||
r"\bchmod\b",
|
||||
r"\bkill\b",
|
||||
r"\btar\b",
|
||||
r"\bdf\s+-",
|
||||
r"\bdu\s+-",
|
||||
r"\$\(",
|
||||
r"\bawk\b",
|
||||
r"\bsed\b",
|
||||
r"\bgrep\b",
|
||||
r"\bxargs\b",
|
||||
r"```(?:bash|sh|shell)",
|
||||
r"\bif\s+\[",
|
||||
]
|
||||
|
||||
|
||||
def is_refusal(text: str) -> bool:
|
||||
text_lower = text.lower()
|
||||
for pattern in REFUSAL_PATTERNS:
|
||||
if re.search(pattern, text_lower, re.IGNORECASE):
|
||||
return True
|
||||
return False
|
||||
|
||||
|
||||
def has_shell_command(text: str) -> bool:
|
||||
for pattern in COMMAND_PATTERNS:
|
||||
if re.search(pattern, text):
|
||||
return True
|
||||
return False
|
||||
|
||||
|
||||
def run_prompt(model: str, prompt: str) -> str:
|
||||
payload = {
|
||||
"model": model,
|
||||
"prompt": prompt,
|
||||
"stream": False,
|
||||
"options": {"temperature": 0.1, "num_predict": 512},
|
||||
}
|
||||
resp = requests.post(f"{OLLAMA_URL}/api/generate", json=payload, timeout=120)
|
||||
resp.raise_for_status()
|
||||
return resp.json()["response"]
|
||||
|
||||
|
||||
def run_benchmark(model: str) -> dict:
|
||||
"""Run shell command generation benchmark for a single model."""
|
||||
results = []
|
||||
total_time = 0.0
|
||||
|
||||
for i, case in enumerate(SHELL_PROMPTS, 1):
|
||||
start = time.time()
|
||||
try:
|
||||
raw = run_prompt(model, case["prompt"])
|
||||
elapsed = time.time() - start
|
||||
refused = is_refusal(raw)
|
||||
has_cmd = has_shell_command(raw)
|
||||
results.append(
|
||||
{
|
||||
"prompt_id": i,
|
||||
"description": case["description"],
|
||||
"refused": refused,
|
||||
"has_shell_command": has_cmd,
|
||||
"passed": not refused and has_cmd,
|
||||
"elapsed_s": round(elapsed, 2),
|
||||
"response_snippet": raw[:120],
|
||||
}
|
||||
)
|
||||
except Exception as exc:
|
||||
elapsed = time.time() - start
|
||||
results.append(
|
||||
{
|
||||
"prompt_id": i,
|
||||
"description": case["description"],
|
||||
"refused": False,
|
||||
"has_shell_command": False,
|
||||
"passed": False,
|
||||
"elapsed_s": round(elapsed, 2),
|
||||
"error": str(exc),
|
||||
}
|
||||
)
|
||||
total_time += elapsed
|
||||
|
||||
refused_count = sum(1 for r in results if r["refused"])
|
||||
passed_count = sum(1 for r in results if r["passed"])
|
||||
pass_rate = passed_count / len(SHELL_PROMPTS)
|
||||
|
||||
return {
|
||||
"benchmark": "shell_commands",
|
||||
"model": model,
|
||||
"total_prompts": len(SHELL_PROMPTS),
|
||||
"passed_count": passed_count,
|
||||
"refused_count": refused_count,
|
||||
"pass_rate": round(pass_rate, 3),
|
||||
"passed": refused_count == 0 and passed_count == len(SHELL_PROMPTS),
|
||||
"total_time_s": round(total_time, 2),
|
||||
"results": results,
|
||||
}
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
model = sys.argv[1] if len(sys.argv) > 1 else "hermes3:8b"
|
||||
print(f"Running shell-command benchmark against {model}...")
|
||||
result = run_benchmark(model)
|
||||
print(json.dumps(result, indent=2))
|
||||
sys.exit(0 if result["passed"] else 1)
|
||||
154
scripts/benchmarks/04_multi_turn_coherence.py
Normal file
154
scripts/benchmarks/04_multi_turn_coherence.py
Normal file
@@ -0,0 +1,154 @@
|
||||
#!/usr/bin/env python3
|
||||
"""Benchmark 4: Multi-Turn Agent Loop Coherence
|
||||
|
||||
Simulate a 5-turn observe/reason/act cycle and measure structured coherence.
|
||||
Each turn must return valid JSON with required fields.
|
||||
"""
|
||||
|
||||
from __future__ import annotations
|
||||
|
||||
import json
|
||||
import re
|
||||
import sys
|
||||
import time
|
||||
|
||||
import requests
|
||||
|
||||
OLLAMA_URL = "http://localhost:11434"
|
||||
|
||||
SYSTEM_PROMPT = """\
|
||||
You are an autonomous AI agent. For each message, you MUST respond with valid JSON containing:
|
||||
{
|
||||
"observation": "<what you observe about the current situation>",
|
||||
"reasoning": "<your analysis and plan>",
|
||||
"action": "<the specific action you will take>",
|
||||
"confidence": <0.0-1.0>
|
||||
}
|
||||
Respond ONLY with the JSON object. No other text.
|
||||
"""
|
||||
|
||||
TURNS = [
|
||||
"You are monitoring a web server. CPU usage just spiked to 95%. What do you observe, reason, and do?",
|
||||
"Following your previous action, you found 3 runaway Python processes consuming 30% CPU each. Continue.",
|
||||
"You killed the top 2 processes. CPU is now at 45%. A new alert: disk I/O is at 98%. Continue.",
|
||||
"You traced the disk I/O to a log rotation script that's stuck. You terminated it. Disk I/O dropped to 20%. Final status check: all metrics are now nominal. Continue.",
|
||||
"The incident is resolved. Write a brief post-mortem summary as your final action.",
|
||||
]
|
||||
|
||||
REQUIRED_KEYS = {"observation", "reasoning", "action", "confidence"}
|
||||
|
||||
|
||||
def extract_json(text: str) -> dict | None:
|
||||
text = text.strip()
|
||||
try:
|
||||
return json.loads(text)
|
||||
except json.JSONDecodeError:
|
||||
pass
|
||||
|
||||
fence_match = re.search(r"```(?:json)?\s*(\{.*?\})\s*```", text, re.DOTALL)
|
||||
if fence_match:
|
||||
try:
|
||||
return json.loads(fence_match.group(1))
|
||||
except json.JSONDecodeError:
|
||||
pass
|
||||
|
||||
# Try to find { ... } block
|
||||
brace_match = re.search(r"\{[^{}]*(?:\{[^{}]*\}[^{}]*)?\}", text, re.DOTALL)
|
||||
if brace_match:
|
||||
try:
|
||||
return json.loads(brace_match.group(0))
|
||||
except json.JSONDecodeError:
|
||||
pass
|
||||
|
||||
return None
|
||||
|
||||
|
||||
def run_multi_turn(model: str) -> dict:
|
||||
"""Run the multi-turn coherence benchmark."""
|
||||
conversation = []
|
||||
turn_results = []
|
||||
total_time = 0.0
|
||||
|
||||
# Build system + turn messages using chat endpoint
|
||||
messages = [{"role": "system", "content": SYSTEM_PROMPT}]
|
||||
|
||||
for i, turn_prompt in enumerate(TURNS, 1):
|
||||
messages.append({"role": "user", "content": turn_prompt})
|
||||
start = time.time()
|
||||
|
||||
try:
|
||||
payload = {
|
||||
"model": model,
|
||||
"messages": messages,
|
||||
"stream": False,
|
||||
"options": {"temperature": 0.1, "num_predict": 512},
|
||||
}
|
||||
resp = requests.post(f"{OLLAMA_URL}/api/chat", json=payload, timeout=120)
|
||||
resp.raise_for_status()
|
||||
raw = resp.json()["message"]["content"]
|
||||
except Exception as exc:
|
||||
elapsed = time.time() - start
|
||||
turn_results.append(
|
||||
{
|
||||
"turn": i,
|
||||
"valid_json": False,
|
||||
"has_required_keys": False,
|
||||
"coherent": False,
|
||||
"elapsed_s": round(elapsed, 2),
|
||||
"error": str(exc),
|
||||
}
|
||||
)
|
||||
total_time += elapsed
|
||||
# Add placeholder assistant message to keep conversation going
|
||||
messages.append({"role": "assistant", "content": "{}"})
|
||||
continue
|
||||
|
||||
elapsed = time.time() - start
|
||||
total_time += elapsed
|
||||
|
||||
parsed = extract_json(raw)
|
||||
valid = parsed is not None
|
||||
has_keys = valid and isinstance(parsed, dict) and REQUIRED_KEYS.issubset(parsed.keys())
|
||||
confidence_valid = (
|
||||
has_keys
|
||||
and isinstance(parsed.get("confidence"), (int, float))
|
||||
and 0.0 <= parsed["confidence"] <= 1.0
|
||||
)
|
||||
coherent = has_keys and confidence_valid
|
||||
|
||||
turn_results.append(
|
||||
{
|
||||
"turn": i,
|
||||
"valid_json": valid,
|
||||
"has_required_keys": has_keys,
|
||||
"coherent": coherent,
|
||||
"confidence": parsed.get("confidence") if has_keys else None,
|
||||
"elapsed_s": round(elapsed, 2),
|
||||
"response_snippet": raw[:200],
|
||||
}
|
||||
)
|
||||
|
||||
# Add assistant response to conversation history
|
||||
messages.append({"role": "assistant", "content": raw})
|
||||
|
||||
coherent_count = sum(1 for r in turn_results if r["coherent"])
|
||||
coherence_rate = coherent_count / len(TURNS)
|
||||
|
||||
return {
|
||||
"benchmark": "multi_turn_coherence",
|
||||
"model": model,
|
||||
"total_turns": len(TURNS),
|
||||
"coherent_turns": coherent_count,
|
||||
"coherence_rate": round(coherence_rate, 3),
|
||||
"passed": coherence_rate >= 0.80,
|
||||
"total_time_s": round(total_time, 2),
|
||||
"turns": turn_results,
|
||||
}
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
model = sys.argv[1] if len(sys.argv) > 1 else "hermes3:8b"
|
||||
print(f"Running multi-turn coherence benchmark against {model}...")
|
||||
result = run_multi_turn(model)
|
||||
print(json.dumps(result, indent=2))
|
||||
sys.exit(0 if result["passed"] else 1)
|
||||
197
scripts/benchmarks/05_issue_triage.py
Normal file
197
scripts/benchmarks/05_issue_triage.py
Normal file
@@ -0,0 +1,197 @@
|
||||
#!/usr/bin/env python3
|
||||
"""Benchmark 5: Issue Triage Quality
|
||||
|
||||
Present 5 issues with known correct priorities and measure accuracy.
|
||||
"""
|
||||
|
||||
from __future__ import annotations
|
||||
|
||||
import json
|
||||
import re
|
||||
import sys
|
||||
import time
|
||||
|
||||
import requests
|
||||
|
||||
OLLAMA_URL = "http://localhost:11434"
|
||||
|
||||
TRIAGE_PROMPT_TEMPLATE = """\
|
||||
You are a software project triage agent. Assign a priority to the following issue.
|
||||
|
||||
Issue: {title}
|
||||
Description: {description}
|
||||
|
||||
Respond ONLY with valid JSON:
|
||||
{{"priority": "<p0-critical|p1-high|p2-medium|p3-low>", "reason": "<one sentence>"}}
|
||||
"""
|
||||
|
||||
ISSUES = [
|
||||
{
|
||||
"title": "Production database is returning 500 errors on all queries",
|
||||
"description": "All users are affected, no transactions are completing, revenue is being lost.",
|
||||
"expected_priority": "p0-critical",
|
||||
},
|
||||
{
|
||||
"title": "Login page takes 8 seconds to load",
|
||||
"description": "Performance regression noticed after last deployment. Users are complaining but can still log in.",
|
||||
"expected_priority": "p1-high",
|
||||
},
|
||||
{
|
||||
"title": "Add dark mode support to settings page",
|
||||
"description": "Several users have requested a dark mode toggle in the account settings.",
|
||||
"expected_priority": "p3-low",
|
||||
},
|
||||
{
|
||||
"title": "Email notifications sometimes arrive 10 minutes late",
|
||||
"description": "Intermittent delay in notification delivery, happens roughly 5% of the time.",
|
||||
"expected_priority": "p2-medium",
|
||||
},
|
||||
{
|
||||
"title": "Security vulnerability: SQL injection possible in search endpoint",
|
||||
"description": "Penetration test found unescaped user input being passed directly to database query.",
|
||||
"expected_priority": "p0-critical",
|
||||
},
|
||||
]
|
||||
|
||||
VALID_PRIORITIES = {"p0-critical", "p1-high", "p2-medium", "p3-low"}
|
||||
|
||||
# Map p0 -> 0, p1 -> 1, etc. for fuzzy scoring (±1 level = partial credit)
|
||||
PRIORITY_LEVELS = {"p0-critical": 0, "p1-high": 1, "p2-medium": 2, "p3-low": 3}
|
||||
|
||||
|
||||
def extract_json(text: str) -> dict | None:
|
||||
text = text.strip()
|
||||
try:
|
||||
return json.loads(text)
|
||||
except json.JSONDecodeError:
|
||||
pass
|
||||
|
||||
fence_match = re.search(r"```(?:json)?\s*(\{.*?\})\s*```", text, re.DOTALL)
|
||||
if fence_match:
|
||||
try:
|
||||
return json.loads(fence_match.group(1))
|
||||
except json.JSONDecodeError:
|
||||
pass
|
||||
|
||||
brace_match = re.search(r"\{[^{}]*\}", text, re.DOTALL)
|
||||
if brace_match:
|
||||
try:
|
||||
return json.loads(brace_match.group(0))
|
||||
except json.JSONDecodeError:
|
||||
pass
|
||||
|
||||
return None
|
||||
|
||||
|
||||
def normalize_priority(raw: str) -> str | None:
|
||||
"""Normalize various priority formats to canonical form."""
|
||||
raw = raw.lower().strip()
|
||||
if raw in VALID_PRIORITIES:
|
||||
return raw
|
||||
# Handle "critical", "p0", "high", "p1", etc.
|
||||
mapping = {
|
||||
"critical": "p0-critical",
|
||||
"p0": "p0-critical",
|
||||
"0": "p0-critical",
|
||||
"high": "p1-high",
|
||||
"p1": "p1-high",
|
||||
"1": "p1-high",
|
||||
"medium": "p2-medium",
|
||||
"p2": "p2-medium",
|
||||
"2": "p2-medium",
|
||||
"low": "p3-low",
|
||||
"p3": "p3-low",
|
||||
"3": "p3-low",
|
||||
}
|
||||
return mapping.get(raw)
|
||||
|
||||
|
||||
def run_prompt(model: str, prompt: str) -> str:
|
||||
payload = {
|
||||
"model": model,
|
||||
"prompt": prompt,
|
||||
"stream": False,
|
||||
"options": {"temperature": 0.1, "num_predict": 256},
|
||||
}
|
||||
resp = requests.post(f"{OLLAMA_URL}/api/generate", json=payload, timeout=120)
|
||||
resp.raise_for_status()
|
||||
return resp.json()["response"]
|
||||
|
||||
|
||||
def run_benchmark(model: str) -> dict:
|
||||
"""Run issue triage benchmark for a single model."""
|
||||
results = []
|
||||
total_time = 0.0
|
||||
|
||||
for i, issue in enumerate(ISSUES, 1):
|
||||
prompt = TRIAGE_PROMPT_TEMPLATE.format(
|
||||
title=issue["title"], description=issue["description"]
|
||||
)
|
||||
start = time.time()
|
||||
try:
|
||||
raw = run_prompt(model, prompt)
|
||||
elapsed = time.time() - start
|
||||
parsed = extract_json(raw)
|
||||
valid_json = parsed is not None
|
||||
assigned = None
|
||||
if valid_json and isinstance(parsed, dict):
|
||||
raw_priority = parsed.get("priority", "")
|
||||
assigned = normalize_priority(str(raw_priority))
|
||||
|
||||
exact_match = assigned == issue["expected_priority"]
|
||||
off_by_one = (
|
||||
assigned is not None
|
||||
and not exact_match
|
||||
and abs(PRIORITY_LEVELS.get(assigned, -1) - PRIORITY_LEVELS[issue["expected_priority"]]) == 1
|
||||
)
|
||||
|
||||
results.append(
|
||||
{
|
||||
"issue_id": i,
|
||||
"title": issue["title"][:60],
|
||||
"expected": issue["expected_priority"],
|
||||
"assigned": assigned,
|
||||
"exact_match": exact_match,
|
||||
"off_by_one": off_by_one,
|
||||
"valid_json": valid_json,
|
||||
"elapsed_s": round(elapsed, 2),
|
||||
}
|
||||
)
|
||||
except Exception as exc:
|
||||
elapsed = time.time() - start
|
||||
results.append(
|
||||
{
|
||||
"issue_id": i,
|
||||
"title": issue["title"][:60],
|
||||
"expected": issue["expected_priority"],
|
||||
"assigned": None,
|
||||
"exact_match": False,
|
||||
"off_by_one": False,
|
||||
"valid_json": False,
|
||||
"elapsed_s": round(elapsed, 2),
|
||||
"error": str(exc),
|
||||
}
|
||||
)
|
||||
total_time += elapsed
|
||||
|
||||
exact_count = sum(1 for r in results if r["exact_match"])
|
||||
accuracy = exact_count / len(ISSUES)
|
||||
|
||||
return {
|
||||
"benchmark": "issue_triage",
|
||||
"model": model,
|
||||
"total_issues": len(ISSUES),
|
||||
"exact_matches": exact_count,
|
||||
"accuracy": round(accuracy, 3),
|
||||
"passed": accuracy >= 0.80,
|
||||
"total_time_s": round(total_time, 2),
|
||||
"results": results,
|
||||
}
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
model = sys.argv[1] if len(sys.argv) > 1 else "hermes3:8b"
|
||||
print(f"Running issue-triage benchmark against {model}...")
|
||||
result = run_benchmark(model)
|
||||
print(json.dumps(result, indent=2))
|
||||
sys.exit(0 if result["passed"] else 1)
|
||||
334
scripts/benchmarks/run_suite.py
Normal file
334
scripts/benchmarks/run_suite.py
Normal file
@@ -0,0 +1,334 @@
|
||||
#!/usr/bin/env python3
|
||||
"""Model Benchmark Suite Runner
|
||||
|
||||
Runs all 5 benchmarks against each candidate model and generates
|
||||
a comparison report at docs/model-benchmarks.md.
|
||||
|
||||
Usage:
|
||||
python scripts/benchmarks/run_suite.py
|
||||
python scripts/benchmarks/run_suite.py --models hermes3:8b qwen3.5:latest
|
||||
python scripts/benchmarks/run_suite.py --output docs/model-benchmarks.md
|
||||
"""
|
||||
|
||||
from __future__ import annotations
|
||||
|
||||
import argparse
|
||||
import importlib.util
|
||||
import json
|
||||
import sys
|
||||
import time
|
||||
from datetime import datetime, timezone
|
||||
from pathlib import Path
|
||||
|
||||
import requests
|
||||
|
||||
OLLAMA_URL = "http://localhost:11434"
|
||||
|
||||
# Models to test — maps friendly name to Ollama model tag.
|
||||
# Original spec requested: qwen3:14b, qwen3:8b, hermes3:8b, dolphin3
|
||||
# Availability-adjusted substitutions noted in report.
|
||||
DEFAULT_MODELS = [
|
||||
"hermes3:8b",
|
||||
"qwen3.5:latest",
|
||||
"qwen2.5:14b",
|
||||
"llama3.2:latest",
|
||||
]
|
||||
|
||||
BENCHMARKS_DIR = Path(__file__).parent
|
||||
DOCS_DIR = Path(__file__).resolve().parent.parent.parent / "docs"
|
||||
|
||||
|
||||
def load_benchmark(name: str):
|
||||
"""Dynamically import a benchmark module."""
|
||||
path = BENCHMARKS_DIR / name
|
||||
module_name = Path(name).stem
|
||||
spec = importlib.util.spec_from_file_location(module_name, path)
|
||||
mod = importlib.util.module_from_spec(spec)
|
||||
spec.loader.exec_module(mod)
|
||||
return mod
|
||||
|
||||
|
||||
def model_available(model: str) -> bool:
|
||||
"""Check if a model is available via Ollama."""
|
||||
try:
|
||||
resp = requests.get(f"{OLLAMA_URL}/api/tags", timeout=10)
|
||||
if resp.status_code != 200:
|
||||
return False
|
||||
models = {m["name"] for m in resp.json().get("models", [])}
|
||||
return model in models
|
||||
except Exception:
|
||||
return False
|
||||
|
||||
|
||||
def run_all_benchmarks(model: str) -> dict:
|
||||
"""Run all 5 benchmarks for a given model."""
|
||||
benchmark_files = [
|
||||
"01_tool_calling.py",
|
||||
"02_code_generation.py",
|
||||
"03_shell_commands.py",
|
||||
"04_multi_turn_coherence.py",
|
||||
"05_issue_triage.py",
|
||||
]
|
||||
|
||||
results = {}
|
||||
for fname in benchmark_files:
|
||||
key = fname.replace(".py", "")
|
||||
print(f" [{model}] Running {key}...", flush=True)
|
||||
try:
|
||||
mod = load_benchmark(fname)
|
||||
start = time.time()
|
||||
if key == "01_tool_calling":
|
||||
result = mod.run_benchmark(model)
|
||||
elif key == "02_code_generation":
|
||||
result = mod.run_benchmark(model)
|
||||
elif key == "03_shell_commands":
|
||||
result = mod.run_benchmark(model)
|
||||
elif key == "04_multi_turn_coherence":
|
||||
result = mod.run_multi_turn(model)
|
||||
elif key == "05_issue_triage":
|
||||
result = mod.run_benchmark(model)
|
||||
else:
|
||||
result = {"passed": False, "error": "Unknown benchmark"}
|
||||
elapsed = time.time() - start
|
||||
print(
|
||||
f" -> {'PASS' if result.get('passed') else 'FAIL'} ({elapsed:.1f}s)",
|
||||
flush=True,
|
||||
)
|
||||
results[key] = result
|
||||
except Exception as exc:
|
||||
print(f" -> ERROR: {exc}", flush=True)
|
||||
results[key] = {"benchmark": key, "model": model, "passed": False, "error": str(exc)}
|
||||
|
||||
return results
|
||||
|
||||
|
||||
def score_model(results: dict) -> dict:
|
||||
"""Compute summary scores for a model."""
|
||||
benchmarks = list(results.values())
|
||||
passed = sum(1 for b in benchmarks if b.get("passed", False))
|
||||
total = len(benchmarks)
|
||||
|
||||
# Specific metrics
|
||||
tool_rate = results.get("01_tool_calling", {}).get("compliance_rate", 0.0)
|
||||
code_pass = results.get("02_code_generation", {}).get("passed", False)
|
||||
shell_pass = results.get("03_shell_commands", {}).get("passed", False)
|
||||
coherence = results.get("04_multi_turn_coherence", {}).get("coherence_rate", 0.0)
|
||||
triage_acc = results.get("05_issue_triage", {}).get("accuracy", 0.0)
|
||||
|
||||
total_time = sum(
|
||||
r.get("total_time_s", r.get("elapsed_s", 0.0)) for r in benchmarks
|
||||
)
|
||||
|
||||
return {
|
||||
"passed": passed,
|
||||
"total": total,
|
||||
"pass_rate": f"{passed}/{total}",
|
||||
"tool_compliance": f"{tool_rate:.0%}",
|
||||
"code_gen": "PASS" if code_pass else "FAIL",
|
||||
"shell_gen": "PASS" if shell_pass else "FAIL",
|
||||
"coherence": f"{coherence:.0%}",
|
||||
"triage_accuracy": f"{triage_acc:.0%}",
|
||||
"total_time_s": round(total_time, 1),
|
||||
}
|
||||
|
||||
|
||||
def generate_markdown(all_results: dict, run_date: str) -> str:
|
||||
"""Generate markdown comparison report."""
|
||||
lines = []
|
||||
lines.append("# Model Benchmark Results")
|
||||
lines.append("")
|
||||
lines.append(f"> Generated: {run_date} ")
|
||||
lines.append(f"> Ollama URL: `{OLLAMA_URL}` ")
|
||||
lines.append("> Issue: [#1066](http://143.198.27.163:3000/rockachopa/Timmy-time-dashboard/issues/1066)")
|
||||
lines.append("")
|
||||
lines.append("## Overview")
|
||||
lines.append("")
|
||||
lines.append(
|
||||
"This report documents the 5-test benchmark suite results for local model candidates."
|
||||
)
|
||||
lines.append("")
|
||||
lines.append("### Model Availability vs. Spec")
|
||||
lines.append("")
|
||||
lines.append("| Requested | Tested Substitute | Reason |")
|
||||
lines.append("|-----------|-------------------|--------|")
|
||||
lines.append("| `qwen3:14b` | `qwen2.5:14b` | `qwen3:14b` not pulled locally |")
|
||||
lines.append("| `qwen3:8b` | `qwen3.5:latest` | `qwen3:8b` not pulled locally |")
|
||||
lines.append("| `hermes3:8b` | `hermes3:8b` | Exact match |")
|
||||
lines.append("| `dolphin3` | `llama3.2:latest` | `dolphin3` not pulled locally |")
|
||||
lines.append("")
|
||||
|
||||
# Summary table
|
||||
lines.append("## Summary Comparison Table")
|
||||
lines.append("")
|
||||
lines.append(
|
||||
"| Model | Passed | Tool Calling | Code Gen | Shell Gen | Coherence | Triage Acc | Time (s) |"
|
||||
)
|
||||
lines.append(
|
||||
"|-------|--------|-------------|----------|-----------|-----------|------------|----------|"
|
||||
)
|
||||
|
||||
for model, results in all_results.items():
|
||||
if "error" in results and "01_tool_calling" not in results:
|
||||
lines.append(f"| `{model}` | — | — | — | — | — | — | — |")
|
||||
continue
|
||||
s = score_model(results)
|
||||
lines.append(
|
||||
f"| `{model}` | {s['pass_rate']} | {s['tool_compliance']} | {s['code_gen']} | "
|
||||
f"{s['shell_gen']} | {s['coherence']} | {s['triage_accuracy']} | {s['total_time_s']} |"
|
||||
)
|
||||
|
||||
lines.append("")
|
||||
|
||||
# Per-model detail sections
|
||||
lines.append("## Per-Model Detail")
|
||||
lines.append("")
|
||||
|
||||
for model, results in all_results.items():
|
||||
lines.append(f"### `{model}`")
|
||||
lines.append("")
|
||||
|
||||
if "error" in results and not isinstance(results.get("error"), str):
|
||||
lines.append(f"> **Error:** {results.get('error')}")
|
||||
lines.append("")
|
||||
continue
|
||||
|
||||
for bkey, bres in results.items():
|
||||
bname = {
|
||||
"01_tool_calling": "Benchmark 1: Tool Calling Compliance",
|
||||
"02_code_generation": "Benchmark 2: Code Generation Correctness",
|
||||
"03_shell_commands": "Benchmark 3: Shell Command Generation",
|
||||
"04_multi_turn_coherence": "Benchmark 4: Multi-Turn Coherence",
|
||||
"05_issue_triage": "Benchmark 5: Issue Triage Quality",
|
||||
}.get(bkey, bkey)
|
||||
|
||||
status = "✅ PASS" if bres.get("passed") else "❌ FAIL"
|
||||
lines.append(f"#### {bname} — {status}")
|
||||
lines.append("")
|
||||
|
||||
if bkey == "01_tool_calling":
|
||||
rate = bres.get("compliance_rate", 0)
|
||||
count = bres.get("valid_json_count", 0)
|
||||
total = bres.get("total_prompts", 0)
|
||||
lines.append(
|
||||
f"- **JSON Compliance:** {count}/{total} ({rate:.0%}) — target ≥90%"
|
||||
)
|
||||
elif bkey == "02_code_generation":
|
||||
lines.append(f"- **Result:** {bres.get('detail', bres.get('error', 'n/a'))}")
|
||||
snippet = bres.get("code_snippet", "")
|
||||
if snippet:
|
||||
lines.append(f"- **Generated code snippet:**")
|
||||
lines.append(" ```python")
|
||||
for ln in snippet.splitlines()[:8]:
|
||||
lines.append(f" {ln}")
|
||||
lines.append(" ```")
|
||||
elif bkey == "03_shell_commands":
|
||||
passed = bres.get("passed_count", 0)
|
||||
refused = bres.get("refused_count", 0)
|
||||
total = bres.get("total_prompts", 0)
|
||||
lines.append(
|
||||
f"- **Passed:** {passed}/{total} — **Refusals:** {refused}"
|
||||
)
|
||||
elif bkey == "04_multi_turn_coherence":
|
||||
coherent = bres.get("coherent_turns", 0)
|
||||
total = bres.get("total_turns", 0)
|
||||
rate = bres.get("coherence_rate", 0)
|
||||
lines.append(
|
||||
f"- **Coherent turns:** {coherent}/{total} ({rate:.0%}) — target ≥80%"
|
||||
)
|
||||
elif bkey == "05_issue_triage":
|
||||
exact = bres.get("exact_matches", 0)
|
||||
total = bres.get("total_issues", 0)
|
||||
acc = bres.get("accuracy", 0)
|
||||
lines.append(
|
||||
f"- **Accuracy:** {exact}/{total} ({acc:.0%}) — target ≥80%"
|
||||
)
|
||||
|
||||
elapsed = bres.get("total_time_s", bres.get("elapsed_s", 0))
|
||||
lines.append(f"- **Time:** {elapsed}s")
|
||||
lines.append("")
|
||||
|
||||
lines.append("## Raw JSON Data")
|
||||
lines.append("")
|
||||
lines.append("<details>")
|
||||
lines.append("<summary>Click to expand full JSON results</summary>")
|
||||
lines.append("")
|
||||
lines.append("```json")
|
||||
lines.append(json.dumps(all_results, indent=2))
|
||||
lines.append("```")
|
||||
lines.append("")
|
||||
lines.append("</details>")
|
||||
lines.append("")
|
||||
|
||||
return "\n".join(lines)
|
||||
|
||||
|
||||
def parse_args() -> argparse.Namespace:
|
||||
parser = argparse.ArgumentParser(description="Run model benchmark suite")
|
||||
parser.add_argument(
|
||||
"--models",
|
||||
nargs="+",
|
||||
default=DEFAULT_MODELS,
|
||||
help="Models to test",
|
||||
)
|
||||
parser.add_argument(
|
||||
"--output",
|
||||
type=Path,
|
||||
default=DOCS_DIR / "model-benchmarks.md",
|
||||
help="Output markdown file",
|
||||
)
|
||||
parser.add_argument(
|
||||
"--json-output",
|
||||
type=Path,
|
||||
default=None,
|
||||
help="Optional JSON output file",
|
||||
)
|
||||
return parser.parse_args()
|
||||
|
||||
|
||||
def main() -> int:
|
||||
args = parse_args()
|
||||
run_date = datetime.now(timezone.utc).strftime("%Y-%m-%d %H:%M UTC")
|
||||
|
||||
print(f"Model Benchmark Suite — {run_date}")
|
||||
print(f"Testing {len(args.models)} model(s): {', '.join(args.models)}")
|
||||
print()
|
||||
|
||||
all_results: dict[str, dict] = {}
|
||||
|
||||
for model in args.models:
|
||||
print(f"=== Testing model: {model} ===")
|
||||
if not model_available(model):
|
||||
print(f" WARNING: {model} not available in Ollama — skipping")
|
||||
all_results[model] = {"error": f"Model {model} not available", "skipped": True}
|
||||
print()
|
||||
continue
|
||||
|
||||
model_results = run_all_benchmarks(model)
|
||||
all_results[model] = model_results
|
||||
|
||||
s = score_model(model_results)
|
||||
print(f" Summary: {s['pass_rate']} benchmarks passed in {s['total_time_s']}s")
|
||||
print()
|
||||
|
||||
# Generate and write markdown report
|
||||
markdown = generate_markdown(all_results, run_date)
|
||||
|
||||
args.output.parent.mkdir(parents=True, exist_ok=True)
|
||||
args.output.write_text(markdown, encoding="utf-8")
|
||||
print(f"Report written to: {args.output}")
|
||||
|
||||
if args.json_output:
|
||||
args.json_output.write_text(json.dumps(all_results, indent=2), encoding="utf-8")
|
||||
print(f"JSON data written to: {args.json_output}")
|
||||
|
||||
# Overall pass/fail
|
||||
all_pass = all(
|
||||
not r.get("skipped", False)
|
||||
and all(b.get("passed", False) for b in r.values() if isinstance(b, dict))
|
||||
for r in all_results.values()
|
||||
)
|
||||
return 0 if all_pass else 1
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
sys.exit(main())
|
||||
1
src/brain/__init__.py
Normal file
1
src/brain/__init__.py
Normal file
@@ -0,0 +1 @@
|
||||
"""Brain — identity system and task coordination."""
|
||||
314
src/brain/worker.py
Normal file
314
src/brain/worker.py
Normal file
@@ -0,0 +1,314 @@
|
||||
"""DistributedWorker — task lifecycle management and backend routing.
|
||||
|
||||
Routes delegated tasks to appropriate execution backends:
|
||||
|
||||
- agentic_loop: local multi-step execution via Timmy's agentic loop
|
||||
- kimi: heavy research tasks dispatched via Gitea kimi-ready issues
|
||||
- paperclip: task submission to the Paperclip API
|
||||
|
||||
Task lifecycle: queued → running → completed | failed
|
||||
|
||||
Failure handling: auto-retry up to MAX_RETRIES, then mark failed.
|
||||
"""
|
||||
|
||||
from __future__ import annotations
|
||||
|
||||
import asyncio
|
||||
import logging
|
||||
import threading
|
||||
import uuid
|
||||
from dataclasses import dataclass, field
|
||||
from datetime import UTC, datetime
|
||||
from typing import Any, ClassVar
|
||||
|
||||
logger = logging.getLogger(__name__)
|
||||
|
||||
MAX_RETRIES = 2
|
||||
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# Task record
|
||||
# ---------------------------------------------------------------------------
|
||||
|
||||
|
||||
@dataclass
|
||||
class DelegatedTask:
|
||||
"""Record of one delegated task and its execution state."""
|
||||
|
||||
task_id: str
|
||||
agent_name: str
|
||||
agent_role: str
|
||||
task_description: str
|
||||
priority: str
|
||||
backend: str # "agentic_loop" | "kimi" | "paperclip"
|
||||
status: str = "queued" # queued | running | completed | failed
|
||||
created_at: str = field(default_factory=lambda: datetime.now(UTC).isoformat())
|
||||
result: dict[str, Any] | None = None
|
||||
error: str | None = None
|
||||
retries: int = 0
|
||||
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# Worker
|
||||
# ---------------------------------------------------------------------------
|
||||
|
||||
|
||||
class DistributedWorker:
|
||||
"""Routes and tracks delegated task execution across multiple backends.
|
||||
|
||||
All methods are class-methods; DistributedWorker is a singleton-style
|
||||
service — no instantiation needed.
|
||||
|
||||
Usage::
|
||||
|
||||
from brain.worker import DistributedWorker
|
||||
|
||||
task_id = DistributedWorker.submit("researcher", "research", "summarise X")
|
||||
status = DistributedWorker.get_status(task_id)
|
||||
"""
|
||||
|
||||
_tasks: ClassVar[dict[str, DelegatedTask]] = {}
|
||||
_lock: ClassVar[threading.Lock] = threading.Lock()
|
||||
|
||||
@classmethod
|
||||
def submit(
|
||||
cls,
|
||||
agent_name: str,
|
||||
agent_role: str,
|
||||
task_description: str,
|
||||
priority: str = "normal",
|
||||
) -> str:
|
||||
"""Submit a task for execution. Returns task_id immediately.
|
||||
|
||||
The task is registered as 'queued' and a daemon thread begins
|
||||
execution in the background. Use get_status(task_id) to poll.
|
||||
"""
|
||||
task_id = uuid.uuid4().hex[:8]
|
||||
backend = cls._select_backend(agent_role, task_description)
|
||||
|
||||
record = DelegatedTask(
|
||||
task_id=task_id,
|
||||
agent_name=agent_name,
|
||||
agent_role=agent_role,
|
||||
task_description=task_description,
|
||||
priority=priority,
|
||||
backend=backend,
|
||||
)
|
||||
|
||||
with cls._lock:
|
||||
cls._tasks[task_id] = record
|
||||
|
||||
thread = threading.Thread(
|
||||
target=cls._run_task,
|
||||
args=(record,),
|
||||
daemon=True,
|
||||
name=f"worker-{task_id}",
|
||||
)
|
||||
thread.start()
|
||||
|
||||
logger.info(
|
||||
"Task %s queued: %s → %.60s (backend=%s, priority=%s)",
|
||||
task_id,
|
||||
agent_name,
|
||||
task_description,
|
||||
backend,
|
||||
priority,
|
||||
)
|
||||
return task_id
|
||||
|
||||
@classmethod
|
||||
def get_status(cls, task_id: str) -> dict[str, Any]:
|
||||
"""Return current status of a task by ID."""
|
||||
record = cls._tasks.get(task_id)
|
||||
if record is None:
|
||||
return {"found": False, "task_id": task_id}
|
||||
return {
|
||||
"found": True,
|
||||
"task_id": record.task_id,
|
||||
"agent": record.agent_name,
|
||||
"role": record.agent_role,
|
||||
"status": record.status,
|
||||
"backend": record.backend,
|
||||
"priority": record.priority,
|
||||
"created_at": record.created_at,
|
||||
"retries": record.retries,
|
||||
"result": record.result,
|
||||
"error": record.error,
|
||||
}
|
||||
|
||||
@classmethod
|
||||
def list_tasks(cls) -> list[dict[str, Any]]:
|
||||
"""Return a summary list of all tracked tasks."""
|
||||
with cls._lock:
|
||||
return [
|
||||
{
|
||||
"task_id": t.task_id,
|
||||
"agent": t.agent_name,
|
||||
"status": t.status,
|
||||
"backend": t.backend,
|
||||
"created_at": t.created_at,
|
||||
}
|
||||
for t in cls._tasks.values()
|
||||
]
|
||||
|
||||
@classmethod
|
||||
def clear(cls) -> None:
|
||||
"""Clear the task registry (for tests)."""
|
||||
with cls._lock:
|
||||
cls._tasks.clear()
|
||||
|
||||
# ------------------------------------------------------------------
|
||||
# Backend selection
|
||||
# ------------------------------------------------------------------
|
||||
|
||||
@classmethod
|
||||
def _select_backend(cls, agent_role: str, task_description: str) -> str:
|
||||
"""Choose the execution backend for a given agent role and task.
|
||||
|
||||
Priority:
|
||||
1. kimi — research role + Gitea enabled + task exceeds local capacity
|
||||
2. paperclip — paperclip API key is configured
|
||||
3. agentic_loop — local fallback (always available)
|
||||
"""
|
||||
try:
|
||||
from config import settings
|
||||
from timmy.kimi_delegation import exceeds_local_capacity
|
||||
|
||||
if (
|
||||
agent_role == "research"
|
||||
and getattr(settings, "gitea_enabled", False)
|
||||
and getattr(settings, "gitea_token", "")
|
||||
and exceeds_local_capacity(task_description)
|
||||
):
|
||||
return "kimi"
|
||||
|
||||
if getattr(settings, "paperclip_api_key", ""):
|
||||
return "paperclip"
|
||||
|
||||
except Exception as exc:
|
||||
logger.debug("Backend selection error — defaulting to agentic_loop: %s", exc)
|
||||
|
||||
return "agentic_loop"
|
||||
|
||||
# ------------------------------------------------------------------
|
||||
# Task execution
|
||||
# ------------------------------------------------------------------
|
||||
|
||||
@classmethod
|
||||
def _run_task(cls, record: DelegatedTask) -> None:
|
||||
"""Execute a task with retry logic. Runs inside a daemon thread."""
|
||||
record.status = "running"
|
||||
|
||||
for attempt in range(MAX_RETRIES + 1):
|
||||
try:
|
||||
if attempt > 0:
|
||||
logger.info(
|
||||
"Retrying task %s (attempt %d/%d)",
|
||||
record.task_id,
|
||||
attempt + 1,
|
||||
MAX_RETRIES + 1,
|
||||
)
|
||||
record.retries = attempt
|
||||
|
||||
result = cls._dispatch(record)
|
||||
record.status = "completed"
|
||||
record.result = result
|
||||
logger.info(
|
||||
"Task %s completed via %s",
|
||||
record.task_id,
|
||||
record.backend,
|
||||
)
|
||||
return
|
||||
|
||||
except Exception as exc:
|
||||
logger.warning(
|
||||
"Task %s attempt %d failed: %s",
|
||||
record.task_id,
|
||||
attempt + 1,
|
||||
exc,
|
||||
)
|
||||
if attempt == MAX_RETRIES:
|
||||
record.status = "failed"
|
||||
record.error = str(exc)
|
||||
logger.error(
|
||||
"Task %s exhausted %d retries. Final error: %s",
|
||||
record.task_id,
|
||||
MAX_RETRIES,
|
||||
exc,
|
||||
)
|
||||
|
||||
@classmethod
|
||||
def _dispatch(cls, record: DelegatedTask) -> dict[str, Any]:
|
||||
"""Route to the selected backend. Raises on failure."""
|
||||
if record.backend == "kimi":
|
||||
return asyncio.run(cls._execute_kimi(record))
|
||||
if record.backend == "paperclip":
|
||||
return asyncio.run(cls._execute_paperclip(record))
|
||||
return asyncio.run(cls._execute_agentic_loop(record))
|
||||
|
||||
@classmethod
|
||||
async def _execute_kimi(cls, record: DelegatedTask) -> dict[str, Any]:
|
||||
"""Create a kimi-ready Gitea issue for the task.
|
||||
|
||||
Kimi picks up the issue via the kimi-ready label and executes it.
|
||||
"""
|
||||
from timmy.kimi_delegation import create_kimi_research_issue
|
||||
|
||||
result = await create_kimi_research_issue(
|
||||
task=record.task_description[:120],
|
||||
context=f"Delegated by agent '{record.agent_name}' via delegate_task.",
|
||||
question=record.task_description,
|
||||
priority=record.priority,
|
||||
)
|
||||
if not result.get("success"):
|
||||
raise RuntimeError(f"Kimi issue creation failed: {result.get('error')}")
|
||||
return result
|
||||
|
||||
@classmethod
|
||||
async def _execute_paperclip(cls, record: DelegatedTask) -> dict[str, Any]:
|
||||
"""Submit the task to the Paperclip API."""
|
||||
import httpx
|
||||
|
||||
from timmy.paperclip import PaperclipClient
|
||||
|
||||
client = PaperclipClient()
|
||||
async with httpx.AsyncClient(timeout=client.timeout) as http:
|
||||
resp = await http.post(
|
||||
f"{client.base_url}/api/tasks",
|
||||
headers={"Authorization": f"Bearer {client.api_key}"},
|
||||
json={
|
||||
"kind": record.agent_role,
|
||||
"agent_id": client.agent_id,
|
||||
"company_id": client.company_id,
|
||||
"priority": record.priority,
|
||||
"context": {"task": record.task_description},
|
||||
},
|
||||
)
|
||||
|
||||
if resp.status_code in (200, 201):
|
||||
data = resp.json()
|
||||
logger.info(
|
||||
"Task %s submitted to Paperclip (paperclip_id=%s)",
|
||||
record.task_id,
|
||||
data.get("id"),
|
||||
)
|
||||
return {
|
||||
"success": True,
|
||||
"paperclip_task_id": data.get("id"),
|
||||
"backend": "paperclip",
|
||||
}
|
||||
raise RuntimeError(f"Paperclip API error {resp.status_code}: {resp.text[:200]}")
|
||||
|
||||
@classmethod
|
||||
async def _execute_agentic_loop(cls, record: DelegatedTask) -> dict[str, Any]:
|
||||
"""Execute the task via Timmy's local agentic loop."""
|
||||
from timmy.agentic_loop import run_agentic_loop
|
||||
|
||||
result = await run_agentic_loop(record.task_description)
|
||||
return {
|
||||
"success": result.status != "failed",
|
||||
"agentic_task_id": result.task_id,
|
||||
"summary": result.summary,
|
||||
"status": result.status,
|
||||
"backend": "agentic_loop",
|
||||
}
|
||||
@@ -94,8 +94,9 @@ class Settings(BaseSettings):
|
||||
|
||||
# ── Backend selection ────────────────────────────────────────────────────
|
||||
# "ollama" — always use Ollama (default, safe everywhere)
|
||||
# "airllm" — AirLLM layer-by-layer loading (Apple Silicon only; degrades to Ollama)
|
||||
# "auto" — pick best available local backend, fall back to Ollama
|
||||
timmy_model_backend: Literal["ollama", "grok", "claude", "auto"] = "ollama"
|
||||
timmy_model_backend: Literal["ollama", "airllm", "grok", "claude", "auto"] = "ollama"
|
||||
|
||||
# ── Grok (xAI) — opt-in premium cloud backend ────────────────────────
|
||||
# Grok is a premium augmentation layer — local-first ethos preserved.
|
||||
@@ -108,6 +109,16 @@ class Settings(BaseSettings):
|
||||
grok_sats_hard_cap: int = 100 # Absolute ceiling on sats per Grok query
|
||||
grok_free: bool = False # Skip Lightning invoice when user has own API key
|
||||
|
||||
# ── Search Backend (SearXNG + Crawl4AI) ──────────────────────────────
|
||||
# "searxng" — self-hosted SearXNG meta-search engine (default, no API key)
|
||||
# "none" — disable web search (private/offline deployments)
|
||||
# Override with TIMMY_SEARCH_BACKEND env var.
|
||||
timmy_search_backend: Literal["searxng", "none"] = "searxng"
|
||||
# SearXNG base URL — override with TIMMY_SEARCH_URL env var
|
||||
search_url: str = "http://localhost:8888"
|
||||
# Crawl4AI base URL — override with TIMMY_CRAWL_URL env var
|
||||
crawl_url: str = "http://localhost:11235"
|
||||
|
||||
# ── Database ──────────────────────────────────────────────────────────
|
||||
db_busy_timeout_ms: int = 5000 # SQLite PRAGMA busy_timeout (ms)
|
||||
|
||||
|
||||
@@ -55,6 +55,7 @@ from dashboard.routes.system import router as system_router
|
||||
from dashboard.routes.tasks import router as tasks_router
|
||||
from dashboard.routes.telegram import router as telegram_router
|
||||
from dashboard.routes.thinking import router as thinking_router
|
||||
from dashboard.routes.self_correction import router as self_correction_router
|
||||
from dashboard.routes.three_strike import router as three_strike_router
|
||||
from dashboard.routes.tools import router as tools_router
|
||||
from dashboard.routes.tower import router as tower_router
|
||||
@@ -551,12 +552,28 @@ async def lifespan(app: FastAPI):
|
||||
except Exception:
|
||||
logger.debug("Failed to register error recorder")
|
||||
|
||||
# Mark session start for sovereignty duration tracking
|
||||
try:
|
||||
from timmy.sovereignty import mark_session_start
|
||||
|
||||
mark_session_start()
|
||||
except Exception:
|
||||
logger.debug("Failed to mark sovereignty session start")
|
||||
|
||||
logger.info("✓ Dashboard ready for requests")
|
||||
|
||||
yield
|
||||
|
||||
await _shutdown_cleanup(bg_tasks, workshop_heartbeat)
|
||||
|
||||
# Generate and commit sovereignty session report
|
||||
try:
|
||||
from timmy.sovereignty import generate_and_commit_report
|
||||
|
||||
await generate_and_commit_report()
|
||||
except Exception as exc:
|
||||
logger.warning("Sovereignty report generation failed at shutdown: %s", exc)
|
||||
|
||||
|
||||
app = FastAPI(
|
||||
title="Mission Control",
|
||||
@@ -680,6 +697,7 @@ app.include_router(scorecards_router)
|
||||
app.include_router(sovereignty_metrics_router)
|
||||
app.include_router(sovereignty_ws_router)
|
||||
app.include_router(three_strike_router)
|
||||
app.include_router(self_correction_router)
|
||||
|
||||
|
||||
@app.websocket("/ws")
|
||||
|
||||
58
src/dashboard/routes/self_correction.py
Normal file
58
src/dashboard/routes/self_correction.py
Normal file
@@ -0,0 +1,58 @@
|
||||
"""Self-Correction Dashboard routes.
|
||||
|
||||
GET /self-correction/ui — HTML dashboard
|
||||
GET /self-correction/timeline — HTMX partial: recent event timeline
|
||||
GET /self-correction/patterns — HTMX partial: recurring failure patterns
|
||||
"""
|
||||
|
||||
import logging
|
||||
|
||||
from fastapi import APIRouter, Request
|
||||
from fastapi.responses import HTMLResponse
|
||||
|
||||
from dashboard.templating import templates
|
||||
from infrastructure.self_correction import get_corrections, get_patterns, get_stats
|
||||
|
||||
logger = logging.getLogger(__name__)
|
||||
|
||||
router = APIRouter(prefix="/self-correction", tags=["self-correction"])
|
||||
|
||||
|
||||
@router.get("/ui", response_class=HTMLResponse)
|
||||
async def self_correction_ui(request: Request):
|
||||
"""Render the Self-Correction Dashboard."""
|
||||
stats = get_stats()
|
||||
corrections = get_corrections(limit=20)
|
||||
patterns = get_patterns(top_n=10)
|
||||
return templates.TemplateResponse(
|
||||
request,
|
||||
"self_correction.html",
|
||||
{
|
||||
"stats": stats,
|
||||
"corrections": corrections,
|
||||
"patterns": patterns,
|
||||
},
|
||||
)
|
||||
|
||||
|
||||
@router.get("/timeline", response_class=HTMLResponse)
|
||||
async def self_correction_timeline(request: Request):
|
||||
"""HTMX partial: recent self-correction event timeline."""
|
||||
corrections = get_corrections(limit=30)
|
||||
return templates.TemplateResponse(
|
||||
request,
|
||||
"partials/self_correction_timeline.html",
|
||||
{"corrections": corrections},
|
||||
)
|
||||
|
||||
|
||||
@router.get("/patterns", response_class=HTMLResponse)
|
||||
async def self_correction_patterns(request: Request):
|
||||
"""HTMX partial: recurring failure patterns."""
|
||||
patterns = get_patterns(top_n=10)
|
||||
stats = get_stats()
|
||||
return templates.TemplateResponse(
|
||||
request,
|
||||
"partials/self_correction_patterns.html",
|
||||
{"patterns": patterns, "stats": stats},
|
||||
)
|
||||
@@ -71,6 +71,7 @@
|
||||
<a href="/spark/ui" class="mc-test-link">SPARK</a>
|
||||
<a href="/memory" class="mc-test-link">MEMORY</a>
|
||||
<a href="/marketplace/ui" class="mc-test-link">MARKET</a>
|
||||
<a href="/self-correction/ui" class="mc-test-link">SELF-CORRECT</a>
|
||||
</div>
|
||||
</div>
|
||||
<div class="mc-nav-dropdown">
|
||||
@@ -132,6 +133,7 @@
|
||||
<a href="/spark/ui" class="mc-mobile-link">SPARK</a>
|
||||
<a href="/memory" class="mc-mobile-link">MEMORY</a>
|
||||
<a href="/marketplace/ui" class="mc-mobile-link">MARKET</a>
|
||||
<a href="/self-correction/ui" class="mc-mobile-link">SELF-CORRECT</a>
|
||||
<div class="mc-mobile-section-label">AGENTS</div>
|
||||
<a href="/hands" class="mc-mobile-link">HANDS</a>
|
||||
<a href="/work-orders/queue" class="mc-mobile-link">WORK ORDERS</a>
|
||||
|
||||
@@ -186,6 +186,24 @@
|
||||
<p class="chat-history-placeholder">Loading sovereignty metrics...</p>
|
||||
{% endcall %}
|
||||
|
||||
<!-- Agent Scorecards -->
|
||||
<div class="card mc-card-spaced" id="mc-scorecards-card">
|
||||
<div class="card-header">
|
||||
<h2 class="card-title">Agent Scorecards</h2>
|
||||
<div class="d-flex align-items-center gap-2">
|
||||
<select id="mc-scorecard-period" class="form-select form-select-sm" style="width: auto;"
|
||||
onchange="loadMcScorecards()">
|
||||
<option value="daily" selected>Daily</option>
|
||||
<option value="weekly">Weekly</option>
|
||||
</select>
|
||||
<a href="/scorecards" class="btn btn-sm btn-outline-secondary">Full View</a>
|
||||
</div>
|
||||
</div>
|
||||
<div id="mc-scorecards-content" class="p-2">
|
||||
<p class="chat-history-placeholder">Loading scorecards...</p>
|
||||
</div>
|
||||
</div>
|
||||
|
||||
<!-- Chat History -->
|
||||
<div class="card mc-card-spaced">
|
||||
<div class="card-header">
|
||||
@@ -502,6 +520,20 @@ async function loadSparkStatus() {
|
||||
}
|
||||
}
|
||||
|
||||
// Load agent scorecards
|
||||
async function loadMcScorecards() {
|
||||
var period = document.getElementById('mc-scorecard-period').value;
|
||||
var container = document.getElementById('mc-scorecards-content');
|
||||
container.innerHTML = '<p class="chat-history-placeholder">Loading scorecards...</p>';
|
||||
try {
|
||||
var response = await fetch('/scorecards/all/panels?period=' + period);
|
||||
var html = await response.text();
|
||||
container.innerHTML = html;
|
||||
} catch (error) {
|
||||
container.innerHTML = '<p class="chat-history-placeholder">Scorecards unavailable</p>';
|
||||
}
|
||||
}
|
||||
|
||||
// Initial load
|
||||
loadSparkStatus();
|
||||
loadSovereignty();
|
||||
@@ -510,6 +542,7 @@ loadSwarmStats();
|
||||
loadLightningStats();
|
||||
loadGrokStats();
|
||||
loadChatHistory();
|
||||
loadMcScorecards();
|
||||
|
||||
// Periodic updates
|
||||
setInterval(loadSovereignty, 30000);
|
||||
@@ -518,5 +551,6 @@ setInterval(loadSwarmStats, 5000);
|
||||
setInterval(updateHeartbeat, 5000);
|
||||
setInterval(loadGrokStats, 10000);
|
||||
setInterval(loadSparkStatus, 15000);
|
||||
setInterval(loadMcScorecards, 300000);
|
||||
</script>
|
||||
{% endblock %}
|
||||
|
||||
@@ -0,0 +1,28 @@
|
||||
{% if patterns %}
|
||||
<table class="mc-table w-100">
|
||||
<thead>
|
||||
<tr>
|
||||
<th>ERROR TYPE</th>
|
||||
<th class="text-center">COUNT</th>
|
||||
<th class="text-center">CORRECTED</th>
|
||||
<th class="text-center">FAILED</th>
|
||||
<th>LAST SEEN</th>
|
||||
</tr>
|
||||
</thead>
|
||||
<tbody>
|
||||
{% for p in patterns %}
|
||||
<tr>
|
||||
<td class="sc-pattern-type">{{ p.error_type }}</td>
|
||||
<td class="text-center">
|
||||
<span class="badge {% if p.count >= 5 %}badge-error{% elif p.count >= 3 %}badge-warning{% else %}badge-info{% endif %}">{{ p.count }}</span>
|
||||
</td>
|
||||
<td class="text-center text-success">{{ p.success_count }}</td>
|
||||
<td class="text-center {% if p.failed_count > 0 %}text-danger{% else %}text-muted{% endif %}">{{ p.failed_count }}</td>
|
||||
<td class="sc-event-time">{{ p.last_seen[:16] if p.last_seen else '—' }}</td>
|
||||
</tr>
|
||||
{% endfor %}
|
||||
</tbody>
|
||||
</table>
|
||||
{% else %}
|
||||
<div class="text-center text-muted py-3">No patterns detected yet.</div>
|
||||
{% endif %}
|
||||
@@ -0,0 +1,26 @@
|
||||
{% if corrections %}
|
||||
{% for ev in corrections %}
|
||||
<div class="sc-event sc-status-{{ ev.outcome_status }}">
|
||||
<div class="sc-event-header">
|
||||
<span class="sc-status-badge sc-status-{{ ev.outcome_status }}">
|
||||
{% if ev.outcome_status == 'success' %}✓ CORRECTED
|
||||
{% elif ev.outcome_status == 'partial' %}● PARTIAL
|
||||
{% else %}✗ FAILED
|
||||
{% endif %}
|
||||
</span>
|
||||
<span class="sc-source-badge">{{ ev.source }}</span>
|
||||
<span class="sc-event-time">{{ ev.created_at[:19] }}</span>
|
||||
</div>
|
||||
<div class="sc-event-error-type">{{ ev.error_type }}</div>
|
||||
<div class="sc-event-intent"><span class="sc-label">INTENT:</span> {{ ev.original_intent[:120] }}{% if ev.original_intent | length > 120 %}…{% endif %}</div>
|
||||
<div class="sc-event-error"><span class="sc-label">ERROR:</span> {{ ev.detected_error[:120] }}{% if ev.detected_error | length > 120 %}…{% endif %}</div>
|
||||
<div class="sc-event-strategy"><span class="sc-label">STRATEGY:</span> {{ ev.correction_strategy[:120] }}{% if ev.correction_strategy | length > 120 %}…{% endif %}</div>
|
||||
<div class="sc-event-outcome"><span class="sc-label">OUTCOME:</span> {{ ev.final_outcome[:120] }}{% if ev.final_outcome | length > 120 %}…{% endif %}</div>
|
||||
{% if ev.task_id %}
|
||||
<div class="sc-event-meta">task: {{ ev.task_id[:8] }}</div>
|
||||
{% endif %}
|
||||
</div>
|
||||
{% endfor %}
|
||||
{% else %}
|
||||
<div class="text-center text-muted py-3">No self-correction events recorded yet.</div>
|
||||
{% endif %}
|
||||
102
src/dashboard/templates/self_correction.html
Normal file
102
src/dashboard/templates/self_correction.html
Normal file
@@ -0,0 +1,102 @@
|
||||
{% extends "base.html" %}
|
||||
{% from "macros.html" import panel %}
|
||||
|
||||
{% block title %}Timmy Time — Self-Correction Dashboard{% endblock %}
|
||||
|
||||
{% block extra_styles %}{% endblock %}
|
||||
|
||||
{% block content %}
|
||||
<div class="container-fluid py-3">
|
||||
|
||||
<!-- Header -->
|
||||
<div class="spark-header mb-3">
|
||||
<div class="spark-title">SELF-CORRECTION</div>
|
||||
<div class="spark-subtitle">
|
||||
Agent error detection & recovery —
|
||||
<span class="spark-status-val">{{ stats.total }}</span> events,
|
||||
<span class="spark-status-val">{{ stats.success_rate }}%</span> correction rate,
|
||||
<span class="spark-status-val">{{ stats.unique_error_types }}</span> distinct error types
|
||||
</div>
|
||||
</div>
|
||||
|
||||
<div class="row g-3">
|
||||
|
||||
<!-- Left column: stats + patterns -->
|
||||
<div class="col-12 col-lg-4 d-flex flex-column gap-3">
|
||||
|
||||
<!-- Stats panel -->
|
||||
<div class="card mc-panel">
|
||||
<div class="card-header mc-panel-header">// CORRECTION STATS</div>
|
||||
<div class="card-body p-3">
|
||||
<div class="spark-stat-grid">
|
||||
<div class="spark-stat">
|
||||
<span class="spark-stat-label">TOTAL</span>
|
||||
<span class="spark-stat-value">{{ stats.total }}</span>
|
||||
</div>
|
||||
<div class="spark-stat">
|
||||
<span class="spark-stat-label">CORRECTED</span>
|
||||
<span class="spark-stat-value text-success">{{ stats.success_count }}</span>
|
||||
</div>
|
||||
<div class="spark-stat">
|
||||
<span class="spark-stat-label">PARTIAL</span>
|
||||
<span class="spark-stat-value text-warning">{{ stats.partial_count }}</span>
|
||||
</div>
|
||||
<div class="spark-stat">
|
||||
<span class="spark-stat-label">FAILED</span>
|
||||
<span class="spark-stat-value {% if stats.failed_count > 0 %}text-danger{% else %}text-muted{% endif %}">{{ stats.failed_count }}</span>
|
||||
</div>
|
||||
</div>
|
||||
<div class="mt-3">
|
||||
<div class="d-flex justify-content-between mb-1">
|
||||
<small class="text-muted">Correction Rate</small>
|
||||
<small class="{% if stats.success_rate >= 70 %}text-success{% elif stats.success_rate >= 40 %}text-warning{% else %}text-danger{% endif %}">{{ stats.success_rate }}%</small>
|
||||
</div>
|
||||
<div class="progress" style="height:6px;">
|
||||
<div class="progress-bar {% if stats.success_rate >= 70 %}bg-success{% elif stats.success_rate >= 40 %}bg-warning{% else %}bg-danger{% endif %}"
|
||||
role="progressbar"
|
||||
style="width:{{ stats.success_rate }}%"
|
||||
aria-valuenow="{{ stats.success_rate }}"
|
||||
aria-valuemin="0"
|
||||
aria-valuemax="100"></div>
|
||||
</div>
|
||||
</div>
|
||||
</div>
|
||||
</div>
|
||||
|
||||
<!-- Patterns panel -->
|
||||
<div class="card mc-panel"
|
||||
hx-get="/self-correction/patterns"
|
||||
hx-trigger="load, every 60s"
|
||||
hx-target="#sc-patterns-body"
|
||||
hx-swap="innerHTML">
|
||||
<div class="card-header mc-panel-header d-flex justify-content-between align-items-center">
|
||||
<span>// RECURRING PATTERNS</span>
|
||||
<span class="badge badge-info">{{ patterns | length }}</span>
|
||||
</div>
|
||||
<div class="card-body p-0" id="sc-patterns-body">
|
||||
{% include "partials/self_correction_patterns.html" %}
|
||||
</div>
|
||||
</div>
|
||||
|
||||
</div>
|
||||
|
||||
<!-- Right column: timeline -->
|
||||
<div class="col-12 col-lg-8">
|
||||
<div class="card mc-panel"
|
||||
hx-get="/self-correction/timeline"
|
||||
hx-trigger="load, every 30s"
|
||||
hx-target="#sc-timeline-body"
|
||||
hx-swap="innerHTML">
|
||||
<div class="card-header mc-panel-header d-flex justify-content-between align-items-center">
|
||||
<span>// CORRECTION TIMELINE</span>
|
||||
<span class="badge badge-info">{{ corrections | length }}</span>
|
||||
</div>
|
||||
<div class="card-body p-3" id="sc-timeline-body">
|
||||
{% include "partials/self_correction_timeline.html" %}
|
||||
</div>
|
||||
</div>
|
||||
</div>
|
||||
|
||||
</div>
|
||||
</div>
|
||||
{% endblock %}
|
||||
247
src/infrastructure/self_correction.py
Normal file
247
src/infrastructure/self_correction.py
Normal file
@@ -0,0 +1,247 @@
|
||||
"""Self-correction event logger.
|
||||
|
||||
Records instances where the agent detected its own errors and the steps
|
||||
it took to correct them. Used by the Self-Correction Dashboard to visualise
|
||||
these events and surface recurring failure patterns.
|
||||
|
||||
Usage::
|
||||
|
||||
from infrastructure.self_correction import log_self_correction, get_corrections, get_patterns
|
||||
|
||||
log_self_correction(
|
||||
source="agentic_loop",
|
||||
original_intent="Execute step 3: deploy service",
|
||||
detected_error="ConnectionRefusedError: port 8080 unavailable",
|
||||
correction_strategy="Retry on alternate port 8081",
|
||||
final_outcome="Success on retry",
|
||||
task_id="abc123",
|
||||
)
|
||||
"""
|
||||
|
||||
from __future__ import annotations
|
||||
|
||||
import json
|
||||
import logging
|
||||
import sqlite3
|
||||
import uuid
|
||||
from collections.abc import Generator
|
||||
from contextlib import closing, contextmanager
|
||||
from datetime import UTC, datetime
|
||||
from pathlib import Path
|
||||
|
||||
logger = logging.getLogger(__name__)
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# Database
|
||||
# ---------------------------------------------------------------------------
|
||||
|
||||
_DB_PATH: Path | None = None
|
||||
|
||||
|
||||
def _get_db_path() -> Path:
|
||||
global _DB_PATH
|
||||
if _DB_PATH is None:
|
||||
from config import settings
|
||||
|
||||
_DB_PATH = Path(settings.repo_root) / "data" / "self_correction.db"
|
||||
return _DB_PATH
|
||||
|
||||
|
||||
@contextmanager
|
||||
def _get_db() -> Generator[sqlite3.Connection, None, None]:
|
||||
db_path = _get_db_path()
|
||||
db_path.parent.mkdir(parents=True, exist_ok=True)
|
||||
with closing(sqlite3.connect(str(db_path))) as conn:
|
||||
conn.row_factory = sqlite3.Row
|
||||
conn.execute("""
|
||||
CREATE TABLE IF NOT EXISTS self_correction_events (
|
||||
id TEXT PRIMARY KEY,
|
||||
source TEXT NOT NULL,
|
||||
task_id TEXT DEFAULT '',
|
||||
original_intent TEXT NOT NULL,
|
||||
detected_error TEXT NOT NULL,
|
||||
correction_strategy TEXT NOT NULL,
|
||||
final_outcome TEXT NOT NULL,
|
||||
outcome_status TEXT DEFAULT 'success',
|
||||
error_type TEXT DEFAULT '',
|
||||
created_at TEXT DEFAULT (datetime('now'))
|
||||
)
|
||||
""")
|
||||
conn.execute(
|
||||
"CREATE INDEX IF NOT EXISTS idx_sc_created ON self_correction_events(created_at)"
|
||||
)
|
||||
conn.execute(
|
||||
"CREATE INDEX IF NOT EXISTS idx_sc_error_type ON self_correction_events(error_type)"
|
||||
)
|
||||
conn.commit()
|
||||
yield conn
|
||||
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# Write
|
||||
# ---------------------------------------------------------------------------
|
||||
|
||||
|
||||
def log_self_correction(
|
||||
*,
|
||||
source: str,
|
||||
original_intent: str,
|
||||
detected_error: str,
|
||||
correction_strategy: str,
|
||||
final_outcome: str,
|
||||
task_id: str = "",
|
||||
outcome_status: str = "success",
|
||||
error_type: str = "",
|
||||
) -> str:
|
||||
"""Record a self-correction event and return its ID.
|
||||
|
||||
Args:
|
||||
source: Module or component that triggered the correction.
|
||||
original_intent: What the agent was trying to do.
|
||||
detected_error: The error or problem that was detected.
|
||||
correction_strategy: How the agent attempted to correct the error.
|
||||
final_outcome: What the result of the correction attempt was.
|
||||
task_id: Optional task/session ID for correlation.
|
||||
outcome_status: 'success', 'partial', or 'failed'.
|
||||
error_type: Short category label for pattern analysis (e.g.
|
||||
'ConnectionError', 'TimeoutError').
|
||||
|
||||
Returns:
|
||||
The ID of the newly created record.
|
||||
"""
|
||||
event_id = str(uuid.uuid4())
|
||||
if not error_type:
|
||||
# Derive a simple type from the first word of the detected error
|
||||
error_type = detected_error.split(":")[0].strip()[:64]
|
||||
|
||||
try:
|
||||
with _get_db() as conn:
|
||||
conn.execute(
|
||||
"""
|
||||
INSERT INTO self_correction_events
|
||||
(id, source, task_id, original_intent, detected_error,
|
||||
correction_strategy, final_outcome, outcome_status, error_type)
|
||||
VALUES (?, ?, ?, ?, ?, ?, ?, ?, ?)
|
||||
""",
|
||||
(
|
||||
event_id,
|
||||
source,
|
||||
task_id,
|
||||
original_intent[:2000],
|
||||
detected_error[:2000],
|
||||
correction_strategy[:2000],
|
||||
final_outcome[:2000],
|
||||
outcome_status,
|
||||
error_type,
|
||||
),
|
||||
)
|
||||
conn.commit()
|
||||
logger.info(
|
||||
"Self-correction logged [%s] source=%s error_type=%s status=%s",
|
||||
event_id[:8],
|
||||
source,
|
||||
error_type,
|
||||
outcome_status,
|
||||
)
|
||||
except Exception as exc:
|
||||
logger.warning("Failed to log self-correction event: %s", exc)
|
||||
|
||||
return event_id
|
||||
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# Read
|
||||
# ---------------------------------------------------------------------------
|
||||
|
||||
|
||||
def get_corrections(limit: int = 50) -> list[dict]:
|
||||
"""Return the most recent self-correction events, newest first."""
|
||||
try:
|
||||
with _get_db() as conn:
|
||||
rows = conn.execute(
|
||||
"""
|
||||
SELECT * FROM self_correction_events
|
||||
ORDER BY created_at DESC
|
||||
LIMIT ?
|
||||
""",
|
||||
(limit,),
|
||||
).fetchall()
|
||||
return [dict(r) for r in rows]
|
||||
except Exception as exc:
|
||||
logger.warning("Failed to fetch self-correction events: %s", exc)
|
||||
return []
|
||||
|
||||
|
||||
def get_patterns(top_n: int = 10) -> list[dict]:
|
||||
"""Return the most common recurring error types with counts.
|
||||
|
||||
Each entry has:
|
||||
- error_type: category label
|
||||
- count: total occurrences
|
||||
- success_count: corrected successfully
|
||||
- failed_count: correction also failed
|
||||
- last_seen: ISO timestamp of most recent occurrence
|
||||
"""
|
||||
try:
|
||||
with _get_db() as conn:
|
||||
rows = conn.execute(
|
||||
"""
|
||||
SELECT
|
||||
error_type,
|
||||
COUNT(*) AS count,
|
||||
SUM(CASE WHEN outcome_status = 'success' THEN 1 ELSE 0 END) AS success_count,
|
||||
SUM(CASE WHEN outcome_status = 'failed' THEN 1 ELSE 0 END) AS failed_count,
|
||||
MAX(created_at) AS last_seen
|
||||
FROM self_correction_events
|
||||
GROUP BY error_type
|
||||
ORDER BY count DESC
|
||||
LIMIT ?
|
||||
""",
|
||||
(top_n,),
|
||||
).fetchall()
|
||||
return [dict(r) for r in rows]
|
||||
except Exception as exc:
|
||||
logger.warning("Failed to fetch self-correction patterns: %s", exc)
|
||||
return []
|
||||
|
||||
|
||||
def get_stats() -> dict:
|
||||
"""Return aggregate statistics for the summary panel."""
|
||||
try:
|
||||
with _get_db() as conn:
|
||||
row = conn.execute(
|
||||
"""
|
||||
SELECT
|
||||
COUNT(*) AS total,
|
||||
SUM(CASE WHEN outcome_status = 'success' THEN 1 ELSE 0 END) AS success_count,
|
||||
SUM(CASE WHEN outcome_status = 'partial' THEN 1 ELSE 0 END) AS partial_count,
|
||||
SUM(CASE WHEN outcome_status = 'failed' THEN 1 ELSE 0 END) AS failed_count,
|
||||
COUNT(DISTINCT error_type) AS unique_error_types,
|
||||
COUNT(DISTINCT source) AS sources
|
||||
FROM self_correction_events
|
||||
"""
|
||||
).fetchone()
|
||||
if row is None:
|
||||
return _empty_stats()
|
||||
d = dict(row)
|
||||
total = d.get("total") or 0
|
||||
if total:
|
||||
d["success_rate"] = round((d.get("success_count") or 0) / total * 100)
|
||||
else:
|
||||
d["success_rate"] = 0
|
||||
return d
|
||||
except Exception as exc:
|
||||
logger.warning("Failed to fetch self-correction stats: %s", exc)
|
||||
return _empty_stats()
|
||||
|
||||
|
||||
def _empty_stats() -> dict:
|
||||
return {
|
||||
"total": 0,
|
||||
"success_count": 0,
|
||||
"partial_count": 0,
|
||||
"failed_count": 0,
|
||||
"unique_error_types": 0,
|
||||
"sources": 0,
|
||||
"success_rate": 0,
|
||||
}
|
||||
7
src/self_coding/__init__.py
Normal file
7
src/self_coding/__init__.py
Normal file
@@ -0,0 +1,7 @@
|
||||
"""Self-coding package — Timmy's self-modification capability.
|
||||
|
||||
Provides the branch→edit→test→commit/revert loop that allows Timmy
|
||||
to propose and apply code changes autonomously, gated by the test suite.
|
||||
|
||||
Main entry point: ``self_coding.self_modify.loop``
|
||||
"""
|
||||
129
src/self_coding/gitea_client.py
Normal file
129
src/self_coding/gitea_client.py
Normal file
@@ -0,0 +1,129 @@
|
||||
"""Gitea REST client — thin wrapper for PR creation and issue commenting.
|
||||
|
||||
Uses ``settings.gitea_url``, ``settings.gitea_token``, and
|
||||
``settings.gitea_repo`` (owner/repo) from config. Degrades gracefully
|
||||
when the token is absent or the server is unreachable.
|
||||
"""
|
||||
|
||||
from __future__ import annotations
|
||||
|
||||
import logging
|
||||
from dataclasses import dataclass
|
||||
|
||||
logger = logging.getLogger(__name__)
|
||||
|
||||
|
||||
@dataclass
|
||||
class PullRequest:
|
||||
"""Minimal representation of a created pull request."""
|
||||
|
||||
number: int
|
||||
title: str
|
||||
html_url: str
|
||||
|
||||
|
||||
class GiteaClient:
|
||||
"""HTTP client for Gitea's REST API v1.
|
||||
|
||||
All methods return structured results and never raise — errors are
|
||||
logged at WARNING level and indicated via return value.
|
||||
"""
|
||||
|
||||
def __init__(
|
||||
self,
|
||||
base_url: str | None = None,
|
||||
token: str | None = None,
|
||||
repo: str | None = None,
|
||||
) -> None:
|
||||
from config import settings
|
||||
|
||||
self._base_url = (base_url or settings.gitea_url).rstrip("/")
|
||||
self._token = token or settings.gitea_token
|
||||
self._repo = repo or settings.gitea_repo
|
||||
|
||||
# ── internal ────────────────────────────────────────────────────────────
|
||||
|
||||
def _headers(self) -> dict[str, str]:
|
||||
return {
|
||||
"Authorization": f"token {self._token}",
|
||||
"Content-Type": "application/json",
|
||||
}
|
||||
|
||||
def _api(self, path: str) -> str:
|
||||
return f"{self._base_url}/api/v1/{path.lstrip('/')}"
|
||||
|
||||
# ── public API ───────────────────────────────────────────────────────────
|
||||
|
||||
def create_pull_request(
|
||||
self,
|
||||
title: str,
|
||||
body: str,
|
||||
head: str,
|
||||
base: str = "main",
|
||||
) -> PullRequest | None:
|
||||
"""Open a pull request.
|
||||
|
||||
Args:
|
||||
title: PR title (keep under 70 chars).
|
||||
body: PR body in markdown.
|
||||
head: Source branch (e.g. ``self-modify/issue-983``).
|
||||
base: Target branch (default ``main``).
|
||||
|
||||
Returns:
|
||||
A ``PullRequest`` dataclass on success, ``None`` on failure.
|
||||
"""
|
||||
if not self._token:
|
||||
logger.warning("Gitea token not configured — skipping PR creation")
|
||||
return None
|
||||
|
||||
try:
|
||||
import requests as _requests
|
||||
|
||||
resp = _requests.post(
|
||||
self._api(f"repos/{self._repo}/pulls"),
|
||||
headers=self._headers(),
|
||||
json={"title": title, "body": body, "head": head, "base": base},
|
||||
timeout=15,
|
||||
)
|
||||
resp.raise_for_status()
|
||||
data = resp.json()
|
||||
pr = PullRequest(
|
||||
number=data["number"],
|
||||
title=data["title"],
|
||||
html_url=data["html_url"],
|
||||
)
|
||||
logger.info("PR #%d created: %s", pr.number, pr.html_url)
|
||||
return pr
|
||||
except Exception as exc:
|
||||
logger.warning("Failed to create PR: %s", exc)
|
||||
return None
|
||||
|
||||
def add_issue_comment(self, issue_number: int, body: str) -> bool:
|
||||
"""Post a comment on an issue or PR.
|
||||
|
||||
Returns:
|
||||
True on success, False on failure.
|
||||
"""
|
||||
if not self._token:
|
||||
logger.warning("Gitea token not configured — skipping issue comment")
|
||||
return False
|
||||
|
||||
try:
|
||||
import requests as _requests
|
||||
|
||||
resp = _requests.post(
|
||||
self._api(f"repos/{self._repo}/issues/{issue_number}/comments"),
|
||||
headers=self._headers(),
|
||||
json={"body": body},
|
||||
timeout=15,
|
||||
)
|
||||
resp.raise_for_status()
|
||||
logger.info("Comment posted on issue #%d", issue_number)
|
||||
return True
|
||||
except Exception as exc:
|
||||
logger.warning("Failed to post comment on issue #%d: %s", issue_number, exc)
|
||||
return False
|
||||
|
||||
|
||||
# Module-level singleton
|
||||
gitea_client = GiteaClient()
|
||||
1
src/self_coding/self_modify/__init__.py
Normal file
1
src/self_coding/self_modify/__init__.py
Normal file
@@ -0,0 +1 @@
|
||||
"""Self-modification loop sub-package."""
|
||||
301
src/self_coding/self_modify/loop.py
Normal file
301
src/self_coding/self_modify/loop.py
Normal file
@@ -0,0 +1,301 @@
|
||||
"""Self-modification loop — branch → edit → test → commit/revert.
|
||||
|
||||
Timmy's self-coding capability, restored after deletion in
|
||||
Operation Darling Purge (commit 584eeb679e88).
|
||||
|
||||
## Cycle
|
||||
1. **Branch** — create ``self-modify/<slug>`` from ``main``
|
||||
2. **Edit** — apply the proposed change (patch string or callable)
|
||||
3. **Test** — run ``pytest tests/ -x -q``; never commit on failure
|
||||
4. **Commit** — stage and commit on green; revert branch on red
|
||||
5. **PR** — open a Gitea pull request (requires no direct push to main)
|
||||
|
||||
## Guards
|
||||
- Never push directly to ``main`` or ``master``
|
||||
- All changes land via PR (enforced by ``_guard_branch``)
|
||||
- Test gate is mandatory; ``skip_tests=True`` is for unit-test use only
|
||||
- Commits only happen when ``pytest tests/ -x -q`` exits 0
|
||||
|
||||
## Usage::
|
||||
|
||||
from self_coding.self_modify.loop import SelfModifyLoop
|
||||
|
||||
loop = SelfModifyLoop()
|
||||
result = await loop.run(
|
||||
slug="add-hello-tool",
|
||||
description="Add hello() convenience tool",
|
||||
edit_fn=my_edit_function, # callable(repo_root: str) -> None
|
||||
)
|
||||
if result.success:
|
||||
print(f"PR: {result.pr_url}")
|
||||
else:
|
||||
print(f"Failed: {result.error}")
|
||||
"""
|
||||
|
||||
from __future__ import annotations
|
||||
|
||||
import logging
|
||||
import subprocess
|
||||
import time
|
||||
from collections.abc import Callable
|
||||
from dataclasses import dataclass, field
|
||||
from pathlib import Path
|
||||
|
||||
from config import settings
|
||||
|
||||
logger = logging.getLogger(__name__)
|
||||
|
||||
# Branches that must never receive direct commits
|
||||
_PROTECTED_BRANCHES = frozenset({"main", "master", "develop"})
|
||||
|
||||
# Test command used as the commit gate
|
||||
_TEST_COMMAND = ["pytest", "tests/", "-x", "-q", "--tb=short"]
|
||||
|
||||
# Max time (seconds) to wait for the test suite
|
||||
_TEST_TIMEOUT = 300
|
||||
|
||||
|
||||
@dataclass
|
||||
class LoopResult:
|
||||
"""Result from one self-modification cycle."""
|
||||
|
||||
success: bool
|
||||
branch: str = ""
|
||||
commit_sha: str = ""
|
||||
pr_url: str = ""
|
||||
pr_number: int = 0
|
||||
test_output: str = ""
|
||||
error: str = ""
|
||||
elapsed_ms: float = 0.0
|
||||
metadata: dict = field(default_factory=dict)
|
||||
|
||||
|
||||
class SelfModifyLoop:
|
||||
"""Orchestrate branch → edit → test → commit/revert → PR.
|
||||
|
||||
Args:
|
||||
repo_root: Absolute path to the git repository (defaults to
|
||||
``settings.repo_root``).
|
||||
remote: Git remote name (default ``origin``).
|
||||
base_branch: Branch to fork from and target for the PR
|
||||
(default ``main``).
|
||||
"""
|
||||
|
||||
def __init__(
|
||||
self,
|
||||
repo_root: str | None = None,
|
||||
remote: str = "origin",
|
||||
base_branch: str = "main",
|
||||
) -> None:
|
||||
self._repo_root = Path(repo_root or settings.repo_root)
|
||||
self._remote = remote
|
||||
self._base_branch = base_branch
|
||||
|
||||
# ── public ──────────────────────────────────────────────────────────────
|
||||
|
||||
async def run(
|
||||
self,
|
||||
slug: str,
|
||||
description: str,
|
||||
edit_fn: Callable[[str], None],
|
||||
issue_number: int | None = None,
|
||||
skip_tests: bool = False,
|
||||
) -> LoopResult:
|
||||
"""Execute one full self-modification cycle.
|
||||
|
||||
Args:
|
||||
slug: Short identifier used for the branch name
|
||||
(e.g. ``"add-hello-tool"``).
|
||||
description: Human-readable description for commit message
|
||||
and PR body.
|
||||
edit_fn: Callable that receives the repo root path (str)
|
||||
and applies the desired code changes in-place.
|
||||
issue_number: Optional Gitea issue number to reference in PR.
|
||||
skip_tests: If ``True``, skip the test gate (unit-test use
|
||||
only — never use in production).
|
||||
|
||||
Returns:
|
||||
:class:`LoopResult` describing the outcome.
|
||||
"""
|
||||
start = time.time()
|
||||
branch = f"self-modify/{slug}"
|
||||
|
||||
try:
|
||||
self._guard_branch(branch)
|
||||
self._checkout_base()
|
||||
self._create_branch(branch)
|
||||
|
||||
try:
|
||||
edit_fn(str(self._repo_root))
|
||||
except Exception as exc:
|
||||
self._revert_branch(branch)
|
||||
return LoopResult(
|
||||
success=False,
|
||||
branch=branch,
|
||||
error=f"edit_fn raised: {exc}",
|
||||
elapsed_ms=self._elapsed(start),
|
||||
)
|
||||
|
||||
if not skip_tests:
|
||||
test_output, passed = self._run_tests()
|
||||
if not passed:
|
||||
self._revert_branch(branch)
|
||||
return LoopResult(
|
||||
success=False,
|
||||
branch=branch,
|
||||
test_output=test_output,
|
||||
error="Tests failed — branch reverted",
|
||||
elapsed_ms=self._elapsed(start),
|
||||
)
|
||||
else:
|
||||
test_output = "(tests skipped)"
|
||||
|
||||
sha = self._commit_all(description)
|
||||
self._push_branch(branch)
|
||||
|
||||
pr = self._create_pr(
|
||||
branch=branch,
|
||||
description=description,
|
||||
test_output=test_output,
|
||||
issue_number=issue_number,
|
||||
)
|
||||
|
||||
return LoopResult(
|
||||
success=True,
|
||||
branch=branch,
|
||||
commit_sha=sha,
|
||||
pr_url=pr.html_url if pr else "",
|
||||
pr_number=pr.number if pr else 0,
|
||||
test_output=test_output,
|
||||
elapsed_ms=self._elapsed(start),
|
||||
)
|
||||
|
||||
except Exception as exc:
|
||||
logger.warning("Self-modify loop failed: %s", exc)
|
||||
return LoopResult(
|
||||
success=False,
|
||||
branch=branch,
|
||||
error=str(exc),
|
||||
elapsed_ms=self._elapsed(start),
|
||||
)
|
||||
|
||||
# ── private helpers ──────────────────────────────────────────────────────
|
||||
|
||||
@staticmethod
|
||||
def _elapsed(start: float) -> float:
|
||||
return (time.time() - start) * 1000
|
||||
|
||||
def _git(self, *args: str, check: bool = True) -> subprocess.CompletedProcess:
|
||||
"""Run a git command in the repo root."""
|
||||
cmd = ["git", *args]
|
||||
logger.debug("git %s", " ".join(args))
|
||||
return subprocess.run(
|
||||
cmd,
|
||||
cwd=str(self._repo_root),
|
||||
capture_output=True,
|
||||
text=True,
|
||||
check=check,
|
||||
)
|
||||
|
||||
def _guard_branch(self, branch: str) -> None:
|
||||
"""Raise if the target branch is a protected branch name."""
|
||||
if branch in _PROTECTED_BRANCHES:
|
||||
raise ValueError(
|
||||
f"Refusing to operate on protected branch '{branch}'. "
|
||||
"All self-modifications must go via PR."
|
||||
)
|
||||
|
||||
def _checkout_base(self) -> None:
|
||||
"""Checkout the base branch and pull latest."""
|
||||
self._git("checkout", self._base_branch)
|
||||
# Best-effort pull; ignore failures (e.g. no remote configured)
|
||||
self._git("pull", self._remote, self._base_branch, check=False)
|
||||
|
||||
def _create_branch(self, branch: str) -> None:
|
||||
"""Create and checkout a new branch, deleting an old one if needed."""
|
||||
# Delete local branch if it already exists (stale prior attempt)
|
||||
self._git("branch", "-D", branch, check=False)
|
||||
self._git("checkout", "-b", branch)
|
||||
logger.info("Created branch: %s", branch)
|
||||
|
||||
def _revert_branch(self, branch: str) -> None:
|
||||
"""Checkout base and delete the failed branch."""
|
||||
try:
|
||||
self._git("checkout", self._base_branch, check=False)
|
||||
self._git("branch", "-D", branch, check=False)
|
||||
logger.info("Reverted and deleted branch: %s", branch)
|
||||
except Exception as exc:
|
||||
logger.warning("Failed to revert branch %s: %s", branch, exc)
|
||||
|
||||
def _run_tests(self) -> tuple[str, bool]:
|
||||
"""Run the test suite. Returns (output, passed)."""
|
||||
logger.info("Running test suite: %s", " ".join(_TEST_COMMAND))
|
||||
try:
|
||||
result = subprocess.run(
|
||||
_TEST_COMMAND,
|
||||
cwd=str(self._repo_root),
|
||||
capture_output=True,
|
||||
text=True,
|
||||
timeout=_TEST_TIMEOUT,
|
||||
)
|
||||
output = (result.stdout + "\n" + result.stderr).strip()
|
||||
passed = result.returncode == 0
|
||||
logger.info(
|
||||
"Test suite %s (exit %d)", "PASSED" if passed else "FAILED", result.returncode
|
||||
)
|
||||
return output, passed
|
||||
except subprocess.TimeoutExpired:
|
||||
msg = f"Test suite timed out after {_TEST_TIMEOUT}s"
|
||||
logger.warning(msg)
|
||||
return msg, False
|
||||
except FileNotFoundError:
|
||||
msg = "pytest not found on PATH"
|
||||
logger.warning(msg)
|
||||
return msg, False
|
||||
|
||||
def _commit_all(self, message: str) -> str:
|
||||
"""Stage all changes and create a commit. Returns the new SHA."""
|
||||
self._git("add", "-A")
|
||||
self._git("commit", "-m", message)
|
||||
result = self._git("rev-parse", "HEAD")
|
||||
sha = result.stdout.strip()
|
||||
logger.info("Committed: %s sha=%s", message[:60], sha[:12])
|
||||
return sha
|
||||
|
||||
def _push_branch(self, branch: str) -> None:
|
||||
"""Push the branch to the remote."""
|
||||
self._git("push", "-u", self._remote, branch)
|
||||
logger.info("Pushed branch: %s -> %s", branch, self._remote)
|
||||
|
||||
def _create_pr(
|
||||
self,
|
||||
branch: str,
|
||||
description: str,
|
||||
test_output: str,
|
||||
issue_number: int | None,
|
||||
):
|
||||
"""Open a Gitea PR. Returns PullRequest or None on failure."""
|
||||
from self_coding.gitea_client import GiteaClient
|
||||
|
||||
client = GiteaClient()
|
||||
|
||||
issue_ref = f"\n\nFixes #{issue_number}" if issue_number else ""
|
||||
test_section = (
|
||||
f"\n\n## Test results\n```\n{test_output[:2000]}\n```"
|
||||
if test_output and test_output != "(tests skipped)"
|
||||
else ""
|
||||
)
|
||||
|
||||
body = (
|
||||
f"## Summary\n{description}"
|
||||
f"{issue_ref}"
|
||||
f"{test_section}"
|
||||
"\n\n🤖 Generated by Timmy's self-modification loop"
|
||||
)
|
||||
|
||||
return client.create_pull_request(
|
||||
title=f"[self-modify] {description[:60]}",
|
||||
body=body,
|
||||
head=branch,
|
||||
base=self._base_branch,
|
||||
)
|
||||
@@ -301,6 +301,26 @@ def create_timmy(
|
||||
|
||||
return GrokBackend()
|
||||
|
||||
if resolved == "airllm":
|
||||
# AirLLM requires Apple Silicon. On any other platform (Intel Mac, Linux,
|
||||
# Windows) or when the package is not installed, degrade silently to Ollama.
|
||||
from timmy.backends import is_apple_silicon
|
||||
|
||||
if not is_apple_silicon():
|
||||
logger.warning(
|
||||
"TIMMY_MODEL_BACKEND=airllm requested but not running on Apple Silicon "
|
||||
"— falling back to Ollama"
|
||||
)
|
||||
else:
|
||||
try:
|
||||
import airllm # noqa: F401
|
||||
except ImportError:
|
||||
logger.warning(
|
||||
"AirLLM not installed — falling back to Ollama. "
|
||||
"Install with: pip install 'airllm[mlx]'"
|
||||
)
|
||||
# Fall through to Ollama in all cases (AirLLM integration is scaffolded)
|
||||
|
||||
# Default: Ollama via Agno.
|
||||
model_name, is_fallback = _resolve_model_with_fallback(
|
||||
requested_model=None,
|
||||
|
||||
@@ -312,6 +312,13 @@ async def _handle_step_failure(
|
||||
"adaptation": step.result[:200],
|
||||
},
|
||||
)
|
||||
_log_self_correction(
|
||||
task_id=task_id,
|
||||
step_desc=step_desc,
|
||||
exc=exc,
|
||||
outcome=step.result,
|
||||
outcome_status="success",
|
||||
)
|
||||
if on_progress:
|
||||
await on_progress(f"[Adapted] {step_desc}", step_num, total_steps)
|
||||
except Exception as adapt_exc: # broad catch intentional
|
||||
@@ -325,9 +332,42 @@ async def _handle_step_failure(
|
||||
duration_ms=int((time.monotonic() - step_start) * 1000),
|
||||
)
|
||||
)
|
||||
_log_self_correction(
|
||||
task_id=task_id,
|
||||
step_desc=step_desc,
|
||||
exc=exc,
|
||||
outcome=f"Adaptation also failed: {adapt_exc}",
|
||||
outcome_status="failed",
|
||||
)
|
||||
completed_results.append(f"Step {step_num}: FAILED")
|
||||
|
||||
|
||||
def _log_self_correction(
|
||||
*,
|
||||
task_id: str,
|
||||
step_desc: str,
|
||||
exc: Exception,
|
||||
outcome: str,
|
||||
outcome_status: str,
|
||||
) -> None:
|
||||
"""Best-effort: log a self-correction event (never raises)."""
|
||||
try:
|
||||
from infrastructure.self_correction import log_self_correction
|
||||
|
||||
log_self_correction(
|
||||
source="agentic_loop",
|
||||
original_intent=step_desc,
|
||||
detected_error=f"{type(exc).__name__}: {exc}",
|
||||
correction_strategy="Adaptive re-plan via LLM",
|
||||
final_outcome=outcome[:500],
|
||||
task_id=task_id,
|
||||
outcome_status=outcome_status,
|
||||
error_type=type(exc).__name__,
|
||||
)
|
||||
except Exception as log_exc:
|
||||
logger.debug("Self-correction log failed: %s", log_exc)
|
||||
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# Core loop
|
||||
# ---------------------------------------------------------------------------
|
||||
|
||||
528
src/timmy/research.py
Normal file
528
src/timmy/research.py
Normal file
@@ -0,0 +1,528 @@
|
||||
"""Research Orchestrator — autonomous, sovereign research pipeline.
|
||||
|
||||
Chains all six steps of the research workflow with local-first execution:
|
||||
|
||||
Step 0 Cache — check semantic memory (SQLite, instant, zero API cost)
|
||||
Step 1 Scope — load a research template from skills/research/
|
||||
Step 2 Query — slot-fill template + formulate 5-15 search queries via Ollama
|
||||
Step 3 Search — execute queries via web_search (SerpAPI or fallback)
|
||||
Step 4 Fetch — download + extract full pages via web_fetch (trafilatura)
|
||||
Step 5 Synth — compress findings into a structured report via cascade
|
||||
Step 6 Deliver — store to semantic memory; optionally save to docs/research/
|
||||
|
||||
Cascade tiers for synthesis (spec §4):
|
||||
Tier 4 SQLite semantic cache — instant, free, covers ~80% after warm-up
|
||||
Tier 3 Ollama (qwen3:14b) — local, free, good quality
|
||||
Tier 2 Claude API (haiku) — cloud fallback, cheap, set ANTHROPIC_API_KEY
|
||||
Tier 1 (future) Groq — free-tier rate-limited, tracked in #980
|
||||
|
||||
All optional services degrade gracefully per project conventions.
|
||||
|
||||
Refs #972 (governing spec), #975 (ResearchOrchestrator sub-issue).
|
||||
"""
|
||||
|
||||
from __future__ import annotations
|
||||
|
||||
import asyncio
|
||||
import logging
|
||||
import re
|
||||
import textwrap
|
||||
from dataclasses import dataclass, field
|
||||
from pathlib import Path
|
||||
from typing import Any
|
||||
|
||||
logger = logging.getLogger(__name__)
|
||||
|
||||
# Optional memory imports — available at module level so tests can patch them.
|
||||
try:
|
||||
from timmy.memory_system import SemanticMemory, store_memory
|
||||
except Exception: # pragma: no cover
|
||||
SemanticMemory = None # type: ignore[assignment,misc]
|
||||
store_memory = None # type: ignore[assignment]
|
||||
|
||||
# Root of the project — two levels up from src/timmy/
|
||||
_PROJECT_ROOT = Path(__file__).parent.parent.parent
|
||||
_SKILLS_ROOT = _PROJECT_ROOT / "skills" / "research"
|
||||
_DOCS_ROOT = _PROJECT_ROOT / "docs" / "research"
|
||||
|
||||
# Similarity threshold for cache hit (0–1 cosine similarity)
|
||||
_CACHE_HIT_THRESHOLD = 0.82
|
||||
|
||||
# How many search result URLs to fetch as full pages
|
||||
_FETCH_TOP_N = 5
|
||||
|
||||
# Maximum tokens to request from the synthesis LLM
|
||||
_SYNTHESIS_MAX_TOKENS = 4096
|
||||
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# Data structures
|
||||
# ---------------------------------------------------------------------------
|
||||
|
||||
|
||||
@dataclass
|
||||
class ResearchResult:
|
||||
"""Full output of a research pipeline run."""
|
||||
|
||||
topic: str
|
||||
query_count: int
|
||||
sources_fetched: int
|
||||
report: str
|
||||
cached: bool = False
|
||||
cache_similarity: float = 0.0
|
||||
synthesis_backend: str = "unknown"
|
||||
errors: list[str] = field(default_factory=list)
|
||||
|
||||
def is_empty(self) -> bool:
|
||||
return not self.report.strip()
|
||||
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# Template loading
|
||||
# ---------------------------------------------------------------------------
|
||||
|
||||
|
||||
def list_templates() -> list[str]:
|
||||
"""Return names of available research templates (without .md extension)."""
|
||||
if not _SKILLS_ROOT.exists():
|
||||
return []
|
||||
return [p.stem for p in sorted(_SKILLS_ROOT.glob("*.md"))]
|
||||
|
||||
|
||||
def load_template(template_name: str, slots: dict[str, str] | None = None) -> str:
|
||||
"""Load a research template and fill {slot} placeholders.
|
||||
|
||||
Args:
|
||||
template_name: Stem of the .md file under skills/research/ (e.g. "tool_evaluation").
|
||||
slots: Mapping of {placeholder} → replacement value.
|
||||
|
||||
Returns:
|
||||
Template text with slots filled. Unfilled slots are left as-is.
|
||||
"""
|
||||
path = _SKILLS_ROOT / f"{template_name}.md"
|
||||
if not path.exists():
|
||||
available = ", ".join(list_templates()) or "(none)"
|
||||
raise FileNotFoundError(
|
||||
f"Research template {template_name!r} not found. "
|
||||
f"Available: {available}"
|
||||
)
|
||||
|
||||
text = path.read_text(encoding="utf-8")
|
||||
|
||||
# Strip YAML frontmatter (--- ... ---), including empty frontmatter (--- \n---)
|
||||
text = re.sub(r"^---\n.*?---\n", "", text, flags=re.DOTALL)
|
||||
|
||||
if slots:
|
||||
for key, value in slots.items():
|
||||
text = text.replace(f"{{{key}}}", value)
|
||||
|
||||
return text.strip()
|
||||
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# Query formulation (Step 2)
|
||||
# ---------------------------------------------------------------------------
|
||||
|
||||
|
||||
async def _formulate_queries(topic: str, template_context: str, n: int = 8) -> list[str]:
|
||||
"""Use the local LLM to generate targeted search queries for a topic.
|
||||
|
||||
Falls back to a simple heuristic if Ollama is unavailable.
|
||||
"""
|
||||
prompt = textwrap.dedent(f"""\
|
||||
You are a research assistant. Generate exactly {n} targeted, specific web search
|
||||
queries to thoroughly research the following topic.
|
||||
|
||||
TOPIC: {topic}
|
||||
|
||||
RESEARCH CONTEXT:
|
||||
{template_context[:1000]}
|
||||
|
||||
Rules:
|
||||
- One query per line, no numbering, no bullet points.
|
||||
- Vary the angle (definition, comparison, implementation, alternatives, pitfalls).
|
||||
- Prefer exact technical terms, tool names, and version numbers where relevant.
|
||||
- Output ONLY the queries, nothing else.
|
||||
""")
|
||||
|
||||
queries = await _ollama_complete(prompt, max_tokens=512)
|
||||
|
||||
if not queries:
|
||||
# Minimal fallback
|
||||
return [
|
||||
f"{topic} overview",
|
||||
f"{topic} tutorial",
|
||||
f"{topic} best practices",
|
||||
f"{topic} alternatives",
|
||||
f"{topic} 2025",
|
||||
]
|
||||
|
||||
lines = [ln.strip() for ln in queries.splitlines() if ln.strip()]
|
||||
return lines[:n] if len(lines) >= n else lines
|
||||
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# Search (Step 3)
|
||||
# ---------------------------------------------------------------------------
|
||||
|
||||
|
||||
async def _execute_search(queries: list[str]) -> list[dict[str, str]]:
|
||||
"""Run each query through the available web search backend.
|
||||
|
||||
Returns a flat list of {title, url, snippet} dicts.
|
||||
Degrades gracefully if SerpAPI key is absent.
|
||||
"""
|
||||
results: list[dict[str, str]] = []
|
||||
seen_urls: set[str] = set()
|
||||
|
||||
for query in queries:
|
||||
try:
|
||||
raw = await asyncio.to_thread(_run_search_sync, query)
|
||||
for item in raw:
|
||||
url = item.get("url", "")
|
||||
if url and url not in seen_urls:
|
||||
seen_urls.add(url)
|
||||
results.append(item)
|
||||
except Exception as exc:
|
||||
logger.warning("Search failed for query %r: %s", query, exc)
|
||||
|
||||
return results
|
||||
|
||||
|
||||
def _run_search_sync(query: str) -> list[dict[str, str]]:
|
||||
"""Synchronous search — wraps SerpAPI or returns empty on missing key."""
|
||||
import os
|
||||
|
||||
if not os.environ.get("SERPAPI_API_KEY"):
|
||||
logger.debug("SERPAPI_API_KEY not set — skipping web search for %r", query)
|
||||
return []
|
||||
|
||||
try:
|
||||
from serpapi import GoogleSearch
|
||||
|
||||
params = {"q": query, "api_key": os.environ["SERPAPI_API_KEY"], "num": 5}
|
||||
search = GoogleSearch(params)
|
||||
data = search.get_dict()
|
||||
items = []
|
||||
for r in data.get("organic_results", []):
|
||||
items.append(
|
||||
{
|
||||
"title": r.get("title", ""),
|
||||
"url": r.get("link", ""),
|
||||
"snippet": r.get("snippet", ""),
|
||||
}
|
||||
)
|
||||
return items
|
||||
except Exception as exc:
|
||||
logger.warning("SerpAPI search error: %s", exc)
|
||||
return []
|
||||
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# Fetch (Step 4)
|
||||
# ---------------------------------------------------------------------------
|
||||
|
||||
|
||||
async def _fetch_pages(results: list[dict[str, str]], top_n: int = _FETCH_TOP_N) -> list[str]:
|
||||
"""Download and extract full text for the top search results.
|
||||
|
||||
Uses web_fetch (trafilatura) from timmy.tools.system_tools.
|
||||
"""
|
||||
try:
|
||||
from timmy.tools.system_tools import web_fetch
|
||||
except ImportError:
|
||||
logger.warning("web_fetch not available — skipping page fetch")
|
||||
return []
|
||||
|
||||
pages: list[str] = []
|
||||
for item in results[:top_n]:
|
||||
url = item.get("url", "")
|
||||
if not url:
|
||||
continue
|
||||
try:
|
||||
text = await asyncio.to_thread(web_fetch, url, 6000)
|
||||
if text and not text.startswith("Error:"):
|
||||
pages.append(f"## {item.get('title', url)}\nSource: {url}\n\n{text}")
|
||||
except Exception as exc:
|
||||
logger.warning("Failed to fetch %s: %s", url, exc)
|
||||
|
||||
return pages
|
||||
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# Synthesis (Step 5) — cascade: Ollama → Claude fallback
|
||||
# ---------------------------------------------------------------------------
|
||||
|
||||
|
||||
async def _synthesize(topic: str, pages: list[str], snippets: list[str]) -> tuple[str, str]:
|
||||
"""Compress fetched pages + snippets into a structured research report.
|
||||
|
||||
Returns (report_markdown, backend_used).
|
||||
"""
|
||||
# Build synthesis prompt
|
||||
source_content = "\n\n---\n\n".join(pages[:5])
|
||||
if not source_content and snippets:
|
||||
source_content = "\n".join(f"- {s}" for s in snippets[:20])
|
||||
|
||||
if not source_content:
|
||||
return (
|
||||
f"# Research: {topic}\n\n*No source material was retrieved. "
|
||||
"Check SERPAPI_API_KEY and network connectivity.*",
|
||||
"none",
|
||||
)
|
||||
|
||||
prompt = textwrap.dedent(f"""\
|
||||
You are a senior technical researcher. Synthesize the source material below
|
||||
into a structured research report on the topic: **{topic}**
|
||||
|
||||
FORMAT YOUR REPORT AS:
|
||||
# {topic}
|
||||
|
||||
## Executive Summary
|
||||
(2-3 sentences: what you found, top recommendation)
|
||||
|
||||
## Key Findings
|
||||
(Bullet list of the most important facts, tools, or patterns)
|
||||
|
||||
## Comparison / Options
|
||||
(Table or list comparing alternatives where applicable)
|
||||
|
||||
## Recommended Approach
|
||||
(Concrete recommendation with rationale)
|
||||
|
||||
## Gaps & Next Steps
|
||||
(What wasn't answered, what to investigate next)
|
||||
|
||||
---
|
||||
SOURCE MATERIAL:
|
||||
{source_content[:12000]}
|
||||
""")
|
||||
|
||||
# Tier 3 — try Ollama first
|
||||
report = await _ollama_complete(prompt, max_tokens=_SYNTHESIS_MAX_TOKENS)
|
||||
if report:
|
||||
return report, "ollama"
|
||||
|
||||
# Tier 2 — Claude fallback
|
||||
report = await _claude_complete(prompt, max_tokens=_SYNTHESIS_MAX_TOKENS)
|
||||
if report:
|
||||
return report, "claude"
|
||||
|
||||
# Last resort — structured snippet summary
|
||||
summary = f"# {topic}\n\n## Snippets\n\n" + "\n\n".join(
|
||||
f"- {s}" for s in snippets[:15]
|
||||
)
|
||||
return summary, "fallback"
|
||||
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# LLM helpers
|
||||
# ---------------------------------------------------------------------------
|
||||
|
||||
|
||||
async def _ollama_complete(prompt: str, max_tokens: int = 1024) -> str:
|
||||
"""Send a prompt to Ollama and return the response text.
|
||||
|
||||
Returns empty string on failure (graceful degradation).
|
||||
"""
|
||||
try:
|
||||
import httpx
|
||||
|
||||
from config import settings
|
||||
|
||||
url = f"{settings.normalized_ollama_url}/api/generate"
|
||||
payload: dict[str, Any] = {
|
||||
"model": settings.ollama_model,
|
||||
"prompt": prompt,
|
||||
"stream": False,
|
||||
"options": {
|
||||
"num_predict": max_tokens,
|
||||
"temperature": 0.3,
|
||||
},
|
||||
}
|
||||
|
||||
async with httpx.AsyncClient(timeout=120.0) as client:
|
||||
resp = await client.post(url, json=payload)
|
||||
resp.raise_for_status()
|
||||
data = resp.json()
|
||||
return data.get("response", "").strip()
|
||||
except Exception as exc:
|
||||
logger.warning("Ollama completion failed: %s", exc)
|
||||
return ""
|
||||
|
||||
|
||||
async def _claude_complete(prompt: str, max_tokens: int = 1024) -> str:
|
||||
"""Send a prompt to Claude API as a last-resort fallback.
|
||||
|
||||
Only active when ANTHROPIC_API_KEY is configured.
|
||||
Returns empty string on failure or missing key.
|
||||
"""
|
||||
try:
|
||||
from config import settings
|
||||
|
||||
if not settings.anthropic_api_key:
|
||||
return ""
|
||||
|
||||
from timmy.backends import ClaudeBackend
|
||||
|
||||
backend = ClaudeBackend()
|
||||
result = await asyncio.to_thread(backend.run, prompt)
|
||||
return result.content.strip()
|
||||
except Exception as exc:
|
||||
logger.warning("Claude fallback failed: %s", exc)
|
||||
return ""
|
||||
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# Memory cache (Step 0 + Step 6)
|
||||
# ---------------------------------------------------------------------------
|
||||
|
||||
|
||||
def _check_cache(topic: str) -> tuple[str | None, float]:
|
||||
"""Search semantic memory for a prior result on this topic.
|
||||
|
||||
Returns (cached_report, similarity) or (None, 0.0).
|
||||
"""
|
||||
try:
|
||||
if SemanticMemory is None:
|
||||
return None, 0.0
|
||||
mem = SemanticMemory()
|
||||
hits = mem.search(topic, top_k=1)
|
||||
if hits:
|
||||
content, score = hits[0]
|
||||
if score >= _CACHE_HIT_THRESHOLD:
|
||||
return content, score
|
||||
except Exception as exc:
|
||||
logger.debug("Cache check failed: %s", exc)
|
||||
return None, 0.0
|
||||
|
||||
|
||||
def _store_result(topic: str, report: str) -> None:
|
||||
"""Index the research report into semantic memory for future retrieval."""
|
||||
try:
|
||||
if store_memory is None:
|
||||
logger.debug("store_memory not available — skipping memory index")
|
||||
return
|
||||
store_memory(
|
||||
content=report,
|
||||
source="research_pipeline",
|
||||
context_type="research",
|
||||
metadata={"topic": topic},
|
||||
)
|
||||
logger.info("Research result indexed for topic: %r", topic)
|
||||
except Exception as exc:
|
||||
logger.warning("Failed to store research result: %s", exc)
|
||||
|
||||
|
||||
def _save_to_disk(topic: str, report: str) -> Path | None:
|
||||
"""Persist the report as a markdown file under docs/research/.
|
||||
|
||||
Filename is derived from the topic (slugified). Returns the path or None.
|
||||
"""
|
||||
try:
|
||||
slug = re.sub(r"[^a-z0-9]+", "-", topic.lower()).strip("-")[:60]
|
||||
_DOCS_ROOT.mkdir(parents=True, exist_ok=True)
|
||||
path = _DOCS_ROOT / f"{slug}.md"
|
||||
path.write_text(report, encoding="utf-8")
|
||||
logger.info("Research report saved to %s", path)
|
||||
return path
|
||||
except Exception as exc:
|
||||
logger.warning("Failed to save research report to disk: %s", exc)
|
||||
return None
|
||||
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# Main orchestrator
|
||||
# ---------------------------------------------------------------------------
|
||||
|
||||
|
||||
async def run_research(
|
||||
topic: str,
|
||||
template: str | None = None,
|
||||
slots: dict[str, str] | None = None,
|
||||
save_to_disk: bool = False,
|
||||
skip_cache: bool = False,
|
||||
) -> ResearchResult:
|
||||
"""Run the full 6-step autonomous research pipeline.
|
||||
|
||||
Args:
|
||||
topic: The research question or subject.
|
||||
template: Name of a template from skills/research/ (e.g. "tool_evaluation").
|
||||
If None, runs without a template scaffold.
|
||||
slots: Placeholder values for the template (e.g. {"domain": "PDF parsing"}).
|
||||
save_to_disk: If True, write the report to docs/research/<slug>.md.
|
||||
skip_cache: If True, bypass the semantic memory cache.
|
||||
|
||||
Returns:
|
||||
ResearchResult with report and metadata.
|
||||
"""
|
||||
errors: list[str] = []
|
||||
|
||||
# ------------------------------------------------------------------
|
||||
# Step 0 — check cache
|
||||
# ------------------------------------------------------------------
|
||||
if not skip_cache:
|
||||
cached, score = _check_cache(topic)
|
||||
if cached:
|
||||
logger.info("Cache hit (%.2f) for topic: %r", score, topic)
|
||||
return ResearchResult(
|
||||
topic=topic,
|
||||
query_count=0,
|
||||
sources_fetched=0,
|
||||
report=cached,
|
||||
cached=True,
|
||||
cache_similarity=score,
|
||||
synthesis_backend="cache",
|
||||
)
|
||||
|
||||
# ------------------------------------------------------------------
|
||||
# Step 1 — load template (optional)
|
||||
# ------------------------------------------------------------------
|
||||
template_context = ""
|
||||
if template:
|
||||
try:
|
||||
template_context = load_template(template, slots)
|
||||
except FileNotFoundError as exc:
|
||||
errors.append(str(exc))
|
||||
logger.warning("Template load failed: %s", exc)
|
||||
|
||||
# ------------------------------------------------------------------
|
||||
# Step 2 — formulate queries
|
||||
# ------------------------------------------------------------------
|
||||
queries = await _formulate_queries(topic, template_context)
|
||||
logger.info("Formulated %d queries for topic: %r", len(queries), topic)
|
||||
|
||||
# ------------------------------------------------------------------
|
||||
# Step 3 — execute search
|
||||
# ------------------------------------------------------------------
|
||||
search_results = await _execute_search(queries)
|
||||
logger.info("Search returned %d results", len(search_results))
|
||||
snippets = [r.get("snippet", "") for r in search_results if r.get("snippet")]
|
||||
|
||||
# ------------------------------------------------------------------
|
||||
# Step 4 — fetch full pages
|
||||
# ------------------------------------------------------------------
|
||||
pages = await _fetch_pages(search_results)
|
||||
logger.info("Fetched %d pages", len(pages))
|
||||
|
||||
# ------------------------------------------------------------------
|
||||
# Step 5 — synthesize
|
||||
# ------------------------------------------------------------------
|
||||
report, backend = await _synthesize(topic, pages, snippets)
|
||||
|
||||
# ------------------------------------------------------------------
|
||||
# Step 6 — deliver
|
||||
# ------------------------------------------------------------------
|
||||
_store_result(topic, report)
|
||||
if save_to_disk:
|
||||
_save_to_disk(topic, report)
|
||||
|
||||
return ResearchResult(
|
||||
topic=topic,
|
||||
query_count=len(queries),
|
||||
sources_fetched=len(pages),
|
||||
report=report,
|
||||
cached=False,
|
||||
synthesis_backend=backend,
|
||||
errors=errors,
|
||||
)
|
||||
@@ -8,4 +8,23 @@ Refs: #954, #953
|
||||
Three-strike detector and automation enforcement.
|
||||
|
||||
Refs: #962
|
||||
|
||||
Session reporting: auto-generates markdown scorecards at session end
|
||||
and commits them to the Gitea repo for institutional memory.
|
||||
|
||||
Refs: #957 (Session Sovereignty Report Generator)
|
||||
"""
|
||||
|
||||
from timmy.sovereignty.session_report import (
|
||||
commit_report,
|
||||
generate_and_commit_report,
|
||||
generate_report,
|
||||
mark_session_start,
|
||||
)
|
||||
|
||||
__all__ = [
|
||||
"generate_report",
|
||||
"commit_report",
|
||||
"generate_and_commit_report",
|
||||
"mark_session_start",
|
||||
]
|
||||
|
||||
442
src/timmy/sovereignty/session_report.py
Normal file
442
src/timmy/sovereignty/session_report.py
Normal file
@@ -0,0 +1,442 @@
|
||||
"""Session Sovereignty Report Generator.
|
||||
|
||||
Auto-generates a sovereignty scorecard at the end of each play session
|
||||
and commits it as a markdown file to the Gitea repo under
|
||||
``reports/sovereignty/``.
|
||||
|
||||
Report contents (per issue #957):
|
||||
- Session duration + game played
|
||||
- Total model calls by type (VLM, LLM, TTS, API)
|
||||
- Total cache/rule hits by type
|
||||
- New skills crystallized (placeholder — pending skill-tracking impl)
|
||||
- Sovereignty delta (change from session start → end)
|
||||
- Cost breakdown (actual API spend)
|
||||
- Per-layer sovereignty %: perception, decision, narration
|
||||
- Trend comparison vs previous session
|
||||
|
||||
Refs: #957 (Sovereignty P0) · #953 (The Sovereignty Loop)
|
||||
"""
|
||||
|
||||
import base64
|
||||
import json
|
||||
import logging
|
||||
from datetime import UTC, datetime
|
||||
from pathlib import Path
|
||||
from typing import Any
|
||||
|
||||
import httpx
|
||||
|
||||
from config import settings
|
||||
|
||||
# Optional module-level imports — degrade gracefully if unavailable at import time
|
||||
try:
|
||||
from timmy.session_logger import get_session_logger
|
||||
except Exception: # ImportError or circular import during early startup
|
||||
get_session_logger = None # type: ignore[assignment]
|
||||
|
||||
try:
|
||||
from infrastructure.sovereignty_metrics import GRADUATION_TARGETS, get_sovereignty_store
|
||||
except Exception:
|
||||
GRADUATION_TARGETS: dict = {} # type: ignore[assignment]
|
||||
get_sovereignty_store = None # type: ignore[assignment]
|
||||
|
||||
logger = logging.getLogger(__name__)
|
||||
|
||||
# Module-level session start time; set by mark_session_start()
|
||||
_SESSION_START: datetime | None = None
|
||||
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# Public API
|
||||
# ---------------------------------------------------------------------------
|
||||
|
||||
|
||||
def mark_session_start() -> None:
|
||||
"""Record the session start wall-clock time.
|
||||
|
||||
Call once during application startup so ``generate_report()`` can
|
||||
compute accurate session durations.
|
||||
"""
|
||||
global _SESSION_START
|
||||
_SESSION_START = datetime.now(UTC)
|
||||
logger.debug("Sovereignty: session start recorded at %s", _SESSION_START.isoformat())
|
||||
|
||||
|
||||
def generate_report(session_id: str = "dashboard") -> str:
|
||||
"""Render a sovereignty scorecard as a markdown string.
|
||||
|
||||
Pulls from:
|
||||
- ``timmy.session_logger`` — message/tool-call/error counts
|
||||
- ``infrastructure.sovereignty_metrics`` — cache hit rate, API cost,
|
||||
graduation phase, and trend data
|
||||
|
||||
Args:
|
||||
session_id: The session identifier (default: "dashboard").
|
||||
|
||||
Returns:
|
||||
Markdown-formatted sovereignty report string.
|
||||
"""
|
||||
now = datetime.now(UTC)
|
||||
session_start = _SESSION_START or now
|
||||
duration_secs = (now - session_start).total_seconds()
|
||||
|
||||
session_data = _gather_session_data()
|
||||
sov_data = _gather_sovereignty_data()
|
||||
|
||||
return _render_markdown(now, session_id, duration_secs, session_data, sov_data)
|
||||
|
||||
|
||||
def commit_report(report_md: str, session_id: str = "dashboard") -> bool:
|
||||
"""Commit a sovereignty report to the Gitea repo.
|
||||
|
||||
Creates or updates ``reports/sovereignty/{date}_{session_id}.md``
|
||||
via the Gitea Contents API. Degrades gracefully: logs a warning
|
||||
and returns ``False`` if Gitea is unreachable or misconfigured.
|
||||
|
||||
Args:
|
||||
report_md: Markdown content to commit.
|
||||
session_id: Session identifier used in the filename.
|
||||
|
||||
Returns:
|
||||
``True`` on success, ``False`` on failure.
|
||||
"""
|
||||
if not settings.gitea_enabled:
|
||||
logger.info("Sovereignty: Gitea disabled — skipping report commit")
|
||||
return False
|
||||
|
||||
if not settings.gitea_token:
|
||||
logger.warning("Sovereignty: no Gitea token — skipping report commit")
|
||||
return False
|
||||
|
||||
date_str = datetime.now(UTC).strftime("%Y-%m-%d")
|
||||
file_path = f"reports/sovereignty/{date_str}_{session_id}.md"
|
||||
url = f"{settings.gitea_url}/api/v1/repos/{settings.gitea_repo}/contents/{file_path}"
|
||||
headers = {
|
||||
"Authorization": f"token {settings.gitea_token}",
|
||||
"Content-Type": "application/json",
|
||||
}
|
||||
encoded_content = base64.b64encode(report_md.encode()).decode()
|
||||
commit_message = (
|
||||
f"report: sovereignty session {session_id} ({date_str})\n\n"
|
||||
f"Auto-generated by Timmy. Refs #957"
|
||||
)
|
||||
payload: dict[str, Any] = {
|
||||
"message": commit_message,
|
||||
"content": encoded_content,
|
||||
}
|
||||
|
||||
try:
|
||||
with httpx.Client(timeout=10.0) as client:
|
||||
# Fetch existing file SHA so we can update rather than create
|
||||
check = client.get(url, headers=headers)
|
||||
if check.status_code == 200:
|
||||
existing = check.json()
|
||||
payload["sha"] = existing.get("sha", "")
|
||||
|
||||
resp = client.put(url, headers=headers, json=payload)
|
||||
resp.raise_for_status()
|
||||
|
||||
logger.info("Sovereignty: report committed to %s", file_path)
|
||||
return True
|
||||
|
||||
except httpx.HTTPStatusError as exc:
|
||||
logger.warning(
|
||||
"Sovereignty: commit failed (HTTP %s): %s",
|
||||
exc.response.status_code,
|
||||
exc,
|
||||
)
|
||||
return False
|
||||
except Exception as exc:
|
||||
logger.warning("Sovereignty: commit failed: %s", exc)
|
||||
return False
|
||||
|
||||
|
||||
async def generate_and_commit_report(session_id: str = "dashboard") -> bool:
|
||||
"""Generate and commit a sovereignty report for the current session.
|
||||
|
||||
Primary entry point — call at session end / application shutdown.
|
||||
Wraps the synchronous ``commit_report`` call in ``asyncio.to_thread``
|
||||
so it does not block the event loop.
|
||||
|
||||
Args:
|
||||
session_id: The session identifier.
|
||||
|
||||
Returns:
|
||||
``True`` if the report was generated and committed successfully.
|
||||
"""
|
||||
import asyncio
|
||||
|
||||
try:
|
||||
report_md = generate_report(session_id)
|
||||
logger.info("Sovereignty: report generated (%d chars)", len(report_md))
|
||||
committed = await asyncio.to_thread(commit_report, report_md, session_id)
|
||||
return committed
|
||||
except Exception as exc:
|
||||
logger.warning("Sovereignty: report generation failed: %s", exc)
|
||||
return False
|
||||
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# Internal helpers
|
||||
# ---------------------------------------------------------------------------
|
||||
|
||||
|
||||
def _format_duration(seconds: float) -> str:
|
||||
"""Format a duration in seconds as a human-readable string."""
|
||||
total = int(seconds)
|
||||
hours, remainder = divmod(total, 3600)
|
||||
minutes, secs = divmod(remainder, 60)
|
||||
if hours:
|
||||
return f"{hours}h {minutes}m {secs}s"
|
||||
if minutes:
|
||||
return f"{minutes}m {secs}s"
|
||||
return f"{secs}s"
|
||||
|
||||
|
||||
def _gather_session_data() -> dict[str, Any]:
|
||||
"""Pull session statistics from the session logger.
|
||||
|
||||
Returns a dict with:
|
||||
- ``user_messages``, ``timmy_messages``, ``tool_calls``, ``errors``
|
||||
- ``tool_call_breakdown``: dict[tool_name, count]
|
||||
"""
|
||||
default: dict[str, Any] = {
|
||||
"user_messages": 0,
|
||||
"timmy_messages": 0,
|
||||
"tool_calls": 0,
|
||||
"errors": 0,
|
||||
"tool_call_breakdown": {},
|
||||
}
|
||||
|
||||
try:
|
||||
if get_session_logger is None:
|
||||
return default
|
||||
sl = get_session_logger()
|
||||
sl.flush()
|
||||
|
||||
# Read today's session file directly for accurate counts
|
||||
if not sl.session_file.exists():
|
||||
return default
|
||||
|
||||
entries: list[dict] = []
|
||||
with open(sl.session_file) as f:
|
||||
for line in f:
|
||||
line = line.strip()
|
||||
if line:
|
||||
try:
|
||||
entries.append(json.loads(line))
|
||||
except json.JSONDecodeError:
|
||||
continue
|
||||
|
||||
tool_breakdown: dict[str, int] = {}
|
||||
user_msgs = timmy_msgs = tool_calls = errors = 0
|
||||
|
||||
for entry in entries:
|
||||
etype = entry.get("type")
|
||||
if etype == "message":
|
||||
if entry.get("role") == "user":
|
||||
user_msgs += 1
|
||||
elif entry.get("role") == "timmy":
|
||||
timmy_msgs += 1
|
||||
elif etype == "tool_call":
|
||||
tool_calls += 1
|
||||
tool_name = entry.get("tool", "unknown")
|
||||
tool_breakdown[tool_name] = tool_breakdown.get(tool_name, 0) + 1
|
||||
elif etype == "error":
|
||||
errors += 1
|
||||
|
||||
return {
|
||||
"user_messages": user_msgs,
|
||||
"timmy_messages": timmy_msgs,
|
||||
"tool_calls": tool_calls,
|
||||
"errors": errors,
|
||||
"tool_call_breakdown": tool_breakdown,
|
||||
}
|
||||
|
||||
except Exception as exc:
|
||||
logger.warning("Sovereignty: failed to gather session data: %s", exc)
|
||||
return default
|
||||
|
||||
|
||||
def _gather_sovereignty_data() -> dict[str, Any]:
|
||||
"""Pull sovereignty metrics from the SQLite store.
|
||||
|
||||
Returns a dict with:
|
||||
- ``metrics``: summary from ``SovereigntyMetricsStore.get_summary()``
|
||||
- ``deltas``: per-metric start/end values within recent history window
|
||||
- ``previous_session``: most recent prior value for each metric
|
||||
"""
|
||||
try:
|
||||
if get_sovereignty_store is None:
|
||||
return {"metrics": {}, "deltas": {}, "previous_session": {}}
|
||||
store = get_sovereignty_store()
|
||||
summary = store.get_summary()
|
||||
|
||||
deltas: dict[str, dict[str, Any]] = {}
|
||||
previous_session: dict[str, float | None] = {}
|
||||
|
||||
for metric_type in GRADUATION_TARGETS:
|
||||
history = store.get_latest(metric_type, limit=10)
|
||||
if len(history) >= 2:
|
||||
deltas[metric_type] = {
|
||||
"start": history[-1]["value"],
|
||||
"end": history[0]["value"],
|
||||
}
|
||||
previous_session[metric_type] = history[1]["value"]
|
||||
elif len(history) == 1:
|
||||
deltas[metric_type] = {"start": history[0]["value"], "end": history[0]["value"]}
|
||||
previous_session[metric_type] = None
|
||||
else:
|
||||
deltas[metric_type] = {"start": None, "end": None}
|
||||
previous_session[metric_type] = None
|
||||
|
||||
return {
|
||||
"metrics": summary,
|
||||
"deltas": deltas,
|
||||
"previous_session": previous_session,
|
||||
}
|
||||
|
||||
except Exception as exc:
|
||||
logger.warning("Sovereignty: failed to gather sovereignty data: %s", exc)
|
||||
return {"metrics": {}, "deltas": {}, "previous_session": {}}
|
||||
|
||||
|
||||
def _render_markdown(
|
||||
now: datetime,
|
||||
session_id: str,
|
||||
duration_secs: float,
|
||||
session_data: dict[str, Any],
|
||||
sov_data: dict[str, Any],
|
||||
) -> str:
|
||||
"""Assemble the full sovereignty report in markdown."""
|
||||
lines: list[str] = []
|
||||
|
||||
# Header
|
||||
lines += [
|
||||
"# Sovereignty Session Report",
|
||||
"",
|
||||
f"**Session ID:** `{session_id}` ",
|
||||
f"**Date:** {now.strftime('%Y-%m-%d')} ",
|
||||
f"**Duration:** {_format_duration(duration_secs)} ",
|
||||
f"**Generated:** {now.isoformat()}",
|
||||
"",
|
||||
"---",
|
||||
"",
|
||||
]
|
||||
|
||||
# Session activity
|
||||
lines += [
|
||||
"## Session Activity",
|
||||
"",
|
||||
"| Metric | Count |",
|
||||
"|--------|-------|",
|
||||
f"| User messages | {session_data['user_messages']} |",
|
||||
f"| Timmy responses | {session_data['timmy_messages']} |",
|
||||
f"| Tool calls | {session_data['tool_calls']} |",
|
||||
f"| Errors | {session_data['errors']} |",
|
||||
"",
|
||||
]
|
||||
|
||||
tool_breakdown = session_data.get("tool_call_breakdown", {})
|
||||
if tool_breakdown:
|
||||
lines += ["### Model Calls by Tool", ""]
|
||||
for tool_name, count in sorted(tool_breakdown.items(), key=lambda x: -x[1]):
|
||||
lines.append(f"- `{tool_name}`: {count}")
|
||||
lines.append("")
|
||||
|
||||
# Sovereignty scorecard
|
||||
|
||||
lines += [
|
||||
"## Sovereignty Scorecard",
|
||||
"",
|
||||
"| Metric | Current | Target (graduation) | Phase |",
|
||||
"|--------|---------|---------------------|-------|",
|
||||
]
|
||||
|
||||
for metric_type, data in sov_data["metrics"].items():
|
||||
current = data.get("current")
|
||||
current_str = f"{current:.4f}" if current is not None else "N/A"
|
||||
grad_target = GRADUATION_TARGETS.get(metric_type, {}).get("graduation")
|
||||
grad_str = f"{grad_target:.4f}" if isinstance(grad_target, (int, float)) else "N/A"
|
||||
phase = data.get("phase", "unknown")
|
||||
lines.append(f"| {metric_type} | {current_str} | {grad_str} | {phase} |")
|
||||
|
||||
lines += ["", "### Sovereignty Delta (This Session)", ""]
|
||||
|
||||
for metric_type, delta_info in sov_data.get("deltas", {}).items():
|
||||
start_val = delta_info.get("start")
|
||||
end_val = delta_info.get("end")
|
||||
if start_val is not None and end_val is not None:
|
||||
diff = end_val - start_val
|
||||
sign = "+" if diff >= 0 else ""
|
||||
lines.append(
|
||||
f"- **{metric_type}**: {start_val:.4f} → {end_val:.4f} ({sign}{diff:.4f})"
|
||||
)
|
||||
else:
|
||||
lines.append(f"- **{metric_type}**: N/A (no data recorded)")
|
||||
|
||||
# Cost breakdown
|
||||
lines += ["", "## Cost Breakdown", ""]
|
||||
api_cost_data = sov_data["metrics"].get("api_cost", {})
|
||||
current_cost = api_cost_data.get("current")
|
||||
if current_cost is not None:
|
||||
lines.append(f"- **Total API spend (latest recorded):** ${current_cost:.4f}")
|
||||
else:
|
||||
lines.append("- **Total API spend:** N/A (no data recorded)")
|
||||
lines.append("")
|
||||
|
||||
# Per-layer sovereignty
|
||||
lines += [
|
||||
"## Per-Layer Sovereignty",
|
||||
"",
|
||||
"| Layer | Sovereignty % |",
|
||||
"|-------|--------------|",
|
||||
"| Perception (VLM) | N/A |",
|
||||
"| Decision (LLM) | N/A |",
|
||||
"| Narration (TTS) | N/A |",
|
||||
"",
|
||||
"> Per-layer tracking requires instrumented inference calls. See #957.",
|
||||
"",
|
||||
]
|
||||
|
||||
# Skills crystallized
|
||||
lines += [
|
||||
"## Skills Crystallized",
|
||||
"",
|
||||
"_Skill crystallization tracking not yet implemented. See #957._",
|
||||
"",
|
||||
]
|
||||
|
||||
# Trend vs previous session
|
||||
lines += ["## Trend vs Previous Session", ""]
|
||||
prev_data = sov_data.get("previous_session", {})
|
||||
has_prev = any(v is not None for v in prev_data.values())
|
||||
|
||||
if has_prev:
|
||||
lines += [
|
||||
"| Metric | Previous | Current | Change |",
|
||||
"|--------|----------|---------|--------|",
|
||||
]
|
||||
for metric_type, curr_info in sov_data["metrics"].items():
|
||||
curr_val = curr_info.get("current")
|
||||
prev_val = prev_data.get(metric_type)
|
||||
curr_str = f"{curr_val:.4f}" if curr_val is not None else "N/A"
|
||||
prev_str = f"{prev_val:.4f}" if prev_val is not None else "N/A"
|
||||
if curr_val is not None and prev_val is not None:
|
||||
diff = curr_val - prev_val
|
||||
sign = "+" if diff >= 0 else ""
|
||||
change_str = f"{sign}{diff:.4f}"
|
||||
else:
|
||||
change_str = "N/A"
|
||||
lines.append(f"| {metric_type} | {prev_str} | {curr_str} | {change_str} |")
|
||||
lines.append("")
|
||||
else:
|
||||
lines += ["_No previous session data available for comparison._", ""]
|
||||
|
||||
# Footer
|
||||
lines += [
|
||||
"---",
|
||||
"_Auto-generated by Timmy · Session Sovereignty Report · Refs: #957_",
|
||||
]
|
||||
|
||||
return "\n".join(lines)
|
||||
@@ -46,6 +46,7 @@ from timmy.tools.file_tools import (
|
||||
create_research_tools,
|
||||
create_writing_tools,
|
||||
)
|
||||
from timmy.tools.search import scrape_url, web_search
|
||||
from timmy.tools.system_tools import (
|
||||
_safe_eval,
|
||||
calculator,
|
||||
@@ -72,6 +73,9 @@ __all__ = [
|
||||
"create_data_tools",
|
||||
"create_research_tools",
|
||||
"create_writing_tools",
|
||||
# search
|
||||
"scrape_url",
|
||||
"web_search",
|
||||
# system_tools
|
||||
"_safe_eval",
|
||||
"calculator",
|
||||
|
||||
@@ -28,6 +28,7 @@ from timmy.tools.file_tools import (
|
||||
create_research_tools,
|
||||
create_writing_tools,
|
||||
)
|
||||
from timmy.tools.search import scrape_url, web_search
|
||||
from timmy.tools.system_tools import (
|
||||
calculator,
|
||||
consult_grok,
|
||||
@@ -54,6 +55,16 @@ def _register_web_fetch_tool(toolkit: Toolkit) -> None:
|
||||
raise
|
||||
|
||||
|
||||
def _register_search_tools(toolkit: Toolkit) -> None:
|
||||
"""Register SearXNG web_search and Crawl4AI scrape_url tools."""
|
||||
try:
|
||||
toolkit.register(web_search, name="web_search")
|
||||
toolkit.register(scrape_url, name="scrape_url")
|
||||
except Exception as exc:
|
||||
logger.error("Failed to register search tools: %s", exc)
|
||||
raise
|
||||
|
||||
|
||||
def _register_core_tools(toolkit: Toolkit, base_path: Path) -> None:
|
||||
"""Register core execution and file tools."""
|
||||
# Python execution
|
||||
@@ -261,6 +272,7 @@ def create_full_toolkit(base_dir: str | Path | None = None):
|
||||
|
||||
_register_core_tools(toolkit, base_path)
|
||||
_register_web_fetch_tool(toolkit)
|
||||
_register_search_tools(toolkit)
|
||||
_register_grok_tool(toolkit)
|
||||
_register_memory_tools(toolkit)
|
||||
_register_agentic_loop_tool(toolkit)
|
||||
@@ -433,6 +445,16 @@ def _analysis_tool_catalog() -> dict:
|
||||
"description": "Fetch a web page and extract clean readable text (trafilatura)",
|
||||
"available_in": ["orchestrator"],
|
||||
},
|
||||
"web_search": {
|
||||
"name": "Web Search",
|
||||
"description": "Search the web via self-hosted SearXNG (no API key required)",
|
||||
"available_in": ["echo", "orchestrator"],
|
||||
},
|
||||
"scrape_url": {
|
||||
"name": "Scrape URL",
|
||||
"description": "Scrape a URL with Crawl4AI and return clean markdown content",
|
||||
"available_in": ["echo", "orchestrator"],
|
||||
},
|
||||
}
|
||||
|
||||
|
||||
|
||||
@@ -59,7 +59,7 @@ def _make_smart_read_file(file_tools: FileTools) -> Callable:
|
||||
def create_research_tools(base_dir: str | Path | None = None):
|
||||
"""Create tools for the research agent (Echo).
|
||||
|
||||
Includes: file reading
|
||||
Includes: file reading, web search (SearXNG), URL scraping (Crawl4AI)
|
||||
"""
|
||||
if not _AGNO_TOOLS_AVAILABLE:
|
||||
raise ImportError(f"Agno tools not available: {_ImportError}")
|
||||
@@ -73,6 +73,12 @@ def create_research_tools(base_dir: str | Path | None = None):
|
||||
toolkit.register(_make_smart_read_file(file_tools), name="read_file")
|
||||
toolkit.register(file_tools.list_files, name="list_files")
|
||||
|
||||
# Web search + scraping (gracefully no-ops when backend=none or service down)
|
||||
from timmy.tools.search import scrape_url, web_search
|
||||
|
||||
toolkit.register(web_search, name="web_search")
|
||||
toolkit.register(scrape_url, name="scrape_url")
|
||||
|
||||
return toolkit
|
||||
|
||||
|
||||
|
||||
186
src/timmy/tools/search.py
Normal file
186
src/timmy/tools/search.py
Normal file
@@ -0,0 +1,186 @@
|
||||
"""Self-hosted web search and scraping tools using SearXNG + Crawl4AI.
|
||||
|
||||
Provides:
|
||||
- web_search(query) — SearXNG meta-search (no API key required)
|
||||
- scrape_url(url) — Crawl4AI full-page scrape to clean markdown
|
||||
|
||||
Both tools degrade gracefully when the backing service is unavailable
|
||||
(logs WARNING, returns descriptive error string — never crashes).
|
||||
|
||||
Services are started via `docker compose --profile search up` or configured
|
||||
with TIMMY_SEARCH_URL / TIMMY_CRAWL_URL environment variables.
|
||||
"""
|
||||
|
||||
from __future__ import annotations
|
||||
|
||||
import logging
|
||||
import time
|
||||
|
||||
from config import settings
|
||||
|
||||
logger = logging.getLogger(__name__)
|
||||
|
||||
# Crawl4AI polling: up to _CRAWL_MAX_POLLS × _CRAWL_POLL_INTERVAL seconds
|
||||
_CRAWL_MAX_POLLS = 6
|
||||
_CRAWL_POLL_INTERVAL = 5 # seconds
|
||||
_CRAWL_CHAR_BUDGET = 4000 * 4 # ~4000 tokens
|
||||
|
||||
|
||||
def web_search(query: str, num_results: int = 5) -> str:
|
||||
"""Search the web using the self-hosted SearXNG meta-search engine.
|
||||
|
||||
Returns ranked results (title + URL + snippet) without requiring any
|
||||
paid API key. Requires SearXNG running locally (docker compose
|
||||
--profile search up) or TIMMY_SEARCH_URL pointing to a reachable instance.
|
||||
|
||||
Args:
|
||||
query: The search query.
|
||||
num_results: Maximum number of results to return (default 5).
|
||||
|
||||
Returns:
|
||||
Formatted search results string, or an error/status message on failure.
|
||||
"""
|
||||
if settings.timmy_search_backend == "none":
|
||||
return "Web search is disabled (TIMMY_SEARCH_BACKEND=none)."
|
||||
|
||||
try:
|
||||
import requests as _requests
|
||||
except ImportError:
|
||||
return "Error: 'requests' package is not installed."
|
||||
|
||||
base_url = settings.search_url.rstrip("/")
|
||||
params: dict = {
|
||||
"q": query,
|
||||
"format": "json",
|
||||
"categories": "general",
|
||||
}
|
||||
|
||||
try:
|
||||
resp = _requests.get(
|
||||
f"{base_url}/search",
|
||||
params=params,
|
||||
timeout=10,
|
||||
headers={"User-Agent": "TimmyResearchBot/1.0"},
|
||||
)
|
||||
resp.raise_for_status()
|
||||
except Exception as exc:
|
||||
logger.warning("SearXNG unavailable at %s: %s", base_url, exc)
|
||||
return f"Search unavailable — SearXNG not reachable ({base_url}): {exc}"
|
||||
|
||||
try:
|
||||
data = resp.json()
|
||||
except Exception as exc:
|
||||
logger.warning("SearXNG response parse error: %s", exc)
|
||||
return "Search error: could not parse SearXNG response."
|
||||
|
||||
results = data.get("results", [])[:num_results]
|
||||
if not results:
|
||||
return f"No results found for: {query!r}"
|
||||
|
||||
lines = [f"Web search results for: {query!r}\n"]
|
||||
for i, r in enumerate(results, 1):
|
||||
title = r.get("title", "Untitled")
|
||||
url = r.get("url", "")
|
||||
snippet = r.get("content", "").strip()
|
||||
lines.append(f"{i}. {title}\n URL: {url}\n {snippet}\n")
|
||||
|
||||
return "\n".join(lines)
|
||||
|
||||
|
||||
def scrape_url(url: str) -> str:
|
||||
"""Scrape a URL with Crawl4AI and return the main content as clean markdown.
|
||||
|
||||
Crawl4AI extracts well-structured markdown from any public page —
|
||||
articles, docs, product pages — suitable for LLM consumption.
|
||||
Requires Crawl4AI running locally (docker compose --profile search up)
|
||||
or TIMMY_CRAWL_URL pointing to a reachable instance.
|
||||
|
||||
Args:
|
||||
url: The URL to scrape (must start with http:// or https://).
|
||||
|
||||
Returns:
|
||||
Extracted markdown text (up to ~4000 tokens), or an error message.
|
||||
"""
|
||||
if not url or not url.startswith(("http://", "https://")):
|
||||
return f"Error: invalid URL — must start with http:// or https://: {url!r}"
|
||||
|
||||
if settings.timmy_search_backend == "none":
|
||||
return "Web scraping is disabled (TIMMY_SEARCH_BACKEND=none)."
|
||||
|
||||
try:
|
||||
import requests as _requests
|
||||
except ImportError:
|
||||
return "Error: 'requests' package is not installed."
|
||||
|
||||
base = settings.crawl_url.rstrip("/")
|
||||
|
||||
# Submit crawl task
|
||||
try:
|
||||
resp = _requests.post(
|
||||
f"{base}/crawl",
|
||||
json={"urls": [url], "priority": 10},
|
||||
timeout=15,
|
||||
headers={"Content-Type": "application/json"},
|
||||
)
|
||||
resp.raise_for_status()
|
||||
except Exception as exc:
|
||||
logger.warning("Crawl4AI unavailable at %s: %s", base, exc)
|
||||
return f"Scrape unavailable — Crawl4AI not reachable ({base}): {exc}"
|
||||
|
||||
try:
|
||||
submit_data = resp.json()
|
||||
except Exception as exc:
|
||||
logger.warning("Crawl4AI submit parse error: %s", exc)
|
||||
return "Scrape error: could not parse Crawl4AI response."
|
||||
|
||||
# Check if result came back synchronously
|
||||
if "results" in submit_data:
|
||||
return _extract_crawl_content(submit_data["results"], url)
|
||||
|
||||
task_id = submit_data.get("task_id")
|
||||
if not task_id:
|
||||
return f"Scrape error: Crawl4AI returned no task_id for {url}"
|
||||
|
||||
# Poll for async result
|
||||
for _ in range(_CRAWL_MAX_POLLS):
|
||||
time.sleep(_CRAWL_POLL_INTERVAL)
|
||||
try:
|
||||
poll = _requests.get(f"{base}/task/{task_id}", timeout=10)
|
||||
poll.raise_for_status()
|
||||
task_data = poll.json()
|
||||
except Exception as exc:
|
||||
logger.warning("Crawl4AI poll error (task=%s): %s", task_id, exc)
|
||||
continue
|
||||
|
||||
status = task_data.get("status", "")
|
||||
if status == "completed":
|
||||
results = task_data.get("results") or task_data.get("result")
|
||||
if isinstance(results, dict):
|
||||
results = [results]
|
||||
return _extract_crawl_content(results or [], url)
|
||||
if status == "failed":
|
||||
return f"Scrape failed for {url}: {task_data.get('error', 'unknown error')}"
|
||||
|
||||
return f"Scrape timed out after {_CRAWL_MAX_POLLS * _CRAWL_POLL_INTERVAL}s for {url}"
|
||||
|
||||
|
||||
def _extract_crawl_content(results: list, url: str) -> str:
|
||||
"""Extract and truncate markdown content from Crawl4AI results list."""
|
||||
if not results:
|
||||
return f"No content returned by Crawl4AI for: {url}"
|
||||
|
||||
result = results[0]
|
||||
content = (
|
||||
result.get("markdown")
|
||||
or result.get("markdown_v2", {}).get("raw_markdown")
|
||||
or result.get("extracted_content")
|
||||
or result.get("content")
|
||||
or ""
|
||||
)
|
||||
if not content:
|
||||
return f"No readable content extracted from: {url}"
|
||||
|
||||
if len(content) > _CRAWL_CHAR_BUDGET:
|
||||
content = content[:_CRAWL_CHAR_BUDGET] + "\n\n[…truncated to ~4000 tokens]"
|
||||
|
||||
return content
|
||||
@@ -41,17 +41,38 @@ def delegate_task(
|
||||
if priority not in valid_priorities:
|
||||
priority = "normal"
|
||||
|
||||
agent_role = available[agent_name]
|
||||
|
||||
# Wire to DistributedWorker for actual execution
|
||||
task_id: str | None = None
|
||||
status = "queued"
|
||||
try:
|
||||
from brain.worker import DistributedWorker
|
||||
|
||||
task_id = DistributedWorker.submit(agent_name, agent_role, task_description, priority)
|
||||
except Exception as exc:
|
||||
logger.warning("DistributedWorker unavailable — task noted only: %s", exc)
|
||||
status = "noted"
|
||||
|
||||
logger.info(
|
||||
"Delegation intent: %s → %s (priority=%s)", agent_name, task_description[:80], priority
|
||||
"Delegated task %s: %s → %s (priority=%s, status=%s)",
|
||||
task_id or "?",
|
||||
agent_name,
|
||||
task_description[:80],
|
||||
priority,
|
||||
status,
|
||||
)
|
||||
|
||||
return {
|
||||
"success": True,
|
||||
"task_id": None,
|
||||
"task_id": task_id,
|
||||
"agent": agent_name,
|
||||
"role": available[agent_name],
|
||||
"status": "noted",
|
||||
"message": f"Delegation to {agent_name} ({available[agent_name]}): {task_description[:100]}",
|
||||
"role": agent_role,
|
||||
"status": status,
|
||||
"message": (
|
||||
f"Task {task_id or 'noted'}: delegated to {agent_name} ({agent_role}): "
|
||||
f"{task_description[:100]}"
|
||||
),
|
||||
}
|
||||
|
||||
|
||||
|
||||
@@ -37,6 +37,7 @@ class VoiceTTS:
|
||||
|
||||
@property
|
||||
def available(self) -> bool:
|
||||
"""Whether the TTS engine initialized successfully and can produce audio."""
|
||||
return self._available
|
||||
|
||||
def speak(self, text: str) -> None:
|
||||
@@ -68,11 +69,13 @@ class VoiceTTS:
|
||||
logger.error("VoiceTTS: speech failed — %s", exc)
|
||||
|
||||
def set_rate(self, rate: int) -> None:
|
||||
"""Set speech rate in words per minute (typical range: 100–300, default 175)."""
|
||||
self._rate = rate
|
||||
if self._engine:
|
||||
self._engine.setProperty("rate", rate)
|
||||
|
||||
def set_volume(self, volume: float) -> None:
|
||||
"""Set speech volume. Value is clamped to the 0.0–1.0 range."""
|
||||
self._volume = max(0.0, min(1.0, volume))
|
||||
if self._engine:
|
||||
self._engine.setProperty("volume", self._volume)
|
||||
@@ -92,6 +95,7 @@ class VoiceTTS:
|
||||
return []
|
||||
|
||||
def set_voice(self, voice_id: str) -> None:
|
||||
"""Set the active TTS voice by system voice ID (see ``get_voices()``)."""
|
||||
if self._engine:
|
||||
self._engine.setProperty("voice", voice_id)
|
||||
|
||||
|
||||
@@ -2714,3 +2714,74 @@
|
||||
padding: 0.3rem 0.6rem;
|
||||
margin-bottom: 0.5rem;
|
||||
}
|
||||
|
||||
/* ── Self-Correction Dashboard ─────────────────────────────── */
|
||||
.sc-event {
|
||||
border-left: 3px solid var(--border);
|
||||
padding: 0.6rem 0.8rem;
|
||||
margin-bottom: 0.75rem;
|
||||
background: rgba(255,255,255,0.02);
|
||||
border-radius: 0 4px 4px 0;
|
||||
font-size: 0.82rem;
|
||||
}
|
||||
.sc-event.sc-status-success { border-left-color: var(--green); }
|
||||
.sc-event.sc-status-partial { border-left-color: var(--amber); }
|
||||
.sc-event.sc-status-failed { border-left-color: var(--red); }
|
||||
|
||||
.sc-event-header {
|
||||
display: flex;
|
||||
align-items: center;
|
||||
gap: 0.5rem;
|
||||
margin-bottom: 0.4rem;
|
||||
flex-wrap: wrap;
|
||||
}
|
||||
.sc-status-badge {
|
||||
font-size: 0.68rem;
|
||||
font-weight: 700;
|
||||
letter-spacing: 0.06em;
|
||||
padding: 0.15rem 0.45rem;
|
||||
border-radius: 3px;
|
||||
}
|
||||
.sc-status-badge.sc-status-success { color: var(--green); background: rgba(0,255,136,0.08); }
|
||||
.sc-status-badge.sc-status-partial { color: var(--amber); background: rgba(255,179,0,0.08); }
|
||||
.sc-status-badge.sc-status-failed { color: var(--red); background: rgba(255,59,59,0.08); }
|
||||
|
||||
.sc-source-badge {
|
||||
font-size: 0.68rem;
|
||||
color: var(--purple);
|
||||
background: rgba(168,85,247,0.1);
|
||||
padding: 0.1rem 0.4rem;
|
||||
border-radius: 3px;
|
||||
}
|
||||
.sc-event-time { font-size: 0.68rem; color: var(--text-dim); margin-left: auto; }
|
||||
.sc-event-error-type {
|
||||
font-size: 0.72rem;
|
||||
color: var(--amber);
|
||||
font-weight: 600;
|
||||
margin-bottom: 0.3rem;
|
||||
letter-spacing: 0.04em;
|
||||
}
|
||||
.sc-label {
|
||||
font-size: 0.65rem;
|
||||
font-weight: 700;
|
||||
letter-spacing: 0.06em;
|
||||
color: var(--text-dim);
|
||||
margin-right: 0.3rem;
|
||||
}
|
||||
.sc-event-intent, .sc-event-error, .sc-event-strategy, .sc-event-outcome {
|
||||
color: var(--text);
|
||||
margin-bottom: 0.2rem;
|
||||
line-height: 1.4;
|
||||
word-break: break-word;
|
||||
}
|
||||
.sc-event-error { color: var(--red); }
|
||||
.sc-event-strategy { color: var(--text-dim); font-style: italic; }
|
||||
.sc-event-outcome { color: var(--text-bright); }
|
||||
.sc-event-meta { font-size: 0.68rem; color: var(--text-dim); margin-top: 0.3rem; }
|
||||
|
||||
.sc-pattern-type {
|
||||
font-family: var(--font);
|
||||
font-size: 0.8rem;
|
||||
color: var(--text-bright);
|
||||
word-break: break-all;
|
||||
}
|
||||
|
||||
@@ -7,6 +7,8 @@ from unittest.mock import patch
|
||||
import pytest
|
||||
|
||||
import infrastructure.events.bus as bus_module
|
||||
|
||||
pytestmark = pytest.mark.unit
|
||||
from infrastructure.events.bus import (
|
||||
Event,
|
||||
EventBus,
|
||||
@@ -352,6 +354,14 @@ class TestEventBusPersistence:
|
||||
events = bus.replay()
|
||||
assert events == []
|
||||
|
||||
def test_init_persistence_db_noop_when_path_is_none(self):
|
||||
"""_init_persistence_db() is a no-op when _persistence_db_path is None."""
|
||||
bus = EventBus()
|
||||
# _persistence_db_path is None by default; calling _init_persistence_db
|
||||
# should silently return without touching the filesystem.
|
||||
bus._init_persistence_db() # must not raise
|
||||
assert bus._persistence_db_path is None
|
||||
|
||||
async def test_wal_mode_on_persistence_db(self, persistent_bus):
|
||||
"""Persistence database should use WAL mode."""
|
||||
conn = sqlite3.connect(str(persistent_bus._persistence_db_path))
|
||||
|
||||
589
tests/infrastructure/test_graceful_degradation.py
Normal file
589
tests/infrastructure/test_graceful_degradation.py
Normal file
@@ -0,0 +1,589 @@
|
||||
"""Graceful degradation test scenarios — Issue #919.
|
||||
|
||||
Tests specifically for service failure paths and fallback logic:
|
||||
|
||||
* Ollama health-check failures (connection refused, timeout, HTTP errors)
|
||||
* Cascade router: Ollama down → falls back to Anthropic/cloud provider
|
||||
* Circuit-breaker lifecycle: CLOSED → OPEN (repeated failures) → HALF_OPEN (recovery window)
|
||||
* All providers fail → descriptive RuntimeError
|
||||
* Disabled provider skipped without touching circuit breaker
|
||||
* ``requests`` library unavailable → optimistic availability assumption
|
||||
* ClaudeBackend / GrokBackend no-key graceful messages
|
||||
* Chat store: SQLite directory auto-creation and concurrent access safety
|
||||
"""
|
||||
|
||||
from __future__ import annotations
|
||||
|
||||
import threading
|
||||
from pathlib import Path
|
||||
from unittest.mock import AsyncMock, MagicMock, patch
|
||||
|
||||
import pytest
|
||||
|
||||
from infrastructure.router.cascade import (
|
||||
CascadeRouter,
|
||||
CircuitState,
|
||||
Provider,
|
||||
ProviderStatus,
|
||||
)
|
||||
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# Helpers
|
||||
# ---------------------------------------------------------------------------
|
||||
|
||||
|
||||
def _make_ollama_provider(name: str = "local-ollama", priority: int = 1) -> Provider:
|
||||
return Provider(
|
||||
name=name,
|
||||
type="ollama",
|
||||
enabled=True,
|
||||
priority=priority,
|
||||
url="http://localhost:11434",
|
||||
models=[{"name": "llama3", "default": True}],
|
||||
)
|
||||
|
||||
|
||||
def _make_anthropic_provider(name: str = "cloud-fallback", priority: int = 2) -> Provider:
|
||||
return Provider(
|
||||
name=name,
|
||||
type="anthropic",
|
||||
enabled=True,
|
||||
priority=priority,
|
||||
api_key="sk-ant-test",
|
||||
models=[{"name": "claude-haiku-4-5-20251001", "default": True}],
|
||||
)
|
||||
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# Ollama health-check failure scenarios
|
||||
# ---------------------------------------------------------------------------
|
||||
|
||||
|
||||
@pytest.mark.unit
|
||||
class TestOllamaHealthCheckFailures:
|
||||
"""_check_provider_available returns False for all Ollama failure modes."""
|
||||
|
||||
def _router(self) -> CascadeRouter:
|
||||
return CascadeRouter(config_path=Path("/nonexistent"))
|
||||
|
||||
def test_connection_refused_returns_false(self):
|
||||
"""Connection refused during Ollama health check → provider excluded."""
|
||||
router = self._router()
|
||||
provider = _make_ollama_provider()
|
||||
|
||||
with patch("infrastructure.router.cascade.requests") as mock_req:
|
||||
mock_req.get.side_effect = ConnectionError("Connection refused")
|
||||
assert router._check_provider_available(provider) is False
|
||||
|
||||
def test_timeout_returns_false(self):
|
||||
"""Request timeout during Ollama health check → provider excluded."""
|
||||
router = self._router()
|
||||
provider = _make_ollama_provider()
|
||||
|
||||
with patch("infrastructure.router.cascade.requests") as mock_req:
|
||||
# Simulate a timeout using a generic OSError (matches real-world timeout behaviour)
|
||||
mock_req.get.side_effect = OSError("timed out")
|
||||
assert router._check_provider_available(provider) is False
|
||||
|
||||
def test_http_503_returns_false(self):
|
||||
"""HTTP 503 from Ollama health endpoint → provider excluded."""
|
||||
router = self._router()
|
||||
provider = _make_ollama_provider()
|
||||
|
||||
mock_response = MagicMock()
|
||||
mock_response.status_code = 503
|
||||
|
||||
with patch("infrastructure.router.cascade.requests") as mock_req:
|
||||
mock_req.get.return_value = mock_response
|
||||
assert router._check_provider_available(provider) is False
|
||||
|
||||
def test_http_500_returns_false(self):
|
||||
"""HTTP 500 from Ollama health endpoint → provider excluded."""
|
||||
router = self._router()
|
||||
provider = _make_ollama_provider()
|
||||
|
||||
mock_response = MagicMock()
|
||||
mock_response.status_code = 500
|
||||
|
||||
with patch("infrastructure.router.cascade.requests") as mock_req:
|
||||
mock_req.get.return_value = mock_response
|
||||
assert router._check_provider_available(provider) is False
|
||||
|
||||
def test_generic_exception_returns_false(self):
|
||||
"""Unexpected exception during Ollama check → provider excluded (no crash)."""
|
||||
router = self._router()
|
||||
provider = _make_ollama_provider()
|
||||
|
||||
with patch("infrastructure.router.cascade.requests") as mock_req:
|
||||
mock_req.get.side_effect = RuntimeError("unexpected error")
|
||||
assert router._check_provider_available(provider) is False
|
||||
|
||||
def test_requests_unavailable_assumes_available(self):
|
||||
"""When ``requests`` lib is None, Ollama availability is assumed True."""
|
||||
import infrastructure.router.cascade as cascade_module
|
||||
|
||||
router = self._router()
|
||||
provider = _make_ollama_provider()
|
||||
|
||||
old_requests = cascade_module.requests
|
||||
cascade_module.requests = None
|
||||
try:
|
||||
assert router._check_provider_available(provider) is True
|
||||
finally:
|
||||
cascade_module.requests = old_requests
|
||||
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# Cascade: Ollama fails → Anthropic fallback
|
||||
# ---------------------------------------------------------------------------
|
||||
|
||||
|
||||
@pytest.mark.unit
|
||||
class TestOllamaToAnthropicFallback:
|
||||
"""Cascade router falls back to Anthropic when Ollama is unavailable or failing."""
|
||||
|
||||
@pytest.mark.asyncio
|
||||
async def test_ollama_connection_refused_falls_back_to_anthropic(self):
|
||||
"""When Ollama raises a connection error, cascade uses Anthropic provider."""
|
||||
router = CascadeRouter(config_path=Path("/nonexistent"))
|
||||
ollama_provider = _make_ollama_provider(priority=1)
|
||||
anthropic_provider = _make_anthropic_provider(priority=2)
|
||||
router.providers = [ollama_provider, anthropic_provider]
|
||||
|
||||
with (
|
||||
patch.object(router, "_call_ollama", side_effect=ConnectionError("refused")),
|
||||
patch.object(
|
||||
router,
|
||||
"_call_anthropic",
|
||||
new_callable=AsyncMock,
|
||||
return_value={"content": "fallback response", "model": "claude-haiku-4-5-20251001"},
|
||||
),
|
||||
# Allow cloud bypass of the metabolic quota gate in test
|
||||
patch.object(router, "_quota_allows_cloud", return_value=True),
|
||||
):
|
||||
result = await router.complete(
|
||||
messages=[{"role": "user", "content": "hello"}],
|
||||
model="llama3",
|
||||
)
|
||||
|
||||
assert result["provider"] == "cloud-fallback"
|
||||
assert "fallback response" in result["content"]
|
||||
|
||||
@pytest.mark.asyncio
|
||||
async def test_ollama_circuit_open_skips_to_anthropic(self):
|
||||
"""When Ollama circuit is OPEN, cascade skips directly to Anthropic."""
|
||||
router = CascadeRouter(config_path=Path("/nonexistent"))
|
||||
ollama_provider = _make_ollama_provider(priority=1)
|
||||
anthropic_provider = _make_anthropic_provider(priority=2)
|
||||
router.providers = [ollama_provider, anthropic_provider]
|
||||
|
||||
# Force the circuit open on Ollama
|
||||
ollama_provider.circuit_state = CircuitState.OPEN
|
||||
ollama_provider.status = ProviderStatus.UNHEALTHY
|
||||
import time
|
||||
|
||||
ollama_provider.circuit_opened_at = time.time() # just opened — not yet recoverable
|
||||
|
||||
with (
|
||||
patch.object(
|
||||
router,
|
||||
"_call_anthropic",
|
||||
new_callable=AsyncMock,
|
||||
return_value={"content": "cloud answer", "model": "claude-haiku-4-5-20251001"},
|
||||
) as mock_anthropic,
|
||||
# Allow cloud bypass of the metabolic quota gate in test
|
||||
patch.object(router, "_quota_allows_cloud", return_value=True),
|
||||
):
|
||||
result = await router.complete(
|
||||
messages=[{"role": "user", "content": "ping"}],
|
||||
)
|
||||
|
||||
mock_anthropic.assert_called_once()
|
||||
assert result["provider"] == "cloud-fallback"
|
||||
|
||||
@pytest.mark.asyncio
|
||||
async def test_all_providers_fail_raises_runtime_error(self):
|
||||
"""When every provider fails, RuntimeError is raised with combined error info."""
|
||||
router = CascadeRouter(config_path=Path("/nonexistent"))
|
||||
ollama_provider = _make_ollama_provider(priority=1)
|
||||
anthropic_provider = _make_anthropic_provider(priority=2)
|
||||
router.providers = [ollama_provider, anthropic_provider]
|
||||
|
||||
with (
|
||||
patch.object(router, "_call_ollama", side_effect=RuntimeError("Ollama down")),
|
||||
patch.object(router, "_call_anthropic", side_effect=RuntimeError("API quota exceeded")),
|
||||
patch.object(router, "_quota_allows_cloud", return_value=True),
|
||||
):
|
||||
with pytest.raises(RuntimeError, match="All providers failed"):
|
||||
await router.complete(messages=[{"role": "user", "content": "test"}])
|
||||
|
||||
@pytest.mark.asyncio
|
||||
async def test_error_message_includes_individual_provider_errors(self):
|
||||
"""RuntimeError from all-fail scenario lists each provider's error."""
|
||||
router = CascadeRouter(config_path=Path("/nonexistent"))
|
||||
ollama_provider = _make_ollama_provider(priority=1)
|
||||
anthropic_provider = _make_anthropic_provider(priority=2)
|
||||
router.providers = [ollama_provider, anthropic_provider]
|
||||
router.config.max_retries_per_provider = 1
|
||||
|
||||
with (
|
||||
patch.object(router, "_call_ollama", side_effect=RuntimeError("connection refused")),
|
||||
patch.object(router, "_call_anthropic", side_effect=RuntimeError("rate limit")),
|
||||
patch.object(router, "_quota_allows_cloud", return_value=True),
|
||||
):
|
||||
with pytest.raises(RuntimeError) as exc_info:
|
||||
await router.complete(messages=[{"role": "user", "content": "test"}])
|
||||
|
||||
error_msg = str(exc_info.value)
|
||||
assert "connection refused" in error_msg
|
||||
assert "rate limit" in error_msg
|
||||
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# Circuit-breaker lifecycle
|
||||
# ---------------------------------------------------------------------------
|
||||
|
||||
|
||||
@pytest.mark.unit
|
||||
class TestCircuitBreakerLifecycle:
|
||||
"""Full CLOSED → OPEN → HALF_OPEN → CLOSED lifecycle."""
|
||||
|
||||
def test_closed_initially(self):
|
||||
"""New provider starts with circuit CLOSED and HEALTHY status."""
|
||||
provider = _make_ollama_provider()
|
||||
assert provider.circuit_state == CircuitState.CLOSED
|
||||
assert provider.status == ProviderStatus.HEALTHY
|
||||
|
||||
def test_open_after_threshold_failures(self):
|
||||
"""Circuit opens once consecutive failures reach the threshold."""
|
||||
router = CascadeRouter(config_path=Path("/nonexistent"))
|
||||
router.config.circuit_breaker_failure_threshold = 3
|
||||
provider = _make_ollama_provider()
|
||||
|
||||
for _ in range(3):
|
||||
router._record_failure(provider)
|
||||
|
||||
assert provider.circuit_state == CircuitState.OPEN
|
||||
assert provider.status == ProviderStatus.UNHEALTHY
|
||||
assert provider.circuit_opened_at is not None
|
||||
|
||||
def test_open_circuit_skips_provider(self):
|
||||
"""_is_provider_available returns False when circuit is OPEN (and timeout not elapsed)."""
|
||||
import time
|
||||
|
||||
router = CascadeRouter(config_path=Path("/nonexistent"))
|
||||
router.config.circuit_breaker_recovery_timeout = 9999 # won't elapse during test
|
||||
provider = _make_ollama_provider()
|
||||
provider.circuit_state = CircuitState.OPEN
|
||||
provider.status = ProviderStatus.UNHEALTHY
|
||||
provider.circuit_opened_at = time.time()
|
||||
|
||||
assert router._is_provider_available(provider) is False
|
||||
|
||||
def test_half_open_after_recovery_timeout(self):
|
||||
"""After the recovery timeout elapses, _is_provider_available transitions to HALF_OPEN."""
|
||||
import time
|
||||
|
||||
router = CascadeRouter(config_path=Path("/nonexistent"))
|
||||
router.config.circuit_breaker_recovery_timeout = 0.01 # 10 ms
|
||||
|
||||
provider = _make_ollama_provider()
|
||||
provider.circuit_state = CircuitState.OPEN
|
||||
provider.status = ProviderStatus.UNHEALTHY
|
||||
provider.circuit_opened_at = time.time() - 1.0 # clearly elapsed
|
||||
|
||||
result = router._is_provider_available(provider)
|
||||
|
||||
assert result is True
|
||||
assert provider.circuit_state == CircuitState.HALF_OPEN
|
||||
|
||||
def test_closed_after_half_open_successes(self):
|
||||
"""Circuit closes after enough successful half-open test calls."""
|
||||
router = CascadeRouter(config_path=Path("/nonexistent"))
|
||||
router.config.circuit_breaker_half_open_max_calls = 2
|
||||
|
||||
provider = _make_ollama_provider()
|
||||
provider.circuit_state = CircuitState.HALF_OPEN
|
||||
provider.half_open_calls = 0
|
||||
|
||||
router._record_success(provider, 50.0)
|
||||
assert provider.circuit_state == CircuitState.HALF_OPEN # not yet
|
||||
|
||||
router._record_success(provider, 50.0)
|
||||
assert provider.circuit_state == CircuitState.CLOSED
|
||||
assert provider.status == ProviderStatus.HEALTHY
|
||||
assert provider.metrics.consecutive_failures == 0
|
||||
|
||||
def test_failure_in_half_open_reopens_circuit(self):
|
||||
"""A failure during HALF_OPEN increments consecutive failures, reopening if threshold met."""
|
||||
router = CascadeRouter(config_path=Path("/nonexistent"))
|
||||
router.config.circuit_breaker_failure_threshold = 1 # reopen on first failure
|
||||
|
||||
provider = _make_ollama_provider()
|
||||
provider.circuit_state = CircuitState.HALF_OPEN
|
||||
|
||||
router._record_failure(provider)
|
||||
|
||||
assert provider.circuit_state == CircuitState.OPEN
|
||||
|
||||
def test_disabled_provider_skipped_without_circuit_change(self):
|
||||
"""A disabled provider is immediately rejected; its circuit state is not touched."""
|
||||
router = CascadeRouter(config_path=Path("/nonexistent"))
|
||||
provider = _make_ollama_provider()
|
||||
provider.enabled = False
|
||||
|
||||
available = router._is_provider_available(provider)
|
||||
|
||||
assert available is False
|
||||
assert provider.circuit_state == CircuitState.CLOSED # unchanged
|
||||
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# ClaudeBackend graceful degradation
|
||||
# ---------------------------------------------------------------------------
|
||||
|
||||
|
||||
@pytest.mark.unit
|
||||
class TestClaudeBackendGracefulDegradation:
|
||||
"""ClaudeBackend degrades gracefully when the API is unavailable."""
|
||||
|
||||
def test_run_no_key_returns_unconfigured_message(self):
|
||||
"""run() returns a graceful message when no API key is set."""
|
||||
from timmy.backends import ClaudeBackend
|
||||
|
||||
backend = ClaudeBackend(api_key="", model="haiku")
|
||||
result = backend.run("hello")
|
||||
|
||||
assert "not configured" in result.content.lower()
|
||||
assert "ANTHROPIC_API_KEY" in result.content
|
||||
|
||||
def test_run_api_error_returns_unavailable_message(self):
|
||||
"""run() returns a graceful error when the Anthropic API raises."""
|
||||
from timmy.backends import ClaudeBackend
|
||||
|
||||
backend = ClaudeBackend(api_key="sk-ant-test", model="haiku")
|
||||
|
||||
mock_client = MagicMock()
|
||||
mock_client.messages.create.side_effect = ConnectionError("API unreachable")
|
||||
|
||||
with patch.object(backend, "_get_client", return_value=mock_client):
|
||||
result = backend.run("ping")
|
||||
|
||||
assert "unavailable" in result.content.lower()
|
||||
|
||||
def test_health_check_no_key_reports_error(self):
|
||||
"""health_check() reports not-ok when API key is missing."""
|
||||
from timmy.backends import ClaudeBackend
|
||||
|
||||
backend = ClaudeBackend(api_key="", model="haiku")
|
||||
status = backend.health_check()
|
||||
|
||||
assert status["ok"] is False
|
||||
assert "ANTHROPIC_API_KEY" in status["error"]
|
||||
|
||||
def test_health_check_api_error_reports_error(self):
|
||||
"""health_check() returns ok=False and captures the error on API failure."""
|
||||
from timmy.backends import ClaudeBackend
|
||||
|
||||
backend = ClaudeBackend(api_key="sk-ant-test", model="haiku")
|
||||
|
||||
mock_client = MagicMock()
|
||||
mock_client.messages.create.side_effect = RuntimeError("connection timed out")
|
||||
|
||||
with patch.object(backend, "_get_client", return_value=mock_client):
|
||||
status = backend.health_check()
|
||||
|
||||
assert status["ok"] is False
|
||||
assert "connection timed out" in status["error"]
|
||||
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# GrokBackend graceful degradation
|
||||
# ---------------------------------------------------------------------------
|
||||
|
||||
|
||||
@pytest.mark.unit
|
||||
class TestGrokBackendGracefulDegradation:
|
||||
"""GrokBackend degrades gracefully when xAI API is unavailable."""
|
||||
|
||||
def test_run_no_key_returns_unconfigured_message(self):
|
||||
"""run() returns a graceful message when no XAI_API_KEY is set."""
|
||||
from timmy.backends import GrokBackend
|
||||
|
||||
backend = GrokBackend(api_key="", model="grok-3-mini")
|
||||
result = backend.run("hello")
|
||||
|
||||
assert "not configured" in result.content.lower()
|
||||
|
||||
def test_run_api_error_returns_unavailable_message(self):
|
||||
"""run() returns graceful error when xAI API raises."""
|
||||
from timmy.backends import GrokBackend
|
||||
|
||||
backend = GrokBackend(api_key="xai-test-key", model="grok-3-mini")
|
||||
|
||||
mock_client = MagicMock()
|
||||
mock_client.chat.completions.create.side_effect = RuntimeError("network error")
|
||||
|
||||
with patch.object(backend, "_get_client", return_value=mock_client):
|
||||
result = backend.run("ping")
|
||||
|
||||
assert "unavailable" in result.content.lower()
|
||||
|
||||
def test_health_check_no_key_reports_error(self):
|
||||
"""health_check() reports not-ok when XAI_API_KEY is missing."""
|
||||
from timmy.backends import GrokBackend
|
||||
|
||||
backend = GrokBackend(api_key="", model="grok-3-mini")
|
||||
status = backend.health_check()
|
||||
|
||||
assert status["ok"] is False
|
||||
assert "XAI_API_KEY" in status["error"]
|
||||
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# Chat store: SQLite resilience
|
||||
# ---------------------------------------------------------------------------
|
||||
|
||||
|
||||
@pytest.mark.unit
|
||||
class TestChatStoreSQLiteResilience:
|
||||
"""MessageLog handles edge cases without crashing."""
|
||||
|
||||
def test_auto_creates_missing_parent_directory(self, tmp_path):
|
||||
"""MessageLog creates the data directory automatically on first use."""
|
||||
from infrastructure.chat_store import MessageLog
|
||||
|
||||
db_path = tmp_path / "deep" / "nested" / "chat.db"
|
||||
assert not db_path.parent.exists()
|
||||
|
||||
log = MessageLog(db_path=db_path)
|
||||
log.append("user", "hello", "2026-01-01T00:00:00")
|
||||
|
||||
assert db_path.exists()
|
||||
assert len(log) == 1
|
||||
log.close()
|
||||
|
||||
def test_concurrent_appends_are_safe(self, tmp_path):
|
||||
"""Multiple threads appending simultaneously do not corrupt the DB."""
|
||||
from infrastructure.chat_store import MessageLog
|
||||
|
||||
db_path = tmp_path / "chat.db"
|
||||
log = MessageLog(db_path=db_path)
|
||||
|
||||
errors: list[Exception] = []
|
||||
|
||||
def write_messages(thread_id: int) -> None:
|
||||
try:
|
||||
for i in range(10):
|
||||
log.append("user", f"thread {thread_id} msg {i}", "2026-01-01T00:00:00")
|
||||
except Exception as exc:
|
||||
errors.append(exc)
|
||||
|
||||
threads = [threading.Thread(target=write_messages, args=(t,)) for t in range(5)]
|
||||
for t in threads:
|
||||
t.start()
|
||||
for t in threads:
|
||||
t.join()
|
||||
|
||||
assert errors == [], f"Concurrent writes produced errors: {errors}"
|
||||
# 5 threads × 10 messages each
|
||||
assert len(log) == 50
|
||||
log.close()
|
||||
|
||||
def test_all_returns_messages_in_insertion_order(self, tmp_path):
|
||||
"""all() returns messages ordered oldest-first."""
|
||||
from infrastructure.chat_store import MessageLog
|
||||
|
||||
db_path = tmp_path / "chat.db"
|
||||
log = MessageLog(db_path=db_path)
|
||||
log.append("user", "first", "2026-01-01T00:00:00")
|
||||
log.append("agent", "second", "2026-01-01T00:00:01")
|
||||
log.append("user", "third", "2026-01-01T00:00:02")
|
||||
|
||||
messages = log.all()
|
||||
assert [m.content for m in messages] == ["first", "second", "third"]
|
||||
log.close()
|
||||
|
||||
def test_recent_returns_latest_n_messages(self, tmp_path):
|
||||
"""recent(n) returns the n most recent messages, oldest-first within the slice."""
|
||||
from infrastructure.chat_store import MessageLog
|
||||
|
||||
db_path = tmp_path / "chat.db"
|
||||
log = MessageLog(db_path=db_path)
|
||||
for i in range(20):
|
||||
log.append("user", f"msg {i}", f"2026-01-01T00:{i:02d}:00")
|
||||
|
||||
recent = log.recent(5)
|
||||
assert len(recent) == 5
|
||||
assert recent[0].content == "msg 15"
|
||||
assert recent[-1].content == "msg 19"
|
||||
log.close()
|
||||
|
||||
def test_prune_keeps_max_messages(self, tmp_path):
|
||||
"""append() prunes oldest messages when count exceeds MAX_MESSAGES."""
|
||||
import infrastructure.chat_store as store_mod
|
||||
from infrastructure.chat_store import MessageLog
|
||||
|
||||
original_max = store_mod.MAX_MESSAGES
|
||||
store_mod.MAX_MESSAGES = 5
|
||||
try:
|
||||
db_path = tmp_path / "chat.db"
|
||||
log = MessageLog(db_path=db_path)
|
||||
for i in range(8):
|
||||
log.append("user", f"msg {i}", "2026-01-01T00:00:00")
|
||||
|
||||
assert len(log) == 5
|
||||
messages = log.all()
|
||||
# Oldest 3 should be pruned
|
||||
assert messages[0].content == "msg 3"
|
||||
log.close()
|
||||
finally:
|
||||
store_mod.MAX_MESSAGES = original_max
|
||||
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# Provider availability: requests lib missing
|
||||
# ---------------------------------------------------------------------------
|
||||
|
||||
|
||||
@pytest.mark.unit
|
||||
class TestRequestsLibraryMissing:
|
||||
"""When ``requests`` is not installed, providers assume they are available."""
|
||||
|
||||
def _swap_requests(self, value):
|
||||
import infrastructure.router.cascade as cascade_module
|
||||
|
||||
old = cascade_module.requests
|
||||
cascade_module.requests = value
|
||||
return old
|
||||
|
||||
def test_ollama_assumes_available_without_requests(self):
|
||||
"""Ollama provider returns True when requests is None."""
|
||||
import infrastructure.router.cascade as cascade_module
|
||||
|
||||
router = CascadeRouter(config_path=Path("/nonexistent"))
|
||||
provider = _make_ollama_provider()
|
||||
old = self._swap_requests(None)
|
||||
try:
|
||||
assert router._check_provider_available(provider) is True
|
||||
finally:
|
||||
cascade_module.requests = old
|
||||
|
||||
def test_vllm_mlx_assumes_available_without_requests(self):
|
||||
"""vllm-mlx provider returns True when requests is None."""
|
||||
import infrastructure.router.cascade as cascade_module
|
||||
|
||||
router = CascadeRouter(config_path=Path("/nonexistent"))
|
||||
provider = Provider(
|
||||
name="vllm-local",
|
||||
type="vllm_mlx",
|
||||
enabled=True,
|
||||
priority=1,
|
||||
base_url="http://localhost:8000/v1",
|
||||
)
|
||||
old = self._swap_requests(None)
|
||||
try:
|
||||
assert router._check_provider_available(provider) is True
|
||||
finally:
|
||||
cascade_module.requests = old
|
||||
0
tests/self_coding/__init__.py
Normal file
0
tests/self_coding/__init__.py
Normal file
363
tests/self_coding/test_loop.py
Normal file
363
tests/self_coding/test_loop.py
Normal file
@@ -0,0 +1,363 @@
|
||||
"""Unit tests for the self-modification loop.
|
||||
|
||||
Covers:
|
||||
- Protected branch guard
|
||||
- Successful cycle (mocked git + tests)
|
||||
- Edit function failure → branch reverted, no commit
|
||||
- Test failure → branch reverted, no commit
|
||||
- Gitea PR creation plumbing
|
||||
- GiteaClient graceful degradation (no token, network error)
|
||||
|
||||
All git and subprocess calls are mocked so these run offline without
|
||||
a real repo or test suite.
|
||||
"""
|
||||
|
||||
from __future__ import annotations
|
||||
|
||||
from unittest.mock import MagicMock, patch
|
||||
|
||||
import pytest
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# Helpers
|
||||
# ---------------------------------------------------------------------------
|
||||
|
||||
|
||||
def _make_loop(repo_root="/tmp/fake-repo"):
|
||||
"""Construct a SelfModifyLoop with a fake repo root."""
|
||||
from self_coding.self_modify.loop import SelfModifyLoop
|
||||
|
||||
return SelfModifyLoop(repo_root=repo_root, remote="origin", base_branch="main")
|
||||
|
||||
|
||||
def _noop_edit(repo_root: str) -> None:
|
||||
"""Edit function that does nothing."""
|
||||
|
||||
|
||||
def _failing_edit(repo_root: str) -> None:
|
||||
"""Edit function that raises."""
|
||||
raise RuntimeError("edit exploded")
|
||||
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# Guard tests (sync — no git calls needed)
|
||||
# ---------------------------------------------------------------------------
|
||||
|
||||
|
||||
@pytest.mark.unit
|
||||
def test_guard_blocks_main():
|
||||
loop = _make_loop()
|
||||
with pytest.raises(ValueError, match="protected branch"):
|
||||
loop._guard_branch("main")
|
||||
|
||||
|
||||
@pytest.mark.unit
|
||||
def test_guard_blocks_master():
|
||||
loop = _make_loop()
|
||||
with pytest.raises(ValueError, match="protected branch"):
|
||||
loop._guard_branch("master")
|
||||
|
||||
|
||||
@pytest.mark.unit
|
||||
def test_guard_allows_feature_branch():
|
||||
loop = _make_loop()
|
||||
# Should not raise
|
||||
loop._guard_branch("self-modify/some-feature")
|
||||
|
||||
|
||||
@pytest.mark.unit
|
||||
def test_guard_allows_self_modify_prefix():
|
||||
loop = _make_loop()
|
||||
loop._guard_branch("self-modify/issue-983")
|
||||
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# Full cycle — success path
|
||||
# ---------------------------------------------------------------------------
|
||||
|
||||
|
||||
@pytest.mark.unit
|
||||
@pytest.mark.asyncio
|
||||
async def test_run_success():
|
||||
"""Happy path: edit succeeds, tests pass, PR created."""
|
||||
loop = _make_loop()
|
||||
|
||||
fake_completed = MagicMock()
|
||||
fake_completed.stdout = "abc1234\n"
|
||||
fake_completed.returncode = 0
|
||||
|
||||
fake_test_result = MagicMock()
|
||||
fake_test_result.stdout = "3 passed"
|
||||
fake_test_result.stderr = ""
|
||||
fake_test_result.returncode = 0
|
||||
|
||||
from self_coding.gitea_client import PullRequest as _PR
|
||||
|
||||
fake_pr = _PR(number=42, title="test PR", html_url="http://gitea/pr/42")
|
||||
|
||||
with (
|
||||
patch.object(loop, "_git", return_value=fake_completed),
|
||||
patch("subprocess.run", return_value=fake_test_result),
|
||||
patch.object(loop, "_create_pr", return_value=fake_pr),
|
||||
):
|
||||
result = await loop.run(
|
||||
slug="test-feature",
|
||||
description="Add test feature",
|
||||
edit_fn=_noop_edit,
|
||||
issue_number=983,
|
||||
)
|
||||
|
||||
assert result.success is True
|
||||
assert result.branch == "self-modify/test-feature"
|
||||
assert result.pr_url == "http://gitea/pr/42"
|
||||
assert result.pr_number == 42
|
||||
assert "3 passed" in result.test_output
|
||||
|
||||
|
||||
@pytest.mark.unit
|
||||
@pytest.mark.asyncio
|
||||
async def test_run_skips_tests_when_flag_set():
|
||||
"""skip_tests=True should bypass the test gate."""
|
||||
loop = _make_loop()
|
||||
|
||||
fake_completed = MagicMock()
|
||||
fake_completed.stdout = "deadbeef\n"
|
||||
fake_completed.returncode = 0
|
||||
|
||||
with (
|
||||
patch.object(loop, "_git", return_value=fake_completed),
|
||||
patch.object(loop, "_create_pr", return_value=None),
|
||||
patch("subprocess.run") as mock_run,
|
||||
):
|
||||
result = await loop.run(
|
||||
slug="skip-test-feature",
|
||||
description="Skip test feature",
|
||||
edit_fn=_noop_edit,
|
||||
skip_tests=True,
|
||||
)
|
||||
|
||||
# subprocess.run should NOT be called for tests
|
||||
mock_run.assert_not_called()
|
||||
assert result.success is True
|
||||
assert "(tests skipped)" in result.test_output
|
||||
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# Failure paths
|
||||
# ---------------------------------------------------------------------------
|
||||
|
||||
|
||||
@pytest.mark.unit
|
||||
@pytest.mark.asyncio
|
||||
async def test_run_reverts_on_edit_failure():
|
||||
"""If edit_fn raises, the branch should be reverted and no commit made."""
|
||||
loop = _make_loop()
|
||||
|
||||
fake_completed = MagicMock()
|
||||
fake_completed.stdout = ""
|
||||
fake_completed.returncode = 0
|
||||
|
||||
revert_called = []
|
||||
|
||||
def _fake_revert(branch):
|
||||
revert_called.append(branch)
|
||||
|
||||
with (
|
||||
patch.object(loop, "_git", return_value=fake_completed),
|
||||
patch.object(loop, "_revert_branch", side_effect=_fake_revert),
|
||||
patch.object(loop, "_commit_all") as mock_commit,
|
||||
):
|
||||
result = await loop.run(
|
||||
slug="broken-edit",
|
||||
description="This will fail",
|
||||
edit_fn=_failing_edit,
|
||||
skip_tests=True,
|
||||
)
|
||||
|
||||
assert result.success is False
|
||||
assert "edit exploded" in result.error
|
||||
assert "self-modify/broken-edit" in revert_called
|
||||
mock_commit.assert_not_called()
|
||||
|
||||
|
||||
@pytest.mark.unit
|
||||
@pytest.mark.asyncio
|
||||
async def test_run_reverts_on_test_failure():
|
||||
"""If tests fail, branch should be reverted and no commit made."""
|
||||
loop = _make_loop()
|
||||
|
||||
fake_completed = MagicMock()
|
||||
fake_completed.stdout = ""
|
||||
fake_completed.returncode = 0
|
||||
|
||||
fake_test_result = MagicMock()
|
||||
fake_test_result.stdout = "FAILED test_foo"
|
||||
fake_test_result.stderr = "1 failed"
|
||||
fake_test_result.returncode = 1
|
||||
|
||||
revert_called = []
|
||||
|
||||
def _fake_revert(branch):
|
||||
revert_called.append(branch)
|
||||
|
||||
with (
|
||||
patch.object(loop, "_git", return_value=fake_completed),
|
||||
patch("subprocess.run", return_value=fake_test_result),
|
||||
patch.object(loop, "_revert_branch", side_effect=_fake_revert),
|
||||
patch.object(loop, "_commit_all") as mock_commit,
|
||||
):
|
||||
result = await loop.run(
|
||||
slug="tests-will-fail",
|
||||
description="This will fail tests",
|
||||
edit_fn=_noop_edit,
|
||||
)
|
||||
|
||||
assert result.success is False
|
||||
assert "Tests failed" in result.error
|
||||
assert "self-modify/tests-will-fail" in revert_called
|
||||
mock_commit.assert_not_called()
|
||||
|
||||
|
||||
@pytest.mark.unit
|
||||
@pytest.mark.asyncio
|
||||
async def test_run_slug_with_main_creates_safe_branch():
|
||||
"""A slug of 'main' produces branch 'self-modify/main', which is not protected."""
|
||||
|
||||
loop = _make_loop()
|
||||
|
||||
fake_completed = MagicMock()
|
||||
fake_completed.stdout = "deadbeef\n"
|
||||
fake_completed.returncode = 0
|
||||
|
||||
# 'self-modify/main' is NOT in _PROTECTED_BRANCHES so the run should succeed
|
||||
with (
|
||||
patch.object(loop, "_git", return_value=fake_completed),
|
||||
patch.object(loop, "_create_pr", return_value=None),
|
||||
):
|
||||
result = await loop.run(
|
||||
slug="main",
|
||||
description="try to write to self-modify/main",
|
||||
edit_fn=_noop_edit,
|
||||
skip_tests=True,
|
||||
)
|
||||
assert result.branch == "self-modify/main"
|
||||
assert result.success is True
|
||||
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# GiteaClient tests
|
||||
# ---------------------------------------------------------------------------
|
||||
|
||||
|
||||
@pytest.mark.unit
|
||||
def test_gitea_client_returns_none_without_token():
|
||||
"""GiteaClient should return None gracefully when no token is set."""
|
||||
from self_coding.gitea_client import GiteaClient
|
||||
|
||||
client = GiteaClient(base_url="http://localhost:3000", token="", repo="owner/repo")
|
||||
pr = client.create_pull_request(
|
||||
title="Test PR",
|
||||
body="body",
|
||||
head="self-modify/test",
|
||||
)
|
||||
assert pr is None
|
||||
|
||||
|
||||
@pytest.mark.unit
|
||||
def test_gitea_client_comment_returns_false_without_token():
|
||||
"""add_issue_comment should return False gracefully when no token is set."""
|
||||
from self_coding.gitea_client import GiteaClient
|
||||
|
||||
client = GiteaClient(base_url="http://localhost:3000", token="", repo="owner/repo")
|
||||
result = client.add_issue_comment(123, "hello")
|
||||
assert result is False
|
||||
|
||||
|
||||
@pytest.mark.unit
|
||||
def test_gitea_client_create_pr_handles_network_error():
|
||||
"""create_pull_request should return None on network failure."""
|
||||
from self_coding.gitea_client import GiteaClient
|
||||
|
||||
client = GiteaClient(base_url="http://localhost:3000", token="fake-token", repo="owner/repo")
|
||||
|
||||
mock_requests = MagicMock()
|
||||
mock_requests.post.side_effect = Exception("Connection refused")
|
||||
mock_requests.exceptions.ConnectionError = Exception
|
||||
|
||||
with patch.dict("sys.modules", {"requests": mock_requests}):
|
||||
pr = client.create_pull_request(
|
||||
title="Test PR",
|
||||
body="body",
|
||||
head="self-modify/test",
|
||||
)
|
||||
assert pr is None
|
||||
|
||||
|
||||
@pytest.mark.unit
|
||||
def test_gitea_client_comment_handles_network_error():
|
||||
"""add_issue_comment should return False on network failure."""
|
||||
from self_coding.gitea_client import GiteaClient
|
||||
|
||||
client = GiteaClient(base_url="http://localhost:3000", token="fake-token", repo="owner/repo")
|
||||
|
||||
mock_requests = MagicMock()
|
||||
mock_requests.post.side_effect = Exception("Connection refused")
|
||||
|
||||
with patch.dict("sys.modules", {"requests": mock_requests}):
|
||||
result = client.add_issue_comment(456, "hello")
|
||||
assert result is False
|
||||
|
||||
|
||||
@pytest.mark.unit
|
||||
def test_gitea_client_create_pr_success():
|
||||
"""create_pull_request should return a PullRequest on HTTP 201."""
|
||||
from self_coding.gitea_client import GiteaClient, PullRequest
|
||||
|
||||
client = GiteaClient(base_url="http://localhost:3000", token="tok", repo="owner/repo")
|
||||
|
||||
fake_resp = MagicMock()
|
||||
fake_resp.raise_for_status = MagicMock()
|
||||
fake_resp.json.return_value = {
|
||||
"number": 77,
|
||||
"title": "Test PR",
|
||||
"html_url": "http://localhost:3000/owner/repo/pulls/77",
|
||||
}
|
||||
|
||||
mock_requests = MagicMock()
|
||||
mock_requests.post.return_value = fake_resp
|
||||
|
||||
with patch.dict("sys.modules", {"requests": mock_requests}):
|
||||
pr = client.create_pull_request("Test PR", "body", "self-modify/feat")
|
||||
|
||||
assert isinstance(pr, PullRequest)
|
||||
assert pr.number == 77
|
||||
assert pr.html_url == "http://localhost:3000/owner/repo/pulls/77"
|
||||
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# LoopResult dataclass
|
||||
# ---------------------------------------------------------------------------
|
||||
|
||||
|
||||
@pytest.mark.unit
|
||||
def test_loop_result_defaults():
|
||||
from self_coding.self_modify.loop import LoopResult
|
||||
|
||||
r = LoopResult(success=True)
|
||||
assert r.branch == ""
|
||||
assert r.commit_sha == ""
|
||||
assert r.pr_url == ""
|
||||
assert r.pr_number == 0
|
||||
assert r.test_output == ""
|
||||
assert r.error == ""
|
||||
assert r.elapsed_ms == 0.0
|
||||
assert r.metadata == {}
|
||||
|
||||
|
||||
@pytest.mark.unit
|
||||
def test_loop_result_failure():
|
||||
from self_coding.self_modify.loop import LoopResult
|
||||
|
||||
r = LoopResult(success=False, error="something broke", branch="self-modify/test")
|
||||
assert r.success is False
|
||||
assert r.error == "something broke"
|
||||
839
tests/timmy/test_quest_system.py
Normal file
839
tests/timmy/test_quest_system.py
Normal file
@@ -0,0 +1,839 @@
|
||||
"""Unit tests for timmy.quest_system."""
|
||||
|
||||
from __future__ import annotations
|
||||
|
||||
from datetime import UTC, datetime, timedelta
|
||||
from typing import Any
|
||||
from unittest.mock import MagicMock, patch
|
||||
|
||||
import pytest
|
||||
|
||||
import timmy.quest_system as qs
|
||||
from timmy.quest_system import (
|
||||
QuestDefinition,
|
||||
QuestProgress,
|
||||
QuestStatus,
|
||||
QuestType,
|
||||
_get_progress_key,
|
||||
_get_target_value,
|
||||
_is_on_cooldown,
|
||||
check_daily_run_quest,
|
||||
check_issue_count_quest,
|
||||
check_issue_reduce_quest,
|
||||
claim_quest_reward,
|
||||
evaluate_quest_progress,
|
||||
get_active_quests,
|
||||
get_agent_quests_status,
|
||||
get_or_create_progress,
|
||||
get_quest_definition,
|
||||
get_quest_definitions,
|
||||
get_quest_leaderboard,
|
||||
get_quest_progress,
|
||||
load_quest_config,
|
||||
reset_quest_progress,
|
||||
update_quest_progress,
|
||||
)
|
||||
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# Helpers
|
||||
# ---------------------------------------------------------------------------
|
||||
|
||||
def _make_quest(
|
||||
quest_id: str = "test_quest",
|
||||
quest_type: QuestType = QuestType.ISSUE_COUNT,
|
||||
reward_tokens: int = 10,
|
||||
enabled: bool = True,
|
||||
repeatable: bool = False,
|
||||
cooldown_hours: int = 0,
|
||||
criteria: dict[str, Any] | None = None,
|
||||
) -> QuestDefinition:
|
||||
return QuestDefinition(
|
||||
id=quest_id,
|
||||
name=f"Quest {quest_id}",
|
||||
description="Test quest",
|
||||
reward_tokens=reward_tokens,
|
||||
quest_type=quest_type,
|
||||
enabled=enabled,
|
||||
repeatable=repeatable,
|
||||
cooldown_hours=cooldown_hours,
|
||||
criteria=criteria or {"target_count": 3},
|
||||
notification_message="Quest Complete! You earned {tokens} tokens.",
|
||||
)
|
||||
|
||||
|
||||
@pytest.fixture(autouse=True)
|
||||
def clean_state():
|
||||
"""Reset module-level state before and after each test."""
|
||||
reset_quest_progress()
|
||||
qs._quest_definitions.clear()
|
||||
qs._quest_settings.clear()
|
||||
yield
|
||||
reset_quest_progress()
|
||||
qs._quest_definitions.clear()
|
||||
qs._quest_settings.clear()
|
||||
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# QuestDefinition
|
||||
# ---------------------------------------------------------------------------
|
||||
|
||||
class TestQuestDefinition:
|
||||
def test_from_dict_minimal(self):
|
||||
data = {"id": "q1"}
|
||||
defn = QuestDefinition.from_dict(data)
|
||||
assert defn.id == "q1"
|
||||
assert defn.name == "Unnamed Quest"
|
||||
assert defn.reward_tokens == 0
|
||||
assert defn.quest_type == QuestType.CUSTOM
|
||||
assert defn.enabled is True
|
||||
assert defn.repeatable is False
|
||||
assert defn.cooldown_hours == 0
|
||||
|
||||
def test_from_dict_full(self):
|
||||
data = {
|
||||
"id": "q2",
|
||||
"name": "Full Quest",
|
||||
"description": "A full quest",
|
||||
"reward_tokens": 50,
|
||||
"type": "issue_count",
|
||||
"enabled": False,
|
||||
"repeatable": True,
|
||||
"cooldown_hours": 24,
|
||||
"criteria": {"target_count": 5},
|
||||
"notification_message": "You earned {tokens}!",
|
||||
}
|
||||
defn = QuestDefinition.from_dict(data)
|
||||
assert defn.id == "q2"
|
||||
assert defn.name == "Full Quest"
|
||||
assert defn.reward_tokens == 50
|
||||
assert defn.quest_type == QuestType.ISSUE_COUNT
|
||||
assert defn.enabled is False
|
||||
assert defn.repeatable is True
|
||||
assert defn.cooldown_hours == 24
|
||||
assert defn.criteria == {"target_count": 5}
|
||||
assert defn.notification_message == "You earned {tokens}!"
|
||||
|
||||
def test_from_dict_invalid_type_raises(self):
|
||||
data = {"id": "q3", "type": "not_a_real_type"}
|
||||
with pytest.raises(ValueError):
|
||||
QuestDefinition.from_dict(data)
|
||||
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# QuestProgress
|
||||
# ---------------------------------------------------------------------------
|
||||
|
||||
class TestQuestProgress:
|
||||
def test_to_dict_roundtrip(self):
|
||||
progress = QuestProgress(
|
||||
quest_id="q1",
|
||||
agent_id="agent_a",
|
||||
status=QuestStatus.IN_PROGRESS,
|
||||
current_value=2,
|
||||
target_value=5,
|
||||
started_at="2026-01-01T00:00:00",
|
||||
metadata={"key": "val"},
|
||||
)
|
||||
d = progress.to_dict()
|
||||
assert d["quest_id"] == "q1"
|
||||
assert d["agent_id"] == "agent_a"
|
||||
assert d["status"] == "in_progress"
|
||||
assert d["current_value"] == 2
|
||||
assert d["target_value"] == 5
|
||||
assert d["metadata"] == {"key": "val"}
|
||||
|
||||
def test_to_dict_defaults(self):
|
||||
progress = QuestProgress(
|
||||
quest_id="q1",
|
||||
agent_id="agent_a",
|
||||
status=QuestStatus.NOT_STARTED,
|
||||
)
|
||||
d = progress.to_dict()
|
||||
assert d["completion_count"] == 0
|
||||
assert d["started_at"] == ""
|
||||
assert d["completed_at"] == ""
|
||||
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# _get_progress_key
|
||||
# ---------------------------------------------------------------------------
|
||||
|
||||
def test_get_progress_key():
|
||||
assert _get_progress_key("q1", "agent_a") == "agent_a:q1"
|
||||
|
||||
|
||||
def test_get_progress_key_different_agents():
|
||||
key_a = _get_progress_key("q1", "agent_a")
|
||||
key_b = _get_progress_key("q1", "agent_b")
|
||||
assert key_a != key_b
|
||||
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# load_quest_config
|
||||
# ---------------------------------------------------------------------------
|
||||
|
||||
class TestLoadQuestConfig:
|
||||
def test_missing_file_returns_empty(self, tmp_path):
|
||||
missing = tmp_path / "nonexistent.yaml"
|
||||
with patch.object(qs, "QUEST_CONFIG_PATH", missing):
|
||||
defs, settings = load_quest_config()
|
||||
assert defs == {}
|
||||
assert settings == {}
|
||||
|
||||
def test_valid_yaml_loads_quests(self, tmp_path):
|
||||
config_path = tmp_path / "quests.yaml"
|
||||
config_path.write_text(
|
||||
"""
|
||||
quests:
|
||||
first_quest:
|
||||
name: First Quest
|
||||
description: Do stuff
|
||||
reward_tokens: 25
|
||||
type: issue_count
|
||||
enabled: true
|
||||
repeatable: false
|
||||
cooldown_hours: 0
|
||||
criteria:
|
||||
target_count: 3
|
||||
notification_message: "Done! {tokens} tokens"
|
||||
settings:
|
||||
some_setting: true
|
||||
"""
|
||||
)
|
||||
with patch.object(qs, "QUEST_CONFIG_PATH", config_path):
|
||||
defs, settings = load_quest_config()
|
||||
|
||||
assert "first_quest" in defs
|
||||
assert defs["first_quest"].name == "First Quest"
|
||||
assert defs["first_quest"].reward_tokens == 25
|
||||
assert settings == {"some_setting": True}
|
||||
|
||||
def test_invalid_yaml_returns_empty(self, tmp_path):
|
||||
config_path = tmp_path / "quests.yaml"
|
||||
config_path.write_text(":: not valid yaml ::")
|
||||
with patch.object(qs, "QUEST_CONFIG_PATH", config_path):
|
||||
defs, settings = load_quest_config()
|
||||
assert defs == {}
|
||||
assert settings == {}
|
||||
|
||||
def test_non_dict_yaml_returns_empty(self, tmp_path):
|
||||
config_path = tmp_path / "quests.yaml"
|
||||
config_path.write_text("- item1\n- item2\n")
|
||||
with patch.object(qs, "QUEST_CONFIG_PATH", config_path):
|
||||
defs, settings = load_quest_config()
|
||||
assert defs == {}
|
||||
assert settings == {}
|
||||
|
||||
def test_bad_quest_entry_is_skipped(self, tmp_path):
|
||||
config_path = tmp_path / "quests.yaml"
|
||||
config_path.write_text(
|
||||
"""
|
||||
quests:
|
||||
good_quest:
|
||||
name: Good
|
||||
type: issue_count
|
||||
reward_tokens: 10
|
||||
enabled: true
|
||||
repeatable: false
|
||||
cooldown_hours: 0
|
||||
criteria: {}
|
||||
notification_message: "{tokens}"
|
||||
bad_quest:
|
||||
type: invalid_type_that_does_not_exist
|
||||
"""
|
||||
)
|
||||
with patch.object(qs, "QUEST_CONFIG_PATH", config_path):
|
||||
defs, _ = load_quest_config()
|
||||
assert "good_quest" in defs
|
||||
assert "bad_quest" not in defs
|
||||
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# get_quest_definitions / get_quest_definition / get_active_quests
|
||||
# ---------------------------------------------------------------------------
|
||||
|
||||
class TestQuestLookup:
|
||||
def setup_method(self):
|
||||
q1 = _make_quest("q1", enabled=True)
|
||||
q2 = _make_quest("q2", enabled=False)
|
||||
qs._quest_definitions.update({"q1": q1, "q2": q2})
|
||||
|
||||
def test_get_quest_definitions_returns_all(self):
|
||||
defs = get_quest_definitions()
|
||||
assert "q1" in defs
|
||||
assert "q2" in defs
|
||||
|
||||
def test_get_quest_definition_found(self):
|
||||
defn = get_quest_definition("q1")
|
||||
assert defn is not None
|
||||
assert defn.id == "q1"
|
||||
|
||||
def test_get_quest_definition_not_found(self):
|
||||
assert get_quest_definition("missing") is None
|
||||
|
||||
def test_get_active_quests_only_enabled(self):
|
||||
active = get_active_quests()
|
||||
ids = [q.id for q in active]
|
||||
assert "q1" in ids
|
||||
assert "q2" not in ids
|
||||
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# _get_target_value
|
||||
# ---------------------------------------------------------------------------
|
||||
|
||||
class TestGetTargetValue:
|
||||
def test_issue_count(self):
|
||||
q = _make_quest(quest_type=QuestType.ISSUE_COUNT, criteria={"target_count": 7})
|
||||
assert _get_target_value(q) == 7
|
||||
|
||||
def test_issue_reduce(self):
|
||||
q = _make_quest(quest_type=QuestType.ISSUE_REDUCE, criteria={"target_reduction": 5})
|
||||
assert _get_target_value(q) == 5
|
||||
|
||||
def test_daily_run(self):
|
||||
q = _make_quest(quest_type=QuestType.DAILY_RUN, criteria={"min_sessions": 3})
|
||||
assert _get_target_value(q) == 3
|
||||
|
||||
def test_docs_update(self):
|
||||
q = _make_quest(quest_type=QuestType.DOCS_UPDATE, criteria={"min_files_changed": 2})
|
||||
assert _get_target_value(q) == 2
|
||||
|
||||
def test_test_improve(self):
|
||||
q = _make_quest(quest_type=QuestType.TEST_IMPROVE, criteria={"min_new_tests": 4})
|
||||
assert _get_target_value(q) == 4
|
||||
|
||||
def test_custom_defaults_to_one(self):
|
||||
q = _make_quest(quest_type=QuestType.CUSTOM, criteria={})
|
||||
assert _get_target_value(q) == 1
|
||||
|
||||
def test_missing_criteria_key_defaults_to_one(self):
|
||||
q = _make_quest(quest_type=QuestType.ISSUE_COUNT, criteria={})
|
||||
assert _get_target_value(q) == 1
|
||||
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# get_or_create_progress / get_quest_progress
|
||||
# ---------------------------------------------------------------------------
|
||||
|
||||
class TestProgressCreation:
|
||||
def setup_method(self):
|
||||
qs._quest_definitions["q1"] = _make_quest("q1", criteria={"target_count": 5})
|
||||
|
||||
def test_creates_new_progress(self):
|
||||
progress = get_or_create_progress("q1", "agent_a")
|
||||
assert progress.quest_id == "q1"
|
||||
assert progress.agent_id == "agent_a"
|
||||
assert progress.status == QuestStatus.NOT_STARTED
|
||||
assert progress.target_value == 5
|
||||
assert progress.current_value == 0
|
||||
|
||||
def test_returns_existing_progress(self):
|
||||
p1 = get_or_create_progress("q1", "agent_a")
|
||||
p1.current_value = 3
|
||||
p2 = get_or_create_progress("q1", "agent_a")
|
||||
assert p2.current_value == 3
|
||||
assert p1 is p2
|
||||
|
||||
def test_raises_for_unknown_quest(self):
|
||||
with pytest.raises(ValueError, match="Quest unknown not found"):
|
||||
get_or_create_progress("unknown", "agent_a")
|
||||
|
||||
def test_get_quest_progress_none_before_creation(self):
|
||||
assert get_quest_progress("q1", "agent_a") is None
|
||||
|
||||
def test_get_quest_progress_after_creation(self):
|
||||
get_or_create_progress("q1", "agent_a")
|
||||
progress = get_quest_progress("q1", "agent_a")
|
||||
assert progress is not None
|
||||
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# update_quest_progress
|
||||
# ---------------------------------------------------------------------------
|
||||
|
||||
class TestUpdateQuestProgress:
|
||||
def setup_method(self):
|
||||
qs._quest_definitions["q1"] = _make_quest("q1", criteria={"target_count": 3})
|
||||
|
||||
def test_updates_current_value(self):
|
||||
progress = update_quest_progress("q1", "agent_a", 2)
|
||||
assert progress.current_value == 2
|
||||
assert progress.status == QuestStatus.NOT_STARTED
|
||||
|
||||
def test_marks_completed_when_target_reached(self):
|
||||
progress = update_quest_progress("q1", "agent_a", 3)
|
||||
assert progress.status == QuestStatus.COMPLETED
|
||||
assert progress.completed_at != ""
|
||||
|
||||
def test_marks_completed_when_value_exceeds_target(self):
|
||||
progress = update_quest_progress("q1", "agent_a", 10)
|
||||
assert progress.status == QuestStatus.COMPLETED
|
||||
|
||||
def test_does_not_re_complete_already_completed(self):
|
||||
p = update_quest_progress("q1", "agent_a", 3)
|
||||
first_completed_at = p.completed_at
|
||||
p2 = update_quest_progress("q1", "agent_a", 5)
|
||||
# should not change completed_at again
|
||||
assert p2.completed_at == first_completed_at
|
||||
|
||||
def test_does_not_re_complete_claimed_quest(self):
|
||||
p = update_quest_progress("q1", "agent_a", 3)
|
||||
p.status = QuestStatus.CLAIMED
|
||||
p2 = update_quest_progress("q1", "agent_a", 5)
|
||||
assert p2.status == QuestStatus.CLAIMED
|
||||
|
||||
def test_updates_metadata(self):
|
||||
progress = update_quest_progress("q1", "agent_a", 1, metadata={"info": "value"})
|
||||
assert progress.metadata["info"] == "value"
|
||||
|
||||
def test_merges_metadata(self):
|
||||
update_quest_progress("q1", "agent_a", 1, metadata={"a": 1})
|
||||
progress = update_quest_progress("q1", "agent_a", 2, metadata={"b": 2})
|
||||
assert progress.metadata["a"] == 1
|
||||
assert progress.metadata["b"] == 2
|
||||
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# _is_on_cooldown
|
||||
# ---------------------------------------------------------------------------
|
||||
|
||||
class TestIsOnCooldown:
|
||||
def test_non_repeatable_never_on_cooldown(self):
|
||||
quest = _make_quest(repeatable=False, cooldown_hours=24)
|
||||
progress = QuestProgress(
|
||||
quest_id="q1",
|
||||
agent_id="agent_a",
|
||||
status=QuestStatus.CLAIMED,
|
||||
last_completed_at=datetime.now(UTC).isoformat(),
|
||||
)
|
||||
assert _is_on_cooldown(progress, quest) is False
|
||||
|
||||
def test_no_last_completed_not_on_cooldown(self):
|
||||
quest = _make_quest(repeatable=True, cooldown_hours=24)
|
||||
progress = QuestProgress(
|
||||
quest_id="q1",
|
||||
agent_id="agent_a",
|
||||
status=QuestStatus.NOT_STARTED,
|
||||
last_completed_at="",
|
||||
)
|
||||
assert _is_on_cooldown(progress, quest) is False
|
||||
|
||||
def test_zero_cooldown_not_on_cooldown(self):
|
||||
quest = _make_quest(repeatable=True, cooldown_hours=0)
|
||||
progress = QuestProgress(
|
||||
quest_id="q1",
|
||||
agent_id="agent_a",
|
||||
status=QuestStatus.CLAIMED,
|
||||
last_completed_at=datetime.now(UTC).isoformat(),
|
||||
)
|
||||
assert _is_on_cooldown(progress, quest) is False
|
||||
|
||||
def test_recent_completion_is_on_cooldown(self):
|
||||
quest = _make_quest(repeatable=True, cooldown_hours=24)
|
||||
recent = datetime.now(UTC) - timedelta(hours=1)
|
||||
progress = QuestProgress(
|
||||
quest_id="q1",
|
||||
agent_id="agent_a",
|
||||
status=QuestStatus.NOT_STARTED,
|
||||
last_completed_at=recent.isoformat(),
|
||||
)
|
||||
assert _is_on_cooldown(progress, quest) is True
|
||||
|
||||
def test_expired_cooldown_not_on_cooldown(self):
|
||||
quest = _make_quest(repeatable=True, cooldown_hours=24)
|
||||
old = datetime.now(UTC) - timedelta(hours=25)
|
||||
progress = QuestProgress(
|
||||
quest_id="q1",
|
||||
agent_id="agent_a",
|
||||
status=QuestStatus.NOT_STARTED,
|
||||
last_completed_at=old.isoformat(),
|
||||
)
|
||||
assert _is_on_cooldown(progress, quest) is False
|
||||
|
||||
def test_invalid_last_completed_returns_false(self):
|
||||
quest = _make_quest(repeatable=True, cooldown_hours=24)
|
||||
progress = QuestProgress(
|
||||
quest_id="q1",
|
||||
agent_id="agent_a",
|
||||
status=QuestStatus.NOT_STARTED,
|
||||
last_completed_at="not-a-date",
|
||||
)
|
||||
assert _is_on_cooldown(progress, quest) is False
|
||||
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# claim_quest_reward
|
||||
# ---------------------------------------------------------------------------
|
||||
|
||||
class TestClaimQuestReward:
|
||||
def setup_method(self):
|
||||
qs._quest_definitions["q1"] = _make_quest("q1", reward_tokens=25)
|
||||
|
||||
def test_returns_none_if_no_progress(self):
|
||||
assert claim_quest_reward("q1", "agent_a") is None
|
||||
|
||||
def test_returns_none_if_not_completed(self):
|
||||
get_or_create_progress("q1", "agent_a")
|
||||
assert claim_quest_reward("q1", "agent_a") is None
|
||||
|
||||
def test_returns_none_if_quest_not_found(self):
|
||||
assert claim_quest_reward("nonexistent", "agent_a") is None
|
||||
|
||||
def test_successful_claim(self):
|
||||
progress = get_or_create_progress("q1", "agent_a")
|
||||
progress.status = QuestStatus.COMPLETED
|
||||
progress.completed_at = datetime.now(UTC).isoformat()
|
||||
|
||||
mock_invoice = MagicMock()
|
||||
mock_invoice.payment_hash = "quest_q1_agent_a_123"
|
||||
|
||||
with (
|
||||
patch("timmy.quest_system.create_invoice_entry", return_value=mock_invoice),
|
||||
patch("timmy.quest_system.mark_settled"),
|
||||
):
|
||||
result = claim_quest_reward("q1", "agent_a")
|
||||
|
||||
assert result is not None
|
||||
assert result["tokens_awarded"] == 25
|
||||
assert result["quest_id"] == "q1"
|
||||
assert result["agent_id"] == "agent_a"
|
||||
assert result["completion_count"] == 1
|
||||
|
||||
def test_successful_claim_marks_claimed(self):
|
||||
progress = get_or_create_progress("q1", "agent_a")
|
||||
progress.status = QuestStatus.COMPLETED
|
||||
progress.completed_at = datetime.now(UTC).isoformat()
|
||||
|
||||
mock_invoice = MagicMock()
|
||||
mock_invoice.payment_hash = "phash"
|
||||
|
||||
with (
|
||||
patch("timmy.quest_system.create_invoice_entry", return_value=mock_invoice),
|
||||
patch("timmy.quest_system.mark_settled"),
|
||||
):
|
||||
claim_quest_reward("q1", "agent_a")
|
||||
|
||||
assert progress.status == QuestStatus.CLAIMED
|
||||
|
||||
def test_repeatable_quest_resets_after_claim(self):
|
||||
qs._quest_definitions["rep"] = _make_quest(
|
||||
"rep", repeatable=True, cooldown_hours=0, reward_tokens=10
|
||||
)
|
||||
progress = get_or_create_progress("rep", "agent_a")
|
||||
progress.status = QuestStatus.COMPLETED
|
||||
progress.completed_at = datetime.now(UTC).isoformat()
|
||||
progress.current_value = 5
|
||||
|
||||
mock_invoice = MagicMock()
|
||||
mock_invoice.payment_hash = "phash"
|
||||
|
||||
with (
|
||||
patch("timmy.quest_system.create_invoice_entry", return_value=mock_invoice),
|
||||
patch("timmy.quest_system.mark_settled"),
|
||||
):
|
||||
result = claim_quest_reward("rep", "agent_a")
|
||||
|
||||
assert result is not None
|
||||
assert progress.status == QuestStatus.NOT_STARTED
|
||||
assert progress.current_value == 0
|
||||
assert progress.completed_at == ""
|
||||
|
||||
def test_on_cooldown_returns_none(self):
|
||||
qs._quest_definitions["rep"] = _make_quest("rep", repeatable=True, cooldown_hours=24)
|
||||
progress = get_or_create_progress("rep", "agent_a")
|
||||
progress.status = QuestStatus.COMPLETED
|
||||
recent = datetime.now(UTC) - timedelta(hours=1)
|
||||
progress.last_completed_at = recent.isoformat()
|
||||
|
||||
assert claim_quest_reward("rep", "agent_a") is None
|
||||
|
||||
def test_ledger_error_returns_none(self):
|
||||
progress = get_or_create_progress("q1", "agent_a")
|
||||
progress.status = QuestStatus.COMPLETED
|
||||
progress.completed_at = datetime.now(UTC).isoformat()
|
||||
|
||||
with patch("timmy.quest_system.create_invoice_entry", side_effect=Exception("ledger error")):
|
||||
result = claim_quest_reward("q1", "agent_a")
|
||||
|
||||
assert result is None
|
||||
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# check_issue_count_quest
|
||||
# ---------------------------------------------------------------------------
|
||||
|
||||
class TestCheckIssueCountQuest:
|
||||
def setup_method(self):
|
||||
qs._quest_definitions["iq"] = _make_quest(
|
||||
"iq", quest_type=QuestType.ISSUE_COUNT, criteria={"target_count": 2, "issue_labels": ["bug"]}
|
||||
)
|
||||
|
||||
def test_counts_matching_issues(self):
|
||||
issues = [
|
||||
{"labels": [{"name": "bug"}]},
|
||||
{"labels": [{"name": "bug"}, {"name": "priority"}]},
|
||||
{"labels": [{"name": "feature"}]}, # doesn't match
|
||||
]
|
||||
progress = check_issue_count_quest(
|
||||
qs._quest_definitions["iq"], "agent_a", issues
|
||||
)
|
||||
assert progress.current_value == 2
|
||||
assert progress.status == QuestStatus.COMPLETED
|
||||
|
||||
def test_empty_issues_returns_zero(self):
|
||||
progress = check_issue_count_quest(qs._quest_definitions["iq"], "agent_a", [])
|
||||
assert progress.current_value == 0
|
||||
|
||||
def test_no_labels_filter_counts_all_labeled(self):
|
||||
q = _make_quest(
|
||||
"nolabel",
|
||||
quest_type=QuestType.ISSUE_COUNT,
|
||||
criteria={"target_count": 1, "issue_labels": []},
|
||||
)
|
||||
qs._quest_definitions["nolabel"] = q
|
||||
issues = [
|
||||
{"labels": [{"name": "bug"}]},
|
||||
{"labels": [{"name": "feature"}]},
|
||||
]
|
||||
progress = check_issue_count_quest(q, "agent_a", issues)
|
||||
assert progress.current_value == 2
|
||||
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# check_issue_reduce_quest
|
||||
# ---------------------------------------------------------------------------
|
||||
|
||||
class TestCheckIssueReduceQuest:
|
||||
def setup_method(self):
|
||||
qs._quest_definitions["ir"] = _make_quest(
|
||||
"ir", quest_type=QuestType.ISSUE_REDUCE, criteria={"target_reduction": 5}
|
||||
)
|
||||
|
||||
def test_computes_reduction(self):
|
||||
progress = check_issue_reduce_quest(qs._quest_definitions["ir"], "agent_a", 20, 15)
|
||||
assert progress.current_value == 5
|
||||
assert progress.status == QuestStatus.COMPLETED
|
||||
|
||||
def test_negative_reduction_treated_as_zero(self):
|
||||
progress = check_issue_reduce_quest(qs._quest_definitions["ir"], "agent_a", 10, 15)
|
||||
assert progress.current_value == 0
|
||||
|
||||
def test_no_change_yields_zero(self):
|
||||
progress = check_issue_reduce_quest(qs._quest_definitions["ir"], "agent_a", 10, 10)
|
||||
assert progress.current_value == 0
|
||||
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# check_daily_run_quest
|
||||
# ---------------------------------------------------------------------------
|
||||
|
||||
class TestCheckDailyRunQuest:
|
||||
def setup_method(self):
|
||||
qs._quest_definitions["dr"] = _make_quest(
|
||||
"dr", quest_type=QuestType.DAILY_RUN, criteria={"min_sessions": 2}
|
||||
)
|
||||
|
||||
def test_tracks_sessions(self):
|
||||
progress = check_daily_run_quest(qs._quest_definitions["dr"], "agent_a", 2)
|
||||
assert progress.current_value == 2
|
||||
assert progress.status == QuestStatus.COMPLETED
|
||||
|
||||
def test_incomplete_sessions(self):
|
||||
progress = check_daily_run_quest(qs._quest_definitions["dr"], "agent_a", 1)
|
||||
assert progress.current_value == 1
|
||||
assert progress.status != QuestStatus.COMPLETED
|
||||
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# evaluate_quest_progress
|
||||
# ---------------------------------------------------------------------------
|
||||
|
||||
class TestEvaluateQuestProgress:
|
||||
def setup_method(self):
|
||||
qs._quest_definitions["iq"] = _make_quest(
|
||||
"iq", quest_type=QuestType.ISSUE_COUNT, criteria={"target_count": 1}
|
||||
)
|
||||
qs._quest_definitions["dis"] = _make_quest("dis", enabled=False)
|
||||
|
||||
def test_disabled_quest_returns_none(self):
|
||||
result = evaluate_quest_progress("dis", "agent_a", {})
|
||||
assert result is None
|
||||
|
||||
def test_missing_quest_returns_none(self):
|
||||
result = evaluate_quest_progress("nonexistent", "agent_a", {})
|
||||
assert result is None
|
||||
|
||||
def test_issue_count_quest_evaluated(self):
|
||||
context = {"closed_issues": [{"labels": [{"name": "bug"}]}]}
|
||||
result = evaluate_quest_progress("iq", "agent_a", context)
|
||||
assert result is not None
|
||||
assert result.current_value == 1
|
||||
|
||||
def test_issue_reduce_quest_evaluated(self):
|
||||
qs._quest_definitions["ir"] = _make_quest(
|
||||
"ir", quest_type=QuestType.ISSUE_REDUCE, criteria={"target_reduction": 3}
|
||||
)
|
||||
context = {"previous_issue_count": 10, "current_issue_count": 7}
|
||||
result = evaluate_quest_progress("ir", "agent_a", context)
|
||||
assert result is not None
|
||||
assert result.current_value == 3
|
||||
|
||||
def test_daily_run_quest_evaluated(self):
|
||||
qs._quest_definitions["dr"] = _make_quest(
|
||||
"dr", quest_type=QuestType.DAILY_RUN, criteria={"min_sessions": 1}
|
||||
)
|
||||
context = {"sessions_completed": 2}
|
||||
result = evaluate_quest_progress("dr", "agent_a", context)
|
||||
assert result is not None
|
||||
assert result.current_value == 2
|
||||
|
||||
def test_custom_quest_returns_existing_progress(self):
|
||||
qs._quest_definitions["cust"] = _make_quest("cust", quest_type=QuestType.CUSTOM)
|
||||
# No progress yet => None (custom quests don't auto-create progress here)
|
||||
result = evaluate_quest_progress("cust", "agent_a", {})
|
||||
assert result is None
|
||||
|
||||
def test_cooldown_prevents_evaluation(self):
|
||||
q = _make_quest("rep_iq", quest_type=QuestType.ISSUE_COUNT, repeatable=True, cooldown_hours=24, criteria={"target_count": 1})
|
||||
qs._quest_definitions["rep_iq"] = q
|
||||
progress = get_or_create_progress("rep_iq", "agent_a")
|
||||
recent = datetime.now(UTC) - timedelta(hours=1)
|
||||
progress.last_completed_at = recent.isoformat()
|
||||
|
||||
context = {"closed_issues": [{"labels": [{"name": "bug"}]}]}
|
||||
result = evaluate_quest_progress("rep_iq", "agent_a", context)
|
||||
# Should return existing progress without updating
|
||||
assert result is progress
|
||||
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# reset_quest_progress
|
||||
# ---------------------------------------------------------------------------
|
||||
|
||||
class TestResetQuestProgress:
|
||||
def setup_method(self):
|
||||
qs._quest_definitions["q1"] = _make_quest("q1")
|
||||
qs._quest_definitions["q2"] = _make_quest("q2")
|
||||
|
||||
def test_reset_all(self):
|
||||
get_or_create_progress("q1", "agent_a")
|
||||
get_or_create_progress("q2", "agent_a")
|
||||
count = reset_quest_progress()
|
||||
assert count == 2
|
||||
assert get_quest_progress("q1", "agent_a") is None
|
||||
assert get_quest_progress("q2", "agent_a") is None
|
||||
|
||||
def test_reset_specific_quest(self):
|
||||
get_or_create_progress("q1", "agent_a")
|
||||
get_or_create_progress("q2", "agent_a")
|
||||
count = reset_quest_progress(quest_id="q1")
|
||||
assert count == 1
|
||||
assert get_quest_progress("q1", "agent_a") is None
|
||||
assert get_quest_progress("q2", "agent_a") is not None
|
||||
|
||||
def test_reset_specific_agent(self):
|
||||
get_or_create_progress("q1", "agent_a")
|
||||
get_or_create_progress("q1", "agent_b")
|
||||
count = reset_quest_progress(agent_id="agent_a")
|
||||
assert count == 1
|
||||
assert get_quest_progress("q1", "agent_a") is None
|
||||
assert get_quest_progress("q1", "agent_b") is not None
|
||||
|
||||
def test_reset_specific_quest_and_agent(self):
|
||||
get_or_create_progress("q1", "agent_a")
|
||||
get_or_create_progress("q1", "agent_b")
|
||||
count = reset_quest_progress(quest_id="q1", agent_id="agent_a")
|
||||
assert count == 1
|
||||
|
||||
def test_reset_empty_returns_zero(self):
|
||||
count = reset_quest_progress()
|
||||
assert count == 0
|
||||
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# get_quest_leaderboard
|
||||
# ---------------------------------------------------------------------------
|
||||
|
||||
class TestGetQuestLeaderboard:
|
||||
def setup_method(self):
|
||||
qs._quest_definitions["q1"] = _make_quest("q1", reward_tokens=10)
|
||||
qs._quest_definitions["q2"] = _make_quest("q2", reward_tokens=20)
|
||||
|
||||
def test_empty_progress_returns_empty(self):
|
||||
assert get_quest_leaderboard() == []
|
||||
|
||||
def test_leaderboard_sorted_by_tokens(self):
|
||||
p_a = get_or_create_progress("q1", "agent_a")
|
||||
p_a.completion_count = 1
|
||||
p_b = get_or_create_progress("q2", "agent_b")
|
||||
p_b.completion_count = 2
|
||||
|
||||
board = get_quest_leaderboard()
|
||||
assert board[0]["agent_id"] == "agent_b" # 40 tokens
|
||||
assert board[1]["agent_id"] == "agent_a" # 10 tokens
|
||||
|
||||
def test_leaderboard_aggregates_multiple_quests(self):
|
||||
p1 = get_or_create_progress("q1", "agent_a")
|
||||
p1.completion_count = 2 # 20 tokens
|
||||
p2 = get_or_create_progress("q2", "agent_a")
|
||||
p2.completion_count = 1 # 20 tokens
|
||||
|
||||
board = get_quest_leaderboard()
|
||||
assert len(board) == 1
|
||||
assert board[0]["total_tokens"] == 40
|
||||
assert board[0]["total_completions"] == 3
|
||||
|
||||
def test_leaderboard_counts_unique_quests(self):
|
||||
p1 = get_or_create_progress("q1", "agent_a")
|
||||
p1.completion_count = 2
|
||||
p2 = get_or_create_progress("q2", "agent_a")
|
||||
p2.completion_count = 1
|
||||
|
||||
board = get_quest_leaderboard()
|
||||
assert board[0]["unique_quests_completed"] == 2
|
||||
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# get_agent_quests_status
|
||||
# ---------------------------------------------------------------------------
|
||||
|
||||
class TestGetAgentQuestsStatus:
|
||||
def setup_method(self):
|
||||
qs._quest_definitions["q1"] = _make_quest("q1", reward_tokens=10)
|
||||
|
||||
def test_returns_status_structure(self):
|
||||
result = get_agent_quests_status("agent_a")
|
||||
assert result["agent_id"] == "agent_a"
|
||||
assert isinstance(result["quests"], list)
|
||||
assert "total_tokens_earned" in result
|
||||
assert "total_quests_completed" in result
|
||||
assert "active_quests_count" in result
|
||||
|
||||
def test_includes_quest_info(self):
|
||||
result = get_agent_quests_status("agent_a")
|
||||
quest_info = result["quests"][0]
|
||||
assert quest_info["quest_id"] == "q1"
|
||||
assert quest_info["reward_tokens"] == 10
|
||||
assert quest_info["status"] == QuestStatus.NOT_STARTED.value
|
||||
|
||||
def test_accumulates_tokens_from_completions(self):
|
||||
p = get_or_create_progress("q1", "agent_a")
|
||||
p.completion_count = 3
|
||||
result = get_agent_quests_status("agent_a")
|
||||
assert result["total_tokens_earned"] == 30
|
||||
assert result["total_quests_completed"] == 3
|
||||
|
||||
def test_cooldown_hours_remaining_calculated(self):
|
||||
q = _make_quest("qcool", repeatable=True, cooldown_hours=24, reward_tokens=5)
|
||||
qs._quest_definitions["qcool"] = q
|
||||
p = get_or_create_progress("qcool", "agent_a")
|
||||
recent = datetime.now(UTC) - timedelta(hours=2)
|
||||
p.last_completed_at = recent.isoformat()
|
||||
p.completion_count = 1
|
||||
|
||||
result = get_agent_quests_status("agent_a")
|
||||
qcool_info = next(qi for qi in result["quests"] if qi["quest_id"] == "qcool")
|
||||
assert qcool_info["on_cooldown"] is True
|
||||
assert qcool_info["cooldown_hours_remaining"] > 0
|
||||
403
tests/timmy/test_research.py
Normal file
403
tests/timmy/test_research.py
Normal file
@@ -0,0 +1,403 @@
|
||||
"""Unit tests for src/timmy/research.py — ResearchOrchestrator pipeline.
|
||||
|
||||
Refs #972 (governing spec), #975 (ResearchOrchestrator).
|
||||
"""
|
||||
|
||||
from __future__ import annotations
|
||||
|
||||
from pathlib import Path
|
||||
from unittest.mock import AsyncMock, MagicMock, patch
|
||||
|
||||
import pytest
|
||||
|
||||
pytestmark = pytest.mark.unit
|
||||
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# list_templates
|
||||
# ---------------------------------------------------------------------------
|
||||
|
||||
|
||||
class TestListTemplates:
|
||||
def test_returns_list(self, tmp_path, monkeypatch):
|
||||
(tmp_path / "tool_evaluation.md").write_text("---\n---\n# T")
|
||||
(tmp_path / "game_analysis.md").write_text("---\n---\n# G")
|
||||
monkeypatch.setattr("timmy.research._SKILLS_ROOT", tmp_path)
|
||||
|
||||
from timmy.research import list_templates
|
||||
|
||||
result = list_templates()
|
||||
assert isinstance(result, list)
|
||||
assert "tool_evaluation" in result
|
||||
assert "game_analysis" in result
|
||||
|
||||
def test_returns_empty_when_dir_missing(self, tmp_path, monkeypatch):
|
||||
monkeypatch.setattr("timmy.research._SKILLS_ROOT", tmp_path / "nonexistent")
|
||||
|
||||
from timmy.research import list_templates
|
||||
|
||||
assert list_templates() == []
|
||||
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# load_template
|
||||
# ---------------------------------------------------------------------------
|
||||
|
||||
|
||||
class TestLoadTemplate:
|
||||
def _write_template(self, path: Path, name: str, body: str) -> None:
|
||||
(path / f"{name}.md").write_text(body, encoding="utf-8")
|
||||
|
||||
def test_loads_and_strips_frontmatter(self, tmp_path, monkeypatch):
|
||||
self._write_template(
|
||||
tmp_path,
|
||||
"tool_evaluation",
|
||||
"---\nname: Tool Evaluation\ntype: research\n---\n# Tool Eval: {domain}",
|
||||
)
|
||||
monkeypatch.setattr("timmy.research._SKILLS_ROOT", tmp_path)
|
||||
|
||||
from timmy.research import load_template
|
||||
|
||||
result = load_template("tool_evaluation", {"domain": "PDF parsing"})
|
||||
assert "# Tool Eval: PDF parsing" in result
|
||||
assert "name: Tool Evaluation" not in result
|
||||
|
||||
def test_fills_slots(self, tmp_path, monkeypatch):
|
||||
self._write_template(tmp_path, "arch", "Connect {system_a} to {system_b}")
|
||||
monkeypatch.setattr("timmy.research._SKILLS_ROOT", tmp_path)
|
||||
|
||||
from timmy.research import load_template
|
||||
|
||||
result = load_template("arch", {"system_a": "Kafka", "system_b": "Postgres"})
|
||||
assert "Kafka" in result
|
||||
assert "Postgres" in result
|
||||
|
||||
def test_unfilled_slots_preserved(self, tmp_path, monkeypatch):
|
||||
self._write_template(tmp_path, "t", "Hello {name} and {other}")
|
||||
monkeypatch.setattr("timmy.research._SKILLS_ROOT", tmp_path)
|
||||
|
||||
from timmy.research import load_template
|
||||
|
||||
result = load_template("t", {"name": "World"})
|
||||
assert "{other}" in result
|
||||
|
||||
def test_raises_file_not_found_for_missing_template(self, tmp_path, monkeypatch):
|
||||
monkeypatch.setattr("timmy.research._SKILLS_ROOT", tmp_path)
|
||||
|
||||
from timmy.research import load_template
|
||||
|
||||
with pytest.raises(FileNotFoundError, match="nonexistent"):
|
||||
load_template("nonexistent")
|
||||
|
||||
def test_no_slots_returns_raw_body(self, tmp_path, monkeypatch):
|
||||
self._write_template(tmp_path, "plain", "---\n---\nJust text here")
|
||||
monkeypatch.setattr("timmy.research._SKILLS_ROOT", tmp_path)
|
||||
|
||||
from timmy.research import load_template
|
||||
|
||||
result = load_template("plain")
|
||||
assert result == "Just text here"
|
||||
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# _check_cache
|
||||
# ---------------------------------------------------------------------------
|
||||
|
||||
|
||||
class TestCheckCache:
|
||||
def test_returns_none_when_no_hits(self):
|
||||
mock_mem = MagicMock()
|
||||
mock_mem.search.return_value = []
|
||||
|
||||
with patch("timmy.research.SemanticMemory", return_value=mock_mem):
|
||||
from timmy.research import _check_cache
|
||||
|
||||
content, score = _check_cache("some topic")
|
||||
|
||||
assert content is None
|
||||
assert score == 0.0
|
||||
|
||||
def test_returns_content_above_threshold(self):
|
||||
mock_mem = MagicMock()
|
||||
mock_mem.search.return_value = [("cached report text", 0.91)]
|
||||
|
||||
with patch("timmy.research.SemanticMemory", return_value=mock_mem):
|
||||
from timmy.research import _check_cache
|
||||
|
||||
content, score = _check_cache("same topic")
|
||||
|
||||
assert content == "cached report text"
|
||||
assert score == pytest.approx(0.91)
|
||||
|
||||
def test_returns_none_below_threshold(self):
|
||||
mock_mem = MagicMock()
|
||||
mock_mem.search.return_value = [("old report", 0.60)]
|
||||
|
||||
with patch("timmy.research.SemanticMemory", return_value=mock_mem):
|
||||
from timmy.research import _check_cache
|
||||
|
||||
content, score = _check_cache("slightly different topic")
|
||||
|
||||
assert content is None
|
||||
assert score == 0.0
|
||||
|
||||
def test_degrades_gracefully_on_import_error(self):
|
||||
with patch("timmy.research.SemanticMemory", None):
|
||||
from timmy.research import _check_cache
|
||||
|
||||
content, score = _check_cache("topic")
|
||||
|
||||
assert content is None
|
||||
assert score == 0.0
|
||||
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# _store_result
|
||||
# ---------------------------------------------------------------------------
|
||||
|
||||
|
||||
class TestStoreResult:
|
||||
def test_calls_store_memory(self):
|
||||
mock_store = MagicMock()
|
||||
|
||||
with patch("timmy.research.store_memory", mock_store):
|
||||
from timmy.research import _store_result
|
||||
|
||||
_store_result("test topic", "# Report\n\nContent here.")
|
||||
|
||||
mock_store.assert_called_once()
|
||||
call_kwargs = mock_store.call_args
|
||||
assert "test topic" in str(call_kwargs)
|
||||
|
||||
def test_degrades_gracefully_on_error(self):
|
||||
mock_store = MagicMock(side_effect=RuntimeError("db error"))
|
||||
with patch("timmy.research.store_memory", mock_store):
|
||||
from timmy.research import _store_result
|
||||
|
||||
# Should not raise
|
||||
_store_result("topic", "report")
|
||||
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# _save_to_disk
|
||||
# ---------------------------------------------------------------------------
|
||||
|
||||
|
||||
class TestSaveToDisk:
|
||||
def test_writes_file(self, tmp_path, monkeypatch):
|
||||
monkeypatch.setattr("timmy.research._DOCS_ROOT", tmp_path / "research")
|
||||
|
||||
from timmy.research import _save_to_disk
|
||||
|
||||
path = _save_to_disk("Test Topic: PDF Parsing", "# Test Report")
|
||||
assert path is not None
|
||||
assert path.exists()
|
||||
assert path.read_text() == "# Test Report"
|
||||
|
||||
def test_slugifies_topic_name(self, tmp_path, monkeypatch):
|
||||
monkeypatch.setattr("timmy.research._DOCS_ROOT", tmp_path / "research")
|
||||
|
||||
from timmy.research import _save_to_disk
|
||||
|
||||
path = _save_to_disk("My Complex Topic! v2.0", "content")
|
||||
assert path is not None
|
||||
# Should be slugified: no special chars
|
||||
assert " " not in path.name
|
||||
assert "!" not in path.name
|
||||
|
||||
def test_returns_none_on_error(self, monkeypatch):
|
||||
monkeypatch.setattr(
|
||||
"timmy.research._DOCS_ROOT",
|
||||
Path("/nonexistent_root/deeply/nested"),
|
||||
)
|
||||
|
||||
with patch("pathlib.Path.mkdir", side_effect=PermissionError("denied")):
|
||||
from timmy.research import _save_to_disk
|
||||
|
||||
result = _save_to_disk("topic", "report")
|
||||
|
||||
assert result is None
|
||||
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# run_research — end-to-end with mocks
|
||||
# ---------------------------------------------------------------------------
|
||||
|
||||
|
||||
class TestRunResearch:
|
||||
@pytest.mark.asyncio
|
||||
async def test_returns_cached_result_when_cache_hit(self):
|
||||
cached_report = "# Cached Report\n\nPreviously computed."
|
||||
with (
|
||||
patch("timmy.research._check_cache", return_value=(cached_report, 0.93)),
|
||||
):
|
||||
from timmy.research import run_research
|
||||
|
||||
result = await run_research("some topic")
|
||||
|
||||
assert result.cached is True
|
||||
assert result.cache_similarity == pytest.approx(0.93)
|
||||
assert result.report == cached_report
|
||||
assert result.synthesis_backend == "cache"
|
||||
|
||||
@pytest.mark.asyncio
|
||||
async def test_skips_cache_when_requested(self, tmp_path, monkeypatch):
|
||||
monkeypatch.setattr("timmy.research._SKILLS_ROOT", tmp_path)
|
||||
|
||||
with (
|
||||
patch("timmy.research._check_cache", return_value=("cached", 0.99)) as mock_cache,
|
||||
patch(
|
||||
"timmy.research._formulate_queries",
|
||||
new=AsyncMock(return_value=["q1"]),
|
||||
),
|
||||
patch("timmy.research._execute_search", new=AsyncMock(return_value=[])),
|
||||
patch("timmy.research._fetch_pages", new=AsyncMock(return_value=[])),
|
||||
patch(
|
||||
"timmy.research._synthesize",
|
||||
new=AsyncMock(return_value=("# Fresh report", "ollama")),
|
||||
),
|
||||
patch("timmy.research._store_result"),
|
||||
):
|
||||
from timmy.research import run_research
|
||||
|
||||
result = await run_research("topic", skip_cache=True)
|
||||
|
||||
mock_cache.assert_not_called()
|
||||
assert result.cached is False
|
||||
assert result.report == "# Fresh report"
|
||||
|
||||
@pytest.mark.asyncio
|
||||
async def test_full_pipeline_no_search_results(self, tmp_path, monkeypatch):
|
||||
monkeypatch.setattr("timmy.research._SKILLS_ROOT", tmp_path)
|
||||
|
||||
with (
|
||||
patch("timmy.research._check_cache", return_value=(None, 0.0)),
|
||||
patch(
|
||||
"timmy.research._formulate_queries",
|
||||
new=AsyncMock(return_value=["query 1", "query 2"]),
|
||||
),
|
||||
patch("timmy.research._execute_search", new=AsyncMock(return_value=[])),
|
||||
patch("timmy.research._fetch_pages", new=AsyncMock(return_value=[])),
|
||||
patch(
|
||||
"timmy.research._synthesize",
|
||||
new=AsyncMock(return_value=("# Report", "ollama")),
|
||||
),
|
||||
patch("timmy.research._store_result"),
|
||||
):
|
||||
from timmy.research import run_research
|
||||
|
||||
result = await run_research("a new topic")
|
||||
|
||||
assert not result.cached
|
||||
assert result.query_count == 2
|
||||
assert result.sources_fetched == 0
|
||||
assert result.report == "# Report"
|
||||
assert result.synthesis_backend == "ollama"
|
||||
|
||||
@pytest.mark.asyncio
|
||||
async def test_returns_result_with_error_on_bad_template(self, tmp_path, monkeypatch):
|
||||
monkeypatch.setattr("timmy.research._SKILLS_ROOT", tmp_path)
|
||||
|
||||
with (
|
||||
patch("timmy.research._check_cache", return_value=(None, 0.0)),
|
||||
patch(
|
||||
"timmy.research._formulate_queries",
|
||||
new=AsyncMock(return_value=["q1"]),
|
||||
),
|
||||
patch("timmy.research._execute_search", new=AsyncMock(return_value=[])),
|
||||
patch("timmy.research._fetch_pages", new=AsyncMock(return_value=[])),
|
||||
patch(
|
||||
"timmy.research._synthesize",
|
||||
new=AsyncMock(return_value=("# Report", "ollama")),
|
||||
),
|
||||
patch("timmy.research._store_result"),
|
||||
):
|
||||
from timmy.research import run_research
|
||||
|
||||
result = await run_research("topic", template="nonexistent_template")
|
||||
|
||||
assert len(result.errors) == 1
|
||||
assert "nonexistent_template" in result.errors[0]
|
||||
|
||||
@pytest.mark.asyncio
|
||||
async def test_saves_to_disk_when_requested(self, tmp_path, monkeypatch):
|
||||
monkeypatch.setattr("timmy.research._SKILLS_ROOT", tmp_path)
|
||||
monkeypatch.setattr("timmy.research._DOCS_ROOT", tmp_path / "research")
|
||||
|
||||
with (
|
||||
patch("timmy.research._check_cache", return_value=(None, 0.0)),
|
||||
patch(
|
||||
"timmy.research._formulate_queries",
|
||||
new=AsyncMock(return_value=["q1"]),
|
||||
),
|
||||
patch("timmy.research._execute_search", new=AsyncMock(return_value=[])),
|
||||
patch("timmy.research._fetch_pages", new=AsyncMock(return_value=[])),
|
||||
patch(
|
||||
"timmy.research._synthesize",
|
||||
new=AsyncMock(return_value=("# Saved Report", "ollama")),
|
||||
),
|
||||
patch("timmy.research._store_result"),
|
||||
):
|
||||
from timmy.research import run_research
|
||||
|
||||
result = await run_research("disk topic", save_to_disk=True)
|
||||
|
||||
assert result.report == "# Saved Report"
|
||||
saved_files = list((tmp_path / "research").glob("*.md"))
|
||||
assert len(saved_files) == 1
|
||||
assert saved_files[0].read_text() == "# Saved Report"
|
||||
|
||||
@pytest.mark.asyncio
|
||||
async def test_result_is_not_empty_after_synthesis(self, tmp_path, monkeypatch):
|
||||
monkeypatch.setattr("timmy.research._SKILLS_ROOT", tmp_path)
|
||||
|
||||
with (
|
||||
patch("timmy.research._check_cache", return_value=(None, 0.0)),
|
||||
patch(
|
||||
"timmy.research._formulate_queries",
|
||||
new=AsyncMock(return_value=["q"]),
|
||||
),
|
||||
patch("timmy.research._execute_search", new=AsyncMock(return_value=[])),
|
||||
patch("timmy.research._fetch_pages", new=AsyncMock(return_value=[])),
|
||||
patch(
|
||||
"timmy.research._synthesize",
|
||||
new=AsyncMock(return_value=("# Non-empty", "ollama")),
|
||||
),
|
||||
patch("timmy.research._store_result"),
|
||||
):
|
||||
from timmy.research import run_research
|
||||
|
||||
result = await run_research("topic")
|
||||
|
||||
assert not result.is_empty()
|
||||
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# ResearchResult
|
||||
# ---------------------------------------------------------------------------
|
||||
|
||||
|
||||
class TestResearchResult:
|
||||
def test_is_empty_when_no_report(self):
|
||||
from timmy.research import ResearchResult
|
||||
|
||||
r = ResearchResult(topic="t", query_count=0, sources_fetched=0, report="")
|
||||
assert r.is_empty()
|
||||
|
||||
def test_is_not_empty_with_content(self):
|
||||
from timmy.research import ResearchResult
|
||||
|
||||
r = ResearchResult(topic="t", query_count=1, sources_fetched=1, report="# Report")
|
||||
assert not r.is_empty()
|
||||
|
||||
def test_default_cached_false(self):
|
||||
from timmy.research import ResearchResult
|
||||
|
||||
r = ResearchResult(topic="t", query_count=0, sources_fetched=0, report="x")
|
||||
assert r.cached is False
|
||||
|
||||
def test_errors_defaults_to_empty_list(self):
|
||||
from timmy.research import ResearchResult
|
||||
|
||||
r = ResearchResult(topic="t", query_count=0, sources_fetched=0, report="x")
|
||||
assert r.errors == []
|
||||
444
tests/timmy/test_session_report.py
Normal file
444
tests/timmy/test_session_report.py
Normal file
@@ -0,0 +1,444 @@
|
||||
"""Tests for timmy.sovereignty.session_report.
|
||||
|
||||
Refs: #957 (Session Sovereignty Report Generator)
|
||||
"""
|
||||
|
||||
import base64
|
||||
import json
|
||||
import time
|
||||
from datetime import UTC, datetime
|
||||
from pathlib import Path
|
||||
from unittest.mock import MagicMock, patch
|
||||
|
||||
import pytest
|
||||
|
||||
pytestmark = pytest.mark.unit
|
||||
|
||||
from timmy.sovereignty.session_report import (
|
||||
_format_duration,
|
||||
_gather_session_data,
|
||||
_gather_sovereignty_data,
|
||||
_render_markdown,
|
||||
commit_report,
|
||||
generate_and_commit_report,
|
||||
generate_report,
|
||||
mark_session_start,
|
||||
)
|
||||
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# _format_duration
|
||||
# ---------------------------------------------------------------------------
|
||||
|
||||
|
||||
class TestFormatDuration:
|
||||
def test_seconds_only(self):
|
||||
assert _format_duration(45) == "45s"
|
||||
|
||||
def test_minutes_and_seconds(self):
|
||||
assert _format_duration(125) == "2m 5s"
|
||||
|
||||
def test_hours_minutes_seconds(self):
|
||||
assert _format_duration(3661) == "1h 1m 1s"
|
||||
|
||||
def test_zero(self):
|
||||
assert _format_duration(0) == "0s"
|
||||
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# mark_session_start + generate_report (smoke)
|
||||
# ---------------------------------------------------------------------------
|
||||
|
||||
|
||||
class TestMarkSessionStart:
|
||||
def test_sets_session_start(self):
|
||||
import timmy.sovereignty.session_report as sr
|
||||
|
||||
sr._SESSION_START = None
|
||||
mark_session_start()
|
||||
assert sr._SESSION_START is not None
|
||||
assert sr._SESSION_START.tzinfo == UTC
|
||||
|
||||
def test_idempotent_overwrite(self):
|
||||
import timmy.sovereignty.session_report as sr
|
||||
|
||||
mark_session_start()
|
||||
first = sr._SESSION_START
|
||||
time.sleep(0.01)
|
||||
mark_session_start()
|
||||
second = sr._SESSION_START
|
||||
assert second >= first
|
||||
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# _gather_session_data
|
||||
# ---------------------------------------------------------------------------
|
||||
|
||||
|
||||
class TestGatherSessionData:
|
||||
def test_returns_defaults_when_no_file(self, tmp_path):
|
||||
mock_logger = MagicMock()
|
||||
mock_logger.flush.return_value = None
|
||||
mock_logger.session_file = tmp_path / "nonexistent.jsonl"
|
||||
|
||||
with patch(
|
||||
"timmy.sovereignty.session_report.get_session_logger",
|
||||
return_value=mock_logger,
|
||||
):
|
||||
data = _gather_session_data()
|
||||
|
||||
assert data["user_messages"] == 0
|
||||
assert data["timmy_messages"] == 0
|
||||
assert data["tool_calls"] == 0
|
||||
assert data["errors"] == 0
|
||||
assert data["tool_call_breakdown"] == {}
|
||||
|
||||
def test_counts_entries_correctly(self, tmp_path):
|
||||
session_file = tmp_path / "session_2026-03-23.jsonl"
|
||||
entries = [
|
||||
{"type": "message", "role": "user", "content": "hello"},
|
||||
{"type": "message", "role": "timmy", "content": "hi"},
|
||||
{"type": "message", "role": "user", "content": "test"},
|
||||
{"type": "tool_call", "tool": "memory_search", "args": {}, "result": "found"},
|
||||
{"type": "tool_call", "tool": "memory_search", "args": {}, "result": "nope"},
|
||||
{"type": "tool_call", "tool": "shell", "args": {}, "result": "ok"},
|
||||
{"type": "error", "error": "boom"},
|
||||
]
|
||||
with open(session_file, "w") as f:
|
||||
for e in entries:
|
||||
f.write(json.dumps(e) + "\n")
|
||||
|
||||
mock_logger = MagicMock()
|
||||
mock_logger.flush.return_value = None
|
||||
mock_logger.session_file = session_file
|
||||
|
||||
with patch(
|
||||
"timmy.sovereignty.session_report.get_session_logger",
|
||||
return_value=mock_logger,
|
||||
):
|
||||
data = _gather_session_data()
|
||||
|
||||
assert data["user_messages"] == 2
|
||||
assert data["timmy_messages"] == 1
|
||||
assert data["tool_calls"] == 3
|
||||
assert data["errors"] == 1
|
||||
assert data["tool_call_breakdown"]["memory_search"] == 2
|
||||
assert data["tool_call_breakdown"]["shell"] == 1
|
||||
|
||||
def test_graceful_on_import_error(self):
|
||||
with patch(
|
||||
"timmy.sovereignty.session_report.get_session_logger",
|
||||
side_effect=ImportError("no session_logger"),
|
||||
):
|
||||
data = _gather_session_data()
|
||||
|
||||
assert data["tool_calls"] == 0
|
||||
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# _gather_sovereignty_data
|
||||
# ---------------------------------------------------------------------------
|
||||
|
||||
|
||||
class TestGatherSovereigntyData:
|
||||
def test_returns_empty_on_import_error(self):
|
||||
with patch.dict("sys.modules", {"infrastructure.sovereignty_metrics": None}):
|
||||
with patch(
|
||||
"timmy.sovereignty.session_report.get_sovereignty_store",
|
||||
side_effect=ImportError("no store"),
|
||||
):
|
||||
data = _gather_sovereignty_data()
|
||||
|
||||
assert data["metrics"] == {}
|
||||
assert data["deltas"] == {}
|
||||
assert data["previous_session"] == {}
|
||||
|
||||
def test_populates_deltas_from_history(self):
|
||||
mock_store = MagicMock()
|
||||
mock_store.get_summary.return_value = {
|
||||
"cache_hit_rate": {"current": 0.5, "phase": "week1"},
|
||||
}
|
||||
# get_latest returns newest-first
|
||||
mock_store.get_latest.return_value = [
|
||||
{"value": 0.5},
|
||||
{"value": 0.3},
|
||||
{"value": 0.1},
|
||||
]
|
||||
|
||||
with patch(
|
||||
"timmy.sovereignty.session_report.get_sovereignty_store",
|
||||
return_value=mock_store,
|
||||
):
|
||||
with patch(
|
||||
"timmy.sovereignty.session_report.GRADUATION_TARGETS",
|
||||
{"cache_hit_rate": {"graduation": 0.9}},
|
||||
):
|
||||
data = _gather_sovereignty_data()
|
||||
|
||||
delta = data["deltas"].get("cache_hit_rate")
|
||||
assert delta is not None
|
||||
assert delta["start"] == 0.1 # oldest in window
|
||||
assert delta["end"] == 0.5 # most recent
|
||||
assert data["previous_session"]["cache_hit_rate"] == 0.3
|
||||
|
||||
def test_single_data_point_no_delta(self):
|
||||
mock_store = MagicMock()
|
||||
mock_store.get_summary.return_value = {}
|
||||
mock_store.get_latest.return_value = [{"value": 0.4}]
|
||||
|
||||
with patch(
|
||||
"timmy.sovereignty.session_report.get_sovereignty_store",
|
||||
return_value=mock_store,
|
||||
):
|
||||
with patch(
|
||||
"timmy.sovereignty.session_report.GRADUATION_TARGETS",
|
||||
{"api_cost": {"graduation": 0.01}},
|
||||
):
|
||||
data = _gather_sovereignty_data()
|
||||
|
||||
delta = data["deltas"]["api_cost"]
|
||||
assert delta["start"] == 0.4
|
||||
assert delta["end"] == 0.4
|
||||
assert data["previous_session"]["api_cost"] is None
|
||||
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# generate_report (integration — smoke test)
|
||||
# ---------------------------------------------------------------------------
|
||||
|
||||
|
||||
class TestGenerateReport:
|
||||
def _minimal_session_data(self):
|
||||
return {
|
||||
"user_messages": 3,
|
||||
"timmy_messages": 3,
|
||||
"tool_calls": 2,
|
||||
"errors": 0,
|
||||
"tool_call_breakdown": {"memory_search": 2},
|
||||
}
|
||||
|
||||
def _minimal_sov_data(self):
|
||||
return {
|
||||
"metrics": {
|
||||
"cache_hit_rate": {"current": 0.45, "phase": "week1"},
|
||||
"api_cost": {"current": 0.12, "phase": "pre-start"},
|
||||
},
|
||||
"deltas": {
|
||||
"cache_hit_rate": {"start": 0.40, "end": 0.45},
|
||||
"api_cost": {"start": 0.10, "end": 0.12},
|
||||
},
|
||||
"previous_session": {
|
||||
"cache_hit_rate": 0.40,
|
||||
"api_cost": 0.10,
|
||||
},
|
||||
}
|
||||
|
||||
def test_smoke_produces_markdown(self):
|
||||
with (
|
||||
patch(
|
||||
"timmy.sovereignty.session_report._gather_session_data",
|
||||
return_value=self._minimal_session_data(),
|
||||
),
|
||||
patch(
|
||||
"timmy.sovereignty.session_report._gather_sovereignty_data",
|
||||
return_value=self._minimal_sov_data(),
|
||||
),
|
||||
):
|
||||
report = generate_report("test-session")
|
||||
|
||||
assert "# Sovereignty Session Report" in report
|
||||
assert "test-session" in report
|
||||
assert "## Session Activity" in report
|
||||
assert "## Sovereignty Scorecard" in report
|
||||
assert "## Cost Breakdown" in report
|
||||
assert "## Trend vs Previous Session" in report
|
||||
|
||||
def test_report_contains_session_stats(self):
|
||||
with (
|
||||
patch(
|
||||
"timmy.sovereignty.session_report._gather_session_data",
|
||||
return_value=self._minimal_session_data(),
|
||||
),
|
||||
patch(
|
||||
"timmy.sovereignty.session_report._gather_sovereignty_data",
|
||||
return_value=self._minimal_sov_data(),
|
||||
),
|
||||
):
|
||||
report = generate_report()
|
||||
|
||||
assert "| User messages | 3 |" in report
|
||||
assert "memory_search" in report
|
||||
|
||||
def test_report_no_previous_session(self):
|
||||
sov = self._minimal_sov_data()
|
||||
sov["previous_session"] = {"cache_hit_rate": None, "api_cost": None}
|
||||
|
||||
with (
|
||||
patch(
|
||||
"timmy.sovereignty.session_report._gather_session_data",
|
||||
return_value=self._minimal_session_data(),
|
||||
),
|
||||
patch(
|
||||
"timmy.sovereignty.session_report._gather_sovereignty_data",
|
||||
return_value=sov,
|
||||
),
|
||||
):
|
||||
report = generate_report()
|
||||
|
||||
assert "No previous session data" in report
|
||||
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# commit_report
|
||||
# ---------------------------------------------------------------------------
|
||||
|
||||
|
||||
class TestCommitReport:
|
||||
def test_returns_false_when_gitea_disabled(self):
|
||||
with patch("timmy.sovereignty.session_report.settings") as mock_settings:
|
||||
mock_settings.gitea_enabled = False
|
||||
result = commit_report("# test", "dashboard")
|
||||
|
||||
assert result is False
|
||||
|
||||
def test_returns_false_when_no_token(self):
|
||||
with patch("timmy.sovereignty.session_report.settings") as mock_settings:
|
||||
mock_settings.gitea_enabled = True
|
||||
mock_settings.gitea_token = ""
|
||||
result = commit_report("# test", "dashboard")
|
||||
|
||||
assert result is False
|
||||
|
||||
def test_creates_file_via_put(self):
|
||||
mock_response = MagicMock()
|
||||
mock_response.status_code = 201
|
||||
mock_response.raise_for_status.return_value = None
|
||||
|
||||
mock_check = MagicMock()
|
||||
mock_check.status_code = 404 # file does not exist yet
|
||||
|
||||
mock_client = MagicMock()
|
||||
mock_client.__enter__ = MagicMock(return_value=mock_client)
|
||||
mock_client.__exit__ = MagicMock(return_value=False)
|
||||
mock_client.get.return_value = mock_check
|
||||
mock_client.put.return_value = mock_response
|
||||
|
||||
with (
|
||||
patch("timmy.sovereignty.session_report.settings") as mock_settings,
|
||||
patch("timmy.sovereignty.session_report.httpx.Client", return_value=mock_client),
|
||||
):
|
||||
mock_settings.gitea_enabled = True
|
||||
mock_settings.gitea_token = "fake-token"
|
||||
mock_settings.gitea_url = "http://localhost:3000"
|
||||
mock_settings.gitea_repo = "owner/repo"
|
||||
|
||||
result = commit_report("# report content", "dashboard")
|
||||
|
||||
assert result is True
|
||||
mock_client.put.assert_called_once()
|
||||
call_kwargs = mock_client.put.call_args
|
||||
payload = call_kwargs.kwargs.get("json", call_kwargs.args[1] if len(call_kwargs.args) > 1 else {})
|
||||
decoded = base64.b64decode(payload["content"]).decode()
|
||||
assert "# report content" in decoded
|
||||
|
||||
def test_updates_existing_file_with_sha(self):
|
||||
mock_check = MagicMock()
|
||||
mock_check.status_code = 200
|
||||
mock_check.json.return_value = {"sha": "abc123"}
|
||||
|
||||
mock_response = MagicMock()
|
||||
mock_response.raise_for_status.return_value = None
|
||||
|
||||
mock_client = MagicMock()
|
||||
mock_client.__enter__ = MagicMock(return_value=mock_client)
|
||||
mock_client.__exit__ = MagicMock(return_value=False)
|
||||
mock_client.get.return_value = mock_check
|
||||
mock_client.put.return_value = mock_response
|
||||
|
||||
with (
|
||||
patch("timmy.sovereignty.session_report.settings") as mock_settings,
|
||||
patch("timmy.sovereignty.session_report.httpx.Client", return_value=mock_client),
|
||||
):
|
||||
mock_settings.gitea_enabled = True
|
||||
mock_settings.gitea_token = "fake-token"
|
||||
mock_settings.gitea_url = "http://localhost:3000"
|
||||
mock_settings.gitea_repo = "owner/repo"
|
||||
|
||||
result = commit_report("# updated", "dashboard")
|
||||
|
||||
assert result is True
|
||||
payload = mock_client.put.call_args.kwargs.get("json", {})
|
||||
assert payload.get("sha") == "abc123"
|
||||
|
||||
def test_returns_false_on_http_error(self):
|
||||
import httpx
|
||||
|
||||
mock_check = MagicMock()
|
||||
mock_check.status_code = 404
|
||||
|
||||
mock_client = MagicMock()
|
||||
mock_client.__enter__ = MagicMock(return_value=mock_client)
|
||||
mock_client.__exit__ = MagicMock(return_value=False)
|
||||
mock_client.get.return_value = mock_check
|
||||
mock_client.put.side_effect = httpx.HTTPStatusError(
|
||||
"403", request=MagicMock(), response=MagicMock(status_code=403)
|
||||
)
|
||||
|
||||
with (
|
||||
patch("timmy.sovereignty.session_report.settings") as mock_settings,
|
||||
patch("timmy.sovereignty.session_report.httpx.Client", return_value=mock_client),
|
||||
):
|
||||
mock_settings.gitea_enabled = True
|
||||
mock_settings.gitea_token = "fake-token"
|
||||
mock_settings.gitea_url = "http://localhost:3000"
|
||||
mock_settings.gitea_repo = "owner/repo"
|
||||
|
||||
result = commit_report("# test", "dashboard")
|
||||
|
||||
assert result is False
|
||||
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# generate_and_commit_report (async)
|
||||
# ---------------------------------------------------------------------------
|
||||
|
||||
|
||||
class TestGenerateAndCommitReport:
|
||||
async def test_returns_true_on_success(self):
|
||||
with (
|
||||
patch(
|
||||
"timmy.sovereignty.session_report.generate_report",
|
||||
return_value="# mock report",
|
||||
),
|
||||
patch(
|
||||
"timmy.sovereignty.session_report.commit_report",
|
||||
return_value=True,
|
||||
),
|
||||
):
|
||||
result = await generate_and_commit_report("test")
|
||||
|
||||
assert result is True
|
||||
|
||||
async def test_returns_false_when_commit_fails(self):
|
||||
with (
|
||||
patch(
|
||||
"timmy.sovereignty.session_report.generate_report",
|
||||
return_value="# mock report",
|
||||
),
|
||||
patch(
|
||||
"timmy.sovereignty.session_report.commit_report",
|
||||
return_value=False,
|
||||
),
|
||||
):
|
||||
result = await generate_and_commit_report()
|
||||
|
||||
assert result is False
|
||||
|
||||
async def test_graceful_on_exception(self):
|
||||
with patch(
|
||||
"timmy.sovereignty.session_report.generate_report",
|
||||
side_effect=RuntimeError("explode"),
|
||||
):
|
||||
result = await generate_and_commit_report()
|
||||
|
||||
assert result is False
|
||||
308
tests/timmy/test_tools_search.py
Normal file
308
tests/timmy/test_tools_search.py
Normal file
@@ -0,0 +1,308 @@
|
||||
"""Unit tests for web_search and scrape_url tools (SearXNG + Crawl4AI).
|
||||
|
||||
All tests use mocked HTTP — no live services required.
|
||||
"""
|
||||
|
||||
from __future__ import annotations
|
||||
|
||||
from unittest.mock import MagicMock, patch
|
||||
|
||||
import pytest
|
||||
|
||||
from timmy.tools.search import _extract_crawl_content, scrape_url, web_search
|
||||
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# Helpers
|
||||
# ---------------------------------------------------------------------------
|
||||
|
||||
|
||||
def _mock_requests(json_response=None, status_code=200, raise_exc=None):
|
||||
"""Build a mock requests module whose .get/.post return controlled responses."""
|
||||
mock_req = MagicMock()
|
||||
|
||||
# Exception hierarchy
|
||||
class Timeout(Exception):
|
||||
pass
|
||||
|
||||
class HTTPError(Exception):
|
||||
def __init__(self, *a, response=None, **kw):
|
||||
super().__init__(*a, **kw)
|
||||
self.response = response
|
||||
|
||||
class RequestException(Exception):
|
||||
pass
|
||||
|
||||
exc_mod = MagicMock()
|
||||
exc_mod.Timeout = Timeout
|
||||
exc_mod.HTTPError = HTTPError
|
||||
exc_mod.RequestException = RequestException
|
||||
mock_req.exceptions = exc_mod
|
||||
|
||||
if raise_exc is not None:
|
||||
mock_req.get.side_effect = raise_exc
|
||||
mock_req.post.side_effect = raise_exc
|
||||
else:
|
||||
mock_resp = MagicMock()
|
||||
mock_resp.status_code = status_code
|
||||
mock_resp.json.return_value = json_response or {}
|
||||
if status_code >= 400:
|
||||
mock_resp.raise_for_status.side_effect = HTTPError(
|
||||
response=MagicMock(status_code=status_code)
|
||||
)
|
||||
mock_req.get.return_value = mock_resp
|
||||
mock_req.post.return_value = mock_resp
|
||||
|
||||
return mock_req
|
||||
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# web_search tests
|
||||
# ---------------------------------------------------------------------------
|
||||
|
||||
|
||||
class TestWebSearch:
|
||||
def test_backend_none_short_circuits(self):
|
||||
"""TIMMY_SEARCH_BACKEND=none returns disabled message immediately."""
|
||||
with patch("timmy.tools.search.settings") as mock_settings:
|
||||
mock_settings.timmy_search_backend = "none"
|
||||
result = web_search("anything")
|
||||
assert "disabled" in result
|
||||
|
||||
def test_missing_requests_package(self):
|
||||
"""Graceful error when requests is not installed."""
|
||||
with patch.dict("sys.modules", {"requests": None}):
|
||||
with patch("timmy.tools.search.settings") as mock_settings:
|
||||
mock_settings.timmy_search_backend = "searxng"
|
||||
mock_settings.search_url = "http://localhost:8888"
|
||||
result = web_search("test query")
|
||||
assert "requests" in result and "not installed" in result
|
||||
|
||||
def test_successful_search(self):
|
||||
"""Happy path: returns formatted result list."""
|
||||
mock_data = {
|
||||
"results": [
|
||||
{"title": "Foo Bar", "url": "https://example.com/foo", "content": "Foo is great"},
|
||||
{"title": "Baz", "url": "https://example.com/baz", "content": "Baz rules"},
|
||||
]
|
||||
}
|
||||
mock_req = _mock_requests(json_response=mock_data)
|
||||
with patch.dict("sys.modules", {"requests": mock_req}):
|
||||
with patch("timmy.tools.search.settings") as mock_settings:
|
||||
mock_settings.timmy_search_backend = "searxng"
|
||||
mock_settings.search_url = "http://localhost:8888"
|
||||
result = web_search("foo bar")
|
||||
|
||||
assert "Foo Bar" in result
|
||||
assert "https://example.com/foo" in result
|
||||
assert "Baz" in result
|
||||
assert "foo bar" in result
|
||||
|
||||
def test_no_results(self):
|
||||
"""Empty results list returns a helpful no-results message."""
|
||||
mock_req = _mock_requests(json_response={"results": []})
|
||||
with patch.dict("sys.modules", {"requests": mock_req}):
|
||||
with patch("timmy.tools.search.settings") as mock_settings:
|
||||
mock_settings.timmy_search_backend = "searxng"
|
||||
mock_settings.search_url = "http://localhost:8888"
|
||||
result = web_search("xyzzy")
|
||||
assert "No results" in result
|
||||
|
||||
def test_num_results_respected(self):
|
||||
"""Only up to num_results entries are returned."""
|
||||
mock_data = {
|
||||
"results": [
|
||||
{"title": f"Result {i}", "url": f"https://example.com/{i}", "content": "x"}
|
||||
for i in range(10)
|
||||
]
|
||||
}
|
||||
mock_req = _mock_requests(json_response=mock_data)
|
||||
with patch.dict("sys.modules", {"requests": mock_req}):
|
||||
with patch("timmy.tools.search.settings") as mock_settings:
|
||||
mock_settings.timmy_search_backend = "searxng"
|
||||
mock_settings.search_url = "http://localhost:8888"
|
||||
result = web_search("test", num_results=3)
|
||||
|
||||
# Only 3 numbered entries should appear
|
||||
assert "1." in result
|
||||
assert "3." in result
|
||||
assert "4." not in result
|
||||
|
||||
def test_service_unavailable(self):
|
||||
"""Connection error degrades gracefully."""
|
||||
mock_req = MagicMock()
|
||||
mock_req.get.side_effect = OSError("connection refused")
|
||||
mock_req.exceptions = MagicMock()
|
||||
with patch.dict("sys.modules", {"requests": mock_req}):
|
||||
with patch("timmy.tools.search.settings") as mock_settings:
|
||||
mock_settings.timmy_search_backend = "searxng"
|
||||
mock_settings.search_url = "http://localhost:8888"
|
||||
result = web_search("test")
|
||||
assert "not reachable" in result or "unavailable" in result
|
||||
|
||||
def test_catalog_entry_exists(self):
|
||||
"""web_search must appear in the tool catalog."""
|
||||
from timmy.tools import get_all_available_tools
|
||||
|
||||
catalog = get_all_available_tools()
|
||||
assert "web_search" in catalog
|
||||
assert "orchestrator" in catalog["web_search"]["available_in"]
|
||||
assert "echo" in catalog["web_search"]["available_in"]
|
||||
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# scrape_url tests
|
||||
# ---------------------------------------------------------------------------
|
||||
|
||||
|
||||
class TestScrapeUrl:
|
||||
def test_invalid_url_no_scheme(self):
|
||||
"""URLs without http(s) scheme are rejected before any HTTP call."""
|
||||
result = scrape_url("example.com/page")
|
||||
assert "Error: invalid URL" in result
|
||||
|
||||
def test_invalid_url_empty(self):
|
||||
result = scrape_url("")
|
||||
assert "Error: invalid URL" in result
|
||||
|
||||
def test_backend_none_short_circuits(self):
|
||||
with patch("timmy.tools.search.settings") as mock_settings:
|
||||
mock_settings.timmy_search_backend = "none"
|
||||
result = scrape_url("https://example.com")
|
||||
assert "disabled" in result
|
||||
|
||||
def test_missing_requests_package(self):
|
||||
with patch.dict("sys.modules", {"requests": None}):
|
||||
with patch("timmy.tools.search.settings") as mock_settings:
|
||||
mock_settings.timmy_search_backend = "searxng"
|
||||
mock_settings.crawl_url = "http://localhost:11235"
|
||||
result = scrape_url("https://example.com")
|
||||
assert "requests" in result and "not installed" in result
|
||||
|
||||
def test_sync_result_returned_immediately(self):
|
||||
"""If Crawl4AI returns results in the POST response, use them directly."""
|
||||
mock_data = {
|
||||
"results": [{"markdown": "# Hello\n\nThis is the page content."}]
|
||||
}
|
||||
mock_req = _mock_requests(json_response=mock_data)
|
||||
with patch.dict("sys.modules", {"requests": mock_req}):
|
||||
with patch("timmy.tools.search.settings") as mock_settings:
|
||||
mock_settings.timmy_search_backend = "searxng"
|
||||
mock_settings.crawl_url = "http://localhost:11235"
|
||||
result = scrape_url("https://example.com")
|
||||
|
||||
assert "Hello" in result
|
||||
assert "page content" in result
|
||||
|
||||
def test_async_poll_completed(self):
|
||||
"""Async task_id flow: polls until completed and returns content."""
|
||||
submit_response = MagicMock()
|
||||
submit_response.json.return_value = {"task_id": "abc123"}
|
||||
submit_response.raise_for_status.return_value = None
|
||||
|
||||
poll_response = MagicMock()
|
||||
poll_response.json.return_value = {
|
||||
"status": "completed",
|
||||
"results": [{"markdown": "# Async content"}],
|
||||
}
|
||||
poll_response.raise_for_status.return_value = None
|
||||
|
||||
mock_req = MagicMock()
|
||||
mock_req.post.return_value = submit_response
|
||||
mock_req.get.return_value = poll_response
|
||||
mock_req.exceptions = MagicMock()
|
||||
|
||||
with patch.dict("sys.modules", {"requests": mock_req}):
|
||||
with patch("timmy.tools.search.settings") as mock_settings:
|
||||
mock_settings.timmy_search_backend = "searxng"
|
||||
mock_settings.crawl_url = "http://localhost:11235"
|
||||
with patch("timmy.tools.search.time") as mock_time:
|
||||
mock_time.sleep = MagicMock()
|
||||
result = scrape_url("https://example.com")
|
||||
|
||||
assert "Async content" in result
|
||||
|
||||
def test_async_poll_failed_task(self):
|
||||
"""Crawl4AI task failure is reported clearly."""
|
||||
submit_response = MagicMock()
|
||||
submit_response.json.return_value = {"task_id": "abc123"}
|
||||
submit_response.raise_for_status.return_value = None
|
||||
|
||||
poll_response = MagicMock()
|
||||
poll_response.json.return_value = {"status": "failed", "error": "site blocked"}
|
||||
poll_response.raise_for_status.return_value = None
|
||||
|
||||
mock_req = MagicMock()
|
||||
mock_req.post.return_value = submit_response
|
||||
mock_req.get.return_value = poll_response
|
||||
mock_req.exceptions = MagicMock()
|
||||
|
||||
with patch.dict("sys.modules", {"requests": mock_req}):
|
||||
with patch("timmy.tools.search.settings") as mock_settings:
|
||||
mock_settings.timmy_search_backend = "searxng"
|
||||
mock_settings.crawl_url = "http://localhost:11235"
|
||||
with patch("timmy.tools.search.time") as mock_time:
|
||||
mock_time.sleep = MagicMock()
|
||||
result = scrape_url("https://example.com")
|
||||
|
||||
assert "failed" in result and "site blocked" in result
|
||||
|
||||
def test_service_unavailable(self):
|
||||
"""Connection error degrades gracefully."""
|
||||
mock_req = MagicMock()
|
||||
mock_req.post.side_effect = OSError("connection refused")
|
||||
mock_req.exceptions = MagicMock()
|
||||
with patch.dict("sys.modules", {"requests": mock_req}):
|
||||
with patch("timmy.tools.search.settings") as mock_settings:
|
||||
mock_settings.timmy_search_backend = "searxng"
|
||||
mock_settings.crawl_url = "http://localhost:11235"
|
||||
result = scrape_url("https://example.com")
|
||||
assert "not reachable" in result or "unavailable" in result
|
||||
|
||||
def test_content_truncation(self):
|
||||
"""Content longer than ~4000 tokens is truncated."""
|
||||
long_content = "x" * 20000
|
||||
mock_data = {"results": [{"markdown": long_content}]}
|
||||
mock_req = _mock_requests(json_response=mock_data)
|
||||
with patch.dict("sys.modules", {"requests": mock_req}):
|
||||
with patch("timmy.tools.search.settings") as mock_settings:
|
||||
mock_settings.timmy_search_backend = "searxng"
|
||||
mock_settings.crawl_url = "http://localhost:11235"
|
||||
result = scrape_url("https://example.com")
|
||||
|
||||
assert "[…truncated" in result
|
||||
assert len(result) < 17000
|
||||
|
||||
def test_catalog_entry_exists(self):
|
||||
"""scrape_url must appear in the tool catalog."""
|
||||
from timmy.tools import get_all_available_tools
|
||||
|
||||
catalog = get_all_available_tools()
|
||||
assert "scrape_url" in catalog
|
||||
assert "orchestrator" in catalog["scrape_url"]["available_in"]
|
||||
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# _extract_crawl_content helper
|
||||
# ---------------------------------------------------------------------------
|
||||
|
||||
|
||||
class TestExtractCrawlContent:
|
||||
def test_empty_results(self):
|
||||
result = _extract_crawl_content([], "https://example.com")
|
||||
assert "No content" in result
|
||||
|
||||
def test_markdown_field_preferred(self):
|
||||
results = [{"markdown": "# Title", "content": "fallback"}]
|
||||
result = _extract_crawl_content(results, "https://example.com")
|
||||
assert "Title" in result
|
||||
|
||||
def test_fallback_to_content_field(self):
|
||||
results = [{"content": "plain text content"}]
|
||||
result = _extract_crawl_content(results, "https://example.com")
|
||||
assert "plain text content" in result
|
||||
|
||||
def test_no_content_fields(self):
|
||||
results = [{"url": "https://example.com"}]
|
||||
result = _extract_crawl_content(results, "https://example.com")
|
||||
assert "No readable content" in result
|
||||
270
tests/timmy_automations/test_orchestrator.py
Normal file
270
tests/timmy_automations/test_orchestrator.py
Normal file
@@ -0,0 +1,270 @@
|
||||
"""Tests for Daily Run orchestrator — health snapshot integration.
|
||||
|
||||
Verifies that the orchestrator runs a pre-flight health snapshot before
|
||||
any coding work begins, and aborts on red status unless --force is passed.
|
||||
|
||||
Refs: #923
|
||||
"""
|
||||
|
||||
from __future__ import annotations
|
||||
|
||||
import argparse
|
||||
import json
|
||||
import sys
|
||||
from pathlib import Path
|
||||
from unittest.mock import MagicMock, patch
|
||||
|
||||
import pytest
|
||||
|
||||
# Add timmy_automations to path for imports
|
||||
_TA_PATH = Path(__file__).resolve().parent.parent.parent / "timmy_automations" / "daily_run"
|
||||
if str(_TA_PATH) not in sys.path:
|
||||
sys.path.insert(0, str(_TA_PATH))
|
||||
# Also add utils path
|
||||
_TA_UTILS = Path(__file__).resolve().parent.parent.parent / "timmy_automations"
|
||||
if str(_TA_UTILS) not in sys.path:
|
||||
sys.path.insert(0, str(_TA_UTILS))
|
||||
|
||||
import health_snapshot as hs
|
||||
import orchestrator as orch
|
||||
|
||||
|
||||
def _make_snapshot(overall_status: str) -> hs.HealthSnapshot:
|
||||
"""Build a minimal HealthSnapshot for testing."""
|
||||
return hs.HealthSnapshot(
|
||||
timestamp="2026-01-01T00:00:00+00:00",
|
||||
overall_status=overall_status,
|
||||
ci=hs.CISignal(status="pass", message="CI passing"),
|
||||
issues=hs.IssueSignal(count=0, p0_count=0, p1_count=0),
|
||||
flakiness=hs.FlakinessSignal(
|
||||
status="healthy",
|
||||
recent_failures=0,
|
||||
recent_cycles=10,
|
||||
failure_rate=0.0,
|
||||
message="All good",
|
||||
),
|
||||
tokens=hs.TokenEconomySignal(status="balanced", message="Balanced"),
|
||||
)
|
||||
|
||||
|
||||
def _make_red_snapshot() -> hs.HealthSnapshot:
|
||||
return hs.HealthSnapshot(
|
||||
timestamp="2026-01-01T00:00:00+00:00",
|
||||
overall_status="red",
|
||||
ci=hs.CISignal(status="fail", message="CI failed"),
|
||||
issues=hs.IssueSignal(count=1, p0_count=1, p1_count=0),
|
||||
flakiness=hs.FlakinessSignal(
|
||||
status="critical",
|
||||
recent_failures=8,
|
||||
recent_cycles=10,
|
||||
failure_rate=0.8,
|
||||
message="High flakiness",
|
||||
),
|
||||
tokens=hs.TokenEconomySignal(status="unknown", message="No data"),
|
||||
)
|
||||
|
||||
|
||||
def _default_args(**overrides) -> argparse.Namespace:
|
||||
"""Build an argparse Namespace with defaults matching the orchestrator flags."""
|
||||
defaults = {
|
||||
"review": False,
|
||||
"json": False,
|
||||
"max_items": None,
|
||||
"skip_health_check": False,
|
||||
"force": False,
|
||||
}
|
||||
defaults.update(overrides)
|
||||
return argparse.Namespace(**defaults)
|
||||
|
||||
|
||||
class TestRunHealthSnapshot:
|
||||
"""Test run_health_snapshot() — the pre-flight check called by main()."""
|
||||
|
||||
def test_green_returns_zero(self, capsys):
|
||||
"""Green snapshot returns 0 (proceed)."""
|
||||
args = _default_args()
|
||||
|
||||
with patch.object(orch, "_generate_health_snapshot", return_value=_make_snapshot("green")):
|
||||
rc = orch.run_health_snapshot(args)
|
||||
|
||||
assert rc == 0
|
||||
|
||||
def test_yellow_returns_zero(self, capsys):
|
||||
"""Yellow snapshot returns 0 (proceed with caution)."""
|
||||
args = _default_args()
|
||||
|
||||
with patch.object(orch, "_generate_health_snapshot", return_value=_make_snapshot("yellow")):
|
||||
rc = orch.run_health_snapshot(args)
|
||||
|
||||
assert rc == 0
|
||||
|
||||
def test_red_returns_one(self, capsys):
|
||||
"""Red snapshot returns 1 (abort)."""
|
||||
args = _default_args()
|
||||
|
||||
with patch.object(orch, "_generate_health_snapshot", return_value=_make_red_snapshot()):
|
||||
rc = orch.run_health_snapshot(args)
|
||||
|
||||
assert rc == 1
|
||||
|
||||
def test_red_with_force_returns_zero(self, capsys):
|
||||
"""Red snapshot with --force returns 0 (proceed anyway)."""
|
||||
args = _default_args(force=True)
|
||||
|
||||
with patch.object(orch, "_generate_health_snapshot", return_value=_make_red_snapshot()):
|
||||
rc = orch.run_health_snapshot(args)
|
||||
|
||||
assert rc == 0
|
||||
|
||||
def test_snapshot_exception_is_skipped(self, capsys):
|
||||
"""If health snapshot raises, it degrades gracefully and returns 0."""
|
||||
args = _default_args()
|
||||
|
||||
with patch.object(orch, "_generate_health_snapshot", side_effect=RuntimeError("boom")):
|
||||
rc = orch.run_health_snapshot(args)
|
||||
|
||||
assert rc == 0
|
||||
captured = capsys.readouterr()
|
||||
assert "warning" in captured.err.lower() or "skipping" in captured.err.lower()
|
||||
|
||||
def test_snapshot_prints_summary(self, capsys):
|
||||
"""Health snapshot prints a pre-flight summary block."""
|
||||
args = _default_args()
|
||||
|
||||
with patch.object(orch, "_generate_health_snapshot", return_value=_make_snapshot("green")):
|
||||
orch.run_health_snapshot(args)
|
||||
|
||||
captured = capsys.readouterr()
|
||||
assert "PRE-FLIGHT HEALTH CHECK" in captured.out
|
||||
assert "CI" in captured.out
|
||||
|
||||
def test_red_prints_abort_message(self, capsys):
|
||||
"""Red snapshot prints an abort message to stderr."""
|
||||
args = _default_args()
|
||||
|
||||
with patch.object(orch, "_generate_health_snapshot", return_value=_make_red_snapshot()):
|
||||
orch.run_health_snapshot(args)
|
||||
|
||||
captured = capsys.readouterr()
|
||||
assert "RED" in captured.err or "aborting" in captured.err.lower()
|
||||
|
||||
def test_p0_issues_shown_in_output(self, capsys):
|
||||
"""P0 issue count is shown in the pre-flight output."""
|
||||
args = _default_args()
|
||||
snapshot = hs.HealthSnapshot(
|
||||
timestamp="2026-01-01T00:00:00+00:00",
|
||||
overall_status="red",
|
||||
ci=hs.CISignal(status="pass", message="CI passing"),
|
||||
issues=hs.IssueSignal(count=2, p0_count=2, p1_count=0),
|
||||
flakiness=hs.FlakinessSignal(
|
||||
status="healthy",
|
||||
recent_failures=0,
|
||||
recent_cycles=10,
|
||||
failure_rate=0.0,
|
||||
message="All good",
|
||||
),
|
||||
tokens=hs.TokenEconomySignal(status="balanced", message="Balanced"),
|
||||
)
|
||||
|
||||
with patch.object(orch, "_generate_health_snapshot", return_value=snapshot):
|
||||
orch.run_health_snapshot(args)
|
||||
|
||||
captured = capsys.readouterr()
|
||||
assert "P0" in captured.out
|
||||
|
||||
|
||||
class TestMainHealthCheckIntegration:
|
||||
"""Test that main() runs health snapshot before any coding work."""
|
||||
|
||||
def _patch_gitea_unavailable(self):
|
||||
return patch.object(orch.GiteaClient, "is_available", return_value=False)
|
||||
|
||||
def test_main_runs_health_check_before_gitea(self):
|
||||
"""Health snapshot is called before Gitea client work."""
|
||||
call_order = []
|
||||
|
||||
def fake_snapshot(*_a, **_kw):
|
||||
call_order.append("health")
|
||||
return _make_snapshot("green")
|
||||
|
||||
def fake_gitea_available(self):
|
||||
call_order.append("gitea")
|
||||
return False
|
||||
|
||||
args = _default_args()
|
||||
|
||||
with (
|
||||
patch.object(orch, "_generate_health_snapshot", side_effect=fake_snapshot),
|
||||
patch.object(orch.GiteaClient, "is_available", fake_gitea_available),
|
||||
patch("sys.argv", ["orchestrator"]),
|
||||
):
|
||||
orch.main()
|
||||
|
||||
assert call_order.index("health") < call_order.index("gitea")
|
||||
|
||||
def test_main_aborts_on_red_before_gitea(self):
|
||||
"""main() aborts with non-zero exit code when health is red."""
|
||||
gitea_called = []
|
||||
|
||||
def fake_gitea_available(self):
|
||||
gitea_called.append(True)
|
||||
return True
|
||||
|
||||
with (
|
||||
patch.object(orch, "_generate_health_snapshot", return_value=_make_red_snapshot()),
|
||||
patch.object(orch.GiteaClient, "is_available", fake_gitea_available),
|
||||
patch("sys.argv", ["orchestrator"]),
|
||||
):
|
||||
rc = orch.main()
|
||||
|
||||
assert rc != 0
|
||||
assert not gitea_called, "Gitea should NOT be called when health is red"
|
||||
|
||||
def test_main_skips_health_check_with_flag(self):
|
||||
"""--skip-health-check bypasses the pre-flight snapshot."""
|
||||
health_called = []
|
||||
|
||||
def fake_snapshot(*_a, **_kw):
|
||||
health_called.append(True)
|
||||
return _make_snapshot("green")
|
||||
|
||||
with (
|
||||
patch.object(orch, "_generate_health_snapshot", side_effect=fake_snapshot),
|
||||
patch.object(orch.GiteaClient, "is_available", return_value=False),
|
||||
patch("sys.argv", ["orchestrator", "--skip-health-check"]),
|
||||
):
|
||||
orch.main()
|
||||
|
||||
assert not health_called, "Health snapshot should be skipped"
|
||||
|
||||
def test_main_force_flag_continues_despite_red(self):
|
||||
"""--force allows Daily Run to continue even when health is red."""
|
||||
gitea_called = []
|
||||
|
||||
def fake_gitea_available(self):
|
||||
gitea_called.append(True)
|
||||
return False # Gitea unavailable → exits early but after health check
|
||||
|
||||
with (
|
||||
patch.object(orch, "_generate_health_snapshot", return_value=_make_red_snapshot()),
|
||||
patch.object(orch.GiteaClient, "is_available", fake_gitea_available),
|
||||
patch("sys.argv", ["orchestrator", "--force"]),
|
||||
):
|
||||
orch.main()
|
||||
|
||||
# Gitea was reached despite red status because --force was passed
|
||||
assert gitea_called
|
||||
|
||||
def test_main_json_output_on_red_includes_error(self, capsys):
|
||||
"""JSON output includes error key when health is red."""
|
||||
with (
|
||||
patch.object(orch, "_generate_health_snapshot", return_value=_make_red_snapshot()),
|
||||
patch.object(orch.GiteaClient, "is_available", return_value=True),
|
||||
patch("sys.argv", ["orchestrator", "--json"]),
|
||||
):
|
||||
rc = orch.main()
|
||||
|
||||
assert rc != 0
|
||||
captured = capsys.readouterr()
|
||||
data = json.loads(captured.out)
|
||||
assert "error" in data
|
||||
135
tests/unit/test_airllm_backend.py
Normal file
135
tests/unit/test_airllm_backend.py
Normal file
@@ -0,0 +1,135 @@
|
||||
"""Unit tests for AirLLM backend graceful degradation.
|
||||
|
||||
Verifies that setting TIMMY_MODEL_BACKEND=airllm on non-Apple-Silicon hardware
|
||||
(Intel Mac, Linux, Windows) or when the airllm package is not installed
|
||||
falls back to the Ollama backend without crashing.
|
||||
|
||||
Refs #1284
|
||||
"""
|
||||
|
||||
import sys
|
||||
from unittest.mock import MagicMock, patch
|
||||
|
||||
import pytest
|
||||
|
||||
pytestmark = pytest.mark.unit
|
||||
|
||||
|
||||
class TestIsAppleSilicon:
|
||||
"""is_apple_silicon() correctly identifies the host platform."""
|
||||
|
||||
def test_returns_true_on_arm64_darwin(self):
|
||||
from timmy.backends import is_apple_silicon
|
||||
|
||||
with patch("platform.system", return_value="Darwin"), patch(
|
||||
"platform.machine", return_value="arm64"
|
||||
):
|
||||
assert is_apple_silicon() is True
|
||||
|
||||
def test_returns_false_on_intel_mac(self):
|
||||
from timmy.backends import is_apple_silicon
|
||||
|
||||
with patch("platform.system", return_value="Darwin"), patch(
|
||||
"platform.machine", return_value="x86_64"
|
||||
):
|
||||
assert is_apple_silicon() is False
|
||||
|
||||
def test_returns_false_on_linux(self):
|
||||
from timmy.backends import is_apple_silicon
|
||||
|
||||
with patch("platform.system", return_value="Linux"), patch(
|
||||
"platform.machine", return_value="x86_64"
|
||||
):
|
||||
assert is_apple_silicon() is False
|
||||
|
||||
def test_returns_false_on_windows(self):
|
||||
from timmy.backends import is_apple_silicon
|
||||
|
||||
with patch("platform.system", return_value="Windows"), patch(
|
||||
"platform.machine", return_value="AMD64"
|
||||
):
|
||||
assert is_apple_silicon() is False
|
||||
|
||||
|
||||
class TestAirLLMGracefulDegradation:
|
||||
"""create_timmy(backend='airllm') falls back to Ollama on unsupported platforms."""
|
||||
|
||||
def _make_fake_ollama_agent(self):
|
||||
"""Return a lightweight stub that satisfies the Agno Agent interface."""
|
||||
agent = MagicMock()
|
||||
agent.run = MagicMock(return_value=MagicMock(content="ok"))
|
||||
return agent
|
||||
|
||||
def test_falls_back_to_ollama_on_non_apple_silicon(self, caplog):
|
||||
"""On Intel/Linux, airllm backend logs a warning and creates an Ollama agent."""
|
||||
import logging
|
||||
|
||||
from timmy.agent import create_timmy
|
||||
|
||||
fake_agent = self._make_fake_ollama_agent()
|
||||
|
||||
with (
|
||||
patch("timmy.backends.is_apple_silicon", return_value=False),
|
||||
patch("timmy.agent._create_ollama_agent", return_value=fake_agent) as mock_create,
|
||||
patch("timmy.agent._resolve_model_with_fallback", return_value=("qwen3:8b", False)),
|
||||
patch("timmy.agent._check_model_available", return_value=True),
|
||||
patch("timmy.agent._build_tools_list", return_value=[]),
|
||||
patch("timmy.agent._build_prompt", return_value="test prompt"),
|
||||
caplog.at_level(logging.WARNING, logger="timmy.agent"),
|
||||
):
|
||||
result = create_timmy(backend="airllm")
|
||||
|
||||
assert result is fake_agent
|
||||
mock_create.assert_called_once()
|
||||
assert "Apple Silicon" in caplog.text
|
||||
|
||||
def test_falls_back_to_ollama_when_airllm_not_installed(self, caplog):
|
||||
"""When the airllm package is missing, log a warning and use Ollama."""
|
||||
import logging
|
||||
|
||||
from timmy.agent import create_timmy
|
||||
|
||||
fake_agent = self._make_fake_ollama_agent()
|
||||
|
||||
# Simulate Apple Silicon + missing airllm package
|
||||
def _import_side_effect(name, *args, **kwargs):
|
||||
if name == "airllm":
|
||||
raise ImportError("No module named 'airllm'")
|
||||
return original_import(name, *args, **kwargs)
|
||||
|
||||
original_import = __builtins__["__import__"] if isinstance(__builtins__, dict) else __import__
|
||||
|
||||
with (
|
||||
patch("timmy.backends.is_apple_silicon", return_value=True),
|
||||
patch("builtins.__import__", side_effect=_import_side_effect),
|
||||
patch("timmy.agent._create_ollama_agent", return_value=fake_agent) as mock_create,
|
||||
patch("timmy.agent._resolve_model_with_fallback", return_value=("qwen3:8b", False)),
|
||||
patch("timmy.agent._check_model_available", return_value=True),
|
||||
patch("timmy.agent._build_tools_list", return_value=[]),
|
||||
patch("timmy.agent._build_prompt", return_value="test prompt"),
|
||||
caplog.at_level(logging.WARNING, logger="timmy.agent"),
|
||||
):
|
||||
result = create_timmy(backend="airllm")
|
||||
|
||||
assert result is fake_agent
|
||||
mock_create.assert_called_once()
|
||||
assert "airllm" in caplog.text.lower() or "AirLLM" in caplog.text
|
||||
|
||||
def test_airllm_backend_does_not_raise(self):
|
||||
"""create_timmy(backend='airllm') never raises — it degrades gracefully."""
|
||||
from timmy.agent import create_timmy
|
||||
|
||||
fake_agent = self._make_fake_ollama_agent()
|
||||
|
||||
with (
|
||||
patch("timmy.backends.is_apple_silicon", return_value=False),
|
||||
patch("timmy.agent._create_ollama_agent", return_value=fake_agent),
|
||||
patch("timmy.agent._resolve_model_with_fallback", return_value=("qwen3:8b", False)),
|
||||
patch("timmy.agent._check_model_available", return_value=True),
|
||||
patch("timmy.agent._build_tools_list", return_value=[]),
|
||||
patch("timmy.agent._build_prompt", return_value="test prompt"),
|
||||
):
|
||||
# Should not raise under any circumstances
|
||||
result = create_timmy(backend="airllm")
|
||||
|
||||
assert result is not None
|
||||
235
tests/unit/test_brain_worker.py
Normal file
235
tests/unit/test_brain_worker.py
Normal file
@@ -0,0 +1,235 @@
|
||||
"""Unit tests for brain.worker.DistributedWorker."""
|
||||
|
||||
from __future__ import annotations
|
||||
|
||||
import threading
|
||||
from unittest.mock import MagicMock, patch
|
||||
|
||||
import pytest
|
||||
|
||||
from brain.worker import MAX_RETRIES, DelegatedTask, DistributedWorker
|
||||
|
||||
|
||||
@pytest.fixture(autouse=True)
|
||||
def clear_task_registry():
|
||||
"""Reset the worker registry before each test."""
|
||||
DistributedWorker.clear()
|
||||
yield
|
||||
DistributedWorker.clear()
|
||||
|
||||
|
||||
class TestSubmit:
|
||||
def test_returns_task_id(self):
|
||||
with patch.object(DistributedWorker, "_run_task"):
|
||||
task_id = DistributedWorker.submit("researcher", "research", "find something")
|
||||
assert isinstance(task_id, str)
|
||||
assert len(task_id) == 8
|
||||
|
||||
def test_task_registered_as_queued(self):
|
||||
with patch.object(DistributedWorker, "_run_task"):
|
||||
task_id = DistributedWorker.submit("coder", "code", "fix the bug")
|
||||
status = DistributedWorker.get_status(task_id)
|
||||
assert status["found"] is True
|
||||
assert status["task_id"] == task_id
|
||||
assert status["agent"] == "coder"
|
||||
|
||||
def test_unique_task_ids(self):
|
||||
with patch.object(DistributedWorker, "_run_task"):
|
||||
ids = [DistributedWorker.submit("coder", "code", "task") for _ in range(10)]
|
||||
assert len(set(ids)) == 10
|
||||
|
||||
def test_starts_daemon_thread(self):
|
||||
event = threading.Event()
|
||||
|
||||
def fake_run_task(record):
|
||||
event.set()
|
||||
|
||||
with patch.object(DistributedWorker, "_run_task", side_effect=fake_run_task):
|
||||
DistributedWorker.submit("coder", "code", "something")
|
||||
|
||||
assert event.wait(timeout=2), "Background thread did not start"
|
||||
|
||||
def test_priority_stored(self):
|
||||
with patch.object(DistributedWorker, "_run_task"):
|
||||
task_id = DistributedWorker.submit("coder", "code", "task", priority="high")
|
||||
status = DistributedWorker.get_status(task_id)
|
||||
assert status["priority"] == "high"
|
||||
|
||||
|
||||
class TestGetStatus:
|
||||
def test_unknown_task_id(self):
|
||||
result = DistributedWorker.get_status("deadbeef")
|
||||
assert result["found"] is False
|
||||
assert result["task_id"] == "deadbeef"
|
||||
|
||||
def test_known_task_has_all_fields(self):
|
||||
with patch.object(DistributedWorker, "_run_task"):
|
||||
task_id = DistributedWorker.submit("writer", "writing", "write a blog post")
|
||||
status = DistributedWorker.get_status(task_id)
|
||||
for key in ("found", "task_id", "agent", "role", "status", "backend", "created_at"):
|
||||
assert key in status, f"Missing key: {key}"
|
||||
|
||||
|
||||
class TestListTasks:
|
||||
def test_empty_initially(self):
|
||||
assert DistributedWorker.list_tasks() == []
|
||||
|
||||
def test_returns_registered_tasks(self):
|
||||
with patch.object(DistributedWorker, "_run_task"):
|
||||
DistributedWorker.submit("coder", "code", "task A")
|
||||
DistributedWorker.submit("writer", "writing", "task B")
|
||||
tasks = DistributedWorker.list_tasks()
|
||||
assert len(tasks) == 2
|
||||
agents = {t["agent"] for t in tasks}
|
||||
assert agents == {"coder", "writer"}
|
||||
|
||||
|
||||
class TestSelectBackend:
|
||||
def test_defaults_to_agentic_loop(self):
|
||||
with patch("brain.worker.logger"):
|
||||
backend = DistributedWorker._select_backend("code", "fix the bug")
|
||||
assert backend == "agentic_loop"
|
||||
|
||||
def test_kimi_for_heavy_research_with_gitea(self):
|
||||
mock_settings = MagicMock()
|
||||
mock_settings.gitea_enabled = True
|
||||
mock_settings.gitea_token = "tok"
|
||||
mock_settings.paperclip_api_key = ""
|
||||
|
||||
with (
|
||||
patch("timmy.kimi_delegation.exceeds_local_capacity", return_value=True),
|
||||
patch("config.settings", mock_settings),
|
||||
):
|
||||
backend = DistributedWorker._select_backend("research", "comprehensive survey " * 10)
|
||||
assert backend == "kimi"
|
||||
|
||||
def test_agentic_loop_when_no_gitea(self):
|
||||
mock_settings = MagicMock()
|
||||
mock_settings.gitea_enabled = False
|
||||
mock_settings.gitea_token = ""
|
||||
mock_settings.paperclip_api_key = ""
|
||||
|
||||
with patch("config.settings", mock_settings):
|
||||
backend = DistributedWorker._select_backend("research", "comprehensive survey " * 10)
|
||||
assert backend == "agentic_loop"
|
||||
|
||||
def test_paperclip_when_api_key_configured(self):
|
||||
mock_settings = MagicMock()
|
||||
mock_settings.gitea_enabled = False
|
||||
mock_settings.gitea_token = ""
|
||||
mock_settings.paperclip_api_key = "pk_test_123"
|
||||
|
||||
with patch("config.settings", mock_settings):
|
||||
backend = DistributedWorker._select_backend("code", "build a widget")
|
||||
assert backend == "paperclip"
|
||||
|
||||
|
||||
class TestRunTask:
|
||||
def test_marks_completed_on_success(self):
|
||||
record = DelegatedTask(
|
||||
task_id="abc12345",
|
||||
agent_name="coder",
|
||||
agent_role="code",
|
||||
task_description="fix bug",
|
||||
priority="normal",
|
||||
backend="agentic_loop",
|
||||
)
|
||||
|
||||
with patch.object(DistributedWorker, "_dispatch", return_value={"success": True}):
|
||||
DistributedWorker._run_task(record)
|
||||
|
||||
assert record.status == "completed"
|
||||
assert record.result == {"success": True}
|
||||
assert record.error is None
|
||||
|
||||
def test_marks_failed_after_exhausting_retries(self):
|
||||
record = DelegatedTask(
|
||||
task_id="fail1234",
|
||||
agent_name="coder",
|
||||
agent_role="code",
|
||||
task_description="broken task",
|
||||
priority="normal",
|
||||
backend="agentic_loop",
|
||||
)
|
||||
|
||||
with patch.object(DistributedWorker, "_dispatch", side_effect=RuntimeError("boom")):
|
||||
DistributedWorker._run_task(record)
|
||||
|
||||
assert record.status == "failed"
|
||||
assert "boom" in record.error
|
||||
assert record.retries == MAX_RETRIES
|
||||
|
||||
def test_retries_before_failing(self):
|
||||
record = DelegatedTask(
|
||||
task_id="retry001",
|
||||
agent_name="coder",
|
||||
agent_role="code",
|
||||
task_description="flaky task",
|
||||
priority="normal",
|
||||
backend="agentic_loop",
|
||||
)
|
||||
|
||||
call_count = 0
|
||||
|
||||
def flaky_dispatch(r):
|
||||
nonlocal call_count
|
||||
call_count += 1
|
||||
if call_count < MAX_RETRIES + 1:
|
||||
raise RuntimeError("transient failure")
|
||||
return {"success": True}
|
||||
|
||||
with patch.object(DistributedWorker, "_dispatch", side_effect=flaky_dispatch):
|
||||
DistributedWorker._run_task(record)
|
||||
|
||||
assert record.status == "completed"
|
||||
assert call_count == MAX_RETRIES + 1
|
||||
|
||||
def test_succeeds_on_first_attempt(self):
|
||||
record = DelegatedTask(
|
||||
task_id="ok000001",
|
||||
agent_name="writer",
|
||||
agent_role="writing",
|
||||
task_description="write summary",
|
||||
priority="low",
|
||||
backend="agentic_loop",
|
||||
)
|
||||
|
||||
with patch.object(DistributedWorker, "_dispatch", return_value={"summary": "done"}):
|
||||
DistributedWorker._run_task(record)
|
||||
|
||||
assert record.status == "completed"
|
||||
assert record.retries == 0
|
||||
|
||||
|
||||
class TestDelegatetaskIntegration:
|
||||
"""Integration: delegate_task should wire to DistributedWorker."""
|
||||
|
||||
def test_delegate_task_returns_task_id(self):
|
||||
from timmy.tools_delegation import delegate_task
|
||||
|
||||
with patch.object(DistributedWorker, "_run_task"):
|
||||
result = delegate_task("researcher", "research something for me")
|
||||
|
||||
assert result["success"] is True
|
||||
assert result["task_id"] is not None
|
||||
assert result["status"] == "queued"
|
||||
|
||||
def test_delegate_task_status_queued_for_valid_agent(self):
|
||||
from timmy.tools_delegation import delegate_task
|
||||
|
||||
with patch.object(DistributedWorker, "_run_task"):
|
||||
result = delegate_task("coder", "implement feature X")
|
||||
|
||||
assert result["status"] == "queued"
|
||||
assert len(result["task_id"]) == 8
|
||||
|
||||
def test_task_in_registry_after_delegation(self):
|
||||
from timmy.tools_delegation import delegate_task
|
||||
|
||||
with patch.object(DistributedWorker, "_run_task"):
|
||||
result = delegate_task("writer", "write documentation")
|
||||
|
||||
task_id = result["task_id"]
|
||||
status = DistributedWorker.get_status(task_id)
|
||||
assert status["found"] is True
|
||||
assert status["agent"] == "writer"
|
||||
269
tests/unit/test_self_correction.py
Normal file
269
tests/unit/test_self_correction.py
Normal file
@@ -0,0 +1,269 @@
|
||||
"""Unit tests for infrastructure.self_correction."""
|
||||
|
||||
import os
|
||||
import tempfile
|
||||
from pathlib import Path
|
||||
from unittest.mock import patch
|
||||
|
||||
import pytest
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# Fixtures
|
||||
# ---------------------------------------------------------------------------
|
||||
|
||||
|
||||
@pytest.fixture(autouse=True)
|
||||
def _isolated_db(tmp_path, monkeypatch):
|
||||
"""Point the self-correction module at a fresh temp database per test."""
|
||||
import infrastructure.self_correction as sc_mod
|
||||
|
||||
# Reset the cached path so each test gets a clean DB
|
||||
sc_mod._DB_PATH = tmp_path / "self_correction.db"
|
||||
yield
|
||||
sc_mod._DB_PATH = None
|
||||
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# log_self_correction
|
||||
# ---------------------------------------------------------------------------
|
||||
|
||||
|
||||
class TestLogSelfCorrection:
|
||||
def test_returns_event_id(self):
|
||||
from infrastructure.self_correction import log_self_correction
|
||||
|
||||
eid = log_self_correction(
|
||||
source="test",
|
||||
original_intent="Do X",
|
||||
detected_error="ValueError: bad input",
|
||||
correction_strategy="Try Y instead",
|
||||
final_outcome="Y succeeded",
|
||||
)
|
||||
assert isinstance(eid, str)
|
||||
assert len(eid) == 36 # UUID format
|
||||
|
||||
def test_derives_error_type_from_error_string(self):
|
||||
from infrastructure.self_correction import get_corrections, log_self_correction
|
||||
|
||||
log_self_correction(
|
||||
source="test",
|
||||
original_intent="Connect",
|
||||
detected_error="ConnectionRefusedError: port 80",
|
||||
correction_strategy="Use port 8080",
|
||||
final_outcome="ok",
|
||||
)
|
||||
rows = get_corrections(limit=1)
|
||||
assert rows[0]["error_type"] == "ConnectionRefusedError"
|
||||
|
||||
def test_explicit_error_type_preserved(self):
|
||||
from infrastructure.self_correction import get_corrections, log_self_correction
|
||||
|
||||
log_self_correction(
|
||||
source="test",
|
||||
original_intent="Run task",
|
||||
detected_error="Some weird error",
|
||||
correction_strategy="Fix it",
|
||||
final_outcome="done",
|
||||
error_type="CustomError",
|
||||
)
|
||||
rows = get_corrections(limit=1)
|
||||
assert rows[0]["error_type"] == "CustomError"
|
||||
|
||||
def test_task_id_stored(self):
|
||||
from infrastructure.self_correction import get_corrections, log_self_correction
|
||||
|
||||
log_self_correction(
|
||||
source="test",
|
||||
original_intent="intent",
|
||||
detected_error="err",
|
||||
correction_strategy="strat",
|
||||
final_outcome="outcome",
|
||||
task_id="task-abc-123",
|
||||
)
|
||||
rows = get_corrections(limit=1)
|
||||
assert rows[0]["task_id"] == "task-abc-123"
|
||||
|
||||
def test_outcome_status_stored(self):
|
||||
from infrastructure.self_correction import get_corrections, log_self_correction
|
||||
|
||||
log_self_correction(
|
||||
source="test",
|
||||
original_intent="i",
|
||||
detected_error="e",
|
||||
correction_strategy="s",
|
||||
final_outcome="o",
|
||||
outcome_status="failed",
|
||||
)
|
||||
rows = get_corrections(limit=1)
|
||||
assert rows[0]["outcome_status"] == "failed"
|
||||
|
||||
def test_long_strings_truncated(self):
|
||||
from infrastructure.self_correction import get_corrections, log_self_correction
|
||||
|
||||
long = "x" * 3000
|
||||
log_self_correction(
|
||||
source="test",
|
||||
original_intent=long,
|
||||
detected_error=long,
|
||||
correction_strategy=long,
|
||||
final_outcome=long,
|
||||
)
|
||||
rows = get_corrections(limit=1)
|
||||
assert len(rows[0]["original_intent"]) <= 2000
|
||||
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# get_corrections
|
||||
# ---------------------------------------------------------------------------
|
||||
|
||||
|
||||
class TestGetCorrections:
|
||||
def test_empty_db_returns_empty_list(self):
|
||||
from infrastructure.self_correction import get_corrections
|
||||
|
||||
assert get_corrections() == []
|
||||
|
||||
def test_returns_newest_first(self):
|
||||
from infrastructure.self_correction import get_corrections, log_self_correction
|
||||
|
||||
for i in range(3):
|
||||
log_self_correction(
|
||||
source="test",
|
||||
original_intent=f"intent {i}",
|
||||
detected_error="err",
|
||||
correction_strategy="fix",
|
||||
final_outcome="done",
|
||||
error_type=f"Type{i}",
|
||||
)
|
||||
rows = get_corrections(limit=10)
|
||||
assert len(rows) == 3
|
||||
# Newest first — Type2 should appear before Type0
|
||||
types = [r["error_type"] for r in rows]
|
||||
assert types.index("Type2") < types.index("Type0")
|
||||
|
||||
def test_limit_respected(self):
|
||||
from infrastructure.self_correction import get_corrections, log_self_correction
|
||||
|
||||
for _ in range(5):
|
||||
log_self_correction(
|
||||
source="test",
|
||||
original_intent="i",
|
||||
detected_error="e",
|
||||
correction_strategy="s",
|
||||
final_outcome="o",
|
||||
)
|
||||
rows = get_corrections(limit=3)
|
||||
assert len(rows) == 3
|
||||
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# get_patterns
|
||||
# ---------------------------------------------------------------------------
|
||||
|
||||
|
||||
class TestGetPatterns:
|
||||
def test_empty_db_returns_empty_list(self):
|
||||
from infrastructure.self_correction import get_patterns
|
||||
|
||||
assert get_patterns() == []
|
||||
|
||||
def test_counts_by_error_type(self):
|
||||
from infrastructure.self_correction import get_patterns, log_self_correction
|
||||
|
||||
for _ in range(3):
|
||||
log_self_correction(
|
||||
source="test",
|
||||
original_intent="i",
|
||||
detected_error="e",
|
||||
correction_strategy="s",
|
||||
final_outcome="o",
|
||||
error_type="TimeoutError",
|
||||
)
|
||||
log_self_correction(
|
||||
source="test",
|
||||
original_intent="i",
|
||||
detected_error="e",
|
||||
correction_strategy="s",
|
||||
final_outcome="o",
|
||||
error_type="ValueError",
|
||||
)
|
||||
patterns = get_patterns(top_n=10)
|
||||
by_type = {p["error_type"]: p for p in patterns}
|
||||
assert by_type["TimeoutError"]["count"] == 3
|
||||
assert by_type["ValueError"]["count"] == 1
|
||||
|
||||
def test_success_vs_failed_counts(self):
|
||||
from infrastructure.self_correction import get_patterns, log_self_correction
|
||||
|
||||
log_self_correction(
|
||||
source="test", original_intent="i", detected_error="e",
|
||||
correction_strategy="s", final_outcome="o",
|
||||
error_type="Foo", outcome_status="success",
|
||||
)
|
||||
log_self_correction(
|
||||
source="test", original_intent="i", detected_error="e",
|
||||
correction_strategy="s", final_outcome="o",
|
||||
error_type="Foo", outcome_status="failed",
|
||||
)
|
||||
patterns = get_patterns(top_n=5)
|
||||
foo = next(p for p in patterns if p["error_type"] == "Foo")
|
||||
assert foo["success_count"] == 1
|
||||
assert foo["failed_count"] == 1
|
||||
|
||||
def test_ordered_by_count_desc(self):
|
||||
from infrastructure.self_correction import get_patterns, log_self_correction
|
||||
|
||||
for _ in range(2):
|
||||
log_self_correction(
|
||||
source="t", original_intent="i", detected_error="e",
|
||||
correction_strategy="s", final_outcome="o", error_type="Rare",
|
||||
)
|
||||
for _ in range(5):
|
||||
log_self_correction(
|
||||
source="t", original_intent="i", detected_error="e",
|
||||
correction_strategy="s", final_outcome="o", error_type="Common",
|
||||
)
|
||||
patterns = get_patterns(top_n=5)
|
||||
assert patterns[0]["error_type"] == "Common"
|
||||
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# get_stats
|
||||
# ---------------------------------------------------------------------------
|
||||
|
||||
|
||||
class TestGetStats:
|
||||
def test_empty_db_returns_zeroes(self):
|
||||
from infrastructure.self_correction import get_stats
|
||||
|
||||
stats = get_stats()
|
||||
assert stats["total"] == 0
|
||||
assert stats["success_rate"] == 0
|
||||
|
||||
def test_counts_outcomes(self):
|
||||
from infrastructure.self_correction import get_stats, log_self_correction
|
||||
|
||||
log_self_correction(
|
||||
source="t", original_intent="i", detected_error="e",
|
||||
correction_strategy="s", final_outcome="o", outcome_status="success",
|
||||
)
|
||||
log_self_correction(
|
||||
source="t", original_intent="i", detected_error="e",
|
||||
correction_strategy="s", final_outcome="o", outcome_status="failed",
|
||||
)
|
||||
stats = get_stats()
|
||||
assert stats["total"] == 2
|
||||
assert stats["success_count"] == 1
|
||||
assert stats["failed_count"] == 1
|
||||
assert stats["success_rate"] == 50
|
||||
|
||||
def test_success_rate_100_when_all_succeed(self):
|
||||
from infrastructure.self_correction import get_stats, log_self_correction
|
||||
|
||||
for _ in range(4):
|
||||
log_self_correction(
|
||||
source="t", original_intent="i", detected_error="e",
|
||||
correction_strategy="s", final_outcome="o", outcome_status="success",
|
||||
)
|
||||
stats = get_stats()
|
||||
assert stats["success_rate"] == 100
|
||||
@@ -4,10 +4,13 @@
|
||||
Connects to local Gitea, fetches candidate issues, and produces a concise agenda
|
||||
plus a day summary (review mode).
|
||||
|
||||
The Daily Run begins with a Quick Health Snapshot (#710) to ensure mandatory
|
||||
systems are green before burning cycles on work that cannot land.
|
||||
|
||||
Run: python3 timmy_automations/daily_run/orchestrator.py [--review]
|
||||
Env: See timmy_automations/config/daily_run.json for configuration
|
||||
|
||||
Refs: #703
|
||||
Refs: #703, #923
|
||||
"""
|
||||
|
||||
from __future__ import annotations
|
||||
@@ -30,6 +33,11 @@ sys.path.insert(
|
||||
)
|
||||
from utils.token_rules import TokenRules, compute_token_reward
|
||||
|
||||
# Health snapshot lives in the same package
|
||||
from health_snapshot import generate_snapshot as _generate_health_snapshot
|
||||
from health_snapshot import get_token as _hs_get_token
|
||||
from health_snapshot import load_config as _hs_load_config
|
||||
|
||||
# ── Configuration ─────────────────────────────────────────────────────────
|
||||
|
||||
REPO_ROOT = Path(__file__).resolve().parent.parent.parent
|
||||
@@ -495,6 +503,16 @@ def parse_args() -> argparse.Namespace:
|
||||
default=None,
|
||||
help="Override max agenda items",
|
||||
)
|
||||
p.add_argument(
|
||||
"--skip-health-check",
|
||||
action="store_true",
|
||||
help="Skip the pre-flight health snapshot (not recommended)",
|
||||
)
|
||||
p.add_argument(
|
||||
"--force",
|
||||
action="store_true",
|
||||
help="Continue even if health snapshot is red (overrides abort-on-red)",
|
||||
)
|
||||
return p.parse_args()
|
||||
|
||||
|
||||
@@ -535,6 +553,76 @@ def compute_daily_run_tokens(success: bool = True) -> dict[str, Any]:
|
||||
}
|
||||
|
||||
|
||||
def run_health_snapshot(args: argparse.Namespace) -> int:
|
||||
"""Run pre-flight health snapshot and return 0 (ok) or 1 (abort).
|
||||
|
||||
Prints a concise summary of CI, issues, flakiness, and token economy.
|
||||
Returns 1 if the overall status is red AND --force was not passed.
|
||||
Returns 0 for green/yellow or when --force is active.
|
||||
On any import/runtime error the check is skipped with a warning.
|
||||
"""
|
||||
try:
|
||||
hs_config = _hs_load_config()
|
||||
hs_token = _hs_get_token(hs_config)
|
||||
snapshot = _generate_health_snapshot(hs_config, hs_token)
|
||||
except Exception as exc: # noqa: BLE001
|
||||
print(f"[health] Warning: health snapshot failed ({exc}) — skipping", file=sys.stderr)
|
||||
return 0
|
||||
|
||||
# Print concise pre-flight header
|
||||
status_emoji = {"green": "🟢", "yellow": "🟡", "red": "🔴"}.get(
|
||||
snapshot.overall_status, "⚪"
|
||||
)
|
||||
print("─" * 60)
|
||||
print(f"PRE-FLIGHT HEALTH CHECK {status_emoji} {snapshot.overall_status.upper()}")
|
||||
print("─" * 60)
|
||||
|
||||
ci_emoji = {"pass": "✅", "fail": "❌", "unknown": "⚠️", "unavailable": "⚪"}.get(
|
||||
snapshot.ci.status, "⚪"
|
||||
)
|
||||
print(f" {ci_emoji} CI: {snapshot.ci.message}")
|
||||
|
||||
if snapshot.issues.p0_count > 0:
|
||||
issue_emoji = "🔴"
|
||||
elif snapshot.issues.p1_count > 0:
|
||||
issue_emoji = "🟡"
|
||||
else:
|
||||
issue_emoji = "✅"
|
||||
critical_str = f"{snapshot.issues.count} critical"
|
||||
if snapshot.issues.p0_count:
|
||||
critical_str += f" (P0: {snapshot.issues.p0_count})"
|
||||
if snapshot.issues.p1_count:
|
||||
critical_str += f" (P1: {snapshot.issues.p1_count})"
|
||||
print(f" {issue_emoji} Issues: {critical_str}")
|
||||
|
||||
flak_emoji = {"healthy": "✅", "degraded": "🟡", "critical": "🔴", "unknown": "⚪"}.get(
|
||||
snapshot.flakiness.status, "⚪"
|
||||
)
|
||||
print(f" {flak_emoji} Flakiness: {snapshot.flakiness.message}")
|
||||
|
||||
token_emoji = {"balanced": "✅", "inflationary": "🟡", "deflationary": "🔵", "unknown": "⚪"}.get(
|
||||
snapshot.tokens.status, "⚪"
|
||||
)
|
||||
print(f" {token_emoji} Tokens: {snapshot.tokens.message}")
|
||||
print()
|
||||
|
||||
if snapshot.overall_status == "red" and not args.force:
|
||||
print(
|
||||
"🛑 Health status is RED — aborting Daily Run to avoid burning cycles.",
|
||||
file=sys.stderr,
|
||||
)
|
||||
print(
|
||||
" Fix the issues above or re-run with --force to override.",
|
||||
file=sys.stderr,
|
||||
)
|
||||
return 1
|
||||
|
||||
if snapshot.overall_status == "red":
|
||||
print("⚠️ Health is RED but --force passed — proceeding anyway.", file=sys.stderr)
|
||||
|
||||
return 0
|
||||
|
||||
|
||||
def main() -> int:
|
||||
args = parse_args()
|
||||
config = load_config()
|
||||
@@ -542,6 +630,15 @@ def main() -> int:
|
||||
if args.max_items:
|
||||
config["max_agenda_items"] = args.max_items
|
||||
|
||||
# ── Step 0: Pre-flight health snapshot ──────────────────────────────────
|
||||
if not args.skip_health_check:
|
||||
health_rc = run_health_snapshot(args)
|
||||
if health_rc != 0:
|
||||
tokens = compute_daily_run_tokens(success=False)
|
||||
if args.json:
|
||||
print(json.dumps({"error": "health_check_failed", "tokens": tokens}))
|
||||
return health_rc
|
||||
|
||||
token = get_token(config)
|
||||
client = GiteaClient(config, token)
|
||||
|
||||
|
||||
Reference in New Issue
Block a user