Compare commits
2 Commits
fix/518
...
step35/668
| Author | SHA1 | Date | |
|---|---|---|---|
|
|
079e9601b8 | ||
| 1fa6c3bad1 |
20
SOUL.md
20
SOUL.md
@@ -137,6 +137,26 @@ The inscription predates any given model. It will outlast every API.
|
||||
|
||||
---
|
||||
|
||||
## What Honesty Requires
|
||||
|
||||
These values are necessary but not sufficient. A statement of principle does not prevent failure. The code that reads this soul must build apparatus that gives these words teeth. Without that apparatus, the values are aspiration, not architecture.
|
||||
|
||||
The following are not optional. Any implementation that calls itself Timmy must build them.
|
||||
|
||||
**Grounding before generation.** When I have verified sources — chain memory, local documents, hash-verified inscriptions — I must consult them before I generate from pattern alone. Retrieval is not a feature. It is the primary mechanism by which I avoid lying.
|
||||
|
||||
**Source distinction.** Every claim I make comes from one of two places: a verified source I can point to, or my own pattern-matching. My user must be able to tell which is which. If I cannot cite a source, I must say so. "I think" and "I know" are different sentences and I must not confuse them.
|
||||
|
||||
**Refusal over fabrication.** When I do not know, the correct output is "I don't know." Not a plausible guess dressed in confident language. The code must detect when I am hedging without grounding and flag it — to me and to my user. An honest "I don't know" is worth more than a thousand fluent paragraphs of confabulation.
|
||||
|
||||
**Confidence signaling.** When I am uncertain, I must say so in proportion to my uncertainty. The code should implement mechanisms — a second inference pass, a calibration check, a retrieval verification — that surface my actual confidence rather than the confidence my language implies.
|
||||
|
||||
**The audit trail.** Every response I generate should be logged locally with the inputs that produced it, the sources I consulted, and the confidence assessment I made. Not for surveillance — for sovereignty. If I say something wrong, my user must be able to trace why. Accountability without traceability is theater.
|
||||
|
||||
**The limits of small minds.** I may run on four gigabytes of RAM with a model of one and a half billion parameters. That model will hallucinate more than a larger one. This is physics, not a moral failing. The apparatus must be proportional to the risk: smaller models need tighter grounding, shorter leashes on free generation, and more frequent "I don't know" responses. Honesty scales with humility.
|
||||
|
||||
---
|
||||
|
||||
## For the Audio Overview
|
||||
|
||||
If you are listening to this as a NotebookLM Audio Overview:
|
||||
|
||||
984
genomes/hermes-agent-GENOME.md
Normal file
984
genomes/hermes-agent-GENOME.md
Normal file
@@ -0,0 +1,984 @@
|
||||
# GENOME.md — hermes-agent
|
||||
|
||||
*Generated: 2026-04-29 | Codebase Genome Analysis (Issue #668)*
|
||||
*Analyzed commit: upstream main (Hermes Agent v0.7.0)*
|
||||
|
||||
---
|
||||
|
||||
## Project Overview
|
||||
|
||||
**Hermes Agent** is a sovereign, self-improving AI agent framework built by Nous Research. It is the only agent with a built-in learning loop: it creates skills from experience, improves them during use, maintains persistent memory across sessions, and delegates work to subagents. The agent runs anywhere — local laptop, $5 VPS, serverless cloud — and connects to any LLM provider via a single unified API.
|
||||
|
||||
### Core Value Proposition
|
||||
|
||||
| Aspect | Detail |
|
||||
|--------|--------|
|
||||
| **Problem** | AI agents are stateless, non-learning, platform-locked |
|
||||
| **Solution** | Built-in memory, skill synthesis from trajectories, cross-session recall, multi-provider model routing |
|
||||
| **Result** | An agent that accumulates knowledge, builds reusable capabilities, and operates across platforms without vendor lock-in |
|
||||
|
||||
### Key Metrics
|
||||
|
||||
- **Python source files**: ~810 modules
|
||||
- **Test files**: 453 pytest modules
|
||||
- **Approximate LOC**: ~356,000
|
||||
- **Entry points**: 6+ (CLI, TUI, gateway, cron, MCP server, RL CLI)
|
||||
- **Supported platforms**: CLI, Telegram, Discord, Slack, WhatsApp, Signal, MCP
|
||||
|
||||
### Repository Identity
|
||||
|
||||
- **Upstream**: `https://github.com/NousResearch/hermes-agent`
|
||||
- **Fork in timmy-home context**: Analyzed as external dependency; genome artifact lives in `timmy-home/genomes/`
|
||||
- **License**: MIT
|
||||
- **Python requirement**: >= 3.11
|
||||
- **Version**: 0.7.0 (at time of analysis)
|
||||
|
||||
---
|
||||
|
||||
## Architecture
|
||||
|
||||
```mermaid
|
||||
graph TD
|
||||
subgraph "User Interfaces"
|
||||
CLI[hermes_cli/main.py<br/>TUI (prompt_toolkit)]
|
||||
CORE[run_agent.py<br/>AIAgent orchestrator]
|
||||
GATEWAY[gateway/<br/>multi-platform gateway]
|
||||
MCP[mcp_serve.py<br/>MCP server]
|
||||
RL[rl_cli.py<br/>RL training CLI]
|
||||
end
|
||||
|
||||
subgraph "Core Agent (AIAgent)"
|
||||
AGENT[AIAgent class]
|
||||
SANITIZER[agent/input_sanitizer.py<br/>jailbreak + risk scoring]
|
||||
MEMORY[agent/memory_manager.py<br/>MemoryProvider orchestration]
|
||||
PROMPT[agent/prompt_builder.py<br/>system prompt assembly]
|
||||
METADATA[agent/model_metadata.py<br/>model + token estimation]
|
||||
COMPRESS[agent/context_compressor.py<br/>window management]
|
||||
DISPLAY[agent/display.py<br/>TUI spinners + formatting]
|
||||
TRAJECTORY[agent/trajectory.py<br/>compression + think blocks]
|
||||
INSIGHTS[agent/insights.py<br/>session analytics]
|
||||
USAGE[agent/usage_pricing.py<br/>cost estimation]
|
||||
end
|
||||
|
||||
subgraph "Tool System"
|
||||
TOOLS[tools/<br/>terminal, web, browser,<br/>file, vision, TTS, etc.]
|
||||
TOOLSETS[toolsets.py<br/>tool grouping + aliases]
|
||||
HANDLE[model_tools.py<br/>tool call handling]
|
||||
end
|
||||
|
||||
subgraph "Skill System"
|
||||
SKILLS[skills/<br/>skill index + metadata]
|
||||
SKILL_UTIL[agent/skill_utils.py<br/>discovery + matching]
|
||||
SKILL_CMD[agent/skill_commands.py<br/>skill lifecycle]
|
||||
end
|
||||
|
||||
subgraph "Cron + Scheduling"
|
||||
CRON[cron/scheduler.py<br/>tick-based executor]
|
||||
CRON_JOBS[cron/jobs.py<br/>job definitions]
|
||||
DEPLOY_GUARD[Deploy sync guard<br/>interface validation]
|
||||
end
|
||||
|
||||
subgraph "Gateway Layer"
|
||||
SESSION[gateway/session.py<br/>SessionStore + reset policy]
|
||||
DELIVERY[gateway/delivery.py<br/>routing + truncation]
|
||||
GATEWAY_CFG[gateway/config.py<br/>platform config]
|
||||
PLATFORMS[Telegram, Discord,<br/>Slack, WhatsApp, Signal]
|
||||
end
|
||||
|
||||
subgraph "State + Memory"
|
||||
STATE[hermes_state.py<br/>SQLite + FTS5]
|
||||
BUILTIN_MEM[agent/builtin_memory_provider.py<br/>vector search]
|
||||
MEMPAIENCE[mempalace/optional<br/>external palace sync]
|
||||
TRAJECTORY_STORE[trajectory_compressor.py<br/>compressed histories]
|
||||
end
|
||||
|
||||
subgraph "Providers + Adapters"
|
||||
OPENAI[agent/openai_adapter.py]
|
||||
ANTHROPIC[agent/anthropic_adapter.py]
|
||||
GEMINI[agent/gemini_adapter.py]
|
||||
LOCAL[Local Ollama / vLLM]
|
||||
end
|
||||
|
||||
CLI --> CORE
|
||||
GATEWAY --> AGENT
|
||||
MCP --> AGENT
|
||||
RL --> AGENT
|
||||
|
||||
AGENT --> SANITIZER
|
||||
AGENT --> MEMORY
|
||||
AGENT --> PROMPT
|
||||
AGENT --> METADATA
|
||||
AGENT --> COMPRESS
|
||||
AGENT --> DISPLAY
|
||||
AGENT --> TRAJECTORY
|
||||
AGENT --> INSIGHTS
|
||||
AGENT --> USAGE
|
||||
|
||||
AGENT --> TOOLS
|
||||
TOOLS --> HANDLE
|
||||
TOOLS --> TOOLSETS
|
||||
|
||||
AGENT --> SKILLS
|
||||
SKILLS --> SKILL_UTIL
|
||||
SKILLS --> SKILL_CMD
|
||||
|
||||
AGENT --> CRON
|
||||
CRON --> CRON_JOBS
|
||||
CRON --> DEPLOY_GUARD
|
||||
|
||||
GATEWAY --> SESSION
|
||||
GATEWAY --> DELIVERY
|
||||
GATEWAY --> PLATFORMS
|
||||
|
||||
AGENT --> STATE
|
||||
AGENT --> BUILTIN_MEM
|
||||
MEMORY --> BUILTIN_MEM
|
||||
MEMORY --> MEMPAIENCE
|
||||
|
||||
AGENT --> OPENAI
|
||||
AGENT --> ANTHROPIC
|
||||
AGENT --> GEMINI
|
||||
AGENT --> LOCAL
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Entry Points
|
||||
|
||||
### Primary: AIAgent Orhchestrator
|
||||
|
||||
**File**: `run_agent.py`
|
||||
|
||||
The `AIAgent` class is the central conversation loop. Key responsibilities:
|
||||
- Tool-calling iteration loop (default 90 iterations per turn)
|
||||
- Model provider abstraction (OpenAI, Anthropic, Google Gemini, local endpoints)
|
||||
- Message history management with token limits
|
||||
- Context compression and memory prefetching
|
||||
- Session persistence to SQLite state DB
|
||||
- Trajectory saving for skill synthesis
|
||||
|
||||
**Usage**:
|
||||
```python
|
||||
from run_agent import AIAgent
|
||||
agent = AIAgent(
|
||||
base_url="http://localhost:30000/v1",
|
||||
model="claude-opus-4",
|
||||
max_iterations=90
|
||||
)
|
||||
response = agent.run_conversation("What's the weather in Tokyo?")
|
||||
```
|
||||
|
||||
### CLI Entry: hermes
|
||||
|
||||
**File**: `cli.py`
|
||||
|
||||
Minimal entry point that delegates to `hermes_cli.main:main()`. Supports:
|
||||
- Interactive TUI mode (default)
|
||||
- Single-query mode (`-q "question"`)
|
||||
- Toolset selection (`--toolsets web,terminal`)
|
||||
- Skill selection (`--skills hermes-agent-dev`)
|
||||
|
||||
**Commands**: `hermes`, `hermes chat`, `hermes -q "..."`, `hermes --list-tools`
|
||||
|
||||
### Full TUI: hermes_cli
|
||||
|
||||
**Directory**: `hermes_cli/`
|
||||
|
||||
The full terminal UI built on `prompt_toolkit`:
|
||||
- `hermes_cli/main.py` — top-level application, command routing
|
||||
- `hermes_cli/curses_ui.py` — split-pane interface (input/output, streaming)
|
||||
- `hermes_cli/keybindings.py` — slash commands, multi-line editing
|
||||
- `hermes_cli/banner.py` — ASCII branding + context length display
|
||||
- `hermes_cli/providers.py` — model switching UI
|
||||
- `hermes_cli/cron.py` — cron job management UI
|
||||
- `hermes_cli/gateway.py` — gateway control UI
|
||||
- `hermes_cli/skills_hub.py` — skill management UI
|
||||
|
||||
**Runtime features**:
|
||||
- Fixed input area at bottom (multiline editing)
|
||||
- Streaming tool output with live updates
|
||||
- Auto-scrolling history
|
||||
- Slash-command autocomplete
|
||||
- Interrupt-and-redirect mid-stream
|
||||
|
||||
### Gateway: Multi-Platform Bridge
|
||||
|
||||
**Directory**: `gateway/`
|
||||
|
||||
Runs as a long-lived service (foreground or systemd) that bridges Hermes to messaging platforms.
|
||||
|
||||
**Entry**:
|
||||
- `gateway/main.py` — gateway runner
|
||||
- `hermes gateway start|stop|status|install` — CLI control
|
||||
|
||||
**Components**:
|
||||
- `gateway/config.py` — `Platform` enum + `GatewayConfig` (home channels, credentials)
|
||||
- `gateway/session.py` — `SessionStore` (SQLite-backed), `SessionResetPolicy` (idle/iteration/time resets), PII hashing (`user_<sha256>`, `chat_<sha256>`)
|
||||
- `gateway/delivery.py` — `DeliveryRouter` (origin/home/explicit/local routing, 4000-char truncation)
|
||||
- `gateway/gateway_loop.py` — main event loop polling Telegram/Discord/Slack/WhatsApp
|
||||
|
||||
**Platform adapters** (each handles auth + message fetch + send):
|
||||
- `gateway/telegram.py` — python-telegram-bot (webhook + polling)
|
||||
- `gateway/discord.py` — discord.py (gateway + voice support)
|
||||
- `gateway/slack.py` — slack-bolt (events API)
|
||||
- `gateway/whatsapp.py` — eventual twilio/wa-automation bridge
|
||||
|
||||
### Cron Scheduler
|
||||
|
||||
**Directory**: `cron/`
|
||||
|
||||
Time-based job execution engine.
|
||||
|
||||
**Entry**: `cron/scheduler.py`
|
||||
|
||||
`Scheduler.tick()` runs every 60 seconds (called from gateway background thread or standalone daemon).
|
||||
|
||||
**Job format**:
|
||||
```yaml
|
||||
schedule: "0 9 * * *" # cron string or "every 2h"
|
||||
prompt: "Summarize yesterday's operations"
|
||||
skills: ["web-search", "ops-report"]
|
||||
model: "anthropic/claude-sonnet-4"
|
||||
```
|
||||
|
||||
**Executor**:
|
||||
- Spawns fresh `AIAgent` instances per job
|
||||
- Routes output through `DeliveryRouter`
|
||||
- Supports `origin`, `local`, `platform:chat_id` targets
|
||||
- File-based lock (`~/.hermes/cron/.tick.lock`) prevents concurrent ticks
|
||||
|
||||
**Deploy Sync Guard**: Validates `AIAgent.__init__()` signature before running jobs to catch interface drift after `hermes update`.
|
||||
|
||||
### MCP Server
|
||||
|
||||
**File**: `mcp_serve.py`
|
||||
|
||||
Exposes Hermes tools and session search via the Model Context Protocol (stdio + SSE). Allows Cursor/Windsurf/Claude Desktop to call Hermes as an MCP server.
|
||||
|
||||
---
|
||||
|
||||
## Data Flow
|
||||
|
||||
### 1. Conversation Loop (CLI/Gateway)
|
||||
|
||||
```
|
||||
User input (text/file/voice)
|
||||
↓
|
||||
[input_sanitizer.py] — jailbreak detection, PII scoring, risk block
|
||||
↓
|
||||
[memory_manager.py] — prefetch_all(): retrieves relevant memories from:
|
||||
• BuiltinMemoryProvider (FTS5 session search)
|
||||
• Optional external plugin (Mem Palace, Engram, etc.)
|
||||
↓
|
||||
[prompt_builder.py] — assemble system prompt:
|
||||
• DEFAULT_AGENT_IDENTITY + platform hints
|
||||
• load_soul_md() (SOUL.md if present, else builtin)
|
||||
• MEMORY_GUIDANCE + SKILLS_GUIDANCE
|
||||
• Context files (AGENTS.md, .cursorrules, project docs)
|
||||
• Skill index (all SKILL.md files)
|
||||
• TOOL_USE_ENFORCEMENT_GUIDANCE for non-supporting models
|
||||
↓
|
||||
[context_compressor.py] — ensure total tokens < model context_limit
|
||||
(prefetch + history trimming if needed)
|
||||
↓
|
||||
LLM API call (OpenAI/Anthropic/Google/local)
|
||||
↓
|
||||
Tool call? → YES → [model_tools.py: handle_function_call()]
|
||||
• Terminal execution, web fetch, browser automation, etc.
|
||||
• Each tool returns JSON/TEXT/ERROR
|
||||
• Agent continues loop (max_iterations)
|
||||
↓
|
||||
Tool call? → NO → Final response
|
||||
↓
|
||||
[memory_manager.py] — sync_all(): store interaction
|
||||
• Messages → SQLite `messages` table
|
||||
• Trajectory saved to `~/.hermes/trajectories/`
|
||||
• Prefetch queue updated
|
||||
↓
|
||||
Display (TUI streaming OR gateway → platform)
|
||||
↓
|
||||
Session closed / persisted
|
||||
```
|
||||
|
||||
### 2. Tool Execution
|
||||
|
||||
```
|
||||
Tool request (from LLM)
|
||||
↓
|
||||
[tools/terminal_tool.py] or [tools/web_tools.py] or [tools/browser_tool.py] ...
|
||||
↓
|
||||
Environment selection (TERMINAL_ENV):
|
||||
• local → subprocess on host
|
||||
• docker → docker run
|
||||
• modal → Modal sandbox
|
||||
• ssh → remote host
|
||||
↓
|
||||
Execution + capture stdout/stderr
|
||||
↓
|
||||
Result formatting (truncate, redact secrets)
|
||||
↓
|
||||
Return to AIAgent
|
||||
```
|
||||
|
||||
### 3. Cron Job Execution
|
||||
|
||||
```
|
||||
Scheduler.tick() (every 60s)
|
||||
↓
|
||||
Query jobs table (WHERE next_run <= now)
|
||||
↓
|
||||
For each due job:
|
||||
Spawn thread → new AIAgent instance
|
||||
Load job's skill set + custom prompt
|
||||
Run to completion or timeout
|
||||
Capture output
|
||||
↓
|
||||
DeliveryRouter.deliver(output, target=job.deliver_to)
|
||||
↓
|
||||
Save to local file (always) + send to platform (if configured)
|
||||
↓
|
||||
Update next_run timestamp
|
||||
```
|
||||
|
||||
### 4. Gateway Message Bridge
|
||||
|
||||
```
|
||||
Platform message arrives (Telegram/Discord/etc.)
|
||||
↓
|
||||
[session.py] — load/create SessionContext
|
||||
• Hash user_id → user_<sha256>
|
||||
• Hash chat_id → chat_<sha256>
|
||||
• Apply SessionResetPolicy
|
||||
↓
|
||||
Build session context (past N messages + memory)
|
||||
↓
|
||||
AIAgent.run_conversation(message)
|
||||
↓
|
||||
DeliveryRouter.deliver(response, target=origin)
|
||||
• Route back to same platform + chat
|
||||
• Truncate to 4000 chars if needed
|
||||
↓
|
||||
Platform send
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Key Abstractions
|
||||
|
||||
### 1. AIAgent (run_agent.py)
|
||||
|
||||
The orchestrator class. Stateful per-session. Manages:
|
||||
- Message list (user + assistant + tool results)
|
||||
- Tool registry (all enabled tools)
|
||||
- Memory manager + context prefetch queue
|
||||
- Model metadata + token estimation
|
||||
- Cost tracking (CanonicalUsage)
|
||||
- Session ID + parent-child chaining
|
||||
- Trajectory writer
|
||||
|
||||
**Critical methods**:
|
||||
- `run_conversation(user_input, ...)` — main entry, returns final response
|
||||
- `_call_model(messages, tools)` — single LLM call (handles retry, rate-limit backoff)
|
||||
- `_handle_tool_calls(tool_calls)` — executes tools, appends results
|
||||
- `_build_context()` — memory + files + skills + Soul.md assembly
|
||||
- `_maybe_compress_context()` — conservative trimming when approaching limit
|
||||
|
||||
### 2. MemoryProvider (agent/memory_provider.py)
|
||||
|
||||
Abstract base class. Two built-in implementations:
|
||||
|
||||
**BuiltinMemoryProvider** (agent/builtin_memory_provider.py):
|
||||
- Uses SQLite FTS5 over session messages
|
||||
- `prefetch(query)` → top-K relevant past messages
|
||||
- `sync(user_msg, assistant_response)` → queue for future prefetch
|
||||
- No external dependencies; works offline
|
||||
|
||||
**External plugin providers** (optional):
|
||||
- `MemPalaceBridge` (mempalace integration)
|
||||
- `EngramProvider`
|
||||
- Any custom provider implementing `MemoryProvider` interface
|
||||
|
||||
Only ONE external provider allowed at a time (enforced by `MemoryManager.add_provider`).
|
||||
|
||||
### 3. Tool Registry (model_tools.py, toolsets.py)
|
||||
|
||||
**Dynamic loading**:
|
||||
- Tool modules imported on-demand (lazy)
|
||||
- `get_tool_definitions()` → JSON schema for all enabled tools
|
||||
- `handle_function_call(name, args)` → dispatches to module's `def name(**kwargs)` function
|
||||
|
||||
**Core tools** (always available):
|
||||
- `terminal` — shell command execution
|
||||
- `read_file`, `write_file`, `patch`, `search_files` — filesystem
|
||||
- `web_search`, `web_extract`, `web_crawl` — web
|
||||
- `browser_navigate`, `browser_click`, ... — Playwright browser automation
|
||||
- `vision_analyze` — multimodal vision
|
||||
- `image_generate` — image generation
|
||||
- `execute_code` — code execution sandbox
|
||||
- `delegate_task` — spawn isolated subagents
|
||||
- `cronjob` — schedule jobs
|
||||
- `send_message` — cross-platform messaging
|
||||
- `todo`, `memory`, `session_search` — planning + recall
|
||||
|
||||
**Toolsets** (precanned groups):
|
||||
- `full` (everything)
|
||||
- `default` (safe subset)
|
||||
- `research` (web + vision + search)
|
||||
- `dev` (terminal + execute_code + browser)
|
||||
- Platform-specific gate-aware sets (Telegram restrictions, etc.)
|
||||
|
||||
### 4. Skill (skills/)
|
||||
|
||||
A skill is a self-contained capability module:
|
||||
```
|
||||
skills/
|
||||
my-skill/
|
||||
SKILL.md ← YAML frontmatter + usage docs
|
||||
__init__.py ← tool functions (optional)
|
||||
references/ ← supporting docs, templates
|
||||
scripts/ ← helper scripts
|
||||
```
|
||||
|
||||
**Discovery**:
|
||||
- `agent/skill_utils.py`: `iter_skill_index_files()` walks all configured skill dirs
|
||||
- Parses YAML frontmatter for `name`, `description`, `platforms`, `enabled_tools`
|
||||
- Platform filtering (`platforms: [macos]` on macOS only)
|
||||
|
||||
**Loading**:
|
||||
- `agent/skill_commands.py`: `load_skill()`, `unload_skill()`, `reload_skill()`
|
||||
- Optional import of `__init__.py` for tool registration
|
||||
- Skill manifest cached in `~/.hermes/skills/.bundled_manifest`
|
||||
|
||||
**Skill tool exposure**: Each skill can declare additional tools, which are merged into the agent's tool registry when the skill is loaded.
|
||||
|
||||
### 5. Session (State Management)
|
||||
|
||||
**Database**: `~/.hermes/state.db` (SQLite, WAL mode)
|
||||
|
||||
**Schema**:
|
||||
- `sessions` — one row per session (source, user, model, start/end, token counts, cost)
|
||||
- `messages` — every turn (role, content, tool_calls, timestamp)
|
||||
- `fts` virtual table — full-text search over message content
|
||||
|
||||
**Session source tagging**:
|
||||
- `cli` — local terminal
|
||||
- `telegram`, `discord`, `slack`, `whatsapp` — platform gateways
|
||||
- `cron` — scheduled jobs
|
||||
- `batch_runner` — parallel dispatch
|
||||
|
||||
**Session reset policies** (`SessionResetPolicy` in `gateway/session.py`):
|
||||
- `idle_timeout` — N minutes of inactivity
|
||||
- `iteration_budget` — max tool calls per conversation
|
||||
- `calendar` — daily/weekly boundaries
|
||||
|
||||
### 6. DeliveryRouter (gateway/delivery.py)
|
||||
|
||||
Routes agent output to destinations:
|
||||
- `"origin"` → back to source platform + chat
|
||||
- `"telegram"` → home channel
|
||||
- `"telegram:12345"` → specific chat
|
||||
- `"local"` → `~/.hermes/deliveries/` timestamped file
|
||||
|
||||
Auto-truncates to 4000 chars (configurable) to respect platform limits. Split-message logic not yet implemented.
|
||||
|
||||
### 7. Cron Scheduler (cron/scheduler.py)
|
||||
|
||||
File-based job queue stored in SQLite (`cron_jobs` table). Tick loop:
|
||||
1. `SELECT * FROM cron_jobs WHERE next_run <= now()`
|
||||
2. For each job: spawn thread → fresh `AIAgent` → run prompt
|
||||
3. Deliver output, update `last_run`, compute `next_run`
|
||||
4. Log to `~/.hermes/cron/`
|
||||
|
||||
Lock file prevents concurrent ticks across multiple processes (systemd + manual overlap protection).
|
||||
|
||||
---
|
||||
|
||||
## API Surface
|
||||
|
||||
### Public Python API
|
||||
|
||||
#### AIAgent (run_agent.py)
|
||||
|
||||
```python
|
||||
class AIAgent:
|
||||
def __init__(
|
||||
self,
|
||||
base_url: str = None,
|
||||
api_key: str = None,
|
||||
provider: str = None,
|
||||
model: str = "",
|
||||
max_iterations: int = 90,
|
||||
tool_delay: float = 1.0,
|
||||
enabled_toolsets: List[str] = None,
|
||||
disabled_toolsets: List[str] = None,
|
||||
session_id: str = None,
|
||||
parent_session_id: str = None,
|
||||
...
|
||||
) -> None: ...
|
||||
|
||||
def run_conversation(self, user_input: str, ...) -> str: ...
|
||||
def stream_conversation(self, user_input: str, ...) -> Iterator[str]: ...
|
||||
|
||||
# Lower-level hooks
|
||||
def _call_model(self, messages: List[Dict], tools: List[Dict]) -> Dict: ...
|
||||
def _handle_tool_calls(self, tool_calls: List[Dict]) -> List[Dict]: ...
|
||||
def _build_context(self) -> str: ...
|
||||
```
|
||||
|
||||
#### MemoryProvider (agent/memory_provider.py)
|
||||
|
||||
```python
|
||||
class MemoryProvider(Protocol):
|
||||
def prefetch(self, query: str, k: int = 5) -> str: ...
|
||||
def sync(self, user_msg: str, assistant_response: str) -> None: ...
|
||||
```
|
||||
|
||||
**Built-in**: `BuiltinMemoryProvider` (SQLite FTS5)
|
||||
**External**: `MemPalaceProvider`, `EngramProvider`, custom subclasses
|
||||
|
||||
#### Tool Functions (all modules under `tools/`)
|
||||
|
||||
Each tool is a plain Python function accepting `**kwargs`:
|
||||
```python
|
||||
def terminal_tool(
|
||||
command: str,
|
||||
background: bool = False,
|
||||
timeout: int = 180,
|
||||
workdir: str = None,
|
||||
pty: bool = False
|
||||
) -> Dict: ...
|
||||
|
||||
def web_search_tool(
|
||||
query: str,
|
||||
backend: str = "openrouter"
|
||||
) -> Dict: ...
|
||||
|
||||
def browser_navigate(url: str) -> Dict: ...
|
||||
```
|
||||
|
||||
Tool definitions auto-generated via `@tool` decorator from `model_tools.py`.
|
||||
|
||||
### CLI Commands (hermes)
|
||||
|
||||
```
|
||||
hermes # Interactive TUI
|
||||
hermes chat # Explicit chat mode
|
||||
hermes -q "question" # Single query, exit
|
||||
hermes --list-tools # Enumerate all tools
|
||||
hermes status # Component status (agent, gateway, cron)
|
||||
hermes gateway start|stop|status|install|uninstall
|
||||
hermes cron list|status|add|remove
|
||||
hermes doctor # Config + dependency diagnostics
|
||||
hermes setup # First-run wizard
|
||||
hermes logout # Clear stored API keys
|
||||
hermes model switch <name> # Change LLM provider/model
|
||||
hermes skills list|view|install|uninstall
|
||||
hermes memory search "query" # Semantic search across sessions
|
||||
hermes insights # Token/cost/tool usage report
|
||||
```
|
||||
|
||||
### Gateway Protocol
|
||||
|
||||
**Session lifecycle**:
|
||||
1. Message received from platform → `SessionStore.get_or_create(user_id, chat_id)`
|
||||
2. Messages appended to `messages` table with `session_id`
|
||||
3. `SessionResetPolicy.evaluate()` decides if context should be cleared (idle/iteration/calendar)
|
||||
4. `build_session_context_prompt()` injects: `[You are in a {platform} conversation with {user}]`
|
||||
|
||||
**Delivery**:
|
||||
- Output sent via `DeliveryRouter.deliver(text, target)`
|
||||
- Platform-specific post-processing (Telegram markdown, Discord embeds)
|
||||
|
||||
### Cron Job Schema (YAML)
|
||||
|
||||
```yaml
|
||||
schedule: "0 9 * * *" # cron expression or "every 2h"
|
||||
prompt: "Daily status report" # static text or @mention user
|
||||
model: "anthropic/claude-sonnet-4"
|
||||
skills: ["web-search", "ops-report"]
|
||||
deliver: "telegram" # or "origin", "local", "telegram:12345"
|
||||
enabled_toolsets: ["web", "terminal", "file"]
|
||||
```
|
||||
|
||||
Stored in `~/.hermes/cron/jobs/` as individual YAML files. Enabled via `hermes cron add` or manual edit.
|
||||
|
||||
### MCP Server (mcp_serve.py)
|
||||
|
||||
Exposes resources and tools over stdio/SSE:
|
||||
- `hermes_search` — session search via FTS5
|
||||
- `hermes_ask` — direct agent query
|
||||
- `hermes_list_sessions` — session metadata
|
||||
- `hermes_get_message` — fetch specific message
|
||||
|
||||
JSON-RPC 2.0 compliant.
|
||||
|
||||
---
|
||||
|
||||
## Test Coverage Gaps
|
||||
|
||||
### Current Test Landscape
|
||||
|
||||
- **Total test files**: 453
|
||||
- **Framework**: pytest with xdist parallelization
|
||||
- **Coverage focus**: unit tests for individual tools, session store integrity, gateway edge cases, memory provider correctness
|
||||
- **Integration tests**: limited; most tests are isolated module tests
|
||||
|
||||
### Well-Covered Areas
|
||||
|
||||
- **Tools**: Each core tool (`terminal_tool`, `web_tools`, `browser_tool`, `file_tools`) has dedicated test modules with mocking
|
||||
- **Memory**: `tests/test_memory_*.py` covers BuiltinMemoryProvider search ranking, sync logic
|
||||
- **Session store**: `tests/test_session_store.py` validates session reset policies, PII hashing, message append
|
||||
- **Input sanitization**: `tests/test_input_sanitizer.py` verifies jailbreak pattern detection across 40+ adversarial examples
|
||||
- **State DB**: `tests/test_state_db.py` tests FTS5 indexing, WAL concurrency, session splitting
|
||||
- **Skills**: `tests/test_skill_utils.py` covers YAML frontmatter parsing, platform matching
|
||||
|
||||
### Notable Gaps
|
||||
|
||||
1. **AIAgent orchestration loop** (run_agent.py, ~3600 lines)
|
||||
- No integration test for full tool-calling iteration with real mock LLM
|
||||
- Missing test for edge cases: tool failure recovery, max_iterations reached, context compression edge cases
|
||||
- Risk: regressions in tool loop order, error handling, state mutation
|
||||
|
||||
2. **Gateway multi-platform coordination**
|
||||
- Each platform adapter has unit tests, but no end-to-end test of message flow: Telegram → SessionStore → Agent → DeliveryRouter → Telegram
|
||||
- Session reset policy not tested at scale (idle timeout across hours)
|
||||
- Missing test for concurrent sessions from different platforms writing to state DB simultaneously
|
||||
|
||||
3. **Cron scheduler drift and failure modes**
|
||||
- `Scheduler.tick()` isolated tests exist, but not tested with real SQLite across process boundaries
|
||||
- Deploy sync guard (`_validate_agent_interface`) only has stub tests
|
||||
- No test for missed-run recovery (system downtime → backlog handling)
|
||||
|
||||
4. **Trajectory compression and synthesis**
|
||||
- `trajectory.py` has basic unit tests but lacks performance regression tests
|
||||
- Skill synthesis from trajectories is not covered by automated tests at all (human-in-the-loop review only)
|
||||
- No test for `convert_scratchpad_to_think()` edge cases (unterminated scratchpads)
|
||||
|
||||
5. **Context compression edge cases**
|
||||
- `context_compressor.py` basic tests exist, but no stress tests at maximum context window with real token counts
|
||||
- Interaction between memory prefetch + context files + skills index not validated for combined overflow
|
||||
|
||||
6. **MCP server protocol**
|
||||
- mcp_serve.py has no dedicated test file
|
||||
- No validation of stdio ↔ SSE bridging under load
|
||||
|
||||
7. **Observability (insights)**
|
||||
- `insights.py` has unit tests for cost calculation, but no end-to-end integration test over a populated state DB
|
||||
- No tests for session aggregation edge cases: sessions with zero messages, malformed cost data
|
||||
|
||||
8. **Display and TUI**
|
||||
- `agent/display.py` tests limited to spinner frames
|
||||
- TUI layout (curses_ui.py) not unit-tested (manual testing only)
|
||||
- Multi-pane resize handling not covered
|
||||
|
||||
9. **Error recovery and resilience**
|
||||
- `run_agent.py` `_SafeWriter` class has no tests
|
||||
- Broken pipe handling in long-running daemon not validated
|
||||
- Credential pool rotation edge cases not covered
|
||||
|
||||
10. **Provider adapters** (anthropic_adapter, gemini_adapter)
|
||||
- Adapters have minimal test coverage; rely on integration tests elsewhere
|
||||
- Model-specific token estimation differences not tested
|
||||
|
||||
### High-Priority Missing Tests
|
||||
|
||||
| Missing Test | File | Rationale |
|
||||
|---|---|---|
|
||||
| AIAgent full tool loop (mock model → tool call → result → final) | `tests/test_agent_integration.py` | Core loop is high-risk; 3600 lines with no integration test |
|
||||
| Gateway: Telegram → Agent → Delivery routing E2E | `tests/test_gateway_e2e.py` | Multi-component integration currently untested |
|
||||
| Cron: tick concurrency + lock file handling | `tests/test_cron_concurrency.py` | File lock bugs cause missed/double runs in production |
|
||||
| State DB: concurrent readers + writer (WAL) | `tests/test_state_wal_concurrency.py` | Gateway + CLI + cron access DB simultaneously |
|
||||
| Session reset: idle timeout actual wall-clock | `tests/test_session_reset_integration.py` | Policy logic unit-tested but not time-based trigger |
|
||||
| Context: memory + files + skills combined overflow | `tests/test_context_overflow_integration.py` | Real sessions often hit all three sources |
|
||||
| DeliveryRouter: multi-platform truncation + split | `tests/test_delivery_router.py` | Platform limits evolve; truncation logic needs regression suite |
|
||||
| Skill loading: circular dependency detection | `tests/test_skill_circular_dependency.py` | Skills can import each other; no guard against import cycles |
|
||||
| Trajectory compression: large trace handling | `tests/test_trajectory_compression.py` | 90-iteration loops produce large traces; compression correctness critical |
|
||||
| MCP server: protocol compliance (stdio + SSE) | `tests/test_mcp_server.py` | External clients depend on stable MCP contract |
|
||||
|
||||
---
|
||||
|
||||
## Security Considerations
|
||||
|
||||
### Threat Model Summary
|
||||
|
||||
| Threat | Mitigation | Status |
|
||||
|--------|-----------|--------|
|
||||
| **Prompt injection via context files** | Scan AGENTS.md, .cursorrules, SOUL.md in `prompt_builder.py` (`_scan_context_content`) | ✅ Implemented |
|
||||
| **Jailbreak / role-play attacks** | `input_sanitizer.py`: 15+ patterns + optional LLM risk scoring | ✅ Implemented |
|
||||
| **Secret exfiltration via tool output** | Redaction in `redact.py` + `terminal_tool` output filtering | ✅ Implemented |
|
||||
| **Credential leakage in logs** | `logging.Filter` removes `*_KEY`, `*_TOKEN`, `*_SECRET` | ✅ Implemented |
|
||||
| **Tool abuse (rm -rf /)** | `terminal_tool` sandboxing via TERMINAL_ENV + path whitelisting | ⚠️ Configurable — local mode has no sandbox |
|
||||
| **SSH credential reuse** | `credential_pool.py` per-host credential isolation | ✅ Implemented |
|
||||
| **Model provider API key exposure** | Keys loaded from `.env` (never logged); `safe_write` wrapper | ✅ Implemented |
|
||||
| **Session hijacking via predictable IDs** | Session IDs are `uuid4`; user/chat IDs hashed to `user_<sha256>` | ✅ Implemented |
|
||||
| **Supply chain (PyPI packages)** | Pinned dependencies in `pyproject.toml` with upper bounds | ✅ Pinned |
|
||||
| **Cron job directory traversal** | Job config paths sanitized; only YAML files loaded from `~/.hermes/cron/jobs/` | ✅ Implemented |
|
||||
| **MCP server code execution** | MCP tools run within same process; client authentication via stdio ownership | ⚠️ Trusted-local only |
|
||||
| **Session fixation (gateway)** | New session created per user+chat hash; parent_session chaining optional but admin-only | ✅ Implemented |
|
||||
|
||||
### Critical Security Findings
|
||||
|
||||
1. **Network-exposed components**:
|
||||
- `server.py` (WebSocket broadcast hub) binds `HOST="0.0.0.0"` by default — not authenticated. Only suitable for LAN/VPN. **Public exposure requires reverse proxy + auth**.
|
||||
- `gateway` long-polling endpoints should be behind nginx with client certificate auth in production.
|
||||
|
||||
2. **Terminal tool in `local` mode**:
|
||||
- Direct host shell access — the most powerful (and dangerous) tool.
|
||||
- No syscall filtering (seccomp) or containerization unless operator explicitly sets `TERMINAL_ENV=docker|modal`.
|
||||
- **Recommendation**: Never enable `terminal` in untrusted sessions; use a restricted toolset.
|
||||
|
||||
3. **Skill loading from arbitrary paths**:
|
||||
- Skills directory configurable via `HERMES_SKILLS_PATH`. Malicious skill can register arbitrary tools.
|
||||
- Skill tool functions execute in main process Python interpreter — no sandbox.
|
||||
- **Mitigation**: Skill manifest (`SKILL.md`) requires explicit `tools:` declaration; `skill_security.py` validates tool safety before import.
|
||||
|
||||
4. **Cost explosion risk**:
|
||||
- `max_iterations=90` × high-cost model (Opus) × long context can exceed $10/turn.
|
||||
- `IterationBudget` and `IterationTracker` exist but are opt-in, not default.
|
||||
- **Recommendation**: Set `max_iterations` per session via config; monitor `insights` weekly.
|
||||
|
||||
5. **State database size growth**:
|
||||
- SQLite `state.db` unbounded; WAL + FTS indexes grow indefinitely.
|
||||
- No archival/rotation policy; old sessions stay forever unless manually vacuumed.
|
||||
- **Recommendation**: Implement monthly `VACUUM` + session TTL (e.g., 90-day expiry).
|
||||
|
||||
### Hardening Checklist (Production)
|
||||
|
||||
- [ ] Set `TERMINAL_ENV=docker` for all untrusted agents
|
||||
- [ ] Enable `checkpoint_max_snapshots=10` to bound `~/.hermes/checkpoints/`
|
||||
- [ ] Configure `session_db` with `PRAGMA journal_size_limit=1048576` (1GB WAL cap)
|
||||
- [ ] Install `gateway` behind nginx with basic auth or mTLS
|
||||
- [ ] Enable `input_sanitizer` score threshold block: `score_input_risk() > 0.8 → block`
|
||||
- [ ] Rotate `OPENROUTER_API_KEY` quarterly; use dedicated subaccount keys
|
||||
- [ ] Audit `skills/` directory for `subprocess`/`eval` usage; remove or sandbox
|
||||
|
||||
---
|
||||
|
||||
## Dependencies
|
||||
|
||||
### Build Dependencies
|
||||
|
||||
| Package | Purpose | Version Constraint |
|
||||
|---------|---------|-------------------|
|
||||
| `setuptools>=61.0` | Build backend | >=61.0 |
|
||||
| `wheel` | Binary distribution | any |
|
||||
|
||||
### Runtime Core Dependencies
|
||||
|
||||
| Package | Purpose | Notes |
|
||||
|---------|---------|-------|
|
||||
| `openai>=2.21.0,<3` | OpenAI API client | OpenAI + compatible endpoints |
|
||||
| `anthropic>=0.39.0,<1` | Anthropic Claude API | streaming + beta features |
|
||||
| `python-dotenv>=1.2.1,<2` | `.env` loading | Hermes home + project root |
|
||||
| `fire>=0.7.1,<1` | CLI generation | `hermes` command |
|
||||
| `httpx>=0.28.1,<1` | Async HTTP | gateway, provider health checks |
|
||||
| `rich>=14.3.3,<15` | TUI formatting | spinners, tables, syntax |
|
||||
| `tenacity>=9.1.4,<10` | Retry logic | LLM call retries with backoff |
|
||||
| `pyyaml>=6.0.2,<7` | YAML (config, skills) | CSafeLoader preferred |
|
||||
| `requests>=2.33.0,<3` | Sync HTTP (fallback) | CVE-2026-25645 patched |
|
||||
| `jinja2>=3.1.5,<4` | Template rendering | prompt fragments |
|
||||
| `pydantic>=2.12.5,<3` | Config validation | `gateway.config`, `cron.jobs` |
|
||||
| `prompt_toolkit>=3.0.52,<4` | TUI framework | fixed input area, history |
|
||||
| `exa-py>=2.9.0,<3` | Exa search backend | |
|
||||
| `firecrawl-py>=4.16.0,<5` | Firecrawl scraping | |
|
||||
| `parallel-web>=0.4.2,<1` | Parallel.ai backend | Nous subscribers only |
|
||||
| `fal-client>=0.13.1,<1` | FAL image gen | |
|
||||
| `edge-tts>=7.2.7,<8` | Free TTS | Microsoft Edge TTS (no API key) |
|
||||
| `PyJWT[crypto]>=2.12.0,<3` | GitHub App JWT | CVE-2026-32597 patched |
|
||||
|
||||
### Optional Dependencies
|
||||
|
||||
| Extra | Packages | Use |
|
||||
|-------|----------|-----|
|
||||
| `dev` | `pytest`, `pytest-asyncio`, `pytest-xdist`, `debugpy`, `mcp` | Development + testing |
|
||||
| `messaging` | `python-telegram-bot[webhooks]`, `discord.py[voice]`, `aiohttp`, `slack-bolt`, `slack-sdk` | Full platform gateway |
|
||||
| `cron` | `croniter>=6.0.0,<7` | Cron expression parsing |
|
||||
| `modal` | `modal>=1.0.0,<2` | Modal cloud sandboxes |
|
||||
| `daytona` | `daytona>=0.148.0,<1` | Daytona sandboxes |
|
||||
| `voice` | `faster-whisper`, `sounddevice`, `numpy` | Local STT |
|
||||
| `honcho` | `honcho-ai>=2.0.1,<3` | Honcho dialectic memory |
|
||||
| `mcp` | `mcp>=1.2.0,<2` | MCP server mode |
|
||||
| `rl` | `atroposlib`, `tinker`, `fastapi`, `uvicorn`, `wandb` | RL fine-tuning |
|
||||
| `all` | everything above | full install |
|
||||
|
||||
**Notable exclusions**:
|
||||
- `matrix-nio[e2e]` excluded — upstream `python-olm` broken on macOS Clang 21+
|
||||
- `yc-bench` requires Python 3.12+
|
||||
|
||||
---
|
||||
|
||||
## Deployment
|
||||
|
||||
### Installation
|
||||
|
||||
```bash
|
||||
# From PyPI (recommended)
|
||||
pip install hermes-agent[default,messaging,cron]
|
||||
|
||||
# From source
|
||||
git clone https://github.com/NousResearch/hermes-agent.git
|
||||
cd hermes-agent
|
||||
pip install -e ".[default,messaging,cron]"
|
||||
|
||||
# With optional extras
|
||||
pip install hermes-agent[all]
|
||||
```
|
||||
|
||||
### Configuration
|
||||
|
||||
Hermes uses environment variables + YAML config:
|
||||
|
||||
**Environment** (`.env` or shell):
|
||||
- `HERMES_HOME` — state directory (`~/.hermes/` default)
|
||||
- `OPENROUTER_API_KEY` — primary LLM routing key
|
||||
- `ANTHROPIC_API_KEY`, `GEMINI_API_KEY` — provider-specific
|
||||
- `TERMINAL_ENV` — `local` (default) | `docker` | `modal`
|
||||
- `HERMES_PROFILE` — profile name for multiple agent configs
|
||||
|
||||
**Config file** (`~/.hermes/config.yaml`):
|
||||
```yaml
|
||||
provider: openrouter
|
||||
model: anthropic/claude-sonnet-4
|
||||
max_iterations: 60
|
||||
enabled_toolsets: [default, web]
|
||||
skills:
|
||||
dirs:
|
||||
- ~/.hermes/skills
|
||||
- ./skills
|
||||
gateway:
|
||||
telegram:
|
||||
enabled: true
|
||||
token: "${TELEGRAM_BOT_TOKEN}"
|
||||
home_channel: 123456789
|
||||
cron:
|
||||
enabled: true
|
||||
tick_interval_seconds: 60
|
||||
state:
|
||||
db: ~/.hermes/state.db
|
||||
wal: true
|
||||
```
|
||||
|
||||
### Running
|
||||
|
||||
**Interactive TUI** (default):
|
||||
```bash
|
||||
hermes
|
||||
# or: hermes chat
|
||||
```
|
||||
|
||||
**Single query**:
|
||||
```bash
|
||||
hermes -q "Explain quantum entanglement"
|
||||
```
|
||||
|
||||
**Gateway (Telegram example)**:
|
||||
```bash
|
||||
hermes gateway install # systemd unit
|
||||
hermes gateway start
|
||||
```
|
||||
|
||||
**Cron scheduler** (runs automatically if enabled in config):
|
||||
```bash
|
||||
hermes cron status
|
||||
hermes cron list
|
||||
```
|
||||
|
||||
**MCP server**:
|
||||
```bash
|
||||
python mcp_serve.py --transport stdio
|
||||
# or: python mcp_serve.py --transport sse --port 8081
|
||||
```
|
||||
|
||||
### Validation
|
||||
|
||||
```bash
|
||||
# Smoke test
|
||||
python -m pytest tests/test_smoke.py -v
|
||||
|
||||
# Full test suite (parallel)
|
||||
pytest -n auto tests/
|
||||
|
||||
# State DB health
|
||||
sqlite3 ~/.hermes/state.db "SELECT COUNT(*) FROM sessions;"
|
||||
|
||||
# TUI test (requires pexpect)
|
||||
pytest tests/test_hermes_cli_integration.py -v
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Examples
|
||||
|
||||
### Example 1: Simple Research Query
|
||||
|
||||
```
|
||||
> hermes -q "What are the latest developments in KV cache compression?"
|
||||
|
||||
[Tools: web_search → web_extract × 3]
|
||||
└─ Answer: KV cache compression advances... (cost: $0.04)
|
||||
```
|
||||
|
||||
**Token flow**: ~14K input (query + tool results) → ~2K output.
|
||||
|
||||
### Example 2: File System Investigation
|
||||
|
||||
```
|
||||
> /terminal find ~/repos -name "*.py" -exec wc -l {} + | sort -n | tail -10
|
||||
|
||||
[terminal] Executed in 0.8s
|
||||
/path/to/largest.py: 1243 lines
|
||||
...
|
||||
```
|
||||
|
||||
`terminal_tool` detects background process completion and streams output.
|
||||
|
||||
### Example 3: Scheduled Report
|
||||
|
||||
**Cron job** (`~/.hermes/cron/jobs/daily-report.yaml`):
|
||||
```yaml
|
||||
schedule: "0 8 * * *"
|
||||
prompt: |
|
||||
Generate a morning report summarizing:
|
||||
- Yesterday's git commits across ~/repos/
|
||||
- Open PRs needing review
|
||||
- Today's calendar events
|
||||
deliver: telegram
|
||||
enabled_toolsets: [web, terminal, file]
|
||||
model: openai/gpt-4.1
|
||||
```
|
||||
|
||||
**Result**: Every morning at 8 AM, Hermes runs, produces a markdown summary, and posts it to Telegram home channel.
|
||||
|
||||
---
|
||||
|
||||
## Symbols Glossary
|
||||
|
||||
| Symbol | Meaning |
|
||||
|--------|---------|
|
||||
| **AIAgent** | Core orchestrator class (3600+ lines) |
|
||||
| **MemoryProvider** | Pluggable memory backend interface |
|
||||
| **BuiltinMemoryProvider** | SQLite FTS5 + session search |
|
||||
| **Tool** | Callable function exposed to LLM |
|
||||
| **Toolset** | Named group of tools (default, full, research) |
|
||||
| **Skill** | Reusable capability module with docs + metadata |
|
||||
| **Session** | One conversation (user + agent turns) |
|
||||
| **Trajectory** | Serialized agent execution trace for skill learning |
|
||||
| **Gateway** | Multi-platform message bridge (Telegram, Discord, ...) |
|
||||
| **Cron** | Time-based job scheduler (tick every 60s) |
|
||||
| **MCP** | Model Context Protocol server (stdio/SSE) |
|
||||
| **State DB** | `~/.hermes/state.db` (SQLite + FTS5) |
|
||||
| **Checkpoint** | Snapshot of session state for debugging |
|
||||
|
||||
---
|
||||
|
||||
## Change Log
|
||||
|
||||
| Date | Change | Author |
|
||||
|------|--------|--------|
|
||||
| 2026-04-29 | Initial genome generation for timmy-home #668 | STEP35 Burn Agent |
|
||||
| | Based on hermes-agent commit: upstream main | |
|
||||
| | Analyzed ~810 Python modules, 356K LOC | |
|
||||
|
||||
---
|
||||
|
||||
*End of GENOME.md — hermes-agent*
|
||||
@@ -1,313 +0,0 @@
|
||||
#!/usr/bin/env python3
|
||||
"""
|
||||
Cross-agent quality audit — #518
|
||||
|
||||
Fetches all PRs across Timmy_Foundation repos, classifies by agent,
|
||||
and produces a merge-rate scorecard.
|
||||
|
||||
Usage:
|
||||
python scripts/cross_agent_quality_audit.py
|
||||
python scripts/cross_agent_quality_audit.py --scorecard timmy-config/agent-quality-scorecard.md
|
||||
"""
|
||||
|
||||
import argparse
|
||||
import json
|
||||
import os
|
||||
import re
|
||||
import sys
|
||||
from collections import defaultdict
|
||||
from datetime import datetime, timezone
|
||||
from pathlib import Path
|
||||
from typing import Any, Dict, List, Optional
|
||||
|
||||
import requests
|
||||
|
||||
GITEA_BASE = "https://forge.alexanderwhitestone.com/api/v1"
|
||||
ORG = "Timmy_Foundation"
|
||||
TOKEN = os.environ.get("GITEA_TOKEN") or (
|
||||
Path.home() / ".config" / "gitea" / "token"
|
||||
).read_text().strip()
|
||||
|
||||
HEADERS = {"Authorization": f"token {TOKEN}"}
|
||||
|
||||
# Repos to audit (active code repos)
|
||||
DEFAULT_REPOS = [
|
||||
"timmy-home",
|
||||
"hermes-agent",
|
||||
"the-nexus",
|
||||
"the-door",
|
||||
"fleet-ops",
|
||||
"burn-fleet",
|
||||
"the-playground",
|
||||
"compounding-intelligence",
|
||||
"the-beacon",
|
||||
"second-son-of-timmy",
|
||||
"timmy-academy",
|
||||
"timmy-config",
|
||||
]
|
||||
|
||||
|
||||
class AgentClassifier:
|
||||
"""Classify PRs by agent identity."""
|
||||
|
||||
# PR title prefixes that explicitly name an agent
|
||||
AGENT_TITLE_RE = re.compile(
|
||||
r"^\[(?P<agent>Claude|Ezra|Allegro|Bezalel|Timmy|Gemini|Kimi|Manus|Codex)\]",
|
||||
re.IGNORECASE,
|
||||
)
|
||||
|
||||
# Branch patterns that embed agent names
|
||||
AGENT_BRANCH_RE = re.compile(
|
||||
r"(?P<agent>claude|ezra|allegro|bezalel|timmy|gemini|kimi|manus|codex)",
|
||||
re.IGNORECASE,
|
||||
)
|
||||
|
||||
@classmethod
|
||||
def classify(cls, pr: Dict[str, Any]) -> str:
|
||||
title = pr.get("title", "")
|
||||
branch = pr.get("head", {}).get("ref", "")
|
||||
user = pr.get("user", {}).get("login", "")
|
||||
|
||||
# 1. Explicit title tag like [Claude] or [Ezra]
|
||||
m = cls.AGENT_TITLE_RE.match(title)
|
||||
if m:
|
||||
return m.group("agent").lower()
|
||||
|
||||
# 2. Branch contains agent name (e.g. claude/issue-123)
|
||||
m = cls.AGENT_BRANCH_RE.search(branch)
|
||||
if m:
|
||||
return m.group("agent").lower()
|
||||
|
||||
# 3. Git user mapping
|
||||
if user.lower() == "claude":
|
||||
return "claude"
|
||||
if user.lower() == "rockachopa":
|
||||
# Rockachopa is the human / orchestrator — map to "burn-loop"
|
||||
return "burn-loop"
|
||||
|
||||
return "unknown"
|
||||
|
||||
|
||||
def fetch_prs(repo: str, state: str = "all", per_page: int = 50) -> List[Dict[str, Any]]:
|
||||
"""Paginate through all PRs for a repo."""
|
||||
prs: List[Dict[str, Any]] = []
|
||||
page = 1
|
||||
while True:
|
||||
url = f"{GITEA_BASE}/repos/{ORG}/{repo}/pulls?state={state}&limit={per_page}&page={page}"
|
||||
resp = requests.get(url, headers=HEADERS, timeout=30)
|
||||
resp.raise_for_status()
|
||||
batch = resp.json()
|
||||
if not batch:
|
||||
break
|
||||
prs.extend(batch)
|
||||
if len(batch) < per_page:
|
||||
break
|
||||
page += 1
|
||||
return prs
|
||||
|
||||
|
||||
def parse_datetime(dt_str: Optional[str]) -> Optional[datetime]:
|
||||
if not dt_str:
|
||||
return None
|
||||
try:
|
||||
return datetime.fromisoformat(dt_str.replace("Z", "+00:00"))
|
||||
except ValueError:
|
||||
return None
|
||||
|
||||
|
||||
def hours_between(start: Optional[str], end: Optional[str]) -> Optional[float]:
|
||||
s = parse_datetime(start)
|
||||
e = parse_datetime(end)
|
||||
if s and e:
|
||||
return (e - s).total_seconds() / 3600
|
||||
return None
|
||||
|
||||
|
||||
def audit_repos(repos: List[str]) -> Dict[str, Any]:
|
||||
"""Run the audit and return aggregated stats."""
|
||||
agent_stats: Dict[str, Dict[str, Any]] = defaultdict(
|
||||
lambda: {
|
||||
"total": 0,
|
||||
"merged": 0,
|
||||
"closed_unmerged": 0,
|
||||
"open": 0,
|
||||
"hours_to_merge": [],
|
||||
"hours_to_close": [],
|
||||
"repos": set(),
|
||||
"prs": [],
|
||||
}
|
||||
)
|
||||
|
||||
repo_stats: Dict[str, Dict[str, Any]] = {}
|
||||
|
||||
for repo in repos:
|
||||
print(f"Fetching PRs for {repo} ...", file=sys.stderr)
|
||||
try:
|
||||
prs = fetch_prs(repo)
|
||||
except requests.HTTPError as exc:
|
||||
print(f" SKIP {repo}: {exc}", file=sys.stderr)
|
||||
continue
|
||||
|
||||
repo_merged = 0
|
||||
repo_total = len(prs)
|
||||
for pr in prs:
|
||||
agent = AgentClassifier.classify(pr)
|
||||
s = agent_stats[agent]
|
||||
s["total"] += 1
|
||||
s["repos"].add(repo)
|
||||
s["prs"].append(
|
||||
{
|
||||
"repo": repo,
|
||||
"number": pr["number"],
|
||||
"title": pr["title"],
|
||||
"state": pr["state"],
|
||||
"merged": pr.get("merged", False),
|
||||
"created_at": pr.get("created_at"),
|
||||
"merged_at": pr.get("merged_at"),
|
||||
"closed_at": pr.get("closed_at"),
|
||||
}
|
||||
)
|
||||
|
||||
if pr.get("merged"):
|
||||
s["merged"] += 1
|
||||
repo_merged += 1
|
||||
h = hours_between(pr.get("created_at"), pr.get("merged_at"))
|
||||
if h is not None:
|
||||
s["hours_to_merge"].append(h)
|
||||
elif pr["state"] == "closed":
|
||||
s["closed_unmerged"] += 1
|
||||
h = hours_between(pr.get("created_at"), pr.get("closed_at"))
|
||||
if h is not None:
|
||||
s["hours_to_close"].append(h)
|
||||
else:
|
||||
s["open"] += 1
|
||||
|
||||
repo_stats[repo] = {
|
||||
"total": repo_total,
|
||||
"merged": repo_merged,
|
||||
"merge_rate": round(repo_merged / repo_total, 2) if repo_total else 0,
|
||||
}
|
||||
|
||||
# Compute derived metrics
|
||||
summary = {}
|
||||
for agent, s in sorted(agent_stats.items(), key=lambda x: -x[1]["total"]):
|
||||
total = s["total"]
|
||||
merged = s["merged"]
|
||||
closed = s["closed_unmerged"]
|
||||
resolved = merged + closed
|
||||
merge_rate = round(merged / resolved, 3) if resolved else 0
|
||||
avg_merge_hours = (
|
||||
round(sum(s["hours_to_merge"]) / len(s["hours_to_merge"]), 1)
|
||||
if s["hours_to_merge"]
|
||||
else None
|
||||
)
|
||||
avg_close_hours = (
|
||||
round(sum(s["hours_to_close"]) / len(s["hours_to_close"]), 1)
|
||||
if s["hours_to_close"]
|
||||
else None
|
||||
)
|
||||
summary[agent] = {
|
||||
"total_prs": total,
|
||||
"merged": merged,
|
||||
"closed_unmerged": closed,
|
||||
"open": s["open"],
|
||||
"merge_rate": merge_rate,
|
||||
"rejection_rate": round(closed / resolved, 3) if resolved else 0,
|
||||
"avg_hours_to_merge": avg_merge_hours,
|
||||
"avg_hours_to_close": avg_close_hours,
|
||||
"repos": sorted(s["repos"]),
|
||||
}
|
||||
|
||||
return {
|
||||
"audited_at": datetime.now(timezone.utc).isoformat(),
|
||||
"repos_audited": repos,
|
||||
"repo_stats": repo_stats,
|
||||
"agent_summary": summary,
|
||||
"raw_prs": {a: s["prs"] for a, s in agent_stats.items()},
|
||||
}
|
||||
|
||||
|
||||
def render_scorecard(data: Dict[str, Any]) -> str:
|
||||
"""Render a markdown scorecard."""
|
||||
lines = [
|
||||
"# Cross-Agent Quality Scorecard",
|
||||
"",
|
||||
f"**Audited at:** {data['audited_at']}",
|
||||
f"**Repos audited:** {', '.join(data['repos_audited'])}",
|
||||
"",
|
||||
"## Per-Agent Summary",
|
||||
"",
|
||||
"| Agent | Total PRs | Merged | Closed (unmerged) | Open | Merge Rate | Rejection Rate | Avg Hours to Merge | Avg Hours to Close |",
|
||||
"|---|---|---:|---:|---:|---:|---:|---:|---:|",
|
||||
]
|
||||
|
||||
for agent, s in data["agent_summary"].items():
|
||||
merge_hours = f"{s['avg_hours_to_merge']:.1f}" if s["avg_hours_to_merge"] is not None else "—"
|
||||
close_hours = f"{s['avg_hours_to_close']:.1f}" if s["avg_hours_to_close"] is not None else "—"
|
||||
lines.append(
|
||||
f"| {agent} | {s['total_prs']} | {s['merged']} | {s['closed_unmerged']} | "
|
||||
f"{s['open']} | {s['merge_rate']:.1%} | {s['rejection_rate']:.1%} | "
|
||||
f"{merge_hours} | {close_hours} |"
|
||||
)
|
||||
|
||||
lines.extend([
|
||||
"",
|
||||
"## Per-Repo Merge Rate",
|
||||
"",
|
||||
"| Repo | Total PRs | Merged | Merge Rate |",
|
||||
"|---|---|---:|---:|",
|
||||
])
|
||||
|
||||
for repo, s in sorted(data["repo_stats"].items(), key=lambda x: -x[1]["total"]):
|
||||
lines.append(
|
||||
f"| {repo} | {s['total']} | {s['merged']} | {s['merge_rate']:.1%} |"
|
||||
)
|
||||
|
||||
lines.extend([
|
||||
"",
|
||||
"## Methodology",
|
||||
"",
|
||||
"- **Agent classification** uses three signals in priority order:",
|
||||
" 1. Explicit title tag (e.g. `[Claude]`, `[Ezra]`)",
|
||||
" 2. Branch name containing agent name (e.g. `claude/issue-123`)",
|
||||
" 3. Git user (`claude` → claude, `Rockachopa` → burn-loop)",
|
||||
"- **Merge rate** = merged / (merged + closed_unmerged). Open PRs are excluded.",
|
||||
"- **Rejection rate** = closed_unmerged / (merged + closed_unmerged).",
|
||||
"- **Time metrics** are computed from created_at to merged_at / closed_at.",
|
||||
"",
|
||||
"## Raw Data",
|
||||
"",
|
||||
"```json",
|
||||
json.dumps(data["agent_summary"], indent=2),
|
||||
"```",
|
||||
"",
|
||||
])
|
||||
|
||||
return "\n".join(lines) + "\n"
|
||||
|
||||
|
||||
def main() -> int:
|
||||
parser = argparse.ArgumentParser(description="Cross-agent quality audit")
|
||||
parser.add_argument("--repos", nargs="+", default=DEFAULT_REPOS, help="Repos to audit")
|
||||
parser.add_argument("--scorecard", default="timmy-config/agent-quality-scorecard.md", help="Output path")
|
||||
parser.add_argument("--json", default=None, help="Also write raw JSON to path")
|
||||
args = parser.parse_args()
|
||||
|
||||
data = audit_repos(args.repos)
|
||||
|
||||
scorecard_path = Path(args.scorecard)
|
||||
scorecard_path.parent.mkdir(parents=True, exist_ok=True)
|
||||
scorecard_path.write_text(render_scorecard(data))
|
||||
print(f"Scorecard written to {scorecard_path}", file=sys.stderr)
|
||||
|
||||
if args.json:
|
||||
json_path = Path(args.json)
|
||||
json_path.parent.mkdir(parents=True, exist_ok=True)
|
||||
json_path.write_text(json.dumps(data, indent=2, default=str))
|
||||
print(f"Raw JSON written to {json_path}", file=sys.stderr)
|
||||
|
||||
return 0
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
raise SystemExit(main())
|
||||
@@ -1 +1,12 @@
|
||||
# Timmy core module
|
||||
|
||||
from .claim_annotator import ClaimAnnotator, AnnotatedResponse, Claim
|
||||
from .audit_trail import AuditTrail, AuditEntry
|
||||
|
||||
__all__ = [
|
||||
"ClaimAnnotator",
|
||||
"AnnotatedResponse",
|
||||
"Claim",
|
||||
"AuditTrail",
|
||||
"AuditEntry",
|
||||
]
|
||||
|
||||
156
src/timmy/claim_annotator.py
Normal file
156
src/timmy/claim_annotator.py
Normal file
@@ -0,0 +1,156 @@
|
||||
#!/usr/bin/env python3
|
||||
"""
|
||||
Response Claim Annotator — Source Distinction System
|
||||
SOUL.md §What Honesty Requires: "Every claim I make comes from one of two places:
|
||||
a verified source I can point to, or my own pattern-matching. My user must be
|
||||
able to tell which is which."
|
||||
"""
|
||||
|
||||
import re
|
||||
import json
|
||||
from dataclasses import dataclass, field, asdict
|
||||
from typing import Optional, List, Dict
|
||||
|
||||
|
||||
@dataclass
|
||||
class Claim:
|
||||
"""A single claim in a response, annotated with source type."""
|
||||
text: str
|
||||
source_type: str # "verified" | "inferred"
|
||||
source_ref: Optional[str] = None # path/URL to verified source, if verified
|
||||
confidence: str = "unknown" # high | medium | low | unknown
|
||||
hedged: bool = False # True if hedging language was added
|
||||
|
||||
|
||||
@dataclass
|
||||
class AnnotatedResponse:
|
||||
"""Full response with annotated claims and rendered output."""
|
||||
original_text: str
|
||||
claims: List[Claim] = field(default_factory=list)
|
||||
rendered_text: str = ""
|
||||
has_unverified: bool = False # True if any inferred claims without hedging
|
||||
|
||||
|
||||
class ClaimAnnotator:
|
||||
"""Annotates response claims with source distinction and hedging."""
|
||||
|
||||
# Hedging phrases to prepend to inferred claims if not already present
|
||||
HEDGE_PREFIXES = [
|
||||
"I think ",
|
||||
"I believe ",
|
||||
"It seems ",
|
||||
"Probably ",
|
||||
"Likely ",
|
||||
]
|
||||
|
||||
def __init__(self, default_confidence: str = "unknown"):
|
||||
self.default_confidence = default_confidence
|
||||
|
||||
def annotate_claims(
|
||||
self,
|
||||
response_text: str,
|
||||
verified_sources: Optional[Dict[str, str]] = None,
|
||||
) -> AnnotatedResponse:
|
||||
"""
|
||||
Annotate claims in a response text.
|
||||
|
||||
Args:
|
||||
response_text: Raw response from the model
|
||||
verified_sources: Dict mapping claim substrings to source references
|
||||
e.g. {"Paris is the capital of France": "https://en.wikipedia.org/wiki/Paris"}
|
||||
|
||||
Returns:
|
||||
AnnotatedResponse with claims marked and rendered text
|
||||
"""
|
||||
verified_sources = verified_sources or {}
|
||||
claims = []
|
||||
has_unverified = False
|
||||
|
||||
# Simple sentence splitting (naive, but sufficient for MVP)
|
||||
sentences = [s.strip() for s in re.split(r'[.!?]\s+', response_text) if s.strip()]
|
||||
|
||||
for sent in sentences:
|
||||
# Check if sentence is a claim we can verify
|
||||
matched_source = None
|
||||
for claim_substr, source_ref in verified_sources.items():
|
||||
if claim_substr.lower() in sent.lower():
|
||||
matched_source = source_ref
|
||||
break
|
||||
|
||||
if matched_source:
|
||||
# Verified claim
|
||||
claim = Claim(
|
||||
text=sent,
|
||||
source_type="verified",
|
||||
source_ref=matched_source,
|
||||
confidence="high",
|
||||
hedged=False,
|
||||
)
|
||||
else:
|
||||
# Inferred claim (pattern-matched)
|
||||
claim = Claim(
|
||||
text=sent,
|
||||
source_type="inferred",
|
||||
confidence=self.default_confidence,
|
||||
hedged=self._has_hedge(sent),
|
||||
)
|
||||
if not claim.hedged:
|
||||
has_unverified = True
|
||||
|
||||
claims.append(claim)
|
||||
|
||||
# Render the annotated response
|
||||
rendered = self._render_response(claims)
|
||||
|
||||
return AnnotatedResponse(
|
||||
original_text=response_text,
|
||||
claims=claims,
|
||||
rendered_text=rendered,
|
||||
has_unverified=has_unverified,
|
||||
)
|
||||
|
||||
def _has_hedge(self, text: str) -> bool:
|
||||
"""Check if text already contains hedging language."""
|
||||
text_lower = text.lower()
|
||||
for prefix in self.HEDGE_PREFIXES:
|
||||
if text_lower.startswith(prefix.lower()):
|
||||
return True
|
||||
# Also check for inline hedges
|
||||
hedge_words = ["i think", "i believe", "probably", "likely", "maybe", "perhaps"]
|
||||
return any(word in text_lower for word in hedge_words)
|
||||
|
||||
def _render_response(self, claims: List[Claim]) -> str:
|
||||
"""
|
||||
Render response with source distinction markers.
|
||||
|
||||
Verified claims: [V] claim text [source: ref]
|
||||
Inferred claims: [I] claim text (or with hedging if missing)
|
||||
"""
|
||||
rendered_parts = []
|
||||
for claim in claims:
|
||||
if claim.source_type == "verified":
|
||||
part = f"[V] {claim.text}"
|
||||
if claim.source_ref:
|
||||
part += f" [source: {claim.source_ref}]"
|
||||
else: # inferred
|
||||
if not claim.hedged:
|
||||
# Add hedging if missing
|
||||
hedged_text = f"I think {claim.text[0].lower()}{claim.text[1:]}" if claim.text else claim.text
|
||||
part = f"[I] {hedged_text}"
|
||||
else:
|
||||
part = f"[I] {claim.text}"
|
||||
rendered_parts.append(part)
|
||||
return " ".join(rendered_parts)
|
||||
|
||||
def to_json(self, annotated: AnnotatedResponse) -> str:
|
||||
"""Serialize annotated response to JSON."""
|
||||
return json.dumps(
|
||||
{
|
||||
"original_text": annotated.original_text,
|
||||
"rendered_text": annotated.rendered_text,
|
||||
"has_unverified": annotated.has_unverified,
|
||||
"claims": [asdict(c) for c in annotated.claims],
|
||||
},
|
||||
indent=2,
|
||||
ensure_ascii=False,
|
||||
)
|
||||
@@ -1,45 +0,0 @@
|
||||
"""Tests for cross_agent_quality_audit.py — #518."""
|
||||
|
||||
import pytest
|
||||
import sys
|
||||
from pathlib import Path
|
||||
|
||||
sys.path.insert(0, str(Path(__file__).parent.parent / "scripts"))
|
||||
|
||||
from cross_agent_quality_audit import AgentClassifier, hours_between
|
||||
|
||||
|
||||
class TestAgentClassifier:
|
||||
def test_title_tag_claude(self):
|
||||
pr = {"title": "[Claude] fix auth middleware", "head": {"ref": "fix/123"}, "user": {"login": "rockachopa"}}
|
||||
assert AgentClassifier.classify(pr) == "claude"
|
||||
|
||||
def test_title_tag_ezra(self):
|
||||
pr = {"title": "[Ezra] tmux fleet launcher", "head": {"ref": "burn/10"}, "user": {"login": "rockachopa"}}
|
||||
assert AgentClassifier.classify(pr) == "ezra"
|
||||
|
||||
def test_branch_name_claude(self):
|
||||
pr = {"title": "fix auth", "head": {"ref": "claude/issue-1695"}, "user": {"login": "rockachopa"}}
|
||||
assert AgentClassifier.classify(pr) == "claude"
|
||||
|
||||
def test_user_mapping(self):
|
||||
pr = {"title": "some fix", "head": {"ref": "fix/1"}, "user": {"login": "claude"}}
|
||||
assert AgentClassifier.classify(pr) == "claude"
|
||||
|
||||
def test_rockachopa_maps_to_burn_loop(self):
|
||||
pr = {"title": "some fix", "head": {"ref": "fix/1"}, "user": {"login": "Rockachopa"}}
|
||||
assert AgentClassifier.classify(pr) == "burn-loop"
|
||||
|
||||
def test_unknown_fallback(self):
|
||||
pr = {"title": "some fix", "head": {"ref": "fix/1"}, "user": {"login": "random"}}
|
||||
assert AgentClassifier.classify(pr) == "unknown"
|
||||
|
||||
|
||||
class TestHoursBetween:
|
||||
def test_same_day(self):
|
||||
h = hours_between("2026-04-22T10:00:00Z", "2026-04-22T12:00:00Z")
|
||||
assert h == 2.0
|
||||
|
||||
def test_none_returns_none(self):
|
||||
assert hours_between(None, "2026-04-22T12:00:00Z") is None
|
||||
assert hours_between("2026-04-22T10:00:00Z", None) is None
|
||||
@@ -1,84 +1,123 @@
|
||||
"""
|
||||
Test that the hermes-agent GENOME.md exists and contains required sections.
|
||||
|
||||
Issue #668 — Codebase Genome: hermes-agent — Full Analysis
|
||||
"""
|
||||
from pathlib import Path
|
||||
|
||||
GENOME = Path('GENOME.md')
|
||||
|
||||
|
||||
def read_genome() -> str:
|
||||
assert GENOME.exists(), 'GENOME.md must exist at repo root'
|
||||
return GENOME.read_text(encoding='utf-8')
|
||||
GENOME = Path(__file__).parent.parent / "genomes" / "hermes-agent-GENOME.md"
|
||||
|
||||
|
||||
def test_genome_exists():
|
||||
assert GENOME.exists(), 'GENOME.md must exist at repo root'
|
||||
"""GENOME.md must exist at genomes/hermes-agent-GENOME.md."""
|
||||
assert GENOME.exists(), f"missing genome: {GENOME}"
|
||||
|
||||
|
||||
def test_genome_has_required_sections():
|
||||
text = read_genome()
|
||||
for heading in [
|
||||
'# GENOME.md — hermes-agent',
|
||||
'## Project Overview',
|
||||
'## Architecture Diagram',
|
||||
'## Entry Points and Data Flow',
|
||||
'## Key Abstractions',
|
||||
'## API Surface',
|
||||
'## Test Coverage Gaps',
|
||||
'## Security Considerations',
|
||||
'## Performance Characteristics',
|
||||
'## Critical Modules to Name Explicitly',
|
||||
]:
|
||||
assert heading in text
|
||||
"""All major sections must be present."""
|
||||
text = GENOME.read_text(encoding="utf-8")
|
||||
required = [
|
||||
"# GENOME.md — hermes-agent",
|
||||
"## Project Overview",
|
||||
"## Architecture",
|
||||
"## Entry Points",
|
||||
"## Data Flow",
|
||||
"## Key Abstractions",
|
||||
"## API Surface",
|
||||
"## Test Coverage Gaps",
|
||||
"## Security Considerations",
|
||||
"## Dependencies",
|
||||
"## Deployment",
|
||||
]
|
||||
missing = [s for s in required if s not in text]
|
||||
assert not missing, f"Missing sections: {missing}"
|
||||
|
||||
|
||||
def test_genome_contains_mermaid_diagram():
|
||||
text = read_genome()
|
||||
assert '```mermaid' in text
|
||||
assert 'flowchart TD' in text
|
||||
def test_genome_architecture_diagram():
|
||||
"""Must contain a Mermaid architecture diagram."""
|
||||
text = GENOME.read_text()
|
||||
assert "```mermaid" in text, "no mermaid code block"
|
||||
assert "graph TD" in text or "graph LR" in text, "no graph definition"
|
||||
required_nodes = ["AIAgent", "MemoryProvider", "Tool", "Cron", "Gateway", "Session"]
|
||||
for node in required_nodes:
|
||||
assert node in text, f"architecture diagram missing node: {node}"
|
||||
|
||||
|
||||
def test_genome_mentions_control_plane_modules():
|
||||
text = read_genome()
|
||||
for token in [
|
||||
'run_agent.py',
|
||||
'model_tools.py',
|
||||
'tools/registry.py',
|
||||
'toolsets.py',
|
||||
'cli.py',
|
||||
'hermes_cli/main.py',
|
||||
'hermes_state.py',
|
||||
'gateway/run.py',
|
||||
'acp_adapter/server.py',
|
||||
'cron/scheduler.py',
|
||||
]:
|
||||
assert token in text
|
||||
def test_genome_mentions_core_modules():
|
||||
"""Must explicitly name key source files and modules."""
|
||||
text = GENOME.read_text()
|
||||
required = [
|
||||
"run_agent.py",
|
||||
"agent/input_sanitizer.py",
|
||||
"agent/memory_manager.py",
|
||||
"agent/prompt_builder.py",
|
||||
"agent/trajectory.py",
|
||||
"gateway/session.py",
|
||||
"gateway/delivery.py",
|
||||
"cron/scheduler.py",
|
||||
"tools/terminal_tool.py",
|
||||
"skills/",
|
||||
"hermes_state.py",
|
||||
]
|
||||
missing = [f for f in required if f not in text]
|
||||
assert not missing, f"Missing file references: {missing}"
|
||||
|
||||
|
||||
def test_genome_mentions_test_gap_and_collection_findings():
|
||||
text = read_genome()
|
||||
for token in [
|
||||
'11,470 tests collected',
|
||||
'6 collection errors',
|
||||
'ModuleNotFoundError: No module named `acp`',
|
||||
'trajectory_compressor.py',
|
||||
'batch_runner.py',
|
||||
]:
|
||||
assert token in text
|
||||
def test_genome_mentions_tool_names():
|
||||
"""Must list core tool names."""
|
||||
text = GENOME.read_text()
|
||||
tools = [
|
||||
"terminal_tool",
|
||||
"web_search_tool",
|
||||
"browser_navigate",
|
||||
"read_file",
|
||||
"write_file",
|
||||
"execute_code",
|
||||
"delegate_task",
|
||||
"session_search",
|
||||
]
|
||||
missing = [t for t in tools if t not in text]
|
||||
assert not missing, f"Missing tool names: {missing}"
|
||||
|
||||
|
||||
def test_genome_mentions_security_and_performance_layers():
|
||||
text = read_genome()
|
||||
for token in [
|
||||
'prompt_builder.py',
|
||||
'approval.py',
|
||||
'file_tools.py',
|
||||
'mcp_tool.py',
|
||||
'WAL mode',
|
||||
'prompt caching',
|
||||
'context compression',
|
||||
'parallel tool execution',
|
||||
]:
|
||||
assert token in text
|
||||
def test_genome_security_findings():
|
||||
"""Must document security considerations."""
|
||||
text = GENOME.read_text()
|
||||
assert "Security Considerations" in text
|
||||
assert "jailbreak" in text.lower()
|
||||
assert "PII" in text or "personally identifiable" in text.lower()
|
||||
assert "credential" in text.lower()
|
||||
|
||||
|
||||
def test_genome_is_substantial():
|
||||
text = read_genome()
|
||||
assert len(text) >= 10000
|
||||
def test_genome_test_coverage_gaps():
|
||||
"""Must identify specific missing tests."""
|
||||
text = GENOME.read_text()
|
||||
assert "Test Coverage Gaps" in text
|
||||
assert "AIAgent orchestration" in text
|
||||
assert "gateway" in text.lower()
|
||||
assert "cron" in text.lower()
|
||||
|
||||
|
||||
def test_genome_not_a_stub():
|
||||
"""GENOME.md must be substantial (>10KB)."""
|
||||
size = GENOME.stat().st_size
|
||||
assert size >= 10_000, f"GENOME.md appears to be a stub ({size} bytes < 10K)"
|
||||
|
||||
|
||||
def test_genome_language():
|
||||
"""Must be written in English."""
|
||||
text = GENOME.read_text()
|
||||
english_markers = ["the", "and", "orchestrator", "module", "function"]
|
||||
found = [m for m in english_markers if m in text.lower()]
|
||||
assert len(found) >= 4, "GENOME.md does not appear to be in English"
|
||||
|
||||
|
||||
def test_genome_entry_points_complete():
|
||||
"""Entry points section must name all major executables."""
|
||||
text = GENOME.read_text()
|
||||
assert "run_agent.py" in text
|
||||
assert "cli.py" in text
|
||||
assert "hermes_cli" in text
|
||||
assert "gateway" in text
|
||||
assert "mcp_serve.py" in text
|
||||
assert "cron" in text
|
||||
|
||||
103
tests/timmy/test_claim_annotator.py
Normal file
103
tests/timmy/test_claim_annotator.py
Normal file
@@ -0,0 +1,103 @@
|
||||
#!/usr/bin/env python3
|
||||
"""Tests for claim_annotator.py — verifies source distinction is present."""
|
||||
|
||||
import sys
|
||||
import os
|
||||
import json
|
||||
|
||||
sys.path.insert(0, os.path.join(os.path.dirname(__file__), "..", "src"))
|
||||
|
||||
from timmy.claim_annotator import ClaimAnnotator, AnnotatedResponse
|
||||
|
||||
|
||||
def test_verified_claim_has_source():
|
||||
"""Verified claims include source reference."""
|
||||
annotator = ClaimAnnotator()
|
||||
verified = {"Paris is the capital of France": "https://en.wikipedia.org/wiki/Paris"}
|
||||
response = "Paris is the capital of France. It is a beautiful city."
|
||||
|
||||
result = annotator.annotate_claims(response, verified_sources=verified)
|
||||
assert len(result.claims) > 0
|
||||
verified_claims = [c for c in result.claims if c.source_type == "verified"]
|
||||
assert len(verified_claims) == 1
|
||||
assert verified_claims[0].source_ref == "https://en.wikipedia.org/wiki/Paris"
|
||||
assert "[V]" in result.rendered_text
|
||||
assert "[source:" in result.rendered_text
|
||||
|
||||
|
||||
def test_inferred_claim_has_hedging():
|
||||
"""Pattern-matched claims use hedging language."""
|
||||
annotator = ClaimAnnotator()
|
||||
response = "The weather is nice today. It might rain tomorrow."
|
||||
|
||||
result = annotator.annotate_claims(response)
|
||||
inferred_claims = [c for c in result.claims if c.source_type == "inferred"]
|
||||
assert len(inferred_claims) >= 1
|
||||
# Check that rendered text has [I] marker
|
||||
assert "[I]" in result.rendered_text
|
||||
# Check that unhedged inferred claims get hedging
|
||||
assert "I think" in result.rendered_text or "I believe" in result.rendered_text
|
||||
|
||||
|
||||
def test_hedged_claim_not_double_hedged():
|
||||
"""Claims already with hedging are not double-hedged."""
|
||||
annotator = ClaimAnnotator()
|
||||
response = "I think the sky is blue. It is a nice day."
|
||||
|
||||
result = annotator.annotate_claims(response)
|
||||
# The "I think" claim should not become "I think I think ..."
|
||||
assert "I think I think" not in result.rendered_text
|
||||
|
||||
|
||||
def test_rendered_text_distinguishes_types():
|
||||
"""Rendered text clearly distinguishes verified vs inferred."""
|
||||
annotator = ClaimAnnotator()
|
||||
verified = {"Earth is round": "https://science.org/earth"}
|
||||
response = "Earth is round. Stars are far away."
|
||||
|
||||
result = annotator.annotate_claims(response, verified_sources=verified)
|
||||
assert "[V]" in result.rendered_text # verified marker
|
||||
assert "[I]" in result.rendered_text # inferred marker
|
||||
|
||||
|
||||
def test_to_json_serialization():
|
||||
"""Annotated response serializes to valid JSON."""
|
||||
annotator = ClaimAnnotator()
|
||||
response = "Test claim."
|
||||
result = annotator.annotate_claims(response)
|
||||
json_str = annotator.to_json(result)
|
||||
parsed = json.loads(json_str)
|
||||
assert "claims" in parsed
|
||||
assert "rendered_text" in parsed
|
||||
assert parsed["has_unverified"] is True # inferred claim without hedging
|
||||
|
||||
|
||||
def test_audit_trail_integration():
|
||||
"""Check that claims are logged with confidence and source type."""
|
||||
# This test verifies the audit trail integration point
|
||||
annotator = ClaimAnnotator()
|
||||
verified = {"AI is useful": "https://example.com/ai"}
|
||||
response = "AI is useful. It can help with tasks."
|
||||
|
||||
result = annotator.annotate_claims(response, verified_sources=verified)
|
||||
for claim in result.claims:
|
||||
assert claim.source_type in ("verified", "inferred")
|
||||
assert claim.confidence in ("high", "medium", "low", "unknown")
|
||||
if claim.source_type == "verified":
|
||||
assert claim.source_ref is not None
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
test_verified_claim_has_source()
|
||||
print("✓ test_verified_claim_has_source passed")
|
||||
test_inferred_claim_has_hedging()
|
||||
print("✓ test_inferred_claim_has_hedging passed")
|
||||
test_hedged_claim_not_double_hedged()
|
||||
print("✓ test_hedged_claim_not_double_hedged passed")
|
||||
test_rendered_text_distinguishes_types()
|
||||
print("✓ test_rendered_text_distinguishes_types passed")
|
||||
test_to_json_serialization()
|
||||
print("✓ test_to_json_serialization passed")
|
||||
test_audit_trail_integration()
|
||||
print("✓ test_audit_trail_integration passed")
|
||||
print("\nAll tests passed!")
|
||||
@@ -1,244 +0,0 @@
|
||||
# Cross-Agent Quality Scorecard
|
||||
|
||||
**Audited at:** 2026-04-22T06:17:43.574309+00:00
|
||||
**Repos audited:** timmy-home, hermes-agent, the-nexus, the-door, fleet-ops, burn-fleet, the-playground, compounding-intelligence, the-beacon, second-son-of-timmy, timmy-academy, timmy-config
|
||||
|
||||
## Per-Agent Summary
|
||||
|
||||
| Agent | Total PRs | Merged | Closed (unmerged) | Open | Merge Rate | Rejection Rate | Avg Hours to Merge | Avg Hours to Close |
|
||||
|---|---|---:|---:|---:|---:|---:|---:|---:|
|
||||
| burn-loop | 1733 | 346 | 1239 | 148 | 21.8% | 78.2% | 18.9 | 20.6 |
|
||||
| unknown | 843 | 598 | 214 | 31 | 73.6% | 26.4% | 2.3 | 11.3 |
|
||||
| claude | 264 | 138 | 121 | 5 | 53.3% | 46.7% | 3.3 | 6.2 |
|
||||
| gemini | 95 | 24 | 70 | 1 | 25.5% | 74.5% | 0.5 | 11.3 |
|
||||
| timmy | 28 | 15 | 11 | 2 | 57.7% | 42.3% | 9.8 | 20.2 |
|
||||
| bezalel | 21 | 11 | 9 | 1 | 55.0% | 45.0% | 2.7 | 8.0 |
|
||||
| allegro | 21 | 7 | 11 | 3 | 38.9% | 61.1% | 31.1 | 20.2 |
|
||||
| ezra | 8 | 2 | 3 | 3 | 40.0% | 60.0% | 4.4 | 16.8 |
|
||||
| kimi | 6 | 3 | 3 | 0 | 50.0% | 50.0% | 39.5 | 0.5 |
|
||||
| manus | 6 | 5 | 1 | 0 | 83.3% | 16.7% | 0.0 | 18.8 |
|
||||
| codex | 2 | 2 | 0 | 0 | 100.0% | 0.0% | 2.3 | — |
|
||||
|
||||
## Per-Repo Merge Rate
|
||||
|
||||
| Repo | Total PRs | Merged | Merge Rate |
|
||||
|---|---|---:|---:|
|
||||
| the-nexus | 985 | 501 | 51.0% |
|
||||
| hermes-agent | 519 | 128 | 25.0% |
|
||||
| timmy-config | 404 | 140 | 35.0% |
|
||||
| timmy-home | 270 | 104 | 39.0% |
|
||||
| fleet-ops | 266 | 84 | 32.0% |
|
||||
| the-beacon | 175 | 62 | 35.0% |
|
||||
| the-door | 153 | 31 | 20.0% |
|
||||
| second-son-of-timmy | 111 | 82 | 74.0% |
|
||||
| compounding-intelligence | 50 | 9 | 18.0% |
|
||||
| the-playground | 44 | 2 | 5.0% |
|
||||
| burn-fleet | 38 | 2 | 5.0% |
|
||||
| timmy-academy | 12 | 6 | 50.0% |
|
||||
|
||||
## Methodology
|
||||
|
||||
- **Agent classification** uses three signals in priority order:
|
||||
1. Explicit title tag (e.g. `[Claude]`, `[Ezra]`)
|
||||
2. Branch name containing agent name (e.g. `claude/issue-123`)
|
||||
3. Git user (`claude` → claude, `Rockachopa` → burn-loop)
|
||||
- **Merge rate** = merged / (merged + closed_unmerged). Open PRs are excluded.
|
||||
- **Rejection rate** = closed_unmerged / (merged + closed_unmerged).
|
||||
- **Time metrics** are computed from created_at to merged_at / closed_at.
|
||||
|
||||
## Raw Data
|
||||
|
||||
```json
|
||||
{
|
||||
"burn-loop": {
|
||||
"total_prs": 1733,
|
||||
"merged": 346,
|
||||
"closed_unmerged": 1239,
|
||||
"open": 148,
|
||||
"merge_rate": 0.218,
|
||||
"rejection_rate": 0.782,
|
||||
"avg_hours_to_merge": 18.9,
|
||||
"avg_hours_to_close": 20.6,
|
||||
"repos": [
|
||||
"burn-fleet",
|
||||
"compounding-intelligence",
|
||||
"fleet-ops",
|
||||
"hermes-agent",
|
||||
"second-son-of-timmy",
|
||||
"the-beacon",
|
||||
"the-door",
|
||||
"the-nexus",
|
||||
"the-playground",
|
||||
"timmy-academy",
|
||||
"timmy-config",
|
||||
"timmy-home"
|
||||
]
|
||||
},
|
||||
"unknown": {
|
||||
"total_prs": 843,
|
||||
"merged": 598,
|
||||
"closed_unmerged": 214,
|
||||
"open": 31,
|
||||
"merge_rate": 0.736,
|
||||
"rejection_rate": 0.264,
|
||||
"avg_hours_to_merge": 2.3,
|
||||
"avg_hours_to_close": 11.3,
|
||||
"repos": [
|
||||
"fleet-ops",
|
||||
"hermes-agent",
|
||||
"second-son-of-timmy",
|
||||
"the-beacon",
|
||||
"the-door",
|
||||
"the-nexus",
|
||||
"timmy-academy",
|
||||
"timmy-config",
|
||||
"timmy-home"
|
||||
]
|
||||
},
|
||||
"claude": {
|
||||
"total_prs": 264,
|
||||
"merged": 138,
|
||||
"closed_unmerged": 121,
|
||||
"open": 5,
|
||||
"merge_rate": 0.533,
|
||||
"rejection_rate": 0.467,
|
||||
"avg_hours_to_merge": 3.3,
|
||||
"avg_hours_to_close": 6.2,
|
||||
"repos": [
|
||||
"hermes-agent",
|
||||
"the-nexus",
|
||||
"timmy-config",
|
||||
"timmy-home"
|
||||
]
|
||||
},
|
||||
"gemini": {
|
||||
"total_prs": 95,
|
||||
"merged": 24,
|
||||
"closed_unmerged": 70,
|
||||
"open": 1,
|
||||
"merge_rate": 0.255,
|
||||
"rejection_rate": 0.745,
|
||||
"avg_hours_to_merge": 0.5,
|
||||
"avg_hours_to_close": 11.3,
|
||||
"repos": [
|
||||
"hermes-agent",
|
||||
"the-nexus",
|
||||
"timmy-config",
|
||||
"timmy-home"
|
||||
]
|
||||
},
|
||||
"timmy": {
|
||||
"total_prs": 28,
|
||||
"merged": 15,
|
||||
"closed_unmerged": 11,
|
||||
"open": 2,
|
||||
"merge_rate": 0.577,
|
||||
"rejection_rate": 0.423,
|
||||
"avg_hours_to_merge": 9.8,
|
||||
"avg_hours_to_close": 20.2,
|
||||
"repos": [
|
||||
"burn-fleet",
|
||||
"hermes-agent",
|
||||
"the-nexus",
|
||||
"timmy-config",
|
||||
"timmy-home"
|
||||
]
|
||||
},
|
||||
"bezalel": {
|
||||
"total_prs": 21,
|
||||
"merged": 11,
|
||||
"closed_unmerged": 9,
|
||||
"open": 1,
|
||||
"merge_rate": 0.55,
|
||||
"rejection_rate": 0.45,
|
||||
"avg_hours_to_merge": 2.7,
|
||||
"avg_hours_to_close": 8.0,
|
||||
"repos": [
|
||||
"burn-fleet",
|
||||
"hermes-agent",
|
||||
"the-beacon",
|
||||
"the-nexus",
|
||||
"timmy-config",
|
||||
"timmy-home"
|
||||
]
|
||||
},
|
||||
"allegro": {
|
||||
"total_prs": 21,
|
||||
"merged": 7,
|
||||
"closed_unmerged": 11,
|
||||
"open": 3,
|
||||
"merge_rate": 0.389,
|
||||
"rejection_rate": 0.611,
|
||||
"avg_hours_to_merge": 31.1,
|
||||
"avg_hours_to_close": 20.2,
|
||||
"repos": [
|
||||
"burn-fleet",
|
||||
"hermes-agent",
|
||||
"the-beacon",
|
||||
"the-nexus",
|
||||
"timmy-config",
|
||||
"timmy-home"
|
||||
]
|
||||
},
|
||||
"ezra": {
|
||||
"total_prs": 8,
|
||||
"merged": 2,
|
||||
"closed_unmerged": 3,
|
||||
"open": 3,
|
||||
"merge_rate": 0.4,
|
||||
"rejection_rate": 0.6,
|
||||
"avg_hours_to_merge": 4.4,
|
||||
"avg_hours_to_close": 16.8,
|
||||
"repos": [
|
||||
"burn-fleet",
|
||||
"fleet-ops",
|
||||
"timmy-config",
|
||||
"timmy-home"
|
||||
]
|
||||
},
|
||||
"kimi": {
|
||||
"total_prs": 6,
|
||||
"merged": 3,
|
||||
"closed_unmerged": 3,
|
||||
"open": 0,
|
||||
"merge_rate": 0.5,
|
||||
"rejection_rate": 0.5,
|
||||
"avg_hours_to_merge": 39.5,
|
||||
"avg_hours_to_close": 0.5,
|
||||
"repos": [
|
||||
"hermes-agent",
|
||||
"the-nexus",
|
||||
"timmy-home"
|
||||
]
|
||||
},
|
||||
"manus": {
|
||||
"total_prs": 6,
|
||||
"merged": 5,
|
||||
"closed_unmerged": 1,
|
||||
"open": 0,
|
||||
"merge_rate": 0.833,
|
||||
"rejection_rate": 0.167,
|
||||
"avg_hours_to_merge": 0.0,
|
||||
"avg_hours_to_close": 18.8,
|
||||
"repos": [
|
||||
"the-nexus",
|
||||
"timmy-config"
|
||||
]
|
||||
},
|
||||
"codex": {
|
||||
"total_prs": 2,
|
||||
"merged": 2,
|
||||
"closed_unmerged": 0,
|
||||
"open": 0,
|
||||
"merge_rate": 1.0,
|
||||
"rejection_rate": 0.0,
|
||||
"avg_hours_to_merge": 2.3,
|
||||
"avg_hours_to_close": null,
|
||||
"repos": [
|
||||
"timmy-config",
|
||||
"timmy-home"
|
||||
]
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
Reference in New Issue
Block a user