Merge branch 'main' into architecture-planning

This commit is contained in:
Teknium
2026-01-08 00:59:51 -08:00
committed by GitHub
30 changed files with 6097 additions and 2934 deletions

23
.cursorrules Normal file
View File

@@ -0,0 +1,23 @@
Hermes-Agent is an agent harness for LLMs.
When building, the tool functionality is in the tools/ directory, where each specific tool (or in some cases, tools that are built for the same execution category or api) are placed in a script each their own.
Each tool is then consolidated in the model_tools.py file in the repo root.
There is also a way to consolidate sets of tools in toolsets.py for the agent to use.
The primary agent runner code is in run_agent, but other runners could be developed using the tools and framework.
Always ensure consistency between tools, the model_tools.py and toolsets.py when changing any of them, otherwise they could become desynced in a way that is detrimental to functionality.
The expected pathway for using API keys is to setup and place them in a .env file in the repo root.
Test scripts will be placed in tests/
The run_agent loop is setup to:
- Process the enabled toolsets to provide to the model,
- Pipe in a prompt or problem from the input to the agent,
- Loop the LLM each time it calls a tool, until the model decides no more tools are needed and provides a natural language response,
- Return that response.
There are additional caveats for logging, where we restructure the "tools" as a system prompt for storage later into a format that can be used and handled properly later.

49
.env.example Normal file
View File

@@ -0,0 +1,49 @@
# Hermes Agent Environment Configuration
# Copy this file to .env and fill in your API keys
# Get API keys from the URLs listed below
# =============================================================================
# REQUIRED API KEYS
# =============================================================================
# Anthropic API Key - Main agent model
# Get at: https://console.anthropic.com/
ANTHROPIC_API_KEY=
# Firecrawl API Key - Web search, extract, and crawl
# Get at: https://firecrawl.dev/
FIRECRAWL_API_KEY=
# Nous Research API Key - Vision analysis and multi-model reasoning
# Get at: https://inference-api.nousresearch.com/
NOUS_API_KEY=
# Morph API Key - Terminal/command execution tools
# Get at: https://morph.so/
MORPH_API_KEY=
# FAL.ai API Key - Image generation
# Get at: https://fal.ai/
FAL_KEY=
# =============================================================================
# OPTIONAL API KEYS
# =============================================================================
# OpenAI API Key - Optional, for enhanced Hecate features
# Get at: https://platform.openai.com/
OPENAI_API_KEY=
# =============================================================================
# OPTIONAL CONFIGURATION
# =============================================================================
# Terminal Tool Settings
HECATE_VM_LIFETIME_SECONDS=300
HECATE_DEFAULT_SNAPSHOT_ID=snapshot_p5294qxt
# Debug Logging (set to "true" to enable, logs saved to ./logs/)
WEB_TOOLS_DEBUG=false
VISION_TOOLS_DEBUG=false
MOA_TOOLS_DEBUG=false
IMAGE_TOOLS_DEBUG=false

15
.gitignore vendored
View File

@@ -16,4 +16,17 @@ __pycache__/
export* export*
__pycache__/model_tools.cpython-310.pyc __pycache__/model_tools.cpython-310.pyc
__pycache__/web_tools.cpython-310.pyc __pycache__/web_tools.cpython-310.pyc
logs/ logs/
data/
.pytest_cache/
tmp/
temp_vision_images/
hermes-*/*
examples/
tests/quick_test_dataset.jsonl
tests/sample_dataset.jsonl
run_datagen_kimik2-thinking.sh
run_datagen_megascience_glm4-6.sh
run_datagen_sonnet.sh
source-data/*
run_datagen_megascience_glm4-6.sh

123
README.md
View File

@@ -10,15 +10,46 @@ An AI agent with advanced tool-calling capabilities, featuring a flexible toolse
- **Reasoning Tools**: Advanced multi-model reasoning (Mixture of Agents) - **Reasoning Tools**: Advanced multi-model reasoning (Mixture of Agents)
- **Creative Tools**: Generate images from text prompts - **Creative Tools**: Generate images from text prompts
- **Toolsets System**: Organize tools into logical groups for different scenarios - **Toolsets System**: Organize tools into logical groups for different scenarios
- **Batch Processing**: Process datasets in parallel with checkpointing and statistics tracking
- **Ephemeral System Prompts**: Guide model behavior without polluting training datasets
## Setup ## Setup
### 1. Install Dependencies
```bash ```bash
# Create and activate virtual environment (recommended)
python3 -m venv venv
source venv/bin/activate # On Windows: venv\Scripts\activate
# Install required packages
pip install -r requirements.txt pip install -r requirements.txt
# Install Hecate for terminal tools
git clone git@github.com:NousResearch/hecate.git git clone git@github.com:NousResearch/hecate.git
cd hecate cd hecate
pip install -e . pip install -e .
cd ..
``` ```
### 2. Configure Environment Variables
```bash
# Copy the example environment file
cp .env.example .env
# Edit .env and add your API keys
nano .env # or use your preferred editor
```
**Required API Keys:**
- `ANTHROPIC_API_KEY` - Main agent model (get at: https://console.anthropic.com/)
- `FIRECRAWL_API_KEY` - Web tools (get at: https://firecrawl.dev/)
- `NOUS_API_KEY` - Vision & reasoning tools (get at: https://inference-api.nousresearch.com/)
- `MORPH_API_KEY` - Terminal tools (get at: https://morph.so/)
- `FAL_KEY` - Image generation (get at: https://fal.ai/)
- `OPENAI_API_KEY` - Optional, for some Hecate features
See `.env.example` for all available configuration options including debug settings and terminal tool configuration.
## Toolsets System ## Toolsets System
The agent uses a toolsets system for organizing and managing tools. All tools must be part of a toolset to be accessible - individual tool selection is not supported. This ensures consistent and logical grouping of capabilities. The agent uses a toolsets system for organizing and managing tools. All tools must be part of a toolset to be accessible - individual tool selection is not supported. This ensures consistent and logical grouping of capabilities.
@@ -47,6 +78,9 @@ python run_agent.py --enabled_toolsets=research --query "Find latest AI papers"
# Combine multiple toolsets # Combine multiple toolsets
python run_agent.py --enabled_toolsets=web,vision --query "Analyze this website" python run_agent.py --enabled_toolsets=web,vision --query "Analyze this website"
# Enable all toolsets explicitly (same as omitting the flag)
python run_agent.py --enabled_toolsets=all --query "Do web research and run commands if helpful"
# Safe mode (no terminal access) # Safe mode (no terminal access)
python run_agent.py --enabled_toolsets=safe --query "Help without running commands" python run_agent.py --enabled_toolsets=safe --query "Help without running commands"
@@ -101,34 +135,109 @@ create_custom_toolset(
agent = AIAgent(enabled_toolsets=["my_tools"]) agent = AIAgent(enabled_toolsets=["my_tools"])
``` ```
## Batch Processing
Process multiple prompts from a dataset in parallel with automatic checkpointing and statistics tracking:
```bash
# Basic batch processing
python batch_runner.py \
--dataset_file=prompts.jsonl \
--batch_size=20 \
--run_name=my_run
# With specific distribution
python batch_runner.py \
--dataset_file=prompts.jsonl \
--batch_size=20 \
--run_name=image_run \
--distribution=image_gen \
--num_workers=4
```
**Key Features:**
- Parallel processing with configurable workers
- Toolset distributions for varied data generation
- Automatic checkpointing and resume capability
- Combined output in `data/<run_name>/trajectories.jsonl`
- Tool usage statistics and success rates
**Quick Start:** See [QUICKSTART_BATCH.md](QUICKSTART_BATCH.md) for a 5-minute getting started guide.
**Full Documentation:** See [BATCH_PROCESSING.md](BATCH_PROCESSING.md) for comprehensive documentation.
### Ephemeral System Prompts
The ephemeral system prompt feature allows you to guide the model's behavior during batch processing **without** saving that prompt to the training dataset trajectories. This is useful for:
- Guiding model behavior during data collection
- Adding task-specific instructions
- Keeping saved trajectories clean and focused on tool-calling format
**Example:**
```bash
python batch_runner.py \
--dataset_file=prompts.jsonl \
--batch_size=10 \
--run_name=my_run \
--ephemeral_system_prompt="You are a helpful assistant focused on image generation."
```
The ephemeral prompt will influence the model's behavior during execution, but **only the standard tool-calling system prompt** will be saved in the trajectory files.
**Documentation:** See [docs/ephemeral_system_prompt.md](docs/ephemeral_system_prompt.md) for complete details.
## Command Line Arguments ## Command Line Arguments
**Single Agent (`run_agent.py`):**
- `--query`: The question or task for the agent - `--query`: The question or task for the agent
- `--model`: Model to use (default: claude-opus-4-20250514) - `--model`: Model to use (default: claude-opus-4-20250514)
- `--api_key`: API key for authentication - `--api_key`: API key for authentication
- `--base_url`: API endpoint URL - `--base_url`: API endpoint URL
- `--max_turns`: Maximum number of tool-calling iterations - `--max_turns`: Maximum number of tool-calling iterations
- `--enabled_toolsets`: Comma-separated list of toolsets to enable - `--enabled_toolsets`: Comma-separated list of toolsets to enable. Use `all` (or `*`) to enable everything. If omitted, all toolsets are enabled by default.
- `--disabled_toolsets`: Comma-separated list of toolsets to disable - `--disabled_toolsets`: Comma-separated list of toolsets to disable
- `--list_tools`: List all available toolsets and tools - `--list_tools`: List all available toolsets and tools
- `--save_trajectories`: Save conversation trajectories to JSONL files - `--save_trajectories`: Save conversation trajectories to JSONL files
**Batch Processing (`batch_runner.py`):**
- `--dataset_file`: Path to JSONL file with prompts
- `--batch_size`: Number of prompts per batch
- `--run_name`: Name for this run (for output/checkpointing)
- `--distribution`: Toolset distribution to use (default: "default")
- `--num_workers`: Number of parallel workers (default: 4)
- `--resume`: Resume from checkpoint if interrupted
- `--ephemeral_system_prompt`: System prompt used during execution but NOT saved to trajectories
- `--list_distributions`: List available toolset distributions
## Environment Variables ## Environment Variables
Set these environment variables to enable different tools: All environment variables can be configured in the `.env` file (copy from `.env.example`).
- `FIRECRAWL_API_KEY`: For web tools (search, extract, crawl) **Core API Keys:**
- `MORPH_API_KEY`: For terminal tools - `ANTHROPIC_API_KEY`: Main agent model
- `NOUS_API_KEY`: For vision and reasoning tools - `FIRECRAWL_API_KEY`: Web tools (search, extract, crawl)
- `FAL_KEY`: For image generation tools - `NOUS_API_KEY`: Vision and reasoning tools
- `ANTHROPIC_API_KEY`: For the main agent model - `MORPH_API_KEY`: Terminal tools
- `FAL_KEY`: Image generation tools
- `OPENAI_API_KEY`: Optional, for some Hecate features
**Configuration Options:**
- `HECATE_VM_LIFETIME_SECONDS`: VM lifetime (default: 300)
- `HECATE_DEFAULT_SNAPSHOT_ID`: Default snapshot (default: snapshot_p5294qxt)
- `WEB_TOOLS_DEBUG`, `VISION_TOOLS_DEBUG`, `MOA_TOOLS_DEBUG`, `IMAGE_TOOLS_DEBUG`: Enable debug logging
## Documentation ## Documentation
**Single Agent Usage:**
- `TOOLSETS_README.md`: Comprehensive guide to the toolsets system - `TOOLSETS_README.md`: Comprehensive guide to the toolsets system
- `toolsets.py`: View and modify available toolsets - `toolsets.py`: View and modify available toolsets
- `model_tools.py`: Core tool definitions and handlers - `model_tools.py`: Core tool definitions and handlers
**Batch Processing:**
- `QUICKSTART_BATCH.md`: 5-minute quick start guide
- `BATCH_PROCESSING.md`: Complete batch processing documentation
- `toolset_distributions.py`: Toolset distributions for data generation
## Examples ## Examples
See `TOOLSETS_README.md` for extensive examples of using different toolsets for various scenarios. See `TOOLSETS_README.md` for extensive examples of using different toolsets for various scenarios.

753
batch_runner.py Normal file
View File

@@ -0,0 +1,753 @@
#!/usr/bin/env python3
"""
Batch Agent Runner
This module provides parallel batch processing capabilities for running the agent
across multiple prompts from a dataset. It includes:
- Dataset loading and batching
- Parallel batch processing with multiprocessing
- Checkpointing for fault tolerance and resumption
- Trajectory saving in the proper format (from/value pairs)
- Tool usage statistics aggregation across all batches
Usage:
python batch_runner.py --dataset_file=data.jsonl --batch_size=10 --run_name=my_run
# Resume an interrupted run
python batch_runner.py --dataset_file=data.jsonl --batch_size=10 --run_name=my_run --resume
# Use a specific toolset distribution
python batch_runner.py --dataset_file=data.jsonl --batch_size=10 --run_name=my_run --distribution=image_gen
"""
import json
import logging
import os
import time
from pathlib import Path
from typing import List, Dict, Any, Optional, Tuple
from datetime import datetime
from multiprocessing import Pool, Manager, Lock
import traceback
import fire
from run_agent import AIAgent
from toolset_distributions import (
get_distribution,
list_distributions,
sample_toolsets_from_distribution,
validate_distribution
)
# Global configuration for worker processes
_WORKER_CONFIG = {}
def _extract_tool_stats(messages: List[Dict[str, Any]]) -> Dict[str, Dict[str, int]]:
"""
Extract tool usage statistics from message history.
Args:
messages (List[Dict]): Message history
Returns:
Dict: Tool statistics with counts and success/failure rates
"""
tool_stats = {}
# Track tool calls and their results
tool_calls_map = {} # Map tool_call_id to tool name
for msg in messages:
# Track tool calls from assistant messages
if msg["role"] == "assistant" and "tool_calls" in msg and msg["tool_calls"]:
for tool_call in msg["tool_calls"]:
tool_name = tool_call["function"]["name"]
tool_call_id = tool_call["id"]
# Initialize stats for this tool if not exists
if tool_name not in tool_stats:
tool_stats[tool_name] = {
"count": 0,
"success": 0,
"failure": 0
}
tool_stats[tool_name]["count"] += 1
tool_calls_map[tool_call_id] = tool_name
# Track tool responses
elif msg["role"] == "tool":
tool_call_id = msg.get("tool_call_id", "")
content = msg.get("content", "")
# Determine if tool call was successful
is_success = True
try:
# Try to parse as JSON and check for actual error values
content_json = json.loads(content) if isinstance(content, str) else content
if isinstance(content_json, dict):
# Check if error field exists AND has a non-null value
if "error" in content_json and content_json["error"] is not None:
is_success = False
# Special handling for terminal tool responses
# Terminal wraps its response in a "content" field
if "content" in content_json and isinstance(content_json["content"], dict):
inner_content = content_json["content"]
# Check for actual error (non-null error field)
# Note: non-zero exit codes are not failures - the model can self-correct
if inner_content.get("error") is not None:
is_success = False
# Check for "success": false pattern used by some tools
if content_json.get("success") is False:
is_success = False
except:
# If not JSON, check if content is empty or explicitly states an error
# Note: We avoid simple substring matching to prevent false positives
if not content:
is_success = False
# Only mark as failure if it explicitly starts with "Error:" or "ERROR:"
elif content.strip().lower().startswith("error:"):
is_success = False
# Update success/failure count
if tool_call_id in tool_calls_map:
tool_name = tool_calls_map[tool_call_id]
if is_success:
tool_stats[tool_name]["success"] += 1
else:
tool_stats[tool_name]["failure"] += 1
return tool_stats
def _process_single_prompt(
prompt_index: int,
prompt_data: Dict[str, Any],
batch_num: int,
config: Dict[str, Any]
) -> Dict[str, Any]:
"""
Process a single prompt with the agent.
Args:
prompt_index (int): Index of prompt in dataset
prompt_data (Dict): Prompt data containing 'prompt' field
batch_num (int): Batch number
config (Dict): Configuration dict with agent parameters
Returns:
Dict: Result containing trajectory, stats, and metadata
"""
prompt = prompt_data["prompt"]
try:
# Sample toolsets from distribution for this prompt
selected_toolsets = sample_toolsets_from_distribution(config["distribution"])
if config.get("verbose"):
print(f" Prompt {prompt_index}: Using toolsets {selected_toolsets}")
# Initialize agent with sampled toolsets
agent = AIAgent(
base_url=config.get("base_url"),
api_key=config.get("api_key"),
model=config["model"],
max_iterations=config["max_iterations"],
enabled_toolsets=selected_toolsets,
save_trajectories=False, # We handle saving ourselves
verbose_logging=config.get("verbose", False),
ephemeral_system_prompt=config.get("ephemeral_system_prompt"),
log_prefix_chars=config.get("log_prefix_chars", 100)
)
# Run the agent with task_id to ensure each task gets its own isolated VM
result = agent.run_conversation(prompt, task_id=f"task_{prompt_index}")
# Extract tool usage statistics
tool_stats = _extract_tool_stats(result["messages"])
# Convert to trajectory format (using existing method)
trajectory = agent._convert_to_trajectory_format(
result["messages"],
prompt,
result["completed"]
)
return {
"success": True,
"prompt_index": prompt_index,
"trajectory": trajectory,
"tool_stats": tool_stats,
"completed": result["completed"],
"api_calls": result["api_calls"],
"toolsets_used": selected_toolsets,
"metadata": {
"batch_num": batch_num,
"timestamp": datetime.now().isoformat(),
"model": config["model"]
}
}
except Exception as e:
print(f"❌ Error processing prompt {prompt_index}: {e}")
if config.get("verbose"):
traceback.print_exc()
return {
"success": False,
"prompt_index": prompt_index,
"error": str(e),
"trajectory": None,
"tool_stats": {},
"toolsets_used": [],
"metadata": {
"batch_num": batch_num,
"timestamp": datetime.now().isoformat()
}
}
def _process_batch_worker(args: Tuple) -> Dict[str, Any]:
"""
Worker function to process a single batch of prompts.
Args:
args (Tuple): (batch_num, batch_data, output_dir, completed_prompts, config)
Returns:
Dict: Batch results with statistics
"""
batch_num, batch_data, output_dir, completed_prompts_set, config = args
output_dir = Path(output_dir)
print(f"\n🔄 Batch {batch_num}: Starting ({len(batch_data)} prompts)")
# Output file for this batch
batch_output_file = output_dir / f"batch_{batch_num}.jsonl"
# Filter out already completed prompts
prompts_to_process = [
(idx, data) for idx, data in batch_data
if idx not in completed_prompts_set
]
if not prompts_to_process:
print(f"✅ Batch {batch_num}: Already completed (skipping)")
return {
"batch_num": batch_num,
"processed": 0,
"skipped": len(batch_data),
"tool_stats": {},
"completed_prompts": []
}
print(f" Processing {len(prompts_to_process)} prompts (skipping {len(batch_data) - len(prompts_to_process)} already completed)")
# Initialize aggregated stats for this batch
batch_tool_stats = {}
completed_in_batch = []
# Process each prompt sequentially in this batch
for prompt_index, prompt_data in prompts_to_process:
# Process the prompt
result = _process_single_prompt(
prompt_index,
prompt_data,
batch_num,
config
)
# Save trajectory if successful
if result["success"] and result["trajectory"]:
trajectory_entry = {
"prompt_index": prompt_index,
"conversations": result["trajectory"],
"metadata": result["metadata"],
"completed": result["completed"],
"api_calls": result["api_calls"],
"toolsets_used": result["toolsets_used"]
}
# Append to batch output file
with open(batch_output_file, 'a', encoding='utf-8') as f:
f.write(json.dumps(trajectory_entry, ensure_ascii=False) + "\n")
# Aggregate tool statistics
for tool_name, stats in result.get("tool_stats", {}).items():
if tool_name not in batch_tool_stats:
batch_tool_stats[tool_name] = {
"count": 0,
"success": 0,
"failure": 0
}
batch_tool_stats[tool_name]["count"] += stats["count"]
batch_tool_stats[tool_name]["success"] += stats["success"]
batch_tool_stats[tool_name]["failure"] += stats["failure"]
completed_in_batch.append(prompt_index)
print(f" ✅ Prompt {prompt_index} completed")
print(f"✅ Batch {batch_num}: Completed ({len(prompts_to_process)} prompts processed)")
return {
"batch_num": batch_num,
"processed": len(prompts_to_process),
"skipped": len(batch_data) - len(prompts_to_process),
"tool_stats": batch_tool_stats,
"completed_prompts": completed_in_batch
}
class BatchRunner:
"""
Manages batch processing of agent prompts with checkpointing and statistics.
"""
def __init__(
self,
dataset_file: str,
batch_size: int,
run_name: str,
distribution: str = "default",
max_iterations: int = 10,
base_url: str = None,
api_key: str = None,
model: str = "claude-opus-4-20250514",
num_workers: int = 4,
verbose: bool = False,
ephemeral_system_prompt: str = None,
log_prefix_chars: int = 100,
):
"""
Initialize the batch runner.
Args:
dataset_file (str): Path to the dataset JSONL file with 'prompt' field
batch_size (int): Number of prompts per batch
run_name (str): Name for this run (used for checkpointing and output)
distribution (str): Toolset distribution to use (default: "default")
max_iterations (int): Max iterations per agent run
base_url (str): Base URL for model API
api_key (str): API key for model
model (str): Model name to use
num_workers (int): Number of parallel workers
verbose (bool): Enable verbose logging
ephemeral_system_prompt (str): System prompt used during agent execution but NOT saved to trajectories (optional)
log_prefix_chars (int): Number of characters to show in log previews for tool calls/responses (default: 20)
"""
self.dataset_file = Path(dataset_file)
self.batch_size = batch_size
self.run_name = run_name
self.distribution = distribution
self.max_iterations = max_iterations
self.base_url = base_url
self.api_key = api_key
self.model = model
self.num_workers = num_workers
self.verbose = verbose
self.ephemeral_system_prompt = ephemeral_system_prompt
self.log_prefix_chars = log_prefix_chars
# Validate distribution
if not validate_distribution(distribution):
raise ValueError(f"Unknown distribution: {distribution}. Available: {list(list_distributions().keys())}")
# Setup output directory
self.output_dir = Path("data") / run_name
self.output_dir.mkdir(parents=True, exist_ok=True)
# Checkpoint file
self.checkpoint_file = self.output_dir / "checkpoint.json"
# Statistics file
self.stats_file = self.output_dir / "statistics.json"
# Load dataset
self.dataset = self._load_dataset()
# Create batches
self.batches = self._create_batches()
print(f"📊 Batch Runner Initialized")
print(f" Dataset: {self.dataset_file} ({len(self.dataset)} prompts)")
print(f" Batch size: {self.batch_size}")
print(f" Total batches: {len(self.batches)}")
print(f" Run name: {self.run_name}")
print(f" Distribution: {self.distribution}")
print(f" Output directory: {self.output_dir}")
print(f" Workers: {self.num_workers}")
if self.ephemeral_system_prompt:
prompt_preview = self.ephemeral_system_prompt[:60] + "..." if len(self.ephemeral_system_prompt) > 60 else self.ephemeral_system_prompt
print(f" 🔒 Ephemeral system prompt: '{prompt_preview}'")
def _load_dataset(self) -> List[Dict[str, Any]]:
"""
Load dataset from JSONL file.
Returns:
List[Dict]: List of dataset entries
"""
if not self.dataset_file.exists():
raise FileNotFoundError(f"Dataset file not found: {self.dataset_file}")
dataset = []
with open(self.dataset_file, 'r', encoding='utf-8') as f:
for line_num, line in enumerate(f, 1):
line = line.strip()
if not line:
continue
try:
entry = json.loads(line)
if 'prompt' not in entry:
print(f"⚠️ Warning: Line {line_num} missing 'prompt' field, skipping")
continue
dataset.append(entry)
except json.JSONDecodeError as e:
print(f"⚠️ Warning: Invalid JSON on line {line_num}: {e}")
continue
if not dataset:
raise ValueError(f"No valid entries found in dataset file: {self.dataset_file}")
return dataset
def _create_batches(self) -> List[List[Tuple[int, Dict[str, Any]]]]:
"""
Split dataset into batches with indices.
Returns:
List of batches, where each batch is a list of (index, entry) tuples
"""
batches = []
for i in range(0, len(self.dataset), self.batch_size):
batch = [(idx, entry) for idx, entry in enumerate(self.dataset[i:i + self.batch_size], start=i)]
batches.append(batch)
return batches
def _load_checkpoint(self) -> Dict[str, Any]:
"""
Load checkpoint data if it exists.
Returns:
Dict: Checkpoint data with completed prompt indices
"""
if not self.checkpoint_file.exists():
return {
"run_name": self.run_name,
"completed_prompts": [],
"batch_stats": {},
"last_updated": None
}
try:
with open(self.checkpoint_file, 'r', encoding='utf-8') as f:
return json.load(f)
except Exception as e:
print(f"⚠️ Warning: Failed to load checkpoint: {e}")
return {
"run_name": self.run_name,
"completed_prompts": [],
"batch_stats": {},
"last_updated": None
}
def _save_checkpoint(self, checkpoint_data: Dict[str, Any], lock: Optional[Lock] = None):
"""
Save checkpoint data.
Args:
checkpoint_data (Dict): Checkpoint data to save
lock (Lock): Optional lock for thread-safe access
"""
checkpoint_data["last_updated"] = datetime.now().isoformat()
if lock:
with lock:
with open(self.checkpoint_file, 'w', encoding='utf-8') as f:
json.dump(checkpoint_data, f, indent=2, ensure_ascii=False)
else:
with open(self.checkpoint_file, 'w', encoding='utf-8') as f:
json.dump(checkpoint_data, f, indent=2, ensure_ascii=False)
def run(self, resume: bool = False):
"""
Run the batch processing pipeline.
Args:
resume (bool): Whether to resume from checkpoint
"""
print("\n" + "=" * 70)
print("🚀 Starting Batch Processing")
print("=" * 70)
# Load checkpoint
checkpoint_data = self._load_checkpoint() if resume else {
"run_name": self.run_name,
"completed_prompts": [],
"batch_stats": {},
"last_updated": None
}
if resume and checkpoint_data.get("completed_prompts"):
print(f"📂 Resuming from checkpoint ({len(checkpoint_data['completed_prompts'])} prompts already completed)")
# Prepare configuration for workers
config = {
"distribution": self.distribution,
"model": self.model,
"max_iterations": self.max_iterations,
"base_url": self.base_url,
"api_key": self.api_key,
"verbose": self.verbose,
"ephemeral_system_prompt": self.ephemeral_system_prompt,
"log_prefix_chars": self.log_prefix_chars
}
# Get completed prompts set
completed_prompts_set = set(checkpoint_data.get("completed_prompts", []))
# Aggregate statistics across all batches
total_tool_stats = {}
start_time = time.time()
# Process batches in parallel
with Pool(processes=self.num_workers) as pool:
# Create tasks for each batch
tasks = [
(
batch_num,
batch_data,
str(self.output_dir), # Convert Path to string for pickling
completed_prompts_set,
config
)
for batch_num, batch_data in enumerate(self.batches)
]
# Use map to process batches in parallel
results = pool.map(_process_batch_worker, tasks)
# Aggregate all batch statistics and update checkpoint
all_completed_prompts = list(completed_prompts_set)
for batch_result in results:
# Add newly completed prompts
all_completed_prompts.extend(batch_result.get("completed_prompts", []))
# Aggregate tool stats
for tool_name, stats in batch_result.get("tool_stats", {}).items():
if tool_name not in total_tool_stats:
total_tool_stats[tool_name] = {
"count": 0,
"success": 0,
"failure": 0
}
total_tool_stats[tool_name]["count"] += stats["count"]
total_tool_stats[tool_name]["success"] += stats["success"]
total_tool_stats[tool_name]["failure"] += stats["failure"]
# Save final checkpoint
checkpoint_data["completed_prompts"] = all_completed_prompts
self._save_checkpoint(checkpoint_data)
# Calculate success rates
for tool_name in total_tool_stats:
stats = total_tool_stats[tool_name]
total_calls = stats["success"] + stats["failure"]
if total_calls > 0:
stats["success_rate"] = round(stats["success"] / total_calls * 100, 2)
stats["failure_rate"] = round(stats["failure"] / total_calls * 100, 2)
else:
stats["success_rate"] = 0.0
stats["failure_rate"] = 0.0
# Combine all batch files into a single trajectories.jsonl file
combined_file = self.output_dir / "trajectories.jsonl"
print(f"\n📦 Combining batch files into {combined_file.name}...")
with open(combined_file, 'w', encoding='utf-8') as outfile:
for batch_num in range(len(self.batches)):
batch_file = self.output_dir / f"batch_{batch_num}.jsonl"
if batch_file.exists():
with open(batch_file, 'r', encoding='utf-8') as infile:
for line in infile:
outfile.write(line)
print(f"✅ Combined {len(self.batches)} batch files into trajectories.jsonl")
# Save final statistics
final_stats = {
"run_name": self.run_name,
"distribution": self.distribution,
"total_prompts": len(self.dataset),
"total_batches": len(self.batches),
"batch_size": self.batch_size,
"model": self.model,
"completed_at": datetime.now().isoformat(),
"duration_seconds": round(time.time() - start_time, 2),
"tool_statistics": total_tool_stats
}
with open(self.stats_file, 'w', encoding='utf-8') as f:
json.dump(final_stats, f, indent=2, ensure_ascii=False)
# Print summary
print("\n" + "=" * 70)
print("📊 BATCH PROCESSING COMPLETE")
print("=" * 70)
print(f"✅ Total prompts processed: {len(self.dataset)}")
print(f"✅ Total batches: {len(self.batches)}")
print(f"⏱️ Total duration: {round(time.time() - start_time, 2)}s")
print(f"\n📈 Tool Usage Statistics:")
print("-" * 70)
if total_tool_stats:
# Sort by count descending
sorted_tools = sorted(
total_tool_stats.items(),
key=lambda x: x[1]["count"],
reverse=True
)
print(f"{'Tool Name':<25} {'Count':<10} {'Success':<10} {'Failure':<10} {'Success Rate':<12}")
print("-" * 70)
for tool_name, stats in sorted_tools:
print(
f"{tool_name:<25} "
f"{stats['count']:<10} "
f"{stats['success']:<10} "
f"{stats['failure']:<10} "
f"{stats['success_rate']:.1f}%"
)
else:
print("No tool calls were made during this run.")
print(f"\n💾 Results saved to: {self.output_dir}")
print(f" - Trajectories: trajectories.jsonl (combined)")
print(f" - Individual batches: batch_*.jsonl (for debugging)")
print(f" - Statistics: {self.stats_file.name}")
print(f" - Checkpoint: {self.checkpoint_file.name}")
def main(
dataset_file: str = None,
batch_size: int = None,
run_name: str = None,
distribution: str = "default",
model: str = "claude-opus-4-20250514",
api_key: str = None,
base_url: str = "https://api.anthropic.com/v1/",
max_turns: int = 10,
num_workers: int = 4,
resume: bool = False,
verbose: bool = False,
list_distributions: bool = False,
ephemeral_system_prompt: str = None,
log_prefix_chars: int = 100,
):
"""
Run batch processing of agent prompts from a dataset.
Args:
dataset_file (str): Path to JSONL file with 'prompt' field in each entry
batch_size (int): Number of prompts per batch
run_name (str): Name for this run (used for output and checkpointing)
distribution (str): Toolset distribution to use (default: "default")
model (str): Model name to use (default: "claude-opus-4-20250514")
api_key (str): API key for model authentication
base_url (str): Base URL for model API
max_turns (int): Maximum number of tool calling iterations per prompt (default: 10)
num_workers (int): Number of parallel worker processes (default: 4)
resume (bool): Resume from checkpoint if run was interrupted (default: False)
verbose (bool): Enable verbose logging (default: False)
list_distributions (bool): List available toolset distributions and exit
ephemeral_system_prompt (str): System prompt used during agent execution but NOT saved to trajectories (optional)
log_prefix_chars (int): Number of characters to show in log previews for tool calls/responses (default: 20)
Examples:
# Basic usage
python batch_runner.py --dataset_file=data.jsonl --batch_size=10 --run_name=my_run
# Resume interrupted run
python batch_runner.py --dataset_file=data.jsonl --batch_size=10 --run_name=my_run --resume
# Use specific distribution
python batch_runner.py --dataset_file=data.jsonl --batch_size=10 --run_name=image_test --distribution=image_gen
# With ephemeral system prompt (not saved to dataset)
python batch_runner.py --dataset_file=data.jsonl --batch_size=10 --run_name=my_run \\
--ephemeral_system_prompt="You are a helpful assistant focused on image generation."
# List available distributions
python batch_runner.py --list_distributions
"""
# Handle list distributions
if list_distributions:
from toolset_distributions import list_distributions as get_all_dists, print_distribution_info
print("📊 Available Toolset Distributions")
print("=" * 70)
all_dists = get_all_dists()
for dist_name in sorted(all_dists.keys()):
print_distribution_info(dist_name)
print("\n💡 Usage:")
print(" python batch_runner.py --dataset_file=data.jsonl --batch_size=10 \\")
print(" --run_name=my_run --distribution=<name>")
return
# Validate required arguments
if not dataset_file:
print("❌ Error: --dataset_file is required")
return
if not batch_size or batch_size < 1:
print("❌ Error: --batch_size must be a positive integer")
return
if not run_name:
print("❌ Error: --run_name is required")
return
# Initialize and run batch runner
try:
runner = BatchRunner(
dataset_file=dataset_file,
batch_size=batch_size,
run_name=run_name,
distribution=distribution,
max_iterations=max_turns,
base_url=base_url,
api_key=api_key,
model=model,
num_workers=num_workers,
verbose=verbose,
ephemeral_system_prompt=ephemeral_system_prompt,
log_prefix_chars=log_prefix_chars
)
runner.run(resume=resume)
except Exception as e:
print(f"\n❌ Fatal error: {e}")
if verbose:
traceback.print_exc()
return 1
if __name__ == "__main__":
fire.Fire(main)

View File

@@ -28,13 +28,15 @@ Usage:
import json import json
import asyncio import asyncio
from typing import Dict, Any, List from typing import Dict, Any, List, Optional
from web_tools import web_search_tool, web_extract_tool, web_crawl_tool, check_firecrawl_api_key from tools.web_tools import web_search_tool, web_extract_tool, web_crawl_tool, check_firecrawl_api_key
from terminal_tool import terminal_tool, check_hecate_requirements, TERMINAL_TOOL_DESCRIPTION from tools.simple_terminal_tool import simple_terminal_tool, check_requirements as check_simple_terminal_requirements, SIMPLE_TERMINAL_TOOL_DESCRIPTION
from vision_tools import vision_analyze_tool, check_vision_requirements # Keep old terminal tool for backwards compatibility if needed
from mixture_of_agents_tool import mixture_of_agents_tool, check_moa_requirements # from tools.terminal_tool import terminal_tool, check_hecate_requirements, TERMINAL_TOOL_DESCRIPTION
from image_generation_tool import image_generate_tool, check_image_generation_requirements from tools.vision_tools import vision_analyze_tool, check_vision_requirements
from tools.mixture_of_agents_tool import mixture_of_agents_tool, check_moa_requirements
from tools.image_generation_tool import image_generate_tool, check_image_generation_requirements
from toolsets import ( from toolsets import (
get_toolset, resolve_toolset, resolve_multiple_toolsets, get_toolset, resolve_toolset, resolve_multiple_toolsets,
get_all_toolsets, get_toolset_names, validate_toolset, get_all_toolsets, get_toolset_names, validate_toolset,
@@ -111,7 +113,7 @@ def get_web_tool_definitions() -> List[Dict[str, Any]]:
def get_terminal_tool_definitions() -> List[Dict[str, Any]]: def get_terminal_tool_definitions() -> List[Dict[str, Any]]:
""" """
Get tool definitions for terminal tools in OpenAI's expected format. Get tool definitions for terminal tools in OpenAI's expected format.
Returns: Returns:
List[Dict]: List of terminal tool definitions compatible with OpenAI API List[Dict]: List of terminal tool definitions compatible with OpenAI API
""" """
@@ -120,7 +122,7 @@ def get_terminal_tool_definitions() -> List[Dict[str, Any]]:
"type": "function", "type": "function",
"function": { "function": {
"name": "terminal", "name": "terminal",
"description": TERMINAL_TOOL_DESCRIPTION, "description": SIMPLE_TERMINAL_TOOL_DESCRIPTION,
"parameters": { "parameters": {
"type": "object", "type": "object",
"properties": { "properties": {
@@ -128,28 +130,18 @@ def get_terminal_tool_definitions() -> List[Dict[str, Any]]:
"type": "string", "type": "string",
"description": "The command to execute on the VM" "description": "The command to execute on the VM"
}, },
"input_keys": {
"type": "string",
"description": "Keystrokes to send to the most recent interactive session (e.g., 'hello\\n' for typing hello + Enter). If no active session exists, this will be ignored."
},
"background": { "background": {
"type": "boolean", "type": "boolean",
"description": "Whether to run the command in the background (default: false)", "description": "Whether to run the command in the background (default: false)",
"default": False "default": False
}, },
"idle_threshold": {
"type": "number",
"description": "Seconds to wait for output before considering session idle (default: 5.0)",
"default": 5.0,
"minimum": 0.1
},
"timeout": { "timeout": {
"type": "integer", "type": "integer",
"description": "Command timeout in seconds (optional)", "description": "Command timeout in seconds (optional)",
"minimum": 1 "minimum": 1
} }
}, },
"required": [] "required": ["command"]
} }
} }
} }
@@ -262,11 +254,11 @@ def get_all_tool_names() -> List[str]:
# Web tools # Web tools
if check_firecrawl_api_key(): if check_firecrawl_api_key():
tool_names.extend(["web_search", "web_extract", "web_crawl"]) tool_names.extend(["web_search", "web_extract", "web_crawl"])
# Terminal tools # Terminal tools
if check_hecate_requirements(): if check_simple_terminal_requirements():
tool_names.extend(["terminal"]) tool_names.extend(["terminal"])
# Vision tools # Vision tools
if check_vision_requirements(): if check_vision_requirements():
tool_names.extend(["vision_analyze"]) tool_names.extend(["vision_analyze"])
@@ -346,11 +338,11 @@ def get_tool_definitions(
if check_firecrawl_api_key(): if check_firecrawl_api_key():
for tool in get_web_tool_definitions(): for tool in get_web_tool_definitions():
all_available_tools_map[tool["function"]["name"]] = tool all_available_tools_map[tool["function"]["name"]] = tool
if check_hecate_requirements(): if check_simple_terminal_requirements():
for tool in get_terminal_tool_definitions(): for tool in get_terminal_tool_definitions():
all_available_tools_map[tool["function"]["name"]] = tool all_available_tools_map[tool["function"]["name"]] = tool
if check_vision_requirements(): if check_vision_requirements():
for tool in get_vision_tool_definitions(): for tool in get_vision_tool_definitions():
all_available_tools_map[tool["function"]["name"]] = tool all_available_tools_map[tool["function"]["name"]] = tool
@@ -478,30 +470,29 @@ def handle_web_function_call(function_name: str, function_args: Dict[str, Any])
return asyncio.run(web_crawl_tool(url, instructions, "basic")) return asyncio.run(web_crawl_tool(url, instructions, "basic"))
else: else:
return json.dumps({"error": f"Unknown web function: {function_name}"}) return json.dumps({"error": f"Unknown web function: {function_name}"}, ensure_ascii=False)
def handle_terminal_function_call(function_name: str, function_args: Dict[str, Any]) -> str: def handle_terminal_function_call(function_name: str, function_args: Dict[str, Any], task_id: Optional[str] = None) -> str:
""" """
Handle function calls for terminal tools. Handle function calls for terminal tools.
Args: Args:
function_name (str): Name of the terminal function to call function_name (str): Name of the terminal function to call
function_args (Dict): Arguments for the function function_args (Dict): Arguments for the function
task_id (str): Unique identifier for this task to isolate VMs between concurrent tasks (optional)
Returns: Returns:
str: Function result as JSON string str: Function result as JSON string
""" """
if function_name == "terminal": if function_name == "terminal":
command = function_args.get("command") command = function_args.get("command")
input_keys = function_args.get("input_keys")
background = function_args.get("background", False) background = function_args.get("background", False)
idle_threshold = function_args.get("idle_threshold", 5.0)
timeout = function_args.get("timeout") timeout = function_args.get("timeout")
return terminal_tool(command, input_keys, None, background, idle_threshold, timeout) return simple_terminal_tool(command=command, background=background, timeout=timeout, task_id=task_id)
else: else:
return json.dumps({"error": f"Unknown terminal function: {function_name}"}) return json.dumps({"error": f"Unknown terminal function: {function_name}"}, ensure_ascii=False)
def handle_vision_function_call(function_name: str, function_args: Dict[str, Any]) -> str: def handle_vision_function_call(function_name: str, function_args: Dict[str, Any]) -> str:
@@ -525,7 +516,7 @@ def handle_vision_function_call(function_name: str, function_args: Dict[str, Any
return asyncio.run(vision_analyze_tool(image_url, full_prompt, "gemini-2.5-flash")) return asyncio.run(vision_analyze_tool(image_url, full_prompt, "gemini-2.5-flash"))
else: else:
return json.dumps({"error": f"Unknown vision function: {function_name}"}) return json.dumps({"error": f"Unknown vision function: {function_name}"}, ensure_ascii=False)
def handle_moa_function_call(function_name: str, function_args: Dict[str, Any]) -> str: def handle_moa_function_call(function_name: str, function_args: Dict[str, Any]) -> str:
@@ -543,13 +534,13 @@ def handle_moa_function_call(function_name: str, function_args: Dict[str, Any])
user_prompt = function_args.get("user_prompt", "") user_prompt = function_args.get("user_prompt", "")
if not user_prompt: if not user_prompt:
return json.dumps({"error": "user_prompt is required for MoA processing"}) return json.dumps({"error": "user_prompt is required for MoA processing"}, ensure_ascii=False)
# Run async function in event loop # Run async function in event loop
return asyncio.run(mixture_of_agents_tool(user_prompt=user_prompt)) return asyncio.run(mixture_of_agents_tool(user_prompt=user_prompt))
else: else:
return json.dumps({"error": f"Unknown MoA function: {function_name}"}) return json.dumps({"error": f"Unknown MoA function: {function_name}"}, ensure_ascii=False)
def handle_image_function_call(function_name: str, function_args: Dict[str, Any]) -> str: def handle_image_function_call(function_name: str, function_args: Dict[str, Any]) -> str:
@@ -567,7 +558,7 @@ def handle_image_function_call(function_name: str, function_args: Dict[str, Any]
prompt = function_args.get("prompt", "") prompt = function_args.get("prompt", "")
if not prompt: if not prompt:
return json.dumps({"success": False, "image": None}) return json.dumps({"success": False, "image": None}, ensure_ascii=False)
image_size = function_args.get("image_size", "landscape_16_9") image_size = function_args.get("image_size", "landscape_16_9")
@@ -581,8 +572,21 @@ def handle_image_function_call(function_name: str, function_args: Dict[str, Any]
allow_nsfw_images = True allow_nsfw_images = True
seed = None seed = None
# Run async function in event loop # Run async function in event loop with proper handling for multiprocessing
return asyncio.run(image_generate_tool( try:
# Try to get existing event loop
loop = asyncio.get_event_loop()
if loop.is_closed():
# If closed, create a new one
loop = asyncio.new_event_loop()
asyncio.set_event_loop(loop)
except RuntimeError:
# No event loop in current thread, create one
loop = asyncio.new_event_loop()
asyncio.set_event_loop(loop)
# Run the coroutine in the event loop
result = loop.run_until_complete(image_generate_tool(
prompt=prompt, prompt=prompt,
image_size=image_size, image_size=image_size,
num_inference_steps=num_inference_steps, num_inference_steps=num_inference_steps,
@@ -594,26 +598,29 @@ def handle_image_function_call(function_name: str, function_args: Dict[str, Any]
allow_nsfw_images=allow_nsfw_images, allow_nsfw_images=allow_nsfw_images,
seed=seed seed=seed
)) ))
return result
else: else:
return json.dumps({"error": f"Unknown image generation function: {function_name}"}) return json.dumps({"error": f"Unknown image generation function: {function_name}"}, ensure_ascii=False)
def handle_function_call(function_name: str, function_args: Dict[str, Any]) -> str: def handle_function_call(function_name: str, function_args: Dict[str, Any], task_id: Optional[str] = None) -> str:
""" """
Main function call dispatcher that routes calls to appropriate toolsets. Main function call dispatcher that routes calls to appropriate toolsets.
This function determines which toolset a function belongs to and dispatches This function determines which toolset a function belongs to and dispatches
the call to the appropriate handler. This makes it easy to add new toolsets the call to the appropriate handler. This makes it easy to add new toolsets
without changing the main calling interface. without changing the main calling interface.
Args: Args:
function_name (str): Name of the function to call function_name (str): Name of the function to call
function_args (Dict): Arguments for the function function_args (Dict): Arguments for the function
task_id (str): Unique identifier for this task to isolate VMs between concurrent tasks (optional)
Returns: Returns:
str: Function result as JSON string str: Function result as JSON string
Raises: Raises:
None: Returns error as JSON string instead of raising exceptions None: Returns error as JSON string instead of raising exceptions
""" """
@@ -621,32 +628,33 @@ def handle_function_call(function_name: str, function_args: Dict[str, Any]) -> s
# Route web tools # Route web tools
if function_name in ["web_search", "web_extract", "web_crawl"]: if function_name in ["web_search", "web_extract", "web_crawl"]:
return handle_web_function_call(function_name, function_args) return handle_web_function_call(function_name, function_args)
# Route terminal tools # Route terminal tools
elif function_name in ["terminal"]: elif function_name in ["terminal"]:
return handle_terminal_function_call(function_name, function_args) return handle_terminal_function_call(function_name, function_args, task_id)
# Route vision tools # Route vision tools
elif function_name in ["vision_analyze"]: elif function_name in ["vision_analyze"]:
return handle_vision_function_call(function_name, function_args) return handle_vision_function_call(function_name, function_args)
# Route MoA tools # Route MoA tools
elif function_name in ["mixture_of_agents"]: elif function_name in ["mixture_of_agents"]:
return handle_moa_function_call(function_name, function_args) return handle_moa_function_call(function_name, function_args)
# Route image generation tools # Route image generation tools
elif function_name in ["image_generate"]: elif function_name in ["image_generate"]:
return handle_image_function_call(function_name, function_args) return handle_image_function_call(function_name, function_args)
else: else:
error_msg = f"Unknown function: {function_name}" error_msg = f"Unknown function: {function_name}"
print(f"{error_msg}") print(f"{error_msg}")
return json.dumps({"error": error_msg})
return json.dumps({"error": error_msg}, ensure_ascii=False)
except Exception as e: except Exception as e:
error_msg = f"Error executing {function_name}: {str(e)}" error_msg = f"Error executing {function_name}: {str(e)}"
print(f"{error_msg}") print(f"{error_msg}")
return json.dumps({"error": error_msg}) return json.dumps({"error": error_msg}, ensure_ascii=False)
def get_available_toolsets() -> Dict[str, Dict[str, Any]]: def get_available_toolsets() -> Dict[str, Dict[str, Any]]:
""" """
@@ -663,10 +671,10 @@ def get_available_toolsets() -> Dict[str, Dict[str, Any]]:
"requirements": ["FIRECRAWL_API_KEY environment variable"] "requirements": ["FIRECRAWL_API_KEY environment variable"]
}, },
"terminal_tools": { "terminal_tools": {
"available": check_hecate_requirements(), "available": check_simple_terminal_requirements(),
"tools": ["terminal_tool"], "tools": ["simple_terminal_tool"],
"description": "Execute commands with optional interactive session support on Linux VMs", "description": "Execute commands on secure Linux VMs without session persistence",
"requirements": ["MORPH_API_KEY environment variable", "hecate package"] "requirements": ["MORPH_API_KEY environment variable"]
}, },
"vision_tools": { "vision_tools": {
"available": check_vision_requirements(), "available": check_vision_requirements(),
@@ -693,13 +701,13 @@ def get_available_toolsets() -> Dict[str, Dict[str, Any]]:
def check_toolset_requirements() -> Dict[str, bool]: def check_toolset_requirements() -> Dict[str, bool]:
""" """
Check if all requirements for available toolsets are met. Check if all requirements for available toolsets are met.
Returns: Returns:
Dict: Status of each toolset's requirements Dict: Status of each toolset's requirements
""" """
return { return {
"web_tools": check_firecrawl_api_key(), "web_tools": check_firecrawl_api_key(),
"terminal_tools": check_hecate_requirements(), "terminal_tools": check_simple_terminal_requirements(),
"vision_tools": check_vision_requirements(), "vision_tools": check_vision_requirements(),
"moa_tools": check_moa_requirements(), "moa_tools": check_moa_requirements(),
"image_tools": check_image_generation_requirements() "image_tools": check_image_generation_requirements()

28
pyproject.toml Normal file
View File

@@ -0,0 +1,28 @@
[build-system]
requires = ["setuptools>=61.0"]
build-backend = "setuptools.build_meta"
[project]
name = "hermes-agent"
version = "0.1.0"
description = "AI agent with advanced tool-calling and toolsets"
readme = "README.md"
requires-python = ">=3.10"
authors = [{ name = "Hermes Agent" }]
license = { text = "MIT" }
dependencies = [
"firecrawl-py",
"openai",
"fal-client",
"python-dotenv",
"fire"
]
[project.scripts]
hermes-agent = "run_agent:main"
[tool.setuptools]
py-modules = ["run_agent", "model_tools", "toolsets"]
[tool.setuptools.packages.find]
include = ["tools"]

View File

@@ -3,4 +3,7 @@ openai
fal-client fal-client
fire fire
git@github.com:NousResearch/hecate.git git@github.com:NousResearch/hecate.git
tenacity tenacity
python-dotenv
fire
httpx

View File

@@ -28,9 +28,22 @@ from typing import List, Dict, Any, Optional
from openai import OpenAI from openai import OpenAI
import fire import fire
from datetime import datetime from datetime import datetime
from pathlib import Path
# Load environment variables from .env file
from dotenv import load_dotenv
# Load .env file if it exists
env_path = Path(__file__).parent / '.env'
if env_path.exists():
load_dotenv(dotenv_path=env_path)
print(f"✅ Loaded environment variables from {env_path}")
else:
print(f" No .env file found at {env_path}. Using system environment variables.")
# Import our tool system # Import our tool system
from model_tools import get_tool_definitions, handle_function_call, check_toolset_requirements from model_tools import get_tool_definitions, handle_function_call, check_toolset_requirements
from tools.terminal_tool import cleanup_vm
class AIAgent: class AIAgent:
@@ -42,20 +55,22 @@ class AIAgent:
""" """
def __init__( def __init__(
self, self,
base_url: str = None, base_url: str = None,
api_key: str = None, api_key: str = None,
model: str = "gpt-4", model: str = "gpt-4",
max_iterations: int = 10, max_iterations: int = 10,
tool_delay: float = 1.0, tool_delay: float = 1.0,
enabled_toolsets: List[str] = None, enabled_toolsets: List[str] = None,
disabled_toolsets: List[str] = None, disabled_toolsets: List[str] = None,
save_trajectories: bool = False, save_trajectories: bool = False,
verbose_logging: bool = False verbose_logging: bool = False,
ephemeral_system_prompt: str = None,
log_prefix_chars: int = 100,
): ):
""" """
Initialize the AI Agent. Initialize the AI Agent.
Args: Args:
base_url (str): Base URL for the model API (optional) base_url (str): Base URL for the model API (optional)
api_key (str): API key for authentication (optional, uses env var if not provided) api_key (str): API key for authentication (optional, uses env var if not provided)
@@ -66,13 +81,17 @@ class AIAgent:
disabled_toolsets (List[str]): Disable tools from these toolsets (optional) disabled_toolsets (List[str]): Disable tools from these toolsets (optional)
save_trajectories (bool): Whether to save conversation trajectories to JSONL files (default: False) save_trajectories (bool): Whether to save conversation trajectories to JSONL files (default: False)
verbose_logging (bool): Enable verbose logging for debugging (default: False) verbose_logging (bool): Enable verbose logging for debugging (default: False)
ephemeral_system_prompt (str): System prompt used during agent execution but NOT saved to trajectories (optional)
log_prefix_chars (int): Number of characters to show in log previews for tool calls/responses (default: 20)
""" """
self.model = model self.model = model
self.max_iterations = max_iterations self.max_iterations = max_iterations
self.tool_delay = tool_delay self.tool_delay = tool_delay
self.save_trajectories = save_trajectories self.save_trajectories = save_trajectories
self.verbose_logging = verbose_logging self.verbose_logging = verbose_logging
self.ephemeral_system_prompt = ephemeral_system_prompt
self.log_prefix_chars = log_prefix_chars
# Store toolset filtering options # Store toolset filtering options
self.enabled_toolsets = enabled_toolsets self.enabled_toolsets = enabled_toolsets
self.disabled_toolsets = disabled_toolsets self.disabled_toolsets = disabled_toolsets
@@ -84,10 +103,11 @@ class AIAgent:
format='%(asctime)s - %(name)s - %(levelname)s - %(message)s', format='%(asctime)s - %(name)s - %(levelname)s - %(message)s',
datefmt='%H:%M:%S' datefmt='%H:%M:%S'
) )
# Also set OpenAI client logging to debug # Keep OpenAI and httpx at INFO level to avoid massive base64 logs
logging.getLogger('openai').setLevel(logging.DEBUG) # Even in verbose mode, we don't want to see full request/response bodies
logging.getLogger('httpx').setLevel(logging.DEBUG) logging.getLogger('openai').setLevel(logging.INFO)
print("🔍 Verbose logging enabled") logging.getLogger('httpx').setLevel(logging.WARNING)
print("🔍 Verbose logging enabled (OpenAI/httpx request bodies suppressed)")
else: else:
# Set logging to INFO level for important messages only # Set logging to INFO level for important messages only
logging.basicConfig( logging.basicConfig(
@@ -145,6 +165,11 @@ class AIAgent:
# Show trajectory saving status # Show trajectory saving status
if self.save_trajectories: if self.save_trajectories:
print("📝 Trajectory saving enabled") print("📝 Trajectory saving enabled")
# Show ephemeral system prompt status
if self.ephemeral_system_prompt:
prompt_preview = self.ephemeral_system_prompt[:60] + "..." if len(self.ephemeral_system_prompt) > 60 else self.ephemeral_system_prompt
print(f"🔒 Ephemeral system prompt: '{prompt_preview}' (not saved to trajectories)")
def _format_tools_for_system_message(self) -> str: def _format_tools_for_system_message(self) -> str:
""" """
@@ -168,7 +193,7 @@ class AIAgent:
} }
formatted_tools.append(formatted_tool) formatted_tools.append(formatted_tool)
return json.dumps(formatted_tools) return json.dumps(formatted_tools, ensure_ascii=False)
def _convert_to_trajectory_format(self, messages: List[Dict[str, Any]], user_query: str, completed: bool) -> List[Dict[str, Any]]: def _convert_to_trajectory_format(self, messages: List[Dict[str, Any]], user_query: str, completed: bool) -> List[Dict[str, Any]]:
""" """
@@ -229,7 +254,7 @@ class AIAgent:
"name": tool_call["function"]["name"], "name": tool_call["function"]["name"],
"arguments": json.loads(tool_call["function"]["arguments"]) if isinstance(tool_call["function"]["arguments"], str) else tool_call["function"]["arguments"] "arguments": json.loads(tool_call["function"]["arguments"]) if isinstance(tool_call["function"]["arguments"], str) else tool_call["function"]["arguments"]
} }
content += f"<tool_call>\n{json.dumps(tool_call_json)}\n</tool_call>\n" content += f"<tool_call>\n{json.dumps(tool_call_json, ensure_ascii=False)}\n</tool_call>\n"
trajectory.append({ trajectory.append({
"from": "gpt", "from": "gpt",
@@ -256,7 +281,7 @@ class AIAgent:
"tool_call_id": tool_msg.get("tool_call_id", ""), "tool_call_id": tool_msg.get("tool_call_id", ""),
"name": msg["tool_calls"][len(tool_responses)]["function"]["name"] if len(tool_responses) < len(msg["tool_calls"]) else "unknown", "name": msg["tool_calls"][len(tool_responses)]["function"]["name"] if len(tool_responses) < len(msg["tool_calls"]) else "unknown",
"content": tool_content "content": tool_content
}) }, ensure_ascii=False)
tool_response += "\n</tool_response>" tool_response += "\n</tool_response>"
tool_responses.append(tool_response) tool_responses.append(tool_response)
j += 1 j += 1
@@ -321,22 +346,27 @@ class AIAgent:
print(f"⚠️ Failed to save trajectory: {e}") print(f"⚠️ Failed to save trajectory: {e}")
def run_conversation( def run_conversation(
self, self,
user_message: str, user_message: str,
system_message: str = None, system_message: str = None,
conversation_history: List[Dict[str, Any]] = None conversation_history: List[Dict[str, Any]] = None,
task_id: str = None
) -> Dict[str, Any]: ) -> Dict[str, Any]:
""" """
Run a complete conversation with tool calling until completion. Run a complete conversation with tool calling until completion.
Args: Args:
user_message (str): The user's message/question user_message (str): The user's message/question
system_message (str): Custom system message (optional) system_message (str): Custom system message (optional, overrides ephemeral_system_prompt if provided)
conversation_history (List[Dict]): Previous conversation messages (optional) conversation_history (List[Dict]): Previous conversation messages (optional)
task_id (str): Unique identifier for this task to isolate VMs between concurrent tasks (optional, auto-generated if not provided)
Returns: Returns:
Dict: Complete conversation result with final response and message history Dict: Complete conversation result with final response and message history
""" """
# Generate unique task_id if not provided to isolate VMs between concurrent tasks
import uuid
effective_task_id = task_id or str(uuid.uuid4())
# Initialize conversation # Initialize conversation
messages = conversation_history or [] messages = conversation_history or []
@@ -348,13 +378,17 @@ class AIAgent:
print(f"💬 Starting conversation: '{user_message[:60]}{'...' if len(user_message) > 60 else ''}'") print(f"💬 Starting conversation: '{user_message[:60]}{'...' if len(user_message) > 60 else ''}'")
# Determine which system prompt to use for API calls (ephemeral)
# Priority: explicit system_message > ephemeral_system_prompt > None
active_system_prompt = system_message if system_message is not None else self.ephemeral_system_prompt
# Main conversation loop # Main conversation loop
api_call_count = 0 api_call_count = 0
final_response = None final_response = None
while api_call_count < self.max_iterations: while api_call_count < self.max_iterations:
api_call_count += 1 api_call_count += 1
print(f"\n🔄 Making API call #{api_call_count}...") print(f"\n🔄 Making OpenAI-compatible API call #{api_call_count}...")
# Log request details if verbose # Log request details if verbose
if self.verbose_logging: if self.verbose_logging:
@@ -363,33 +397,40 @@ class AIAgent:
api_start_time = time.time() api_start_time = time.time()
retry_count = 0 retry_count = 0
max_retries = 3 max_retries = 6 # Increased to allow longer backoff periods
while retry_count <= max_retries: while retry_count <= max_retries:
try: try:
# Prepare messages for API call
# If we have an ephemeral system prompt, prepend it to the messages
api_messages = messages.copy()
if active_system_prompt:
# Insert system message at the beginning
api_messages = [{"role": "system", "content": active_system_prompt}] + api_messages
# Make API call with tools # Make API call with tools
response = self.client.chat.completions.create( response = self.client.chat.completions.create(
model=self.model, model=self.model,
messages=messages, messages=api_messages,
tools=self.tools if self.tools else None, tools=self.tools if self.tools else None,
timeout=60.0 # Add explicit timeout timeout=300.0 # 5 minute timeout for long-running agent tasks
) )
api_duration = time.time() - api_start_time api_duration = time.time() - api_start_time
print(f"⏱️ API call completed in {api_duration:.2f}s") print(f"⏱️ OpenAI-compatible API call completed in {api_duration:.2f}s")
if self.verbose_logging: if self.verbose_logging:
logging.debug(f"API Response received - Usage: {response.usage if hasattr(response, 'usage') else 'N/A'}") logging.debug(f"API Response received - Usage: {response.usage if hasattr(response, 'usage') else 'N/A'}")
break # Success, exit retry loop break # Success, exit retry loop
except Exception as api_error: except Exception as api_error:
retry_count += 1 retry_count += 1
if retry_count > max_retries: if retry_count > max_retries:
raise api_error raise api_error
wait_time = min(2 ** retry_count, 10) # Exponential backoff, max 10s wait_time = min(2 ** retry_count, 60) # Exponential backoff: 2s, 4s, 8s, 16s, 32s, 60s, 60s
print(f"⚠️ API call failed (attempt {retry_count}/{max_retries}): {str(api_error)[:100]}") print(f"⚠️ OpenAI-compatible API call failed (attempt {retry_count}/{max_retries}): {str(api_error)[:100]}")
print(f"⏳ Retrying in {wait_time}s...") print(f"⏳ Retrying in {wait_time}s...")
logging.warning(f"API retry {retry_count}/{max_retries} after error: {api_error}") logging.warning(f"API retry {retry_count}/{max_retries} after error: {api_error}")
time.sleep(wait_time) time.sleep(wait_time)
@@ -436,28 +477,33 @@ class AIAgent:
print(f"❌ Invalid JSON in tool call arguments: {e}") print(f"❌ Invalid JSON in tool call arguments: {e}")
function_args = {} function_args = {}
print(f" 📞 Tool {i}: {function_name}({list(function_args.keys())})") # Preview tool call arguments
args_str = json.dumps(function_args, ensure_ascii=False)
args_preview = args_str[:self.log_prefix_chars] + "..." if len(args_str) > self.log_prefix_chars else args_str
print(f" 📞 Tool {i}: {function_name}({list(function_args.keys())}) - {args_preview}")
tool_start_time = time.time() tool_start_time = time.time()
# Execute the tool # Execute the tool with task_id to isolate VMs between concurrent tasks
function_result = handle_function_call(function_name, function_args) function_result = handle_function_call(function_name, function_args, effective_task_id)
tool_duration = time.time() - tool_start_time tool_duration = time.time() - tool_start_time
result_preview = function_result[:200] if len(function_result) > 200 else function_result result_preview = function_result[:200] if len(function_result) > 200 else function_result
if self.verbose_logging: if self.verbose_logging:
logging.debug(f"Tool {function_name} completed in {tool_duration:.2f}s") logging.debug(f"Tool {function_name} completed in {tool_duration:.2f}s")
logging.debug(f"Tool result preview: {result_preview}...") logging.debug(f"Tool result preview: {result_preview}...")
# Add tool result to conversation # Add tool result to conversation
messages.append({ messages.append({
"role": "tool", "role": "tool",
"content": function_result, "content": function_result,
"tool_call_id": tool_call.id "tool_call_id": tool_call.id
}) })
print(f" ✅ Tool {i} completed in {tool_duration:.2f}s") # Preview tool response
response_preview = function_result[:self.log_prefix_chars] + "..." if len(function_result) > self.log_prefix_chars else function_result
print(f" ✅ Tool {i} completed in {tool_duration:.2f}s - {response_preview}")
# Delay between tool calls # Delay between tool calls
if self.tool_delay > 0 and i < len(assistant_message.tool_calls): if self.tool_delay > 0 and i < len(assistant_message.tool_calls):
@@ -476,11 +522,11 @@ class AIAgent:
"content": final_response "content": final_response
}) })
print(f"🎉 Conversation completed after {api_call_count} API call(s)") print(f"🎉 Conversation completed after {api_call_count} OpenAI-compatible API call(s)")
break break
except Exception as e: except Exception as e:
error_msg = f"Error during API call #{api_call_count}: {str(e)}" error_msg = f"Error during OpenAI-compatible API call #{api_call_count}: {str(e)}"
print(f"{error_msg}") print(f"{error_msg}")
if self.verbose_logging: if self.verbose_logging:
@@ -505,10 +551,17 @@ class AIAgent:
# Determine if conversation completed successfully # Determine if conversation completed successfully
completed = final_response is not None and api_call_count < self.max_iterations completed = final_response is not None and api_call_count < self.max_iterations
# Save trajectory if enabled # Save trajectory if enabled
self._save_trajectory(messages, user_message, completed) self._save_trajectory(messages, user_message, completed)
# Clean up VM for this task after conversation completes
try:
cleanup_vm(effective_task_id)
except Exception as e:
if self.verbose_logging:
logging.warning(f"Failed to cleanup VM for task {effective_task_id}: {e}")
return { return {
"final_response": final_response, "final_response": final_response,
"messages": messages, "messages": messages,
@@ -532,7 +585,7 @@ class AIAgent:
def main( def main(
query: str = None, query: str = None,
model: str = "claude-opus-4-20250514", model: str = "claude-opus-4-20250514",
api_key: str = None, api_key: str = None,
base_url: str = "https://api.anthropic.com/v1/", base_url: str = "https://api.anthropic.com/v1/",
max_turns: int = 10, max_turns: int = 10,
@@ -540,25 +593,27 @@ def main(
disabled_toolsets: str = None, disabled_toolsets: str = None,
list_tools: bool = False, list_tools: bool = False,
save_trajectories: bool = False, save_trajectories: bool = False,
verbose: bool = False verbose: bool = False,
log_prefix_chars: int = 20
): ):
""" """
Main function for running the agent directly. Main function for running the agent directly.
Args: Args:
query (str): Natural language query for the agent. Defaults to Python 3.13 example. query (str): Natural language query for the agent. Defaults to Python 3.13 example.
model (str): Model name to use. Defaults to claude-opus-4-20250514. model (str): Model name to use. Defaults to claude-opus-4-20250514.
api_key (str): API key for authentication. Uses ANTHROPIC_API_KEY env var if not provided. api_key (str): API key for authentication. Uses ANTHROPIC_API_KEY env var if not provided.
base_url (str): Base URL for the model API. Defaults to https://api.anthropic.com/v1/ base_url (str): Base URL for the model API. Defaults to https://api.anthropic.com/v1/
max_turns (int): Maximum number of API call iterations. Defaults to 10. max_turns (int): Maximum number of API call iterations. Defaults to 10.
enabled_toolsets (str): Comma-separated list of toolsets to enable. Supports predefined enabled_toolsets (str): Comma-separated list of toolsets to enable. Supports predefined
toolsets (e.g., "research", "development", "safe"). toolsets (e.g., "research", "development", "safe").
Multiple toolsets can be combined: "web,vision" Multiple toolsets can be combined: "web,vision"
disabled_toolsets (str): Comma-separated list of toolsets to disable (e.g., "terminal") disabled_toolsets (str): Comma-separated list of toolsets to disable (e.g., "terminal")
list_tools (bool): Just list available tools and exit list_tools (bool): Just list available tools and exit
save_trajectories (bool): Save conversation trajectories to JSONL files. Defaults to False. save_trajectories (bool): Save conversation trajectories to JSONL files. Defaults to False.
verbose (bool): Enable verbose logging for debugging. Defaults to False. verbose (bool): Enable verbose logging for debugging. Defaults to False.
log_prefix_chars (int): Number of characters to show in log previews for tool calls/responses. Defaults to 20.
Toolset Examples: Toolset Examples:
- "research": Web search, extract, crawl + vision tools - "research": Web search, extract, crawl + vision tools
""" """
@@ -675,7 +730,8 @@ def main(
enabled_toolsets=enabled_toolsets_list, enabled_toolsets=enabled_toolsets_list,
disabled_toolsets=disabled_toolsets_list, disabled_toolsets=disabled_toolsets_list,
save_trajectories=save_trajectories, save_trajectories=save_trajectories,
verbose_logging=verbose verbose_logging=verbose,
log_prefix_chars=log_prefix_chars
) )
except RuntimeError as e: except RuntimeError as e:
print(f"❌ Failed to initialize agent: {e}") print(f"❌ Failed to initialize agent: {e}")

12
run_datagen_images.sh Normal file
View File

@@ -0,0 +1,12 @@
python batch_runner.py \
--dataset_file="hermes-agent-imagen-data/hermes_agent_imagen_eval.jsonl" \
--batch_size=10 \
--run_name="imagen_eval_gpt5" \
--distribution="image_gen" \
--model="gpt-5" \
--base_url="https://api.openai.com/v1" \
--api_key="${OPENAI_API_KEY}" \
--num_workers=4 \
--max_turns=5 \
--verbose \
--ephemeral_system_prompt="When generating an image for the user view the image by using the vision_analyze tool to ensure it is what the user wanted. If it isn't feel free to retry a few times. If none are perfect, choose the best option that is the closest match, and explain its imperfections. If the image generation tool fails, try again a few times. If the vision analyze tool fails, provide the image to the user and explain it is your best effort attempt."

12
run_datagen_megascience.sh Executable file
View File

@@ -0,0 +1,12 @@
python batch_runner.py \
--dataset_file="hermes-agent-megascience-data/hermes_agent_megascience_eval.jsonl" \
--batch_size=10 \
--run_name="megascience_eval_gpt5_2" \
--distribution="science" \
--model="gpt-5" \
--base_url="https://api.openai.com/v1" \
--api_key="${OPENAI_API_KEY}" \
--num_workers=5 \
--max_turns=30 \
--verbose \
--ephemeral_system_prompt="You have access to a variety of tools to help you solve scientific, math, and technology problems presented to you. You can use them in sequence and build off of the results of prior tools you've used results. Always use a tool if it can provide additional context, verify formulas, double check concepts and recent studies and understanding, doing all calculations, etc. You should not be confident in your own reasoning, knowledge, or calculations without using a tool to verify or validate your work."

View File

@@ -0,0 +1,12 @@
python batch_runner.py \
--dataset_file="hermes-agent-megascience-data/hermes_agent_megascience_eval.jsonl" \
--batch_size=10 \
--run_name="megascience_eval_glm4-6-fixedterminal-2" \
--distribution="science" \
--model="z-ai/glm-4.6" \
--base_url="https://openrouter.ai/api/v1" \
--api_key="${OPENROUTER_API_KEY}" \
--num_workers=5 \
--max_turns=30 \
--verbose \
--ephemeral_system_prompt="You have access to a variety of tools to help you solve scientific, math, and technology problems presented to you. You can use them in sequence and build off of the results of prior tools you've used results. Always use a tool if it can provide additional context, verify formulas, double check concepts and recent studies and understanding, doing all calculations, etc. You should only be confident in your own reasoning, knowledge, or calculations if you've exhaustively used all tools available to you to that can help you verify or validate your work. Always pip install any packages you need to use the python scripts you want to run."

View File

@@ -1,234 +0,0 @@
#!/usr/bin/env python3
"""
Terminal Tool Module
This module provides a single terminal tool using Hecate's VM infrastructure.
It wraps Hecate's functionality to provide a simple interface for executing commands
on Morph VMs with automatic lifecycle management.
Available tool:
- terminal_tool: Execute commands with optional interactive session support
Usage:
from terminal_tool import terminal_tool
# Execute a single command
result = terminal_tool("ls -la")
# Execute in an interactive session
result = terminal_tool("python", input_keys="print('hello')\\nexit()\\n")
"""
import json
import os
from typing import Optional, Dict, Any
from hecate import run_tool_with_lifecycle_management
from morphcloud._llm import ToolCall
# Detailed description for the terminal tool based on Hermes Terminal system prompt
TERMINAL_TOOL_DESCRIPTION = """Execute commands on a secure, persistent Linux VM environment with full interactive application support.
**Environment:**
- Minimal Debian-based OS with internet access
- Automatic VM lifecycle management (creates on-demand, reuses, cleans up)
- **Full state persistence across tool calls**: current directory (pwd), environment variables, activated virtual environments (conda/venv), running processes, and command history all persist between consecutive tool calls
- Session state managed automatically via tmux
**Command Execution:**
- Simple commands: Just provide the 'command' parameter
- Background processes: Set 'background': True for servers/long-running tasks
- Interactive applications automatically detected and handled
**Interactive Applications (TUIs/Pagers/Prompts):**
When commands enter interactive mode (vim, nano, less, git prompts, package managers, etc.), you'll receive screen content with "frozen" status. This is NORMAL - the session is still active and waiting for input.
**To interact with frozen sessions:**
1. Use 'input_keys' parameter with keystrokes to send
2. System auto-detects and uses the active session
3. Session stays active until application exits
**Special Key Syntax for input_keys:**
- `<ESC>`: Escape key
- `<ENTER>`: Enter/Return
- `<CTRL+C>`, `<CTRL+D>`, `<CTRL+Z>`: Control combinations
- `<UP>`, `<DOWN>`, `<LEFT>`, `<RIGHT>`: Arrow keys
- `<TAB>`, `<BACKSPACE>`: Tab and Backspace
- `<F1>` through `<F12>`: Function keys
- `<SHIFT+TAB>`: Shift+Tab
- Uppercase letters for Shift+letter (e.g., 'V' for Shift+V)
- Symbols for Shift+number (e.g., '!' for Shift+1, ':' for Shift+;)
**Examples:**
- Start vim: `{"command": "vim file.txt"}`
- Type in vim: `{"input_keys": "iHello World<ESC>"}`
- Save and quit: `{"input_keys": ":wq<ENTER>"}`
- Navigate in less: `{"input_keys": "j"}`
- Quit less: `{"input_keys": "q"}`
**Best Practices:**
- Run servers/long processes in background with separate tool calls
- Chain multiple foreground commands in single call if needed
- Monitor disk usage for large tasks, clean up to free space
- Test components incrementally with mock inputs
- Install whatever tools needed - full system access provided"""
def terminal_tool(
command: Optional[str] = None,
input_keys: Optional[str] = None,
session_id: Optional[str] = None,
background: bool = False,
idle_threshold: float = 5.0,
timeout: Optional[int] = None
) -> str:
"""
Execute a command on a Morph VM with optional interactive session support.
This tool uses Hecate's VM lifecycle management to automatically create
and manage VMs. VMs are reused within the configured lifetime window
and automatically cleaned up after inactivity.
Args:
command: The command to execute (optional if continuing existing session)
input_keys: Keystrokes to send to interactive session (e.g., "hello\\n")
session_id: ID of existing session to continue (optional)
background: Whether to run the command in the background (default: False)
idle_threshold: Seconds to wait for output before considering session idle (default: 5.0)
timeout: Command timeout in seconds (optional)
Returns:
str: JSON string containing command output, session info, exit code, and any errors
Examples:
# Execute a simple command
>>> result = terminal_tool(command="ls -la /tmp")
# Start an interactive Python session
>>> result = terminal_tool(command="python3")
>>> session_data = json.loads(result)
>>> session_id = session_data["session_id"]
# Send input to the session
>>> result = terminal_tool(input_keys="print('Hello')\\n", session_id=session_id)
# Run a background task
>>> result = terminal_tool(command="sleep 60", background=True)
"""
try:
# Build tool input based on provided parameters
tool_input = {}
if command:
tool_input["command"] = command
if input_keys:
tool_input["input_keys"] = input_keys
if session_id:
tool_input["session_id"] = session_id
if background:
tool_input["background"] = background
if idle_threshold != 5.0:
tool_input["idle_threshold"] = idle_threshold
if timeout is not None:
tool_input["timeout"] = timeout
tool_call = ToolCall(
name="run_command",
input=tool_input
)
# Execute with lifecycle management
result = run_tool_with_lifecycle_management(tool_call)
# Format the result with all possible fields
# Map hecate's "stdout" to "output" for compatibility
formatted_result = {
"output": result.get("stdout", result.get("output", "")),
"screen": result.get("screen", ""),
"session_id": result.get("session_id"),
"exit_code": result.get("returncode", result.get("exit_code", -1)),
"error": result.get("error"),
"status": "active" if result.get("session_id") else "ended"
}
return json.dumps(formatted_result)
except Exception as e:
return json.dumps({
"output": "",
"screen": "",
"session_id": None,
"exit_code": -1,
"error": f"Failed to execute terminal command: {str(e)}",
"status": "error"
})
def check_hecate_requirements() -> bool:
"""
Check if all requirements for terminal tools are met.
Returns:
bool: True if all requirements are met, False otherwise
"""
# Check for required environment variables
required_vars = ["MORPH_API_KEY"]
optional_vars = ["OPENAI_API_KEY"] # Needed for Hecate's LLM features
missing_required = [var for var in required_vars if not os.getenv(var)]
missing_optional = [var for var in optional_vars if not os.getenv(var)]
if missing_required:
print(f"Missing required environment variables: {', '.join(missing_required)}")
return False
if missing_optional:
print(f"Warning: Missing optional environment variables: {', '.join(missing_optional)}")
print(" (Some Hecate features may be limited)")
# Check if Hecate is importable
try:
import hecate
return True
except ImportError:
print("Hecate is not installed. Please install it with: pip install hecate")
return False
# Module-level initialization check
_requirements_met = check_hecate_requirements()
if __name__ == "__main__":
"""
Simple test/demo when run directly
"""
print("Terminal Tool Module")
print("=" * 40)
if not _requirements_met:
print("Requirements not met. Please check the messages above.")
exit(1)
print("All requirements met!")
print("\nAvailable Tool:")
print(" - terminal_tool: Execute commands with optional interactive session support")
print("\nUsage Examples:")
print(" # Execute a command")
print(" result = terminal_tool(command='ls -la')")
print(" ")
print(" # Start an interactive session")
print(" result = terminal_tool(command='python3')")
print(" session_data = json.loads(result)")
print(" session_id = session_data['session_id']")
print(" ")
print(" # Send input to the session")
print(" result = terminal_tool(")
print(" input_keys='print(\"Hello\")\\\\n',")
print(" session_id=session_id")
print(" )")
print(" ")
print(" # Run a background task")
print(" result = terminal_tool(command='sleep 60', background=True)")
print("\nEnvironment Variables:")
print(f" MORPH_API_KEY: {'Set' if os.getenv('MORPH_API_KEY') else 'Not set'}")
print(f" OPENAI_API_KEY: {'Set' if os.getenv('OPENAI_API_KEY') else 'Not set (optional)'}")
print(f" HECATE_VM_LIFETIME_SECONDS: {os.getenv('HECATE_VM_LIFETIME_SECONDS', '300')} (default: 300)")
print(f" HECATE_DEFAULT_SNAPSHOT_ID: {os.getenv('HECATE_DEFAULT_SNAPSHOT_ID', 'snapshot_p5294qxt')} (default: snapshot_p5294qxt)")

12
test_run.sh Normal file → Executable file
View File

@@ -17,15 +17,7 @@ export WEB_TOOLS_DEBUG=true
python run_agent.py \ python run_agent.py \
--query "$PROMPT" \ --query "$PROMPT" \
--max_turns 30 \ --max_turns 30 \
--model claude-sonnet-4-20250514 \ --model claude-sonnet-4-5-20250929 \
--base_url https://api.anthropic.com/v1/ \ --base_url https://api.anthropic.com/v1/ \
--api_key $ANTHROPIC_API_KEY \ --api_key $ANTHROPIC_API_KEY \
--save_trajectories \ --save_trajectories
--enabled_toolsets=web
# --model claude-sonnet-4-20250514 \
#
#Possible Toolsets:
#web_tools
#vision_tools
#terminal_tools

0
tests/__init__.py Normal file
View File

129
tests/test_batch_runner.py Normal file
View File

@@ -0,0 +1,129 @@
#!/usr/bin/env python3
"""
Test script for batch runner
This script tests the batch runner with a small sample dataset
to verify functionality before running large batches.
"""
import json
import shutil
from pathlib import Path
def create_test_dataset():
"""Create a small test dataset."""
test_file = Path("tests/test_dataset.jsonl")
test_file.parent.mkdir(exist_ok=True)
prompts = [
{"prompt": "What is 2 + 2?"},
{"prompt": "What is the capital of France?"},
{"prompt": "Explain what Python is in one sentence."},
]
with open(test_file, 'w') as f:
for prompt in prompts:
f.write(json.dumps(prompt, ensure_ascii=False) + "\n")
print(f"✅ Created test dataset: {test_file}")
return test_file
def cleanup_test_run(run_name):
"""Clean up test run output."""
output_dir = Path("data") / run_name
if output_dir.exists():
shutil.rmtree(output_dir)
print(f"🗑️ Cleaned up test output: {output_dir}")
def verify_output(run_name):
"""Verify that output files were created correctly."""
output_dir = Path("data") / run_name
# Check directory exists
if not output_dir.exists():
print(f"❌ Output directory not found: {output_dir}")
return False
# Check for checkpoint
checkpoint_file = output_dir / "checkpoint.json"
if not checkpoint_file.exists():
print(f"❌ Checkpoint file not found: {checkpoint_file}")
return False
# Check for statistics
stats_file = output_dir / "statistics.json"
if not stats_file.exists():
print(f"❌ Statistics file not found: {stats_file}")
return False
# Check for batch files
batch_files = list(output_dir.glob("batch_*.jsonl"))
if not batch_files:
print(f"❌ No batch files found in: {output_dir}")
return False
print(f"✅ Output verification passed:")
print(f" - Checkpoint: {checkpoint_file}")
print(f" - Statistics: {stats_file}")
print(f" - Batch files: {len(batch_files)}")
# Load and display statistics
with open(stats_file) as f:
stats = json.load(f)
print(f"\n📊 Statistics Summary:")
print(f" - Total prompts: {stats['total_prompts']}")
print(f" - Total batches: {stats['total_batches']}")
print(f" - Duration: {stats['duration_seconds']}s")
if stats.get('tool_statistics'):
print(f" - Tool calls:")
for tool, tool_stats in stats['tool_statistics'].items():
print(f"{tool}: {tool_stats['count']} calls, {tool_stats['success_rate']:.1f}% success")
return True
def main():
"""Run the test."""
print("🧪 Batch Runner Test")
print("=" * 60)
run_name = "test_run"
# Clean up any previous test run
cleanup_test_run(run_name)
# Create test dataset
test_file = create_test_dataset()
print(f"\n📝 To run the test manually:")
print(f" python batch_runner.py \\")
print(f" --dataset_file={test_file} \\")
print(f" --batch_size=2 \\")
print(f" --run_name={run_name} \\")
print(f" --distribution=minimal \\")
print(f" --num_workers=2")
print(f"\n💡 Or test with different distributions:")
print(f" python batch_runner.py --list_distributions")
print(f"\n🔍 After running, you can verify output with:")
print(f" python tests/test_batch_runner.py --verify")
# Note: We don't actually run the batch runner here to avoid API calls during testing
# Users should run it manually with their API keys configured
if __name__ == "__main__":
import sys
if "--verify" in sys.argv:
run_name = "test_run"
verify_output(run_name)
else:
main()

View File

@@ -0,0 +1,424 @@
#!/usr/bin/env python3
"""
Test script to verify checkpoint behavior in batch_runner.py
This script simulates batch processing with intentional failures to test:
1. Whether checkpoints are saved incrementally during processing
2. Whether resume functionality works correctly after interruption
3. Whether data integrity is maintained across checkpoint cycles
Usage:
# Test current implementation
python tests/test_checkpoint_resumption.py --test_current
# Test after fix is applied
python tests/test_checkpoint_resumption.py --test_fixed
# Run full comparison
python tests/test_checkpoint_resumption.py --compare
"""
import json
import os
import shutil
import sys
import time
import signal
from pathlib import Path
from typing import List, Dict, Any
import traceback
# Add parent directory to path to import batch_runner
sys.path.insert(0, str(Path(__file__).parent.parent))
def create_test_dataset(num_prompts: int = 20) -> Path:
"""Create a small test dataset for checkpoint testing."""
test_data_dir = Path("tests/test_data")
test_data_dir.mkdir(parents=True, exist_ok=True)
dataset_file = test_data_dir / "checkpoint_test_dataset.jsonl"
with open(dataset_file, 'w', encoding='utf-8') as f:
for i in range(num_prompts):
entry = {
"prompt": f"Test prompt {i}: What is 2+2? Just answer briefly.",
"test_id": i
}
f.write(json.dumps(entry, ensure_ascii=False) + "\n")
print(f"✅ Created test dataset: {dataset_file} ({num_prompts} prompts)")
return dataset_file
def monitor_checkpoint_during_run(checkpoint_file: Path, duration: int = 30) -> List[Dict[str, Any]]:
"""
Monitor checkpoint file during a batch run to see when it gets updated.
Args:
checkpoint_file: Path to checkpoint file to monitor
duration: How long to monitor (seconds)
Returns:
List of checkpoint snapshots with timestamps
"""
snapshots = []
start_time = time.time()
last_mtime = None
print(f"\n🔍 Monitoring checkpoint file: {checkpoint_file}")
print(f" Duration: {duration}s")
print("-" * 70)
while time.time() - start_time < duration:
if checkpoint_file.exists():
current_mtime = checkpoint_file.stat().st_mtime
# Check if file was modified
if last_mtime is None or current_mtime != last_mtime:
elapsed = time.time() - start_time
try:
with open(checkpoint_file, 'r') as f:
checkpoint_data = json.load(f)
snapshot = {
"elapsed_seconds": round(elapsed, 2),
"completed_count": len(checkpoint_data.get("completed_prompts", [])),
"completed_prompts": checkpoint_data.get("completed_prompts", [])[:5], # First 5 for display
"timestamp": checkpoint_data.get("last_updated")
}
snapshots.append(snapshot)
print(f"[{elapsed:6.2f}s] Checkpoint updated: {snapshot['completed_count']} prompts completed")
except Exception as e:
print(f"[{elapsed:6.2f}s] Error reading checkpoint: {e}")
last_mtime = current_mtime
else:
if len(snapshots) == 0:
print(f"[{time.time() - start_time:6.2f}s] Checkpoint file not yet created...")
time.sleep(0.5) # Check every 0.5 seconds
return snapshots
def test_current_implementation():
"""Test the current checkpoint implementation."""
print("\n" + "=" * 70)
print("TEST 1: Current Implementation - Checkpoint Timing")
print("=" * 70)
print("\n📝 Testing whether checkpoints are saved incrementally during run...")
# Setup
dataset_file = create_test_dataset(num_prompts=12)
run_name = "checkpoint_test_current"
output_dir = Path("data") / run_name
# Clean up any existing test data
if output_dir.exists():
shutil.rmtree(output_dir)
# Import here to avoid issues if module changes
from batch_runner import BatchRunner
checkpoint_file = output_dir / "checkpoint.json"
# Start monitoring in a separate process would be ideal, but for simplicity
# we'll just check before and after
print(f"\n▶️ Starting batch run...")
print(f" Dataset: {dataset_file}")
print(f" Batch size: 3 (4 batches total)")
print(f" Workers: 2")
print(f" Expected behavior: If incremental, checkpoint should update during run")
start_time = time.time()
try:
runner = BatchRunner(
dataset_file=str(dataset_file),
batch_size=3,
run_name=run_name,
distribution="default",
max_iterations=3, # Keep it short
model="claude-opus-4-20250514",
num_workers=2,
verbose=False
)
# Run with monitoring
import threading
snapshots = []
def monitor():
nonlocal snapshots
snapshots = monitor_checkpoint_during_run(checkpoint_file, duration=60)
monitor_thread = threading.Thread(target=monitor, daemon=True)
monitor_thread.start()
runner.run(resume=False)
monitor_thread.join(timeout=2)
except Exception as e:
print(f"❌ Error during run: {e}")
traceback.print_exc()
return False
elapsed = time.time() - start_time
# Analyze results
print("\n" + "=" * 70)
print("📊 TEST RESULTS")
print("=" * 70)
print(f"Total run time: {elapsed:.2f}s")
print(f"Checkpoint updates observed: {len(snapshots)}")
if len(snapshots) == 0:
print("\n❌ ISSUE: No checkpoint updates observed during run")
print(" This suggests checkpoints are only saved at the end")
return False
elif len(snapshots) == 1:
print("\n⚠️ WARNING: Only 1 checkpoint update (likely at the end)")
print(" This confirms the bug - no incremental checkpointing")
return False
else:
print(f"\n✅ GOOD: Multiple checkpoint updates ({len(snapshots)}) observed")
print(" Checkpointing appears to be incremental")
# Show timeline
print("\n📈 Checkpoint Timeline:")
for i, snapshot in enumerate(snapshots, 1):
print(f" {i}. [{snapshot['elapsed_seconds']:6.2f}s] "
f"{snapshot['completed_count']} prompts completed")
return True
def test_interruption_and_resume():
"""Test that resume actually works after interruption."""
print("\n" + "=" * 70)
print("TEST 2: Interruption and Resume")
print("=" * 70)
print("\n📝 Testing whether resume works after manual interruption...")
# Setup
dataset_file = create_test_dataset(num_prompts=15)
run_name = "checkpoint_test_resume"
output_dir = Path("data") / run_name
# Clean up any existing test data
if output_dir.exists():
shutil.rmtree(output_dir)
from batch_runner import BatchRunner
checkpoint_file = output_dir / "checkpoint.json"
print(f"\n▶️ Starting first run (will process 5 prompts, then simulate interruption)...")
try:
# Create a modified dataset with only first 5 prompts for initial run
temp_dataset = Path("tests/test_data/checkpoint_test_resume_partial.jsonl")
with open(dataset_file, 'r') as f:
lines = f.readlines()[:5]
with open(temp_dataset, 'w') as f:
f.writelines(lines)
runner = BatchRunner(
dataset_file=str(temp_dataset),
batch_size=2,
run_name=run_name,
distribution="default",
max_iterations=3,
model="claude-opus-4-20250514",
num_workers=1,
verbose=False
)
runner.run(resume=False)
# Check checkpoint after first run
if not checkpoint_file.exists():
print("❌ ERROR: Checkpoint file not created after first run")
return False
with open(checkpoint_file, 'r') as f:
checkpoint_data = json.load(f)
initial_completed = len(checkpoint_data.get("completed_prompts", []))
print(f"✅ First run completed: {initial_completed} prompts saved to checkpoint")
# Now try to resume with full dataset
print(f"\n▶️ Starting resume run with full dataset (15 prompts)...")
runner2 = BatchRunner(
dataset_file=str(dataset_file),
batch_size=2,
run_name=run_name,
distribution="default",
max_iterations=3,
model="claude-opus-4-20250514",
num_workers=1,
verbose=False
)
runner2.run(resume=True)
# Check final checkpoint
with open(checkpoint_file, 'r') as f:
final_checkpoint = json.load(f)
final_completed = len(final_checkpoint.get("completed_prompts", []))
print("\n" + "=" * 70)
print("📊 TEST RESULTS")
print("=" * 70)
print(f"Initial completed: {initial_completed}")
print(f"Final completed: {final_completed}")
print(f"Expected: 15")
if final_completed == 15:
print("\n✅ PASS: Resume successfully completed all prompts")
return True
else:
print(f"\n❌ FAIL: Expected 15 completed, got {final_completed}")
return False
except Exception as e:
print(f"❌ Error during test: {e}")
traceback.print_exc()
return False
def test_simulated_crash():
"""Test behavior when process crashes mid-execution."""
print("\n" + "=" * 70)
print("TEST 3: Simulated Crash During Execution")
print("=" * 70)
print("\n📝 This test would require running in a subprocess and killing it...")
print(" Skipping for safety - manual testing recommended")
return None
def print_test_plan():
"""Print the detailed test and fix plan."""
print("\n" + "=" * 70)
print("CHECKPOINT FIX - DETAILED PLAN")
print("=" * 70)
print("""
📋 PROBLEM SUMMARY
------------------
Current implementation uses pool.map() which blocks until ALL batches complete.
Checkpoint is only saved after all batches finish (line 558-559).
If process crashes during batch processing:
- All progress is lost
- Resume does nothing (no incremental checkpoint was saved)
📋 PROPOSED SOLUTION
--------------------
Replace pool.map() with pool.imap_unordered() to get results as they complete.
Save checkpoint after EACH batch completes using a multiprocessing Lock.
Key changes:
1. Use Manager().Lock() for thread-safe checkpoint writes
2. Replace pool.map() with pool.imap_unordered()
3. Update checkpoint after each batch result
4. Maintain backward compatibility with existing checkpoints
📋 IMPLEMENTATION STEPS
-----------------------
1. Add Manager and Lock initialization before Pool creation
2. Pass shared checkpoint data and lock to workers (via Manager)
3. Replace pool.map() with pool.imap_unordered()
4. In result loop: save checkpoint after each batch
5. Add error handling for checkpoint write failures
📋 RISKS & MITIGATIONS
----------------------
Risk: Checkpoint file corruption if two processes write simultaneously
→ Mitigation: Use multiprocessing.Lock() for exclusive access
Risk: Performance impact from frequent checkpoint writes
→ Mitigation: Checkpoint writes are fast (small JSON), negligible impact
Risk: Breaking existing runs that are already checkpointed
→ Mitigation: Maintain checkpoint format, only change timing
Risk: Bugs in multiprocessing lock/manager code
→ Mitigation: Thorough testing with this test script
📋 TESTING STRATEGY
-------------------
1. Run test_current_implementation() - Confirm bug exists
2. Apply fix to batch_runner.py
3. Run test_current_implementation() again - Should see incremental updates
4. Run test_interruption_and_resume() - Verify resume works
5. Manual test: Start run, kill process mid-batch, resume
📋 ROLLBACK PLAN
----------------
If issues arise:
1. Git revert the changes
2. Original code is working (just missing incremental checkpoint)
3. No data corruption risk - checkpoints are write-only
""")
def main(
test_current: bool = False,
test_resume: bool = False,
test_crash: bool = False,
compare: bool = False,
show_plan: bool = False
):
"""
Run checkpoint behavior tests.
Args:
test_current: Test current implementation checkpoint timing
test_resume: Test interruption and resume functionality
test_crash: Test simulated crash scenario (manual)
compare: Run all tests and compare
show_plan: Show detailed fix plan
"""
if show_plan or (not any([test_current, test_resume, test_crash, compare])):
print_test_plan()
return
results = {}
if test_current or compare:
results['current'] = test_current_implementation()
if test_resume or compare:
results['resume'] = test_interruption_and_resume()
if test_crash or compare:
results['crash'] = test_simulated_crash()
# Summary
if results:
print("\n" + "=" * 70)
print("OVERALL TEST SUMMARY")
print("=" * 70)
for test_name, result in results.items():
if result is None:
status = "⏭️ SKIPPED"
elif result:
status = "✅ PASS"
else:
status = "❌ FAIL"
print(f"{status} - {test_name}")
if __name__ == "__main__":
import fire
fire.Fire(main)

176
tests/test_nous_api_limits.py Executable file
View File

@@ -0,0 +1,176 @@
#!/usr/bin/env python3
"""
Test script to diagnose Nous API 400 errors with gemini-2.5-flash model.
This tests various content lengths and parameters to identify what causes failures.
"""
import asyncio
import os
from openai import AsyncOpenAI
from dotenv import load_dotenv
# Load environment variables
load_dotenv()
# Initialize the Nous API client
nous_client = AsyncOpenAI(
api_key=os.getenv("NOUS_API_KEY"),
base_url="https://inference-api.nousresearch.com/v1"
)
MODEL = "gemini-2.5-flash"
async def test_api_call(test_name: str, content_length: int, **kwargs):
"""Test an API call with specific parameters."""
print(f"\n{'='*60}")
print(f"Test: {test_name}")
print(f"Content length: {content_length:,} characters")
print(f"Additional params: {kwargs}")
print(f"{'='*60}")
# Generate test content
content = "A" * content_length
system_prompt = """You are an expert content analyst. Your job is to process web content and create a comprehensive yet concise summary that preserves all important information while dramatically reducing bulk.
Create a well-structured markdown summary that includes:
1. Key excerpts (quotes, code snippets, important facts) in their original format
2. Comprehensive summary of all other important information
3. Proper markdown formatting with headers, bullets, and emphasis
Your goal is to preserve ALL important information while reducing length. Never lose key facts, figures, insights, or actionable information. Make it scannable and well-organized."""
user_prompt = f"""Please process this web content and create a comprehensive markdown summary:
CONTENT TO PROCESS:
{content}
Create a markdown summary that captures all key information in a well-organized, scannable format. Include important quotes and code snippets in their original formatting. Focus on actionable information, specific details, and unique insights."""
try:
response = await nous_client.chat.completions.create(
model=MODEL,
messages=[
{"role": "system", "content": system_prompt},
{"role": "user", "content": user_prompt}
],
**kwargs
)
result = response.choices[0].message.content
print(f"✅ SUCCESS")
print(f" Response length: {len(result)} characters")
print(f" Model used: {response.model}")
print(f" Usage: {response.usage}")
return True
except Exception as e:
print(f"❌ FAILED: {str(e)}")
return False
async def main():
"""Run all tests."""
print("Testing Nous API with gemini-2.5-flash model")
print(f"API Key present: {'Yes' if os.getenv('NOUS_API_KEY') else 'No'}")
results = {}
# Test 1: Small content (should always work)
results['small'] = await test_api_call(
"Small content (5,000 chars)",
5000,
temperature=0.1,
max_tokens=4000
)
await asyncio.sleep(1)
# Test 2: Medium content (around what was failing)
results['medium'] = await test_api_call(
"Medium content (20,000 chars)",
20000,
temperature=0.1,
max_tokens=4000
)
await asyncio.sleep(1)
# Test 3: Large content (79,625 chars like the error)
results['large'] = await test_api_call(
"Large content (79,625 chars)",
79625,
temperature=0.1,
max_tokens=4000
)
await asyncio.sleep(1)
# Test 4: Very large content (100k chars)
results['very_large'] = await test_api_call(
"Very large content (100,000 chars)",
100000,
temperature=0.1,
max_tokens=4000
)
await asyncio.sleep(1)
# Test 5: Same as working case but different max_tokens
results['diff_max_tokens'] = await test_api_call(
"Medium content with higher max_tokens",
20000,
temperature=0.1,
max_tokens=8000
)
await asyncio.sleep(1)
# Test 6: No max_tokens specified
results['no_max_tokens'] = await test_api_call(
"Medium content without max_tokens",
20000,
temperature=0.1
)
await asyncio.sleep(1)
# Test 7: With actual web content (mixed characters)
mixed_content = """
This is a test of web content with various characters:
- Unicode: 你好世界 🌍
- Special chars: <>&"'
- Numbers: 123456789
- Markdown: **bold** _italic_ `code`
- URLs: https://example.com
""" * 1000 # Repeat to make it ~79k chars
print(f"\n{'='*60}")
print(f"Test: Mixed content (real-world scenario)")
print(f"Content length: {len(mixed_content):,} characters")
print(f"{'='*60}")
try:
response = await nous_client.chat.completions.create(
model=MODEL,
messages=[
{"role": "system", "content": "Summarize this content."},
{"role": "user", "content": mixed_content}
],
temperature=0.1,
max_tokens=4000
)
print(f"✅ SUCCESS")
results['mixed_content'] = True
except Exception as e:
print(f"❌ FAILED: {str(e)}")
results['mixed_content'] = False
# Summary
print(f"\n{'='*60}")
print("SUMMARY OF RESULTS:")
print(f"{'='*60}")
for test, passed in results.items():
status = "✅ PASS" if passed else "❌ FAIL"
print(f"{test:20s}: {status}")
passed = sum(results.values())
total = len(results)
print(f"\nTotal: {passed}/{total} tests passed")
if __name__ == "__main__":
asyncio.run(main())

View File

@@ -0,0 +1,131 @@
#!/usr/bin/env python3
"""
Test to understand the pattern of failures - it's not about content length!
"""
import asyncio
import os
from openai import AsyncOpenAI
from dotenv import load_dotenv
load_dotenv()
nous_client = AsyncOpenAI(
api_key=os.getenv("NOUS_API_KEY"),
base_url="https://inference-api.nousresearch.com/v1"
)
MODEL = "gemini-2.5-flash"
async def quick_test(description: str, content: str, **kwargs):
"""Quick API test."""
print(f"\n{description} ({len(content):,} chars)...", end=" ")
try:
response = await nous_client.chat.completions.create(
model=MODEL,
messages=[
{"role": "system", "content": "Summarize this."},
{"role": "user", "content": content}
],
**kwargs
)
print(f"✅ SUCCESS")
return True
except Exception as e:
print(f"❌ FAILED: {str(e)[:80]}")
return False
async def main():
print("Testing different content types and parameters...")
# Theory 1: Repeated characters trigger validation
print("\n" + "="*60)
print("THEORY 1: Repeated characters")
print("="*60)
await quick_test("Repeated 'A's (5k)", "A" * 5000, temperature=0.1, max_tokens=4000)
await asyncio.sleep(0.5)
await quick_test("Repeated 'A's (79k)", "A" * 79625, temperature=0.1, max_tokens=4000)
await asyncio.sleep(0.5)
await quick_test("Varied text (5k)", "Test content. " * 400, temperature=0.1, max_tokens=4000)
await asyncio.sleep(0.5)
await quick_test("Varied text (79k)", "Test content with variety. " * 3000, temperature=0.1, max_tokens=4000)
# Theory 2: max_tokens parameter
print("\n" + "="*60)
print("THEORY 2: max_tokens parameter")
print("="*60)
content = "Test " * 4000 # 20k chars
await quick_test("max_tokens=4000", content, temperature=0.1, max_tokens=4000)
await asyncio.sleep(0.5)
await quick_test("max_tokens=8000", content, temperature=0.1, max_tokens=8000)
await asyncio.sleep(0.5)
await quick_test("max_tokens=2000", content, temperature=0.1, max_tokens=2000)
await asyncio.sleep(0.5)
await quick_test("No max_tokens", content, temperature=0.1)
# Theory 3: Temperature parameter
print("\n" + "="*60)
print("THEORY 3: Temperature parameter")
print("="*60)
content = "Test " * 4000
await quick_test("temperature=0.1", content, temperature=0.1, max_tokens=4000)
await asyncio.sleep(0.5)
await quick_test("temperature=0.0", content, temperature=0.0, max_tokens=4000)
await asyncio.sleep(0.5)
await quick_test("temperature=0.5", content, temperature=0.5, max_tokens=4000)
await asyncio.sleep(0.5)
await quick_test("No temperature", content, max_tokens=4000)
# Theory 4: System prompt impact
print("\n" + "="*60)
print("THEORY 4: System prompt length")
print("="*60)
short_system = "Summarize this."
long_system = """You are an expert content analyst. Your job is to process web content and create a comprehensive yet concise summary that preserves all important information while dramatically reducing bulk.
Create a well-structured markdown summary that includes:
1. Key excerpts (quotes, code snippets, important facts) in their original format
2. Comprehensive summary of all other important information
3. Proper markdown formatting with headers, bullets, and emphasis
Your goal is to preserve ALL important information while reducing length."""
content = "A" * 5000
print(f"\nShort system prompt...", end=" ")
try:
response = await nous_client.chat.completions.create(
model=MODEL,
messages=[
{"role": "system", "content": short_system},
{"role": "user", "content": content}
],
temperature=0.1,
max_tokens=4000
)
print(f"✅ SUCCESS")
except Exception as e:
print(f"❌ FAILED")
await asyncio.sleep(0.5)
print(f"Long system prompt...", end=" ")
try:
response = await nous_client.chat.completions.create(
model=MODEL,
messages=[
{"role": "system", "content": long_system},
{"role": "user", "content": content}
],
temperature=0.1,
max_tokens=4000
)
print(f"✅ SUCCESS")
except Exception as e:
print(f"❌ FAILED")
if __name__ == "__main__":
asyncio.run(main())

View File

@@ -0,0 +1,109 @@
#!/usr/bin/env python3
"""
Test to confirm: temperature < 0.3 causes failures on Nous API
"""
import asyncio
import os
from openai import AsyncOpenAI
from dotenv import load_dotenv
load_dotenv()
nous_client = AsyncOpenAI(
api_key=os.getenv("NOUS_API_KEY"),
base_url="https://inference-api.nousresearch.com/v1"
)
MODEL = "gemini-2.5-flash"
async def test_temp(temp_value):
"""Test a specific temperature value."""
content = "Test content. " * 1000 # 14k chars
print(f"Testing temperature={temp_value}...", end=" ")
try:
response = await nous_client.chat.completions.create(
model=MODEL,
messages=[
{"role": "system", "content": "Summarize this content."},
{"role": "user", "content": content}
],
temperature=temp_value,
max_tokens=4000
)
print(f"✅ SUCCESS")
return True
except Exception as e:
print(f"❌ FAILED")
return False
async def main():
print("Testing temperature threshold for Nous API...")
print("="*60)
temps = [0.0, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 1.0]
for temp in temps:
await test_temp(temp)
await asyncio.sleep(0.5)
print("="*60)
print("\nNow testing with ACTUAL web_tools.py content and parameters:")
print("="*60)
# Simulate the actual web_tools.py call
system_prompt = """You are an expert content analyst. Your job is to process web content and create a comprehensive yet concise summary that preserves all important information while dramatically reducing bulk.
Create a well-structured markdown summary that includes:
1. Key excerpts (quotes, code snippets, important facts) in their original format
2. Comprehensive summary of all other important information
3. Proper markdown formatting with headers, bullets, and emphasis
Your goal is to preserve ALL important information while reducing length. Never lose key facts, figures, insights, or actionable information. Make it scannable and well-organized."""
content = "Sample web page content. " * 3000 # ~75k chars like the real failures
user_prompt = f"""Please process this web content and create a comprehensive markdown summary:
CONTENT TO PROCESS:
{content}
Create a markdown summary that captures all key information in a well-organized, scannable format. Include important quotes and code snippets in their original formatting. Focus on actionable information, specific details, and unique insights."""
print(f"\nActual web_tools call (temp=0.1, {len(content):,} chars)...", end=" ")
try:
response = await nous_client.chat.completions.create(
model=MODEL,
messages=[
{"role": "system", "content": system_prompt},
{"role": "user", "content": user_prompt}
],
temperature=0.1,
max_tokens=4000
)
print(f"✅ SUCCESS")
except:
print(f"❌ FAILED")
await asyncio.sleep(0.5)
print(f"Same call but with temp=0.3...", end=" ")
try:
response = await nous_client.chat.completions.create(
model=MODEL,
messages=[
{"role": "system", "content": system_prompt},
{"role": "user", "content": user_prompt}
],
temperature=0.3,
max_tokens=4000
)
print(f"✅ SUCCESS")
except:
print(f"❌ FAILED")
if __name__ == "__main__":
asyncio.run(main())

File diff suppressed because it is too large Load Diff

67
tools/__init__.py Normal file
View File

@@ -0,0 +1,67 @@
#!/usr/bin/env python3
"""
Tools Package
This package contains all the specific tool implementations for the Hermes Agent.
Each module provides specialized functionality for different capabilities:
- web_tools: Web search, content extraction, and crawling
- terminal_tool: Command execution on virtual machines
- vision_tools: Image analysis and understanding
- mixture_of_agents_tool: Multi-model collaborative reasoning
- image_generation_tool: Text-to-image generation with upscaling
The tools are imported into model_tools.py which provides a unified interface
for the AI agent to access all capabilities.
"""
# Export all tools for easy importing
from .web_tools import (
web_search_tool,
web_extract_tool,
web_crawl_tool,
check_firecrawl_api_key
)
from .terminal_tool import (
terminal_tool,
check_hecate_requirements,
TERMINAL_TOOL_DESCRIPTION
)
from .vision_tools import (
vision_analyze_tool,
check_vision_requirements
)
from .mixture_of_agents_tool import (
mixture_of_agents_tool,
check_moa_requirements
)
from .image_generation_tool import (
image_generate_tool,
check_image_generation_requirements
)
__all__ = [
# Web tools
'web_search_tool',
'web_extract_tool',
'web_crawl_tool',
'check_firecrawl_api_key',
# Terminal tools
'terminal_tool',
'check_hecate_requirements',
'TERMINAL_TOOL_DESCRIPTION',
# Vision tools
'vision_analyze_tool',
'check_vision_requirements',
# MoA tools
'mixture_of_agents_tool',
'check_moa_requirements',
# Image generation tools
'image_generate_tool',
'check_image_generation_requirements',
]

View File

@@ -319,9 +319,6 @@ async def image_generate_tool(
if not prompt or not isinstance(prompt, str) or len(prompt.strip()) == 0: if not prompt or not isinstance(prompt, str) or len(prompt.strip()) == 0:
raise ValueError("Prompt is required and must be a non-empty string") raise ValueError("Prompt is required and must be a non-empty string")
if len(prompt) > 1000:
raise ValueError("Prompt must be 1000 characters or less")
# Check API key availability # Check API key availability
if not os.getenv("FAL_KEY"): if not os.getenv("FAL_KEY"):
raise ValueError("FAL_KEY environment variable not set") raise ValueError("FAL_KEY environment variable not set")
@@ -417,7 +414,7 @@ async def image_generate_tool(
_log_debug_call("image_generate_tool", debug_call_data) _log_debug_call("image_generate_tool", debug_call_data)
_save_debug_log() _save_debug_log()
return json.dumps(response_data, indent=2) return json.dumps(response_data, indent=2, ensure_ascii=False)
except Exception as e: except Exception as e:
generation_time = (datetime.datetime.now() - start_time).total_seconds() generation_time = (datetime.datetime.now() - start_time).total_seconds()
@@ -435,7 +432,7 @@ async def image_generate_tool(
_log_debug_call("image_generate_tool", debug_call_data) _log_debug_call("image_generate_tool", debug_call_data)
_save_debug_log() _save_debug_log()
return json.dumps(response_data, indent=2) return json.dumps(response_data, indent=2, ensure_ascii=False)
def check_fal_api_key() -> bool: def check_fal_api_key() -> bool:

File diff suppressed because it is too large Load Diff

View File

@@ -0,0 +1,395 @@
#!/usr/bin/env python3
"""
Simple Terminal Tool Module
A simplified terminal tool that executes commands on MorphCloud VMs without tmux.
No session persistence, no interactive app support - just simple command execution.
Features:
- Direct SSH command execution
- Background task support
- VM lifecycle management with TTL
- Automatic cleanup after inactivity
Usage:
from simple_terminal_tool import simple_terminal_tool
# Execute a simple command
result = simple_terminal_tool("ls -la")
# Execute in background
result = simple_terminal_tool("python server.py", background=True)
"""
import json
import os
import time
import threading
import atexit
from typing import Optional, Dict, Any
# Tool description for LLM
SIMPLE_TERMINAL_TOOL_DESCRIPTION = """Execute commands on a secure Linux VM environment.
**Environment:**
- Minimal Debian-based OS with internet access
- Automatic VM lifecycle management (creates on-demand, reuses, cleans up)
- Filesystem is persisted between tool calls but environment variables, venvs, etc are reset.
**Command Execution:**
- Simple commands: Just provide the 'command' parameter
- Background processes: Set 'background': True for servers/long-running tasks
- Command timeout: Optional 'timeout' parameter in seconds
**Examples:**
- Run command: `{"command": "ls -la"}`
- Background task: `{"command": "source path/to/my/venv/bin/activate && python server.py", "background": True}`
- With timeout: `{"command": "long_task.sh", "timeout": 300}`
**Best Practices:**
- Run servers/long processes in background
- Monitor disk usage for large tasks
- Install whatever tools you need with sudo apt-get
- Do not be afraid to run pip with --break-system-packages
**Things to avoid**
- Do NOT use interactive tools such as tmux, vim, nano, python repl - you will get stuck. Even git sometimes becomes interactive if the output is large. If you're not sure pipe to cat.
"""
# Global state for VM lifecycle management
_active_instances: Dict[str, Any] = {}
_last_activity: Dict[str, float] = {}
_instance_lock = threading.Lock()
_cleanup_thread = None
_cleanup_running = False
def _cleanup_inactive_vms(vm_lifetime_seconds: int = 300):
"""Clean up VMs that have been inactive for longer than vm_lifetime_seconds."""
global _active_instances, _last_activity
current_time = time.time()
tasks_to_cleanup = []
with _instance_lock:
for task_id, last_time in list(_last_activity.items()):
if current_time - last_time > vm_lifetime_seconds:
tasks_to_cleanup.append(task_id)
for task_id in tasks_to_cleanup:
try:
if task_id in _active_instances:
instance = _active_instances[task_id]
if hasattr(instance, 'terminate'):
instance.terminate()
elif hasattr(instance, 'stop'):
instance.stop()
elif hasattr(instance, 'delete'):
instance.delete()
del _active_instances[task_id]
print(f"[VM Cleanup] Terminated inactive VM for task: {task_id}")
if task_id in _last_activity:
del _last_activity[task_id]
except Exception as e:
# 404 errors are benign - VM already cleaned up by TTL
error_str = str(e)
if "404" in error_str or "InstanceNotFoundError" in error_str or "not found" in error_str.lower():
print(f"[VM Cleanup] VM for task {task_id} already cleaned up (likely TTL expiration)")
else:
print(f"[VM Cleanup] Error cleaning up VM for task {task_id}: {e}")
def _cleanup_thread_worker():
"""Background thread worker that periodically cleans up inactive VMs."""
global _cleanup_running
while _cleanup_running:
try:
vm_lifetime = int(os.getenv("HECATE_VM_LIFETIME_SECONDS", "300"))
_cleanup_inactive_vms(vm_lifetime)
except Exception as e:
print(f"[VM Cleanup] Error in cleanup thread: {e}")
for _ in range(60):
if not _cleanup_running:
break
time.sleep(1)
def _start_cleanup_thread():
"""Start the background cleanup thread if not already running."""
global _cleanup_thread, _cleanup_running
with _instance_lock:
if _cleanup_thread is None or not _cleanup_thread.is_alive():
_cleanup_running = True
_cleanup_thread = threading.Thread(target=_cleanup_thread_worker, daemon=True)
_cleanup_thread.start()
def _stop_cleanup_thread():
"""Stop the background cleanup thread."""
global _cleanup_running
_cleanup_running = False
if _cleanup_thread is not None:
_cleanup_thread.join(timeout=5)
def cleanup_vm(task_id: str):
"""Manually clean up a specific VM by task_id."""
global _active_instances, _last_activity
with _instance_lock:
try:
if task_id in _active_instances:
instance = _active_instances[task_id]
if hasattr(instance, 'terminate'):
instance.terminate()
elif hasattr(instance, 'stop'):
instance.stop()
elif hasattr(instance, 'delete'):
instance.delete()
del _active_instances[task_id]
print(f"[VM Cleanup] Manually terminated VM for task: {task_id}")
if task_id in _last_activity:
del _last_activity[task_id]
except Exception as e:
# 404 errors are benign - VM already cleaned up by TTL
error_str = str(e)
if "404" in error_str or "InstanceNotFoundError" in error_str or "not found" in error_str.lower():
print(f"[VM Cleanup] VM for task {task_id} already cleaned up (likely TTL expiration)")
else:
print(f"[VM Cleanup] Error manually cleaning up VM for task {task_id}: {e}")
atexit.register(_stop_cleanup_thread)
def _execute_ssh_command(instance, command: str, timeout: Optional[int] = None) -> Dict[str, Any]:
"""
Execute a command via SSH on the VM instance.
Args:
instance: MorphVM instance
command: Command to execute
timeout: Optional timeout in seconds
Returns:
dict with stdout, stderr, returncode
"""
ssh_context_manager = None
try:
# Use the instance's SSH context manager
ssh_context_manager = instance.ssh()
ssh_context = ssh_context_manager.__enter__()
# Execute the command
result = ssh_context.run(command, get_pty=False, timeout=timeout or 120)
# Close the SSH connection
if ssh_context_manager:
try:
ssh_context_manager.__exit__(None, None, None)
except:
pass
return {
"stdout": result.stdout or "",
"stderr": result.stderr or "",
"returncode": result.returncode
}
except Exception as e:
# Close connection on error
if ssh_context_manager:
try:
ssh_context_manager.__exit__(None, None, None)
except:
pass
# Check if it's a timeout
error_str = str(e).lower()
if "timeout" in error_str:
return {
"stdout": "",
"stderr": f"Command timed out after {timeout or 120} seconds",
"returncode": 124
}
return {
"stdout": "",
"stderr": f"SSH execution failed: {str(e)}",
"returncode": -1
}
def simple_terminal_tool(
command: str,
background: bool = False,
timeout: Optional[int] = None,
task_id: Optional[str] = None
) -> str:
"""
Execute a command on a MorphCloud VM without session persistence.
Args:
command: The command to execute
background: Whether to run in background (default: False)
timeout: Command timeout in seconds (default: 120)
task_id: Unique identifier for VM isolation (optional)
Returns:
str: JSON string with output, exit_code, and error fields
Examples:
# Execute a simple command
>>> result = simple_terminal_tool(command="ls -la /tmp")
# Run a background task
>>> result = simple_terminal_tool(command="python server.py", background=True)
# With custom timeout
>>> result = simple_terminal_tool(command="long_task.sh", timeout=300)
"""
global _active_instances, _last_activity
try:
# Import required modules
try:
from morphcloud.api import MorphCloudClient
except ImportError as import_error:
return json.dumps({
"output": "",
"exit_code": -1,
"error": f"Terminal tool disabled: {import_error}",
"status": "disabled"
}, ensure_ascii=False)
# Get configuration
vm_ttl_seconds = int(os.getenv("HECATE_VM_TTL_SECONDS", "1200"))
snapshot_id = os.getenv("HECATE_DEFAULT_SNAPSHOT_ID", "snapshot_defv9tjg")
# Check API key
morph_api_key = os.getenv("MORPH_API_KEY")
if not morph_api_key:
return json.dumps({
"output": "",
"exit_code": -1,
"error": "MORPH_API_KEY environment variable not set",
"status": "disabled"
}, ensure_ascii=False)
# Use task_id for VM isolation
effective_task_id = task_id or "default"
# Start cleanup thread
_start_cleanup_thread()
# Get or create VM instance
with _instance_lock:
if effective_task_id not in _active_instances:
morph_client = MorphCloudClient(api_key=morph_api_key)
_active_instances[effective_task_id] = morph_client.instances.start(
snapshot_id=snapshot_id,
ttl_seconds=vm_ttl_seconds,
ttl_action="stop"
)
# Update last activity time
_last_activity[effective_task_id] = time.time()
instance = _active_instances[effective_task_id]
# Wait for instance to be ready
instance.wait_until_ready()
# Prepare command for execution
if background:
# Run in background with nohup and redirect output
exec_command = f"nohup {command} > /tmp/bg_output.log 2>&1 &"
result = _execute_ssh_command(instance, exec_command, timeout=10)
# For background tasks, return immediately with info
if result["returncode"] == 0:
return json.dumps({
"output": "Background task started successfully",
"exit_code": 0,
"error": None
}, ensure_ascii=False)
else:
return json.dumps({
"output": result["stdout"],
"exit_code": result["returncode"],
"error": result["stderr"]
}, ensure_ascii=False)
else:
# Run foreground command
result = _execute_ssh_command(instance, command, timeout=timeout)
# Combine stdout and stderr for output
output = result["stdout"]
if result["stderr"] and result["returncode"] != 0:
output = f"{output}\n{result['stderr']}" if output else result["stderr"]
return json.dumps({
"output": output.strip(),
"exit_code": result["returncode"],
"error": result["stderr"] if result["returncode"] != 0 else None
}, ensure_ascii=False)
except Exception as e:
return json.dumps({
"output": "",
"exit_code": -1,
"error": f"Failed to execute command: {str(e)}",
"status": "error"
}, ensure_ascii=False)
def check_requirements() -> bool:
"""Check if all requirements for the simple terminal tool are met."""
required_vars = ["MORPH_API_KEY"]
missing_required = [var for var in required_vars if not os.getenv(var)]
if missing_required:
print(f"Missing required environment variables: {', '.join(missing_required)}")
return False
try:
from morphcloud.api import MorphCloudClient
return True
except Exception as e:
print(f"MorphCloud not available: {e}")
return False
if __name__ == "__main__":
"""Simple test when run directly."""
print("Simple Terminal Tool Module")
print("=" * 40)
if not check_requirements():
print("Requirements not met. Please check the messages above.")
exit(1)
print("All requirements met!")
print("\nAvailable Tool:")
print(" - simple_terminal_tool: Execute commands without session persistence")
print("\nUsage Examples:")
print(" # Execute a command")
print(" result = simple_terminal_tool(command='ls -la')")
print(" ")
print(" # Run a background task")
print(" result = simple_terminal_tool(command='python server.py', background=True)")
print("\nEnvironment Variables:")
print(f" MORPH_API_KEY: {'Set' if os.getenv('MORPH_API_KEY') else 'Not set'}")
print(f" HECATE_VM_TTL_SECONDS: {os.getenv('HECATE_VM_TTL_SECONDS', '1200')} (default: 1200 / 20 minutes)")
print(f" HECATE_VM_LIFETIME_SECONDS: {os.getenv('HECATE_VM_LIFETIME_SECONDS', '300')} (default: 300 / 5 minutes)")
print(f" HECATE_DEFAULT_SNAPSHOT_ID: {os.getenv('HECATE_DEFAULT_SNAPSHOT_ID', 'snapshot_defv9tjg')}")

456
tools/terminal_tool.py Normal file
View File

@@ -0,0 +1,456 @@
#!/usr/bin/env python3
"""
Terminal Tool Module
This module provides a single terminal tool using Hecate's VM infrastructure.
It wraps Hecate's functionality to provide a simple interface for executing commands
on Morph VMs with automatic lifecycle management.
VM Lifecycle:
- VMs have a TTL (time to live) set at creation (default: 20 minutes)
- VMs are also cleaned up locally after 5 minutes of inactivity
- Timer resets with each use
Available tool:
- terminal_tool: Execute commands with optional interactive session support
Usage:
from terminal_tool import terminal_tool
# Execute a single command
result = terminal_tool("ls -la")
# Execute in an interactive session
result = terminal_tool("python", input_keys="print('hello')\\nexit()\\n")
"""
import json
import os
import uuid
import threading
import time
import atexit
from typing import Optional, Dict, Any
# Detailed description for the terminal tool based on Hermes Terminal system prompt
TERMINAL_TOOL_DESCRIPTION = """Execute commands on a secure, persistent Linux VM environment with full interactive application support.
**Environment:**
- Minimal Debian-based OS with internet access
- Automatic VM lifecycle management (creates on-demand, reuses, cleans up)
- **Full state persistence across tool calls**: current directory (pwd), environment variables, activated virtual environments (conda/venv), running processes, and command history all persist between consecutive tool calls
- Session state managed automatically via tmux
**Command Execution:**
- Simple commands: Just provide the 'command' parameter
- Background processes: Set 'background': True for servers/long-running tasks
- Interactive applications automatically detected and handled
**Interactive Applications (TUIs/Pagers/Prompts):**
When commands enter interactive mode (vim, nano, less, git prompts, package managers, etc.), you'll receive screen content with "frozen" status. This is NORMAL - the session is still active and waiting for input.
**To interact with frozen sessions:**
1. Use 'input_keys' parameter with keystrokes to send
2. System auto-detects and uses the active session
3. Session stays active until application exits
**Special Key Syntax for input_keys:**
- `<ESC>`: Escape key
- `<ENTER>`: Enter/Return
- `<CTRL+C>`, `<CTRL+D>`, `<CTRL+Z>`: Control combinations
- `<UP>`, `<DOWN>`, `<LEFT>`, `<RIGHT>`: Arrow keys
- `<TAB>`, `<BACKSPACE>`: Tab and Backspace
- `<F1>` through `<F12>`: Function keys
- `<SHIFT+TAB>`: Shift+Tab
- Uppercase letters for Shift+letter (e.g., 'V' for Shift+V)
- Symbols for Shift+number (e.g., '!' for Shift+1, ':' for Shift+;)
**Examples:**
- Start vim: `{"command": "vim file.txt"}`
- Type in vim: `{"input_keys": "iHello World<ESC>"}`
- Save and quit: `{"input_keys": ":wq<ENTER>"}`
- Navigate in less: `{"input_keys": "j"}`
- Quit less: `{"input_keys": "q"}`
**Best Practices:**
- Run servers/long processes in background with separate tool calls
- Chain multiple foreground commands in single call if needed
- Monitor disk usage for large tasks, clean up to free space
- Test components incrementally with mock inputs
- Install whatever tools needed - full system access provided"""
# Global state for VM lifecycle management
# These persist across tool calls to enable session continuity
# Changed to dictionaries keyed by task_id to prevent leakage between concurrent tasks
_active_instances: Dict[str, Any] = {}
_active_contexts: Dict[str, Any] = {}
_last_activity: Dict[str, float] = {} # Track last activity time for each VM
_instance_lock = threading.Lock()
_cleanup_thread = None
_cleanup_running = False
def _cleanup_inactive_vms(vm_lifetime_seconds: int = 300):
"""
Clean up VMs that have been inactive for longer than vm_lifetime_seconds.
This function should be called periodically by a background thread.
Args:
vm_lifetime_seconds: Maximum lifetime in seconds for inactive VMs (default: 300)
"""
global _active_instances, _active_contexts, _last_activity
current_time = time.time()
tasks_to_cleanup = []
with _instance_lock:
# Find all VMs that have been inactive for too long
for task_id, last_time in list(_last_activity.items()):
if current_time - last_time > vm_lifetime_seconds:
tasks_to_cleanup.append(task_id)
# Clean up the inactive VMs
for task_id in tasks_to_cleanup:
try:
if task_id in _active_instances:
instance = _active_instances[task_id]
# Terminate the VM instance
if hasattr(instance, 'terminate'):
instance.terminate()
elif hasattr(instance, 'stop'):
instance.stop()
elif hasattr(instance, 'delete'):
instance.delete()
# Remove from tracking dictionaries
del _active_instances[task_id]
print(f"[VM Cleanup] Terminated inactive VM for task: {task_id}")
if task_id in _active_contexts:
del _active_contexts[task_id]
if task_id in _last_activity:
del _last_activity[task_id]
except Exception as e:
print(f"[VM Cleanup] Error cleaning up VM for task {task_id}: {e}")
def _cleanup_thread_worker():
"""
Background thread worker that periodically cleans up inactive VMs.
Runs every 60 seconds.
"""
global _cleanup_running
while _cleanup_running:
try:
vm_lifetime = int(os.getenv("HECATE_VM_LIFETIME_SECONDS", "300"))
_cleanup_inactive_vms(vm_lifetime)
except Exception as e:
print(f"[VM Cleanup] Error in cleanup thread: {e}")
# Sleep for 60 seconds, but check every second if we should stop
for _ in range(60):
if not _cleanup_running:
break
time.sleep(1)
def _start_cleanup_thread():
"""
Start the background cleanup thread if it's not already running.
"""
global _cleanup_thread, _cleanup_running
with _instance_lock:
if _cleanup_thread is None or not _cleanup_thread.is_alive():
_cleanup_running = True
_cleanup_thread = threading.Thread(target=_cleanup_thread_worker, daemon=True)
_cleanup_thread.start()
def _stop_cleanup_thread():
"""
Stop the background cleanup thread.
"""
global _cleanup_running
_cleanup_running = False
if _cleanup_thread is not None:
_cleanup_thread.join(timeout=5)
def cleanup_vm(task_id: str):
"""
Manually clean up a specific VM by task_id.
This should be called when a task is completed.
Args:
task_id: The task ID of the VM to clean up
"""
global _active_instances, _active_contexts, _last_activity
with _instance_lock:
try:
if task_id in _active_instances:
instance = _active_instances[task_id]
# Terminate the VM instance
if hasattr(instance, 'terminate'):
instance.terminate()
elif hasattr(instance, 'stop'):
instance.stop()
elif hasattr(instance, 'delete'):
instance.delete()
# Remove from tracking dictionaries
del _active_instances[task_id]
print(f"[VM Cleanup] Manually terminated VM for task: {task_id}")
if task_id in _active_contexts:
del _active_contexts[task_id]
if task_id in _last_activity:
del _last_activity[task_id]
except Exception as e:
print(f"[VM Cleanup] Error manually cleaning up VM for task {task_id}: {e}")
# Register cleanup on program exit
atexit.register(_stop_cleanup_thread)
def terminal_tool(
command: Optional[str] = None,
input_keys: Optional[str] = None,
session_id: Optional[str] = None,
background: bool = False,
idle_threshold: float = 5.0,
timeout: Optional[int] = None,
task_id: Optional[str] = None
) -> str:
"""
Execute a command on a Morph VM with optional interactive session support.
This tool uses Hecate's VM lifecycle management to automatically create
and manage VMs. VMs are reused within the configured lifetime window
and automatically cleaned up after inactivity.
Args:
command: The command to execute (optional if continuing existing session)
input_keys: Keystrokes to send to interactive session (e.g., "hello\\n")
session_id: ID of existing session to continue (optional)
background: Whether to run the command in the background (default: False)
idle_threshold: Seconds to wait for output before considering session idle (default: 5.0)
timeout: Command timeout in seconds (optional)
task_id: Unique identifier for this task to isolate VMs between concurrent tasks (optional)
Returns:
str: JSON string containing command output, session info, exit code, and any errors
Examples:
# Execute a simple command
>>> result = terminal_tool(command="ls -la /tmp")
# Start an interactive Python session
>>> result = terminal_tool(command="python3")
>>> session_data = json.loads(result)
>>> session_id = session_data["session_id"]
# Send input to the session
>>> result = terminal_tool(input_keys="print('Hello')\\n", session_id=session_id)
# Run a background task
>>> result = terminal_tool(command="sleep 60", background=True)
"""
global _active_instances, _active_contexts
try:
# Import required modules lazily so this module can be imported
# even when hecate is not installed
try:
from morphcloud._llm import ToolCall
from morphcloud.api import MorphCloudClient
from hecate.cli import run_tool, ExecutionContext
from rich.console import Console
import io
except ImportError as import_error:
return json.dumps({
"output": "",
"screen": "",
"exit_code": -1,
"error": f"Terminal tool is disabled due to import error: {import_error}",
"status": "disabled"
}, ensure_ascii=False)
# Get configuration from environment
vm_lifetime_seconds = int(os.getenv("HECATE_VM_LIFETIME_SECONDS", "300"))
vm_ttl_seconds = int(os.getenv("HECATE_VM_TTL_SECONDS", "1200")) # 20 minutes default
snapshot_id = os.getenv("HECATE_DEFAULT_SNAPSHOT_ID", "snapshot_defv9tjg")
# Check API key
morph_api_key = os.getenv("MORPH_API_KEY")
if not morph_api_key:
return json.dumps({
"output": "",
"screen": "",
"exit_code": -1,
"error": "MORPH_API_KEY environment variable not set",
"status": "disabled"
}, ensure_ascii=False)
# Use task_id to isolate VMs between concurrent tasks
# If no task_id provided, use "default" for backward compatibility
effective_task_id = task_id or "default"
# Start the cleanup thread if not already running
_start_cleanup_thread()
# Get or create VM instance and execution context per task
# This is critical for interactive session support - the context must persist!
with _instance_lock:
if effective_task_id not in _active_instances:
morph_client = MorphCloudClient(api_key=morph_api_key)
_active_instances[effective_task_id] = morph_client.instances.start(
snapshot_id=snapshot_id,
ttl_seconds=vm_ttl_seconds,
ttl_action="stop"
)
# Get or create persistent execution context per task
if effective_task_id not in _active_contexts:
_active_contexts[effective_task_id] = ExecutionContext()
# Update last activity time for this VM (resets the inactivity timer)
_last_activity[effective_task_id] = time.time()
instance = _active_instances[effective_task_id]
ctx = _active_contexts[effective_task_id]
# Build tool input based on provided parameters
tool_input = {}
if command:
tool_input["command"] = command
if input_keys:
tool_input["input_keys"] = input_keys
if session_id:
tool_input["session_id"] = session_id
if background:
tool_input["background"] = background
if idle_threshold != 5.0:
tool_input["idle_threshold"] = idle_threshold
if timeout is not None:
tool_input["timeout"] = timeout
tool_call = ToolCall(
name="run_command",
input=tool_input
)
# Create a console for output (redirect to string buffer to avoid printing)
console_output = io.StringIO()
console = Console(file=console_output, force_terminal=False, legacy_windows=False)
# Generate unique tool block ID
tool_block_id = f"tool_{uuid.uuid4().hex[:8]}"
# Execute the tool with hecate
result = run_tool(
tool_call=tool_call,
instance=instance,
console=console,
tool_block_id=tool_block_id,
ctx=ctx
)
# Format the result with only essential fields for the LLM
# Map hecate's "stdout" to "output" for compatibility
formatted_result = {
"output": result.get("stdout", result.get("output", "")),
"screen": result.get("screen", ""),
"exit_code": result.get("returncode", result.get("exit_code", -1)),
"error": result.get("error")
}
return json.dumps(formatted_result, ensure_ascii=False)
except Exception as e:
return json.dumps({
"output": "",
"screen": "",
"exit_code": -1,
"error": f"Failed to execute terminal command: {str(e)}",
"status": "error"
}, ensure_ascii=False)
def check_hecate_requirements() -> bool:
"""
Check if all requirements for terminal tools are met.
Returns:
bool: True if all requirements are met, False otherwise
"""
# Check for required environment variables
required_vars = ["MORPH_API_KEY"]
optional_vars = ["OPENAI_API_KEY"] # Needed for Hecate's LLM features
missing_required = [var for var in required_vars if not os.getenv(var)]
missing_optional = [var for var in optional_vars if not os.getenv(var)]
if missing_required:
print(f"Missing required environment variables: {', '.join(missing_required)}")
return False
if missing_optional:
print(f"Warning: Missing optional environment variables: {', '.join(missing_optional)}")
print(" (Some Hecate features may be limited)")
# Check if Hecate and required modules are importable
try:
from morphcloud._llm import ToolCall
from morphcloud.api import MorphCloudClient
from hecate.cli import run_tool, ExecutionContext
from rich.console import Console
return True
except Exception as e:
print(f"Hecate not available: {e}")
print(f"Make sure hecate is installed and MORPH_API_KEY is set.")
return False
# Module-level initialization check
_requirements_met = check_hecate_requirements()
if __name__ == "__main__":
"""
Simple test/demo when run directly
"""
print("Terminal Tool Module")
print("=" * 40)
if not _requirements_met:
print("Requirements not met. Please check the messages above.")
exit(1)
print("All requirements met!")
print("\nAvailable Tool:")
print(" - terminal_tool: Execute commands with optional interactive session support")
print("\nUsage Examples:")
print(" # Execute a command")
print(" result = terminal_tool(command='ls -la')")
print(" ")
print(" # Start an interactive session")
print(" result = terminal_tool(command='python3')")
print(" session_data = json.loads(result)")
print(" session_id = session_data['session_id']")
print(" ")
print(" # Send input to the session")
print(" result = terminal_tool(")
print(" input_keys='print(\"Hello\")\\\\n',")
print(" session_id=session_id")
print(" )")
print(" ")
print(" # Run a background task")
print(" result = terminal_tool(command='sleep 60', background=True)")
print("\nEnvironment Variables:")
print(f" MORPH_API_KEY: {'Set' if os.getenv('MORPH_API_KEY') else 'Not set'}")
print(f" OPENAI_API_KEY: {'Set' if os.getenv('OPENAI_API_KEY') else 'Not set (optional)'}")
print(f" HECATE_VM_TTL_SECONDS: {os.getenv('HECATE_VM_TTL_SECONDS', '1200')} (default: 1200 / 20 minutes)")
print(f" HECATE_VM_LIFETIME_SECONDS: {os.getenv('HECATE_VM_LIFETIME_SECONDS', '300')} (default: 300 / 5 minutes)")
print(f" HECATE_DEFAULT_SNAPSHOT_ID: {os.getenv('HECATE_DEFAULT_SNAPSHOT_ID', 'snapshot_defv9tjg')} (default: snapshot_defv9tjg)")

View File

@@ -1,346 +1,471 @@
#!/usr/bin/env python3 #!/usr/bin/env python3
""" """
Vision Tools Module Vision Tools Module
This module provides vision analysis tools that work with image URLs. This module provides vision analysis tools that work with image URLs.
Uses Gemini Flash via Nous Research API for intelligent image understanding. Uses Gemini Flash via Nous Research API for intelligent image understanding.
Available tools: Available tools:
- vision_analyze_tool: Analyze images from URLs with custom prompts - vision_analyze_tool: Analyze images from URLs with custom prompts
Features: Features:
- Comprehensive image description - Downloads images from URLs and converts to base64 for API compatibility
- Context-aware analysis based on user queries - Comprehensive image description
- Proper error handling and validation - Context-aware analysis based on user queries
- Debug logging support - Automatic temporary file cleanup
- Proper error handling and validation
Usage: - Debug logging support
from vision_tools import vision_analyze_tool
import asyncio Usage:
from vision_tools import vision_analyze_tool
# Analyze an image import asyncio
result = await vision_analyze_tool(
image_url="https://example.com/image.jpg", # Analyze an image
user_prompt="What architectural style is this building?" result = await vision_analyze_tool(
) image_url="https://example.com/image.jpg",
""" user_prompt="What architectural style is this building?"
)
import json """
import os
import asyncio import json
import uuid import os
import datetime import asyncio
from pathlib import Path import uuid
from typing import Dict, Any, Optional import datetime
from openai import AsyncOpenAI import base64
from pathlib import Path
# Initialize Nous Research API client for vision processing from typing import Dict, Any, Optional
nous_client = AsyncOpenAI( from openai import AsyncOpenAI
api_key=os.getenv("NOUS_API_KEY"), import httpx # Use httpx for async HTTP requests
base_url="https://inference-api.nousresearch.com/v1"
) # Initialize Nous Research API client for vision processing
nous_client = AsyncOpenAI(
# Configuration for vision processing api_key=os.getenv("NOUS_API_KEY"),
DEFAULT_VISION_MODEL = "gemini-2.5-flash" base_url="https://inference-api.nousresearch.com/v1"
)
# Debug mode configuration
DEBUG_MODE = os.getenv("VISION_TOOLS_DEBUG", "false").lower() == "true" # Configuration for vision processing
DEBUG_SESSION_ID = str(uuid.uuid4()) DEFAULT_VISION_MODEL = "gemini-2.5-flash"
DEBUG_LOG_PATH = Path("./logs")
DEBUG_DATA = { # Debug mode configuration
"session_id": DEBUG_SESSION_ID, DEBUG_MODE = os.getenv("VISION_TOOLS_DEBUG", "false").lower() == "true"
"start_time": datetime.datetime.now().isoformat(), DEBUG_SESSION_ID = str(uuid.uuid4())
"debug_enabled": DEBUG_MODE, DEBUG_LOG_PATH = Path("./logs")
"tool_calls": [] DEBUG_DATA = {
} if DEBUG_MODE else None "session_id": DEBUG_SESSION_ID,
"start_time": datetime.datetime.now().isoformat(),
# Create logs directory if debug mode is enabled "debug_enabled": DEBUG_MODE,
if DEBUG_MODE: "tool_calls": []
DEBUG_LOG_PATH.mkdir(exist_ok=True) } if DEBUG_MODE else None
print(f"🐛 Vision debug mode enabled - Session ID: {DEBUG_SESSION_ID}")
# Create logs directory if debug mode is enabled
if DEBUG_MODE:
def _log_debug_call(tool_name: str, call_data: Dict[str, Any]) -> None: DEBUG_LOG_PATH.mkdir(exist_ok=True)
""" print(f"🐛 Vision debug mode enabled - Session ID: {DEBUG_SESSION_ID}")
Log a debug call entry to the global debug data structure.
Args: def _log_debug_call(tool_name: str, call_data: Dict[str, Any]) -> None:
tool_name (str): Name of the tool being called """
call_data (Dict[str, Any]): Data about the call including parameters and results Log a debug call entry to the global debug data structure.
"""
if not DEBUG_MODE or not DEBUG_DATA: Args:
return tool_name (str): Name of the tool being called
call_data (Dict[str, Any]): Data about the call including parameters and results
call_entry = { """
"timestamp": datetime.datetime.now().isoformat(), if not DEBUG_MODE or not DEBUG_DATA:
"tool_name": tool_name, return
**call_data
} call_entry = {
"timestamp": datetime.datetime.now().isoformat(),
DEBUG_DATA["tool_calls"].append(call_entry) "tool_name": tool_name,
**call_data
}
def _save_debug_log() -> None:
""" DEBUG_DATA["tool_calls"].append(call_entry)
Save the current debug data to a JSON file in the logs directory.
"""
if not DEBUG_MODE or not DEBUG_DATA: def _save_debug_log() -> None:
return """
Save the current debug data to a JSON file in the logs directory.
try: """
debug_filename = f"vision_tools_debug_{DEBUG_SESSION_ID}.json" if not DEBUG_MODE or not DEBUG_DATA:
debug_filepath = DEBUG_LOG_PATH / debug_filename return
# Update end time try:
DEBUG_DATA["end_time"] = datetime.datetime.now().isoformat() debug_filename = f"vision_tools_debug_{DEBUG_SESSION_ID}.json"
DEBUG_DATA["total_calls"] = len(DEBUG_DATA["tool_calls"]) debug_filepath = DEBUG_LOG_PATH / debug_filename
with open(debug_filepath, 'w', encoding='utf-8') as f: # Update end time
json.dump(DEBUG_DATA, f, indent=2, ensure_ascii=False) DEBUG_DATA["end_time"] = datetime.datetime.now().isoformat()
DEBUG_DATA["total_calls"] = len(DEBUG_DATA["tool_calls"])
print(f"🐛 Vision debug log saved: {debug_filepath}")
with open(debug_filepath, 'w', encoding='utf-8') as f:
except Exception as e: json.dump(DEBUG_DATA, f, indent=2, ensure_ascii=False)
print(f"❌ Error saving vision debug log: {str(e)}")
print(f"🐛 Vision debug log saved: {debug_filepath}")
def _validate_image_url(url: str) -> bool: except Exception as e:
""" print(f"❌ Error saving vision debug log: {str(e)}")
Basic validation of image URL format.
Args: def _validate_image_url(url: str) -> bool:
url (str): The URL to validate """
Basic validation of image URL format.
Returns:
bool: True if URL appears to be valid, False otherwise Args:
""" url (str): The URL to validate
if not url or not isinstance(url, str):
return False Returns:
bool: True if URL appears to be valid, False otherwise
# Check if it's a valid URL format """
if not (url.startswith('http://') or url.startswith('https://')): if not url or not isinstance(url, str):
return False return False
# Check for common image extensions (optional, as URLs may not have extensions) # Check if it's a valid URL format
image_extensions = ['.jpg', '.jpeg', '.png', '.gif', '.bmp', '.webp', '.svg'] if not (url.startswith('http://') or url.startswith('https://')):
return False
return True # Allow all HTTP/HTTPS URLs for flexibility
# Check for common image extensions (optional, as URLs may not have extensions)
image_extensions = ['.jpg', '.jpeg', '.png', '.gif', '.bmp', '.webp', '.svg']
async def vision_analyze_tool(
image_url: str, return True # Allow all HTTP/HTTPS URLs for flexibility
user_prompt: str,
model: str = DEFAULT_VISION_MODEL
) -> str: async def _download_image(image_url: str, destination: Path) -> Path:
""" """
Analyze an image from a URL using vision AI. Download an image from a URL to a local destination (async).
This tool processes images using Gemini Flash via Nous Research API. Args:
The user_prompt parameter is expected to be pre-formatted by the calling image_url (str): The URL of the image to download
function (typically model_tools.py) to include both full description destination (Path): The path where the image should be saved
requests and specific questions.
Returns:
Args: Path: The path to the downloaded image
image_url (str): The URL of the image to analyze
user_prompt (str): The pre-formatted prompt for the vision model Raises:
model (str): The vision model to use (default: gemini-2.5-flash) Exception: If download fails or response is invalid
"""
Returns: # Create parent directories if they don't exist
str: JSON string containing the analysis results with the following structure: destination.parent.mkdir(parents=True, exist_ok=True)
{
"success": bool, # Download the image with appropriate headers using async httpx
"analysis": str (defaults to error message if None) async with httpx.AsyncClient(timeout=30.0) as client:
} response = await client.get(
image_url,
Raises: headers={"User-Agent": "hermes-agent-vision/1.0"},
Exception: If analysis fails or API key is not set )
""" response.raise_for_status()
debug_call_data = {
"parameters": { # Save the image content
"image_url": image_url, destination.write_bytes(response.content)
"user_prompt": user_prompt,
"model": model return destination
},
"error": None,
"success": False, def _determine_mime_type(image_path: Path) -> str:
"analysis_length": 0, """
"model_used": model Determine the MIME type of an image based on its file extension.
}
Args:
try: image_path (Path): Path to the image file
print(f"🔍 Analyzing image from URL: {image_url[:60]}{'...' if len(image_url) > 60 else ''}")
print(f"📝 User prompt: {user_prompt[:100]}{'...' if len(user_prompt) > 100 else ''}") Returns:
str: The MIME type (defaults to image/jpeg if unknown)
# Validate image URL """
if not _validate_image_url(image_url): extension = image_path.suffix.lower()
raise ValueError("Invalid image URL format. Must start with http:// or https://") mime_types = {
'.jpg': 'image/jpeg',
# Check API key availability '.jpeg': 'image/jpeg',
if not os.getenv("NOUS_API_KEY"): '.png': 'image/png',
raise ValueError("NOUS_API_KEY environment variable not set") '.gif': 'image/gif',
'.bmp': 'image/bmp',
# Use the prompt as provided (model_tools.py now handles full description formatting) '.webp': 'image/webp',
comprehensive_prompt = user_prompt '.svg': 'image/svg+xml'
}
# Prepare the message with image URL format return mime_types.get(extension, 'image/jpeg')
messages = [
{
"role": "user", def _image_to_base64_data_url(image_path: Path, mime_type: Optional[str] = None) -> str:
"content": [ """
{ Convert an image file to a base64-encoded data URL.
"type": "text",
"text": comprehensive_prompt Args:
}, image_path (Path): Path to the image file
{ mime_type (Optional[str]): MIME type of the image (auto-detected if None)
"type": "image_url",
"image_url": { Returns:
"url": image_url str: Base64-encoded data URL (e.g., "data:image/jpeg;base64,...")
} """
} # Read the image as bytes
] data = image_path.read_bytes()
}
] # Encode to base64
encoded = base64.b64encode(data).decode("ascii")
print(f"🧠 Processing image with {model}...")
# Determine MIME type
# Call the vision API mime = mime_type or _determine_mime_type(image_path)
response = await nous_client.chat.completions.create(
model=model, # Create data URL
messages=messages, data_url = f"data:{mime};base64,{encoded}"
temperature=0.1, # Low temperature for consistent analysis
max_tokens=2000 # Generous limit for detailed analysis return data_url
)
# Extract the analysis async def vision_analyze_tool(
analysis = response.choices[0].message.content.strip() image_url: str,
analysis_length = len(analysis) user_prompt: str,
model: str = DEFAULT_VISION_MODEL
print(f"✅ Image analysis completed ({analysis_length} characters)") ) -> str:
"""
# Prepare successful response Analyze an image from a URL using vision AI.
result = {
"success": True, This tool downloads images from URLs, converts them to base64, and processes
"analysis": analysis or "There was a problem with the request and the image could not be analyzed." them using Gemini Flash via Nous Research API. The image is downloaded to a
} temporary location and automatically cleaned up after processing.
debug_call_data["success"] = True The user_prompt parameter is expected to be pre-formatted by the calling
debug_call_data["analysis_length"] = analysis_length function (typically model_tools.py) to include both full description
requests and specific questions.
# Log debug information
_log_debug_call("vision_analyze_tool", debug_call_data) Args:
_save_debug_log() image_url (str): The URL of the image to analyze (must be http:// or https://)
user_prompt (str): The pre-formatted prompt for the vision model
return json.dumps(result, indent=2) model (str): The vision model to use (default: gemini-2.5-flash)
except Exception as e: Returns:
error_msg = f"Error analyzing image: {str(e)}" str: JSON string containing the analysis results with the following structure:
print(f"{error_msg}") {
"success": bool,
# Prepare error response "analysis": str (defaults to error message if None)
result = { }
"success": False,
"analysis": "There was a problem with the request and the image could not be analyzed." Raises:
} Exception: If download fails, analysis fails, or API key is not set
debug_call_data["error"] = error_msg Note:
_log_debug_call("vision_analyze_tool", debug_call_data) - Temporary images are stored in ./temp_vision_images/
_save_debug_log() - Images are automatically deleted after processing
- Supports common image formats (JPEG, PNG, GIF, WebP, etc.)
return json.dumps(result, indent=2) """
debug_call_data = {
"parameters": {
def check_nous_api_key() -> bool: "image_url": image_url,
""" "user_prompt": user_prompt[:200] + "..." if len(user_prompt) > 200 else user_prompt,
Check if the Nous Research API key is available in environment variables. "model": model
},
Returns: "error": None,
bool: True if API key is set, False otherwise "success": False,
""" "analysis_length": 0,
return bool(os.getenv("NOUS_API_KEY")) "model_used": model,
"image_size_bytes": 0
}
def check_vision_requirements() -> bool:
""" temp_image_path = None
Check if all requirements for vision tools are met.
try:
Returns: print(f"🔍 Analyzing image from URL: {image_url[:60]}{'...' if len(image_url) > 60 else ''}", flush=True)
bool: True if requirements are met, False otherwise print(f"📝 User prompt: {user_prompt[:100]}{'...' if len(user_prompt) > 100 else ''}", flush=True)
"""
return check_nous_api_key() # Validate image URL
if not _validate_image_url(image_url):
raise ValueError("Invalid image URL format. Must start with http:// or https://")
def get_debug_session_info() -> Dict[str, Any]:
""" # Check API key availability
Get information about the current debug session. if not os.getenv("NOUS_API_KEY"):
raise ValueError("NOUS_API_KEY environment variable not set")
Returns:
Dict[str, Any]: Dictionary containing debug session information # Download the image to a temporary location
""" print(f"⬇️ Downloading image from URL...", flush=True)
if not DEBUG_MODE or not DEBUG_DATA: temp_dir = Path("./temp_vision_images")
return { temp_image_path = temp_dir / f"temp_image_{uuid.uuid4()}.jpg"
"enabled": False,
"session_id": None, await _download_image(image_url, temp_image_path)
"log_path": None,
"total_calls": 0 # Get image file size for logging
} image_size_bytes = temp_image_path.stat().st_size
image_size_kb = image_size_bytes / 1024
return { print(f"✅ Image downloaded successfully ({image_size_kb:.1f} KB)", flush=True)
"enabled": True,
"session_id": DEBUG_SESSION_ID, # Convert image to base64 data URL
"log_path": str(DEBUG_LOG_PATH / f"vision_tools_debug_{DEBUG_SESSION_ID}.json"), print(f"🔄 Converting image to base64...", flush=True)
"total_calls": len(DEBUG_DATA["tool_calls"]) image_data_url = _image_to_base64_data_url(temp_image_path)
} # Calculate size in KB for better readability
data_size_kb = len(image_data_url) / 1024
print(f"✅ Image converted to base64 ({data_size_kb:.1f} KB)", flush=True)
if __name__ == "__main__":
""" debug_call_data["image_size_bytes"] = image_size_bytes
Simple test/demo when run directly
""" # Use the prompt as provided (model_tools.py now handles full description formatting)
print("👁️ Vision Tools Module") comprehensive_prompt = user_prompt
print("=" * 40)
# Prepare the message with base64-encoded image
# Check if API key is available messages = [
api_available = check_nous_api_key() {
"role": "user",
if not api_available: "content": [
print("❌ NOUS_API_KEY environment variable not set") {
print("Please set your API key: export NOUS_API_KEY='your-key-here'") "type": "text",
print("Get API key at: https://inference-api.nousresearch.com/") "text": comprehensive_prompt
exit(1) },
else: {
print("✅ Nous Research API key found") "type": "image_url",
"image_url": {
print("🛠️ Vision tools ready for use!") "url": image_data_url
print(f"🧠 Using model: {DEFAULT_VISION_MODEL}") }
}
# Show debug mode status ]
if DEBUG_MODE: }
print(f"🐛 Debug mode ENABLED - Session ID: {DEBUG_SESSION_ID}") ]
print(f" Debug logs will be saved to: ./logs/vision_tools_debug_{DEBUG_SESSION_ID}.json")
else: print(f"🧠 Processing image with {model}...", flush=True)
print("🐛 Debug mode disabled (set VISION_TOOLS_DEBUG=true to enable)")
# Call the vision API
print("\nBasic usage:") response = await nous_client.chat.completions.create(
print(" from vision_tools import vision_analyze_tool") model=model,
print(" import asyncio") messages=messages,
print("") temperature=0.1, # Low temperature for consistent analysis
print(" async def main():") max_tokens=2000 # Generous limit for detailed analysis
print(" result = await vision_analyze_tool(") )
print(" image_url='https://example.com/image.jpg',")
print(" user_prompt='What do you see in this image?'") # Extract the analysis
print(" )") analysis = response.choices[0].message.content.strip()
print(" print(result)") analysis_length = len(analysis)
print(" asyncio.run(main())")
print(f"✅ Image analysis completed ({analysis_length} characters)", flush=True)
print("\nExample prompts:")
print(" - 'What architectural style is this building?'") # Prepare successful response
print(" - 'Describe the emotions and mood in this image'") result = {
print(" - 'What text can you read in this image?'") "success": True,
print(" - 'Identify any safety hazards visible'") "analysis": analysis or "There was a problem with the request and the image could not be analyzed."
print(" - 'What products or brands are shown?'") }
print("\nDebug mode:") debug_call_data["success"] = True
print(" # Enable debug logging") debug_call_data["analysis_length"] = analysis_length
print(" export VISION_TOOLS_DEBUG=true")
print(" # Debug logs capture all vision analysis calls and results") # Log debug information
print(" # Logs saved to: ./logs/vision_tools_debug_UUID.json") _log_debug_call("vision_analyze_tool", debug_call_data)
_save_debug_log()
return json.dumps(result, indent=2, ensure_ascii=False)
except Exception as e:
error_msg = f"Error analyzing image: {str(e)}"
print(f"{error_msg}", flush=True)
# Prepare error response
result = {
"success": False,
"analysis": "There was a problem with the request and the image could not be analyzed."
}
debug_call_data["error"] = error_msg
_log_debug_call("vision_analyze_tool", debug_call_data)
_save_debug_log()
return json.dumps(result, indent=2, ensure_ascii=False)
finally:
# Clean up temporary image file
if temp_image_path and temp_image_path.exists():
try:
temp_image_path.unlink()
print(f"🧹 Cleaned up temporary image file", flush=True)
except Exception as cleanup_error:
print(f"⚠️ Warning: Could not delete temporary file: {cleanup_error}", flush=True)
def check_nous_api_key() -> bool:
"""
Check if the Nous Research API key is available in environment variables.
Returns:
bool: True if API key is set, False otherwise
"""
return bool(os.getenv("NOUS_API_KEY"))
def check_vision_requirements() -> bool:
"""
Check if all requirements for vision tools are met.
Returns:
bool: True if requirements are met, False otherwise
"""
return check_nous_api_key()
def get_debug_session_info() -> Dict[str, Any]:
"""
Get information about the current debug session.
Returns:
Dict[str, Any]: Dictionary containing debug session information
"""
if not DEBUG_MODE or not DEBUG_DATA:
return {
"enabled": False,
"session_id": None,
"log_path": None,
"total_calls": 0
}
return {
"enabled": True,
"session_id": DEBUG_SESSION_ID,
"log_path": str(DEBUG_LOG_PATH / f"vision_tools_debug_{DEBUG_SESSION_ID}.json"),
"total_calls": len(DEBUG_DATA["tool_calls"])
}
if __name__ == "__main__":
"""
Simple test/demo when run directly
"""
print("👁️ Vision Tools Module")
print("=" * 40)
# Check if API key is available
api_available = check_nous_api_key()
if not api_available:
print("❌ NOUS_API_KEY environment variable not set")
print("Please set your API key: export NOUS_API_KEY='your-key-here'")
print("Get API key at: https://inference-api.nousresearch.com/")
exit(1)
else:
print("✅ Nous Research API key found")
print("🛠️ Vision tools ready for use!")
print(f"🧠 Using model: {DEFAULT_VISION_MODEL}")
# Show debug mode status
if DEBUG_MODE:
print(f"🐛 Debug mode ENABLED - Session ID: {DEBUG_SESSION_ID}")
print(f" Debug logs will be saved to: ./logs/vision_tools_debug_{DEBUG_SESSION_ID}.json")
else:
print("🐛 Debug mode disabled (set VISION_TOOLS_DEBUG=true to enable)")
print("\nBasic usage:")
print(" from vision_tools import vision_analyze_tool")
print(" import asyncio")
print("")
print(" async def main():")
print(" result = await vision_analyze_tool(")
print(" image_url='https://example.com/image.jpg',")
print(" user_prompt='What do you see in this image?'")
print(" )")
print(" print(result)")
print(" asyncio.run(main())")
print("\nExample prompts:")
print(" - 'What architectural style is this building?'")
print(" - 'Describe the emotions and mood in this image'")
print(" - 'What text can you read in this image?'")
print(" - 'Identify any safety hazards visible'")
print(" - 'What products or brands are shown?'")
print("\nDebug mode:")
print(" # Enable debug logging")
print(" export VISION_TOOLS_DEBUG=true")
print(" # Debug logs capture all vision analysis calls and results")
print(" # Logs saved to: ./logs/vision_tools_debug_UUID.json")

File diff suppressed because it is too large Load Diff

282
toolset_distributions.py Normal file
View File

@@ -0,0 +1,282 @@
#!/usr/bin/env python3
"""
Toolset Distributions Module
This module defines distributions of toolsets for data generation runs.
Each distribution specifies which toolsets should be used and their probability
of being selected for any given prompt during the batch processing.
A distribution is a dictionary mapping toolset names to their selection probability (%).
Probabilities should sum to 100, but the system will normalize if they don't.
Usage:
from toolset_distributions import get_distribution, list_distributions
# Get a specific distribution
dist = get_distribution("image_gen")
# List all available distributions
all_dists = list_distributions()
"""
from typing import Dict, List, Optional
import random
from toolsets import validate_toolset
# Distribution definitions
# Each key is a distribution name, and the value is a dict of toolset_name: probability_percentage
DISTRIBUTIONS = {
# Default: All tools available 100% of the time
"default": {
"description": "All available tools, all the time",
"toolsets": {
"web": 100,
"vision": 100,
"image_gen": 100,
"terminal": 100,
"moa": 100
}
},
# Image generation focused distribution
"image_gen": {
"description": "Heavy focus on image generation with vision and web support",
"toolsets": {
"image_gen": 90, # 80% chance of image generation tools
"vision": 90, # 60% chance of vision tools
"web": 55, # 40% chance of web tools
"terminal": 45,
"moa": 10 # 20% chance of reasoning tools
}
},
# Research-focused distribution
"research": {
"description": "Web research with vision analysis and reasoning",
"toolsets": {
"web": 90, # 90% chance of web tools
"vision": 50, # 50% chance of vision tools
"moa": 40, # 40% chance of reasoning tools
"terminal": 10 # 10% chance of terminal tools
}
},
# Scientific problem solving focused distribution
"science": {
"description": "Web research with vision analysis and reasoning",
"toolsets": {
"web": 94, # 90% chance of web tools
"vision": 65, # 50% chance of vision tools
"moa": 10, # 40% chance of reasoning tools
"terminal": 94, # 10% chance of terminal tools
"image_gen": 15 # 80% chance of image generation tools
}
},
# Development-focused distribution
"development": {
"description": "Terminal and reasoning with occasional web lookup",
"toolsets": {
"terminal": 80, # 80% chance of terminal tools
"moa": 60, # 60% chance of reasoning tools
"web": 30, # 30% chance of web tools
"vision": 10 # 10% chance of vision tools
}
},
# Safe mode (no terminal)
"safe": {
"description": "All tools except terminal for safety",
"toolsets": {
"web": 80,
"vision": 60,
"image_gen": 60,
"moa": 50
}
},
# Balanced distribution
"balanced": {
"description": "Equal probability of all toolsets",
"toolsets": {
"web": 50,
"vision": 50,
"image_gen": 50,
"terminal": 50,
"moa": 50
}
},
# Minimal (web only)
"minimal": {
"description": "Only web tools for basic research",
"toolsets": {
"web": 100
}
},
# Creative (vision + image generation)
"creative": {
"description": "Image generation and vision analysis focus",
"toolsets": {
"image_gen": 90,
"vision": 90,
"web": 30
}
},
# Reasoning heavy
"reasoning": {
"description": "Heavy mixture of agents usage with minimal other tools",
"toolsets": {
"moa": 90,
"web": 30,
"terminal": 20
}
}
}
def get_distribution(name: str) -> Optional[Dict[str, any]]:
"""
Get a toolset distribution by name.
Args:
name (str): Name of the distribution
Returns:
Dict: Distribution definition with description and toolsets
None: If distribution not found
"""
return DISTRIBUTIONS.get(name)
def list_distributions() -> Dict[str, Dict]:
"""
List all available distributions.
Returns:
Dict: All distribution definitions
"""
return DISTRIBUTIONS.copy()
def sample_toolsets_from_distribution(distribution_name: str) -> List[str]:
"""
Sample toolsets based on a distribution's probabilities.
Each toolset in the distribution has a % chance of being included.
This allows multiple toolsets to be active simultaneously.
Args:
distribution_name (str): Name of the distribution to sample from
Returns:
List[str]: List of sampled toolset names
Raises:
ValueError: If distribution name is not found
"""
dist = get_distribution(distribution_name)
if not dist:
raise ValueError(f"Unknown distribution: {distribution_name}")
# Sample each toolset independently based on its probability
selected_toolsets = []
for toolset_name, probability in dist["toolsets"].items():
# Validate toolset exists
if not validate_toolset(toolset_name):
print(f"⚠️ Warning: Toolset '{toolset_name}' in distribution '{distribution_name}' is not valid")
continue
# Roll the dice - if random value is less than probability, include this toolset
if random.random() * 100 < probability:
selected_toolsets.append(toolset_name)
# If no toolsets were selected (can happen with low probabilities),
# ensure at least one toolset is selected by picking the highest probability one
if not selected_toolsets and dist["toolsets"]:
# Find toolset with highest probability
highest_prob_toolset = max(dist["toolsets"].items(), key=lambda x: x[1])[0]
if validate_toolset(highest_prob_toolset):
selected_toolsets.append(highest_prob_toolset)
return selected_toolsets
def validate_distribution(distribution_name: str) -> bool:
"""
Check if a distribution name is valid.
Args:
distribution_name (str): Distribution name to validate
Returns:
bool: True if valid, False otherwise
"""
return distribution_name in DISTRIBUTIONS
def print_distribution_info(distribution_name: str) -> None:
"""
Print detailed information about a distribution.
Args:
distribution_name (str): Distribution name
"""
dist = get_distribution(distribution_name)
if not dist:
print(f"❌ Unknown distribution: {distribution_name}")
return
print(f"\n📊 Distribution: {distribution_name}")
print(f" Description: {dist['description']}")
print(f" Toolsets:")
for toolset, prob in sorted(dist["toolsets"].items(), key=lambda x: x[1], reverse=True):
print(f"{toolset:15} : {prob:3}% chance")
if __name__ == "__main__":
"""
Demo and testing of the distributions system
"""
print("📊 Toolset Distributions Demo")
print("=" * 60)
# List all distributions
print("\n📋 Available Distributions:")
print("-" * 40)
for name, dist in list_distributions().items():
print(f"\n {name}:")
print(f" {dist['description']}")
toolset_list = ", ".join([f"{ts}({p}%)" for ts, p in dist["toolsets"].items()])
print(f" Toolsets: {toolset_list}")
# Demo sampling
print("\n\n🎲 Sampling Examples:")
print("-" * 40)
test_distributions = ["image_gen", "research", "balanced", "default"]
for dist_name in test_distributions:
print(f"\n{dist_name}:")
# Sample 5 times to show variability
samples = []
for _ in range(5):
sampled = sample_toolsets_from_distribution(dist_name)
samples.append(sorted(sampled))
print(f" Sample 1: {samples[0]}")
print(f" Sample 2: {samples[1]}")
print(f" Sample 3: {samples[2]}")
print(f" Sample 4: {samples[3]}")
print(f" Sample 5: {samples[4]}")
# Show detailed info
print("\n\n📊 Detailed Distribution Info:")
print("-" * 40)
print_distribution_info("image_gen")
print_distribution_info("research")

View File

@@ -110,6 +110,16 @@ def resolve_toolset(name: str, visited: Set[str] = None) -> List[str]:
if visited is None: if visited is None:
visited = set() visited = set()
# Special aliases that represent all tools across every toolset
# This ensures future toolsets are automatically included without changes.
if name in {"all", "*"}:
all_tools: Set[str] = set()
for toolset_name in get_toolset_names():
# Use a fresh visited set per branch to avoid cross-branch contamination
resolved = resolve_toolset(toolset_name, visited.copy())
all_tools.update(resolved)
return list(all_tools)
# Check for cycles # Check for cycles
if name in visited: if name in visited:
print(f"⚠️ Circular dependency detected in toolset '{name}'") print(f"⚠️ Circular dependency detected in toolset '{name}'")
@@ -184,6 +194,9 @@ def validate_toolset(name: str) -> bool:
Returns: Returns:
bool: True if valid, False otherwise bool: True if valid, False otherwise
""" """
# Accept special alias names for convenience
if name in {"all", "*"}:
return True
return name in TOOLSETS return name in TOOLSETS