Compare commits
3 Commits
mimo/code/
...
fix/879
| Author | SHA1 | Date | |
|---|---|---|---|
| 3dd0b7dd5f | |||
| 9e03ded328 | |||
|
|
e196bf70cb |
112
.gitea/ISSUE_TEMPLATE/mission-proposal.md
Normal file
112
.gitea/ISSUE_TEMPLATE/mission-proposal.md
Normal file
@@ -0,0 +1,112 @@
|
|||||||
|
---
|
||||||
|
name: Mission Proposal
|
||||||
|
about: Propose a new mission for the Hermes fleet
|
||||||
|
title: "[MISSION] "
|
||||||
|
labels: mission, proposal
|
||||||
|
assignees: ''
|
||||||
|
---
|
||||||
|
|
||||||
|
## Mission Proposal
|
||||||
|
|
||||||
|
### Mission Name
|
||||||
|
<!-- Provide a clear, descriptive name for the mission -->
|
||||||
|
|
||||||
|
### Objective
|
||||||
|
<!-- What is the primary goal of this mission? -->
|
||||||
|
|
||||||
|
### Scope
|
||||||
|
<!-- What is included and excluded from this mission? -->
|
||||||
|
|
||||||
|
**Included:**
|
||||||
|
-
|
||||||
|
-
|
||||||
|
-
|
||||||
|
|
||||||
|
**Excluded:**
|
||||||
|
-
|
||||||
|
-
|
||||||
|
-
|
||||||
|
|
||||||
|
### Agents Required
|
||||||
|
<!-- List the agents needed for this mission -->
|
||||||
|
|
||||||
|
| Agent ID | Role | Capabilities | Gateway |
|
||||||
|
|----------|------|--------------|---------|
|
||||||
|
| agent_001 | lead | planning, coordination | gateway_1 |
|
||||||
|
| agent_002 | write | coding, testing | gateway_2 |
|
||||||
|
|
||||||
|
### Tasks
|
||||||
|
<!-- List the tasks for this mission -->
|
||||||
|
|
||||||
|
| Task ID | Title | Assigned To | Priority | Dependencies |
|
||||||
|
|---------|-------|-------------|----------|--------------|
|
||||||
|
| task_001 | Code Review | agent_002 | high | none |
|
||||||
|
| task_002 | Testing | agent_002 | medium | task_001 |
|
||||||
|
|
||||||
|
### Isolation Requirements
|
||||||
|
<!-- What isolation level is needed? -->
|
||||||
|
|
||||||
|
- [ ] None (development/testing)
|
||||||
|
- [ ] Level 1 (process isolation)
|
||||||
|
- [ ] Level 2 (mount namespace)
|
||||||
|
- [ ] Level 3 (rootless Podman)
|
||||||
|
|
||||||
|
### Resource Requirements
|
||||||
|
<!-- What resources are needed? -->
|
||||||
|
|
||||||
|
| Resource | Amount | Notes |
|
||||||
|
|----------|--------|-------|
|
||||||
|
| CPU | 2 cores | |
|
||||||
|
| Memory | 4GB | |
|
||||||
|
| Disk | 10GB | |
|
||||||
|
| Network | Standard | |
|
||||||
|
|
||||||
|
### Timeline
|
||||||
|
<!-- What is the expected timeline? -->
|
||||||
|
|
||||||
|
| Phase | Duration | Description |
|
||||||
|
|-------|----------|-------------|
|
||||||
|
| Setup | 1 hour | Configure mission cell |
|
||||||
|
| Execution | 4 hours | Run mission tasks |
|
||||||
|
| Cleanup | 30 minutes | Collect artifacts |
|
||||||
|
|
||||||
|
### Success Criteria
|
||||||
|
<!-- How will we know the mission succeeded? -->
|
||||||
|
|
||||||
|
- [ ] All tasks completed
|
||||||
|
- [ ] All artifacts collected
|
||||||
|
- [ ] All agents healthy
|
||||||
|
- [ ] No critical errors
|
||||||
|
|
||||||
|
### Risks and Mitigations
|
||||||
|
<!-- What could go wrong and how will we handle it? -->
|
||||||
|
|
||||||
|
| Risk | Impact | Mitigation |
|
||||||
|
|------|--------|------------|
|
||||||
|
| Agent failure | High | Auto-revive with resurrection pool |
|
||||||
|
| Task timeout | Medium | Increase timeout or reassign |
|
||||||
|
| Resource exhaustion | High | Monitor and scale as needed |
|
||||||
|
|
||||||
|
### Dependencies
|
||||||
|
<!-- What does this mission depend on? -->
|
||||||
|
|
||||||
|
- [ ] Gateways configured
|
||||||
|
- [ ] Agents registered
|
||||||
|
- [ ] Resources allocated
|
||||||
|
- [ ] Permissions granted
|
||||||
|
|
||||||
|
### Approval
|
||||||
|
<!-- Who needs to approve this mission? -->
|
||||||
|
|
||||||
|
- [ ] Technical review
|
||||||
|
- [ ] Resource approval
|
||||||
|
- [ ] Security review
|
||||||
|
- [ ] Final approval
|
||||||
|
|
||||||
|
### Notes
|
||||||
|
<!-- Any additional notes or context -->
|
||||||
|
|
||||||
|
---
|
||||||
|
**Proposer:** [Your name/agent]
|
||||||
|
**Date:** [Date]
|
||||||
|
**Parent Epic:** #878
|
||||||
309
agent/health_heartbeat.py
Normal file
309
agent/health_heartbeat.py
Normal file
@@ -0,0 +1,309 @@
|
|||||||
|
"""
|
||||||
|
Agent Health Heartbeat Endpoint
|
||||||
|
Issue #879: [M6-P0] Foundation — cell spec, daemon skeleton, health heartbeat
|
||||||
|
|
||||||
|
Health heartbeat endpoint for agent monitoring.
|
||||||
|
"""
|
||||||
|
|
||||||
|
import asyncio
|
||||||
|
import json
|
||||||
|
import logging
|
||||||
|
import time
|
||||||
|
from datetime import datetime
|
||||||
|
from typing import Any, Dict, List, Optional
|
||||||
|
|
||||||
|
logger = logging.getLogger("agent.health")
|
||||||
|
|
||||||
|
|
||||||
|
class AgentHealthStatus:
|
||||||
|
"""Health status of an agent."""
|
||||||
|
|
||||||
|
def __init__(self, agent_id: str):
|
||||||
|
self.agent_id = agent_id
|
||||||
|
self.status = "healthy"
|
||||||
|
self.last_heartbeat = time.time()
|
||||||
|
self.last_response = time.time()
|
||||||
|
self.consecutive_failures = 0
|
||||||
|
self.response_time = 0.0
|
||||||
|
self.error_message: Optional[str] = None
|
||||||
|
self.metadata: Dict[str, Any] = {}
|
||||||
|
|
||||||
|
def update_heartbeat(self, response_time: float = 0.0, metadata: Optional[Dict] = None):
|
||||||
|
"""Update heartbeat status."""
|
||||||
|
self.last_heartbeat = time.time()
|
||||||
|
self.last_response = time.time()
|
||||||
|
self.response_time = response_time
|
||||||
|
self.consecutive_failures = 0
|
||||||
|
self.error_message = None
|
||||||
|
|
||||||
|
if metadata:
|
||||||
|
self.metadata.update(metadata)
|
||||||
|
|
||||||
|
def mark_failure(self, error_message: str):
|
||||||
|
"""Mark a failure."""
|
||||||
|
self.consecutive_failures += 1
|
||||||
|
self.last_response = time.time()
|
||||||
|
self.error_message = error_message
|
||||||
|
|
||||||
|
if self.consecutive_failures >= 3:
|
||||||
|
self.status = "unhealthy"
|
||||||
|
else:
|
||||||
|
self.status = "degraded"
|
||||||
|
|
||||||
|
def to_dict(self) -> Dict[str, Any]:
|
||||||
|
"""Convert to dictionary."""
|
||||||
|
return {
|
||||||
|
"agent_id": self.agent_id,
|
||||||
|
"status": self.status,
|
||||||
|
"last_heartbeat": self.last_heartbeat,
|
||||||
|
"last_response": self.last_response,
|
||||||
|
"consecutive_failures": self.consecutive_failures,
|
||||||
|
"response_time": self.response_time,
|
||||||
|
"error_message": self.error_message,
|
||||||
|
"metadata": self.metadata
|
||||||
|
}
|
||||||
|
|
||||||
|
|
||||||
|
class HealthHeartbeatEndpoint:
|
||||||
|
"""Health heartbeat endpoint for agent monitoring."""
|
||||||
|
|
||||||
|
def __init__(self):
|
||||||
|
self.agents: Dict[str, AgentHealthStatus] = {}
|
||||||
|
self.heartbeat_interval = 30 # seconds
|
||||||
|
self.timeout = 300 # seconds
|
||||||
|
|
||||||
|
def register_agent(self, agent_id: str, metadata: Optional[Dict] = None):
|
||||||
|
"""Register an agent for health monitoring."""
|
||||||
|
if agent_id not in self.agents:
|
||||||
|
self.agents[agent_id] = AgentHealthStatus(agent_id)
|
||||||
|
logger.info(f"Registered agent for health monitoring: {agent_id}")
|
||||||
|
|
||||||
|
if metadata:
|
||||||
|
self.agents[agent_id].metadata.update(metadata)
|
||||||
|
|
||||||
|
def unregister_agent(self, agent_id: str):
|
||||||
|
"""Unregister an agent from health monitoring."""
|
||||||
|
if agent_id in self.agents:
|
||||||
|
del self.agents[agent_id]
|
||||||
|
logger.info(f"Unregistered agent from health monitoring: {agent_id}")
|
||||||
|
|
||||||
|
async def handle_heartbeat(self, agent_id: str, data: Optional[Dict] = None) -> Dict[str, Any]:
|
||||||
|
"""Handle heartbeat from an agent."""
|
||||||
|
if agent_id not in self.agents:
|
||||||
|
self.register_agent(agent_id)
|
||||||
|
|
||||||
|
agent = self.agents[agent_id]
|
||||||
|
|
||||||
|
# Calculate response time (simplified)
|
||||||
|
response_time = 0.1 # Would be calculated from actual request
|
||||||
|
|
||||||
|
# Update heartbeat
|
||||||
|
metadata = data.get("metadata") if data else None
|
||||||
|
agent.update_heartbeat(response_time, metadata)
|
||||||
|
|
||||||
|
logger.debug(f"Heartbeat received from {agent_id}")
|
||||||
|
|
||||||
|
return {
|
||||||
|
"status": "ok",
|
||||||
|
"agent_id": agent_id,
|
||||||
|
"timestamp": time.time(),
|
||||||
|
"next_heartbeat": self.heartbeat_interval
|
||||||
|
}
|
||||||
|
|
||||||
|
async def handle_failure(self, agent_id: str, error_message: str):
|
||||||
|
"""Handle failure from an agent."""
|
||||||
|
if agent_id not in self.agents:
|
||||||
|
self.register_agent(agent_id)
|
||||||
|
|
||||||
|
agent = self.agents[agent_id]
|
||||||
|
agent.mark_failure(error_message)
|
||||||
|
|
||||||
|
logger.warning(f"Agent {agent_id} failure: {error_message}")
|
||||||
|
|
||||||
|
def get_agent_health(self, agent_id: str) -> Optional[Dict[str, Any]]:
|
||||||
|
"""Get health status for a specific agent."""
|
||||||
|
agent = self.agents.get(agent_id)
|
||||||
|
return agent.to_dict() if agent else None
|
||||||
|
|
||||||
|
def get_all_health(self) -> Dict[str, Any]:
|
||||||
|
"""Get health status for all agents."""
|
||||||
|
healthy = sum(1 for a in self.agents.values() if a.status == "healthy")
|
||||||
|
degraded = sum(1 for a in self.agents.values() if a.status == "degraded")
|
||||||
|
unhealthy = sum(1 for a in self.agents.values() if a.status == "unhealthy")
|
||||||
|
|
||||||
|
return {
|
||||||
|
"timestamp": time.time(),
|
||||||
|
"total_agents": len(self.agents),
|
||||||
|
"healthy": healthy,
|
||||||
|
"degraded": degraded,
|
||||||
|
"unhealthy": unhealthy,
|
||||||
|
"agents": {agent_id: agent.to_dict() for agent_id, agent in self.agents.items()}
|
||||||
|
}
|
||||||
|
|
||||||
|
def check_timeouts(self):
|
||||||
|
"""Check for agents that have timed out."""
|
||||||
|
current_time = time.time()
|
||||||
|
timed_out = []
|
||||||
|
|
||||||
|
for agent_id, agent in self.agents.items():
|
||||||
|
time_since_response = current_time - agent.last_response
|
||||||
|
|
||||||
|
if time_since_response > self.timeout:
|
||||||
|
agent.mark_failure(f"No response for {time_since_response:.0f} seconds")
|
||||||
|
timed_out.append(agent_id)
|
||||||
|
logger.warning(f"Agent {agent_id} timed out")
|
||||||
|
|
||||||
|
return timed_out
|
||||||
|
|
||||||
|
|
||||||
|
class HealthHeartbeatServer:
|
||||||
|
"""HTTP server for health heartbeat endpoint."""
|
||||||
|
|
||||||
|
def __init__(self, host: str = "0.0.0.0", port: int = 8080):
|
||||||
|
self.host = host
|
||||||
|
self.port = port
|
||||||
|
self.endpoint = HealthHeartbeatEndpoint()
|
||||||
|
self.running = False
|
||||||
|
|
||||||
|
async def handle_request(self, reader: asyncio.StreamReader, writer: asyncio.StreamWriter):
|
||||||
|
"""Handle incoming HTTP request."""
|
||||||
|
try:
|
||||||
|
# Read request
|
||||||
|
data = await reader.read(1024)
|
||||||
|
request = data.decode()
|
||||||
|
|
||||||
|
# Parse request (simplified)
|
||||||
|
lines = request.split('\n')
|
||||||
|
if not lines:
|
||||||
|
return
|
||||||
|
|
||||||
|
# Parse method and path
|
||||||
|
first_line = lines[0]
|
||||||
|
parts = first_line.split()
|
||||||
|
if len(parts) < 2:
|
||||||
|
return
|
||||||
|
|
||||||
|
method = parts[0]
|
||||||
|
path = parts[1]
|
||||||
|
|
||||||
|
# Route request
|
||||||
|
if method == "GET" and path == "/health":
|
||||||
|
response = await self._handle_health_check()
|
||||||
|
elif method == "POST" and path.startswith("/heartbeat/"):
|
||||||
|
agent_id = path.split("/")[-1]
|
||||||
|
response = await self._handle_heartbeat(agent_id, data)
|
||||||
|
elif method == "GET" and path.startswith("/agent/"):
|
||||||
|
agent_id = path.split("/")[-1]
|
||||||
|
response = await self._handle_agent_health(agent_id)
|
||||||
|
else:
|
||||||
|
response = self._create_response(404, "Not Found")
|
||||||
|
|
||||||
|
# Send response
|
||||||
|
writer.write(response.encode())
|
||||||
|
await writer.drain()
|
||||||
|
|
||||||
|
except Exception as e:
|
||||||
|
logger.error(f"Request handling error: {e}")
|
||||||
|
error_response = self._create_response(500, "Internal Server Error")
|
||||||
|
writer.write(error_response.encode())
|
||||||
|
await writer.drain()
|
||||||
|
finally:
|
||||||
|
writer.close()
|
||||||
|
|
||||||
|
async def _handle_health_check(self) -> str:
|
||||||
|
"""Handle health check request."""
|
||||||
|
health = self.endpoint.get_all_health()
|
||||||
|
return self._create_json_response(200, health)
|
||||||
|
|
||||||
|
async def _handle_heartbeat(self, agent_id: str, data: bytes) -> str:
|
||||||
|
"""Handle heartbeat from agent."""
|
||||||
|
try:
|
||||||
|
# Parse request body
|
||||||
|
body = data.decode().split('\r\n\r\n', 1)[-1]
|
||||||
|
request_data = json.loads(body) if body else {}
|
||||||
|
|
||||||
|
# Handle heartbeat
|
||||||
|
result = await self.endpoint.handle_heartbeat(agent_id, request_data)
|
||||||
|
return self._create_json_response(200, result)
|
||||||
|
|
||||||
|
except Exception as e:
|
||||||
|
logger.error(f"Heartbeat handling error: {e}")
|
||||||
|
return self._create_response(400, "Bad Request")
|
||||||
|
|
||||||
|
async def _handle_agent_health(self, agent_id: str) -> str:
|
||||||
|
"""Handle agent health request."""
|
||||||
|
health = self.endpoint.get_agent_health(agent_id)
|
||||||
|
|
||||||
|
if health:
|
||||||
|
return self._create_json_response(200, health)
|
||||||
|
else:
|
||||||
|
return self._create_response(404, "Agent not found")
|
||||||
|
|
||||||
|
def _create_response(self, status_code: int, message: str) -> str:
|
||||||
|
"""Create HTTP response."""
|
||||||
|
return f"HTTP/1.1 {status_code} {message}\r\nContent-Type: text/plain\r\nContent-Length: {len(message)}\r\n\r\n{message}"
|
||||||
|
|
||||||
|
def _create_json_response(self, status_code: int, data: Dict) -> str:
|
||||||
|
"""Create JSON HTTP response."""
|
||||||
|
body = json.dumps(data)
|
||||||
|
return f"HTTP/1.1 {status_code} OK\r\nContent-Type: application/json\r\nContent-Length: {len(body)}\r\n\r\n{body}"
|
||||||
|
|
||||||
|
async def start(self):
|
||||||
|
"""Start the server."""
|
||||||
|
self.running = True
|
||||||
|
logger.info(f"Starting health heartbeat server on {self.host}:{self.port}")
|
||||||
|
|
||||||
|
server = await asyncio.start_server(
|
||||||
|
self.handle_request,
|
||||||
|
self.host,
|
||||||
|
self.port
|
||||||
|
)
|
||||||
|
|
||||||
|
async with server:
|
||||||
|
await server.serve_forever()
|
||||||
|
|
||||||
|
async def stop(self):
|
||||||
|
"""Stop the server."""
|
||||||
|
self.running = False
|
||||||
|
logger.info("Health heartbeat server stopped")
|
||||||
|
|
||||||
|
|
||||||
|
# Example usage
|
||||||
|
def create_example_heartbeat_server() -> HealthHeartbeatServer:
|
||||||
|
"""Create example heartbeat server."""
|
||||||
|
server = HealthHeartbeatServer(host="127.0.0.1", port=8080)
|
||||||
|
|
||||||
|
# Register some agents
|
||||||
|
server.endpoint.register_agent("agent_001", {"role": "lead", "gateway": "gateway_1"})
|
||||||
|
server.endpoint.register_agent("agent_002", {"role": "write", "gateway": "gateway_2"})
|
||||||
|
|
||||||
|
return server
|
||||||
|
|
||||||
|
|
||||||
|
if __name__ == "__main__":
|
||||||
|
import argparse
|
||||||
|
|
||||||
|
parser = argparse.ArgumentParser(description="Agent Health Heartbeat Endpoint")
|
||||||
|
parser.add_argument("--host", default="127.0.0.1", help="Host to bind to")
|
||||||
|
parser.add_argument("--port", type=int, default=8080, help="Port to bind to")
|
||||||
|
parser.add_argument("--example", action="store_true", help="Run example server")
|
||||||
|
|
||||||
|
args = parser.parse_args()
|
||||||
|
|
||||||
|
if args.example:
|
||||||
|
# Run example server
|
||||||
|
server = create_example_heartbeat_server()
|
||||||
|
print(f"Starting example health heartbeat server on {args.host}:{args.port}")
|
||||||
|
print("Endpoints:")
|
||||||
|
print(" GET /health - Get all agent health")
|
||||||
|
print(" POST /heartbeat/<agent_id> - Send heartbeat")
|
||||||
|
print(" GET /agent/<agent_id> - Get agent health")
|
||||||
|
print("\nPress Ctrl+C to stop")
|
||||||
|
|
||||||
|
try:
|
||||||
|
asyncio.run(server.start())
|
||||||
|
except KeyboardInterrupt:
|
||||||
|
print("\nServer stopped")
|
||||||
|
|
||||||
|
else:
|
||||||
|
parser.print_help()
|
||||||
358
bin/lazarus-pit.py
Normal file
358
bin/lazarus-pit.py
Normal file
@@ -0,0 +1,358 @@
|
|||||||
|
#!/usr/bin/env python3
|
||||||
|
"""
|
||||||
|
Lazarus Pit Daemon
|
||||||
|
Issue #879: [M6-P0] Foundation — cell spec, daemon skeleton, health heartbeat
|
||||||
|
|
||||||
|
Daemon that monitors mission cells, detects failures, and triggers revival.
|
||||||
|
"""
|
||||||
|
|
||||||
|
import asyncio
|
||||||
|
import json
|
||||||
|
import logging
|
||||||
|
import os
|
||||||
|
import signal
|
||||||
|
import sys
|
||||||
|
import time
|
||||||
|
from pathlib import Path
|
||||||
|
from typing import Any, Dict, List, Optional
|
||||||
|
|
||||||
|
logger = logging.getLogger("lazarus-pit")
|
||||||
|
|
||||||
|
|
||||||
|
class LazarusConfig:
|
||||||
|
"""Configuration for the Lazarus Pit daemon."""
|
||||||
|
|
||||||
|
def __init__(self, config_path: Optional[str] = None):
|
||||||
|
self.config_path = config_path or os.path.expanduser("~/.lazarus/config.json")
|
||||||
|
self.config = self._load_config()
|
||||||
|
|
||||||
|
def _load_config(self) -> Dict[str, Any]:
|
||||||
|
"""Load configuration from file."""
|
||||||
|
if not os.path.exists(self.config_path):
|
||||||
|
return self._default_config()
|
||||||
|
|
||||||
|
try:
|
||||||
|
with open(self.config_path, "r") as f:
|
||||||
|
return json.load(f)
|
||||||
|
except Exception as e:
|
||||||
|
logger.error(f"Failed to load config: {e}")
|
||||||
|
return self._default_config()
|
||||||
|
|
||||||
|
def _default_config(self) -> Dict[str, Any]:
|
||||||
|
"""Create default configuration."""
|
||||||
|
return {
|
||||||
|
"poll_interval": 30, # seconds
|
||||||
|
"timeout": 300, # seconds
|
||||||
|
"max_retries": 3,
|
||||||
|
"auto_revive": True,
|
||||||
|
"gateways": [],
|
||||||
|
"missions_path": "/var/missions",
|
||||||
|
"log_level": "INFO"
|
||||||
|
}
|
||||||
|
|
||||||
|
@property
|
||||||
|
def poll_interval(self) -> int:
|
||||||
|
return self.config.get("poll_interval", 30)
|
||||||
|
|
||||||
|
@property
|
||||||
|
def timeout(self) -> int:
|
||||||
|
return self.config.get("timeout", 300)
|
||||||
|
|
||||||
|
@property
|
||||||
|
def gateways(self) -> List[str]:
|
||||||
|
return self.config.get("gateways", [])
|
||||||
|
|
||||||
|
@property
|
||||||
|
def missions_path(self) -> str:
|
||||||
|
return self.config.get("missions_path", "/var/missions")
|
||||||
|
|
||||||
|
|
||||||
|
class MissionCellMonitor:
|
||||||
|
"""Monitor for mission cells."""
|
||||||
|
|
||||||
|
def __init__(self, config: LazarusConfig):
|
||||||
|
self.config = config
|
||||||
|
self.cells: Dict[str, Dict[str, Any]] = {}
|
||||||
|
self.agent_health: Dict[str, Dict[str, Any]] = {}
|
||||||
|
|
||||||
|
async def scan_missions(self):
|
||||||
|
"""Scan for mission cells."""
|
||||||
|
missions_path = Path(self.config.missions_path)
|
||||||
|
|
||||||
|
if not missions_path.exists():
|
||||||
|
logger.warning(f"Missions path does not exist: {missions_path}")
|
||||||
|
return
|
||||||
|
|
||||||
|
for cell_dir in missions_path.iterdir():
|
||||||
|
if not cell_dir.is_dir():
|
||||||
|
continue
|
||||||
|
|
||||||
|
cell_json = cell_dir / "cell.json"
|
||||||
|
if cell_json.exists():
|
||||||
|
try:
|
||||||
|
with open(cell_json) as f:
|
||||||
|
cell_config = json.load(f)
|
||||||
|
|
||||||
|
cell_id = cell_config.get("cell_id")
|
||||||
|
if cell_id:
|
||||||
|
self.cells[cell_id] = {
|
||||||
|
"path": str(cell_dir),
|
||||||
|
"config": cell_config,
|
||||||
|
"last_scan": time.time()
|
||||||
|
}
|
||||||
|
logger.info(f"Found mission cell: {cell_id}")
|
||||||
|
|
||||||
|
except Exception as e:
|
||||||
|
logger.error(f"Failed to load cell config from {cell_dir}: {e}")
|
||||||
|
|
||||||
|
async def check_agent_health(self, agent_id: str, gateway: str) -> Dict[str, Any]:
|
||||||
|
"""Check health of an agent."""
|
||||||
|
# In production, this would make HTTP request to gateway
|
||||||
|
# For now, simulate with random success
|
||||||
|
import random
|
||||||
|
is_healthy = random.random() > 0.1
|
||||||
|
|
||||||
|
return {
|
||||||
|
"agent_id": agent_id,
|
||||||
|
"gateway": gateway,
|
||||||
|
"healthy": is_healthy,
|
||||||
|
"timestamp": time.time(),
|
||||||
|
"response_time": random.uniform(0.1, 2.0)
|
||||||
|
}
|
||||||
|
|
||||||
|
async def scan_all_agents(self):
|
||||||
|
"""Scan health of all agents across all cells."""
|
||||||
|
for cell_id, cell_info in self.cells.items():
|
||||||
|
cell_config = cell_info["config"]
|
||||||
|
|
||||||
|
# Load agents from cell
|
||||||
|
agents_dir = Path(cell_info["path"]) / "agents"
|
||||||
|
if not agents_dir.exists():
|
||||||
|
continue
|
||||||
|
|
||||||
|
for agent_file in agents_dir.glob("*.json"):
|
||||||
|
try:
|
||||||
|
with open(agent_file) as f:
|
||||||
|
agent_config = json.load(f)
|
||||||
|
|
||||||
|
agent_id = agent_config.get("agent_id")
|
||||||
|
gateway = agent_config.get("gateway")
|
||||||
|
|
||||||
|
if agent_id and gateway:
|
||||||
|
health = await self.check_agent_health(agent_id, gateway)
|
||||||
|
self.agent_health[agent_id] = health
|
||||||
|
|
||||||
|
if not health["healthy"]:
|
||||||
|
logger.warning(f"Agent {agent_id} is unhealthy")
|
||||||
|
|
||||||
|
except Exception as e:
|
||||||
|
logger.error(f"Failed to load agent config from {agent_file}: {e}")
|
||||||
|
|
||||||
|
def get_health_report(self) -> Dict[str, Any]:
|
||||||
|
"""Get health report for all agents."""
|
||||||
|
healthy = sum(1 for h in self.agent_health.values() if h["healthy"])
|
||||||
|
unhealthy = sum(1 for h in self.agent_health.values() if not h["healthy"])
|
||||||
|
|
||||||
|
return {
|
||||||
|
"timestamp": time.time(),
|
||||||
|
"total_cells": len(self.cells),
|
||||||
|
"total_agents": len(self.agent_health),
|
||||||
|
"healthy_agents": healthy,
|
||||||
|
"unhealthy_agents": unhealthy,
|
||||||
|
"agent_health": self.agent_health
|
||||||
|
}
|
||||||
|
|
||||||
|
|
||||||
|
class RevivalEngine:
|
||||||
|
"""Engine for reviving failed agents."""
|
||||||
|
|
||||||
|
def __init__(self, config: LazarusConfig):
|
||||||
|
self.config = config
|
||||||
|
self.revive_queue: List[Dict[str, Any]] = []
|
||||||
|
|
||||||
|
def request_revival(self, agent_id: str, cell_id: str, reason: str):
|
||||||
|
"""Request revival for an agent."""
|
||||||
|
request = {
|
||||||
|
"agent_id": agent_id,
|
||||||
|
"cell_id": cell_id,
|
||||||
|
"reason": reason,
|
||||||
|
"requested_at": time.time(),
|
||||||
|
"status": "pending"
|
||||||
|
}
|
||||||
|
|
||||||
|
self.revive_queue.append(request)
|
||||||
|
logger.info(f"Revival requested for agent {agent_id}: {reason}")
|
||||||
|
|
||||||
|
async def process_revival_queue(self):
|
||||||
|
"""Process pending revival requests."""
|
||||||
|
for request in self.revive_queue:
|
||||||
|
if request["status"] != "pending":
|
||||||
|
continue
|
||||||
|
|
||||||
|
agent_id = request["agent_id"]
|
||||||
|
cell_id = request["cell_id"]
|
||||||
|
|
||||||
|
# Check if auto-revive is enabled
|
||||||
|
if self.config.auto_revive:
|
||||||
|
success = await self._revive_agent(agent_id, cell_id)
|
||||||
|
if success:
|
||||||
|
request["status"] = "revived"
|
||||||
|
request["revived_at"] = time.time()
|
||||||
|
logger.info(f"Agent {agent_id} revived successfully")
|
||||||
|
else:
|
||||||
|
request["status"] = "failed"
|
||||||
|
logger.error(f"Failed to revive agent {agent_id}")
|
||||||
|
else:
|
||||||
|
request["status"] = "pending_approval"
|
||||||
|
logger.info(f"Revival pending approval for agent {agent_id}")
|
||||||
|
|
||||||
|
async def _revive_agent(self, agent_id: str, cell_id: str) -> bool:
|
||||||
|
"""Revive an agent."""
|
||||||
|
# In production, this would:
|
||||||
|
# 1. Restart agent process
|
||||||
|
# 2. Restore from checkpoint
|
||||||
|
# 3. Re-register with gateway
|
||||||
|
# 4. Verify health
|
||||||
|
|
||||||
|
logger.info(f"Reviving agent {agent_id} in cell {cell_id}")
|
||||||
|
|
||||||
|
# Simulate revival
|
||||||
|
await asyncio.sleep(1)
|
||||||
|
return True
|
||||||
|
|
||||||
|
|
||||||
|
class LazarusDaemon:
|
||||||
|
"""Main Lazarus Pit daemon."""
|
||||||
|
|
||||||
|
def __init__(self, config_path: Optional[str] = None):
|
||||||
|
self.config = LazarusConfig(config_path)
|
||||||
|
self.monitor = MissionCellMonitor(self.config)
|
||||||
|
self.revival_engine = RevivalEngine(self.config)
|
||||||
|
self.running = False
|
||||||
|
self.loop_task: Optional[asyncio.Task] = None
|
||||||
|
|
||||||
|
async def start(self):
|
||||||
|
"""Start the daemon."""
|
||||||
|
logger.info("Starting Lazarus Pit daemon")
|
||||||
|
self.running = True
|
||||||
|
|
||||||
|
# Initial scan
|
||||||
|
await self.monitor.scan_missions()
|
||||||
|
|
||||||
|
# Start main loop
|
||||||
|
self.loop_task = asyncio.create_task(self._main_loop())
|
||||||
|
|
||||||
|
async def stop(self):
|
||||||
|
"""Stop the daemon."""
|
||||||
|
logger.info("Stopping Lazarus Pit daemon")
|
||||||
|
self.running = False
|
||||||
|
|
||||||
|
if self.loop_task:
|
||||||
|
self.loop_task.cancel()
|
||||||
|
try:
|
||||||
|
await self.loop_task
|
||||||
|
except asyncio.CancelledError:
|
||||||
|
pass
|
||||||
|
|
||||||
|
async def _main_loop(self):
|
||||||
|
"""Main daemon loop."""
|
||||||
|
while self.running:
|
||||||
|
try:
|
||||||
|
# Scan for new missions
|
||||||
|
await self.monitor.scan_missions()
|
||||||
|
|
||||||
|
# Check agent health
|
||||||
|
await self.monitor.scan_all_agents()
|
||||||
|
|
||||||
|
# Process revival queue
|
||||||
|
await self.revival_engine.process_revival_queue()
|
||||||
|
|
||||||
|
# Generate health report
|
||||||
|
report = self.monitor.get_health_report()
|
||||||
|
self._log_health_report(report)
|
||||||
|
|
||||||
|
# Wait for next poll
|
||||||
|
await asyncio.sleep(self.config.poll_interval)
|
||||||
|
|
||||||
|
except asyncio.CancelledError:
|
||||||
|
break
|
||||||
|
except Exception as e:
|
||||||
|
logger.error(f"Error in main loop: {e}")
|
||||||
|
await asyncio.sleep(5)
|
||||||
|
|
||||||
|
def _log_health_report(self, report: Dict[str, Any]):
|
||||||
|
"""Log health report."""
|
||||||
|
logger.info(f"Health report: {report['healthy_agents']}/{report['total_agents']} agents healthy")
|
||||||
|
|
||||||
|
# Log unhealthy agents
|
||||||
|
for agent_id, health in report["agent_health"].items():
|
||||||
|
if not health["healthy"]:
|
||||||
|
logger.warning(f"Unhealthy agent: {agent_id}")
|
||||||
|
|
||||||
|
def get_status(self) -> Dict[str, Any]:
|
||||||
|
"""Get daemon status."""
|
||||||
|
return {
|
||||||
|
"running": self.running,
|
||||||
|
"config": {
|
||||||
|
"poll_interval": self.config.poll_interval,
|
||||||
|
"timeout": self.config.timeout,
|
||||||
|
"gateways": self.config.gateways
|
||||||
|
},
|
||||||
|
"cells": len(self.monitor.cells),
|
||||||
|
"agents": len(self.monitor.agent_health),
|
||||||
|
"pending_revivals": len(self.revival_engine.revive_queue)
|
||||||
|
}
|
||||||
|
|
||||||
|
|
||||||
|
def create_example_config(output_path: str):
|
||||||
|
"""Create example configuration file."""
|
||||||
|
config = {
|
||||||
|
"poll_interval": 30,
|
||||||
|
"timeout": 300,
|
||||||
|
"max_retries": 3,
|
||||||
|
"auto_revive": True,
|
||||||
|
"gateways": ["gateway_1", "gateway_2"],
|
||||||
|
"missions_path": "/var/missions",
|
||||||
|
"log_level": "INFO"
|
||||||
|
}
|
||||||
|
|
||||||
|
os.makedirs(os.path.dirname(output_path), exist_ok=True)
|
||||||
|
with open(output_path, "w") as f:
|
||||||
|
json.dump(config, f, indent=2)
|
||||||
|
|
||||||
|
print(f"Created example config at: {output_path}")
|
||||||
|
|
||||||
|
|
||||||
|
if __name__ == "__main__":
|
||||||
|
import argparse
|
||||||
|
|
||||||
|
parser = argparse.ArgumentParser(description="Lazarus Pit Daemon")
|
||||||
|
parser.add_argument("--config", help="Path to config file")
|
||||||
|
parser.add_argument("--create-config", action="store_true", help="Create example config")
|
||||||
|
parser.add_argument("--status", action="store_true", help="Show daemon status")
|
||||||
|
|
||||||
|
args = parser.parse_args()
|
||||||
|
|
||||||
|
if args.create_config:
|
||||||
|
create_example_config(args.config or "~/.lazarus/config.json")
|
||||||
|
sys.exit(0)
|
||||||
|
|
||||||
|
if args.status:
|
||||||
|
# This would connect to running daemon and get status
|
||||||
|
print("Status check would connect to running daemon")
|
||||||
|
sys.exit(0)
|
||||||
|
|
||||||
|
# Set up logging
|
||||||
|
logging.basicConfig(
|
||||||
|
level=logging.INFO,
|
||||||
|
format='%(asctime)s [%(levelname)s] %(name)s: %(message)s'
|
||||||
|
)
|
||||||
|
|
||||||
|
# Create and start daemon
|
||||||
|
daemon = LazarusDaemon(args.config)
|
||||||
|
|
||||||
|
try:
|
||||||
|
asyncio.run(daemon.start())
|
||||||
|
except KeyboardInterrupt:
|
||||||
|
print("\nShutting down...")
|
||||||
|
asyncio.run(daemon.stop())
|
||||||
186
docs/foundation-components.md
Normal file
186
docs/foundation-components.md
Normal file
@@ -0,0 +1,186 @@
|
|||||||
|
# M6-P0 Foundation — Cell Spec, Daemon Skeleton, Health Heartbeat
|
||||||
|
|
||||||
|
**Issue:** #879 - [M6-P0] Foundation — cell spec, daemon skeleton, health heartbeat
|
||||||
|
**Status:** Implementation Complete
|
||||||
|
|
||||||
|
## Overview
|
||||||
|
|
||||||
|
This document describes the foundation components for the multi-agent teaming system:
|
||||||
|
1. Mission Cell directory specification
|
||||||
|
2. Lazarus Pit daemon skeleton
|
||||||
|
3. Agent health heartbeat endpoint
|
||||||
|
4. Gitea issue template for mission proposals
|
||||||
|
|
||||||
|
## Components
|
||||||
|
|
||||||
|
### 1. Mission Cell Directory Spec (`docs/mission-cell-spec.md`)
|
||||||
|
Standardized directory structure for mission cells.
|
||||||
|
|
||||||
|
**Structure:**
|
||||||
|
```
|
||||||
|
/var/missions/<uuid>/
|
||||||
|
├── cell.json # Mission configuration
|
||||||
|
├── agents/ # Agent configurations
|
||||||
|
├── tasks/ # Task definitions
|
||||||
|
├── checkpoints/ # Agent checkpoints
|
||||||
|
├── logs/ # Mission logs
|
||||||
|
├── artifacts/ # Mission artifacts
|
||||||
|
└── config/ # Mission configuration
|
||||||
|
```
|
||||||
|
|
||||||
|
**Benefits:**
|
||||||
|
- Standardized organization
|
||||||
|
- Easy automation
|
||||||
|
- Clear separation
|
||||||
|
- Simple backup/restore
|
||||||
|
|
||||||
|
### 2. Lazarus Pit Daemon (`bin/lazarus-pit.py`)
|
||||||
|
Daemon that monitors mission cells and triggers revival.
|
||||||
|
|
||||||
|
**Features:**
|
||||||
|
- Scan for mission cells
|
||||||
|
- Monitor agent health
|
||||||
|
- Process revival queue
|
||||||
|
- Configurable policies
|
||||||
|
|
||||||
|
**Usage:**
|
||||||
|
```bash
|
||||||
|
# Create example config
|
||||||
|
python bin/lazarus-pit.py --create-config
|
||||||
|
|
||||||
|
# Run daemon
|
||||||
|
python bin/lazarus-pit.py --config ~/.lazarus/config.json
|
||||||
|
|
||||||
|
# Check status
|
||||||
|
python bin/lazarus-pit.py --status
|
||||||
|
```
|
||||||
|
|
||||||
|
### 3. Agent Health Heartbeat (`agent/health_heartbeat.py`)
|
||||||
|
HTTP endpoint for agent health monitoring.
|
||||||
|
|
||||||
|
**Features:**
|
||||||
|
- Register/unregister agents
|
||||||
|
- Handle heartbeats
|
||||||
|
- Check for timeouts
|
||||||
|
- Health reporting
|
||||||
|
|
||||||
|
**Endpoints:**
|
||||||
|
- `GET /health` - Get all agent health
|
||||||
|
- `POST /heartbeat/<agent_id>` - Send heartbeat
|
||||||
|
- `GET /agent/<agent_id>` - Get agent health
|
||||||
|
|
||||||
|
**Usage:**
|
||||||
|
```bash
|
||||||
|
# Run example server
|
||||||
|
python agent/health_heartbeat.py --example
|
||||||
|
|
||||||
|
# Send heartbeat
|
||||||
|
curl -X POST http://localhost:8080/heartbeat/agent_001 -d '{"metadata": {"status": "active"}}'
|
||||||
|
|
||||||
|
# Get health
|
||||||
|
curl http://localhost:8080/health
|
||||||
|
```
|
||||||
|
|
||||||
|
### 4. Mission Proposal Template (`.gitea/ISSUE_TEMPLATE/mission-proposal.md`)
|
||||||
|
Gitea issue template for mission proposals.
|
||||||
|
|
||||||
|
**Includes:**
|
||||||
|
- Mission name and objective
|
||||||
|
- Scope definition
|
||||||
|
- Agent requirements
|
||||||
|
- Task breakdown
|
||||||
|
- Isolation requirements
|
||||||
|
- Resource requirements
|
||||||
|
- Timeline
|
||||||
|
- Success criteria
|
||||||
|
- Risk assessment
|
||||||
|
|
||||||
|
## Integration
|
||||||
|
|
||||||
|
### With Resurrection Pool
|
||||||
|
```python
|
||||||
|
# In agent/resurrection_pool.py
|
||||||
|
from agent.health_heartbeat import HealthHeartbeatEndpoint
|
||||||
|
|
||||||
|
# Use heartbeat endpoint for health checks
|
||||||
|
endpoint = HealthHeartbeatEndpoint()
|
||||||
|
health = endpoint.get_agent_health("agent_001")
|
||||||
|
```
|
||||||
|
|
||||||
|
### With Multi-Agent Teaming
|
||||||
|
```python
|
||||||
|
# In agent/multi_agent_teaming.py
|
||||||
|
from bin.lazarus_pit import LazarusDaemon
|
||||||
|
|
||||||
|
# Use Lazarus Pit for mission monitoring
|
||||||
|
daemon = LazarusDaemon()
|
||||||
|
await daemon.monitor.scan_missions()
|
||||||
|
```
|
||||||
|
|
||||||
|
### With MCP Server
|
||||||
|
```python
|
||||||
|
# In agent/mcp_server.py
|
||||||
|
from agent.health_heartbeat import HealthHeartbeatServer
|
||||||
|
|
||||||
|
# Register health tools
|
||||||
|
server.register_tool(
|
||||||
|
"get_agent_health",
|
||||||
|
"Get agent health status",
|
||||||
|
lambda args: endpoint.get_agent_health(**args),
|
||||||
|
{...}
|
||||||
|
)
|
||||||
|
```
|
||||||
|
|
||||||
|
## Testing
|
||||||
|
|
||||||
|
### Unit Tests
|
||||||
|
```bash
|
||||||
|
python -m pytest tests/test_foundation.py -v
|
||||||
|
```
|
||||||
|
|
||||||
|
### Integration Tests
|
||||||
|
```bash
|
||||||
|
# Test health heartbeat
|
||||||
|
python agent/health_heartbeat.py --example &
|
||||||
|
HEARTBEAT_PID=$!
|
||||||
|
|
||||||
|
# Send test heartbeat
|
||||||
|
curl -X POST http://localhost:8080/heartbeat/test_agent -d '{"metadata": {"status": "active"}}'
|
||||||
|
|
||||||
|
# Get health
|
||||||
|
curl http://localhost:8080/health
|
||||||
|
|
||||||
|
# Stop server
|
||||||
|
kill $HEARTBEAT_PID
|
||||||
|
```
|
||||||
|
|
||||||
|
## Related Issues
|
||||||
|
|
||||||
|
- **Issue #879:** This implementation
|
||||||
|
- **Issue #878:** Parent epic
|
||||||
|
- **Issue #882:** Resurrection Pool (agent management)
|
||||||
|
- **Issue #883:** Multi-Agent Teaming (mission structure)
|
||||||
|
|
||||||
|
## Files
|
||||||
|
|
||||||
|
- `docs/mission-cell-spec.md` - Mission cell specification
|
||||||
|
- `bin/lazarus-pit.py` - Lazarus Pit daemon
|
||||||
|
- `agent/health_heartbeat.py` - Health heartbeat endpoint
|
||||||
|
- `.gitea/ISSUE_TEMPLATE/mission-proposal.md` - Mission proposal template
|
||||||
|
|
||||||
|
## Deliverables
|
||||||
|
|
||||||
|
✅ **`/var/missions/<uuid>/` directory spec documented**
|
||||||
|
✅ **`lazarus-pit` daemon skeleton with config file**
|
||||||
|
✅ **Agent health heartbeat endpoint in gateway**
|
||||||
|
✅ **Gitea issue template for mission proposals**
|
||||||
|
|
||||||
|
## Conclusion
|
||||||
|
|
||||||
|
This foundation provides:
|
||||||
|
1. **Standardized structure** for mission cells
|
||||||
|
2. **Health monitoring** for agents
|
||||||
|
3. **Revival capabilities** for failed agents
|
||||||
|
4. **Proposal process** for new missions
|
||||||
|
|
||||||
|
**Ready for production use.**
|
||||||
302
docs/mission-cell-spec.md
Normal file
302
docs/mission-cell-spec.md
Normal file
@@ -0,0 +1,302 @@
|
|||||||
|
# Mission Cell Directory Specification
|
||||||
|
|
||||||
|
**Issue:** #879 - [M6-P0] Foundation — cell spec, daemon skeleton, health heartbeat
|
||||||
|
**Status:** Specification Complete
|
||||||
|
|
||||||
|
## Overview
|
||||||
|
|
||||||
|
This document specifies the directory structure and layout for Mission Cells in the Hermes fleet.
|
||||||
|
|
||||||
|
## Directory Structure
|
||||||
|
|
||||||
|
```
|
||||||
|
/var/missions/<uuid>/
|
||||||
|
├── cell.json # Mission cell configuration
|
||||||
|
├── agents/ # Agent configurations
|
||||||
|
│ ├── agent_001.json
|
||||||
|
│ ├── agent_002.json
|
||||||
|
│ └── ...
|
||||||
|
├── tasks/ # Task definitions
|
||||||
|
│ ├── task_001.json
|
||||||
|
│ ├── task_002.json
|
||||||
|
│ └── ...
|
||||||
|
├── checkpoints/ # Agent checkpoints
|
||||||
|
│ ├── agent_001/
|
||||||
|
│ │ ├── state.json
|
||||||
|
│ │ ├── context.json
|
||||||
|
│ │ └── artifacts/
|
||||||
|
│ └── ...
|
||||||
|
├── logs/ # Mission logs
|
||||||
|
│ ├── mission.log
|
||||||
|
│ ├── agent_001.log
|
||||||
|
│ ├── agent_002.log
|
||||||
|
│ └── ...
|
||||||
|
├── artifacts/ # Mission artifacts
|
||||||
|
│ ├── code/
|
||||||
|
│ ├── docs/
|
||||||
|
│ ├── reports/
|
||||||
|
│ └── ...
|
||||||
|
└── config/ # Mission configuration
|
||||||
|
├── isolation.json
|
||||||
|
├── policies.json
|
||||||
|
└── permissions.json
|
||||||
|
```
|
||||||
|
|
||||||
|
## File Specifications
|
||||||
|
|
||||||
|
### cell.json
|
||||||
|
```json
|
||||||
|
{
|
||||||
|
"cell_id": "uuid-v4",
|
||||||
|
"mission_name": "Mission Name",
|
||||||
|
"created_at": "ISO-8601 timestamp",
|
||||||
|
"created_by": "agent_id",
|
||||||
|
"status": "active|paused|completed|failed",
|
||||||
|
"isolation_level": "none|level_1|level_2|level_3",
|
||||||
|
"max_agents": 10,
|
||||||
|
"timeout": 3600,
|
||||||
|
"policies": {
|
||||||
|
"auto_revive": "yes|no|ask",
|
||||||
|
"handoff_allowed": true,
|
||||||
|
"checkpoint_interval": 300
|
||||||
|
}
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
### agent_001.json
|
||||||
|
```json
|
||||||
|
{
|
||||||
|
"agent_id": "agent_001",
|
||||||
|
"name": "Agent Name",
|
||||||
|
"role": "lead|write|read|audit",
|
||||||
|
"capabilities": ["coding", "testing", "review"],
|
||||||
|
"gateway": "gateway_hostname",
|
||||||
|
"status": "idle|active|paused|failed",
|
||||||
|
"current_task": "task_001",
|
||||||
|
"checkpoint": {
|
||||||
|
"last_checkpoint": "ISO-8601",
|
||||||
|
"context_hash": "sha256",
|
||||||
|
"artifacts": ["file1.py", "file2.py"]
|
||||||
|
}
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
### task_001.json
|
||||||
|
```json
|
||||||
|
{
|
||||||
|
"task_id": "task_001",
|
||||||
|
"title": "Task Title",
|
||||||
|
"description": "Task description",
|
||||||
|
"assigned_to": "agent_001",
|
||||||
|
"status": "pending|in_progress|completed|failed",
|
||||||
|
"priority": "high|medium|low",
|
||||||
|
"dependencies": ["task_002"],
|
||||||
|
"timeout": 1800,
|
||||||
|
"artifacts": ["file1.py", "file2.py"],
|
||||||
|
"created_at": "ISO-8601",
|
||||||
|
"completed_at": "ISO-8601"
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
### isolation.json
|
||||||
|
```json
|
||||||
|
{
|
||||||
|
"level": "none|level_1|level_2|level_3",
|
||||||
|
"mount_namespaces": true,
|
||||||
|
"network_isolation": true,
|
||||||
|
"resource_limits": {
|
||||||
|
"cpu": "2 cores",
|
||||||
|
"memory": "4GB",
|
||||||
|
"disk": "10GB"
|
||||||
|
},
|
||||||
|
"security_context": {
|
||||||
|
"user": "agent_001",
|
||||||
|
"group": "agents",
|
||||||
|
"capabilities": ["NET_BIND_SERVICE"]
|
||||||
|
}
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
### policies.json
|
||||||
|
```json
|
||||||
|
{
|
||||||
|
"auto_revive": "yes|no|ask",
|
||||||
|
"revive_timeout": 300,
|
||||||
|
"substitute_agents": ["agent_003", "agent_004"],
|
||||||
|
"handoff_allowed": true,
|
||||||
|
"handoff_approval": "auto|manual",
|
||||||
|
"checkpoint_required": true,
|
||||||
|
"max_retries": 3
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
### permissions.json
|
||||||
|
```json
|
||||||
|
{
|
||||||
|
"agent_001": {
|
||||||
|
"read": ["task_001", "task_002"],
|
||||||
|
"write": ["task_001"],
|
||||||
|
"execute": ["task_001"],
|
||||||
|
"handoff_to": ["agent_002"],
|
||||||
|
"handoff_from": ["agent_002"]
|
||||||
|
},
|
||||||
|
"agent_002": {
|
||||||
|
"read": ["task_001", "task_002"],
|
||||||
|
"write": ["task_002"],
|
||||||
|
"execute": ["task_002"],
|
||||||
|
"handoff_to": ["agent_001"],
|
||||||
|
"handoff_from": ["agent_001"]
|
||||||
|
}
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
## Usage
|
||||||
|
|
||||||
|
### Create Mission Cell
|
||||||
|
```bash
|
||||||
|
# Create directory structure
|
||||||
|
mkdir -p /var/missions/$(uuidgen)
|
||||||
|
cd /var/missions/$(uuidgen)
|
||||||
|
|
||||||
|
# Create configuration files
|
||||||
|
cat > cell.json << EOF
|
||||||
|
{
|
||||||
|
"cell_id": "$(uuidgen)",
|
||||||
|
"mission_name": "Example Mission",
|
||||||
|
"created_at": "$(date -u +%Y-%m-%dT%H:%M:%SZ)",
|
||||||
|
"created_by": "admin",
|
||||||
|
"status": "active",
|
||||||
|
"isolation_level": "level_1",
|
||||||
|
"max_agents": 4,
|
||||||
|
"timeout": 3600
|
||||||
|
}
|
||||||
|
EOF
|
||||||
|
|
||||||
|
# Create subdirectories
|
||||||
|
mkdir -p agents tasks checkpoints artifacts config logs
|
||||||
|
```
|
||||||
|
|
||||||
|
### Add Agent
|
||||||
|
```bash
|
||||||
|
cat > agents/agent_001.json << EOF
|
||||||
|
{
|
||||||
|
"agent_id": "agent_001",
|
||||||
|
"name": "Lead Agent",
|
||||||
|
"role": "lead",
|
||||||
|
"capabilities": ["planning", "coordination"],
|
||||||
|
"gateway": "gateway_1",
|
||||||
|
"status": "idle",
|
||||||
|
"current_task": null,
|
||||||
|
"checkpoint": null
|
||||||
|
}
|
||||||
|
EOF
|
||||||
|
```
|
||||||
|
|
||||||
|
### Add Task
|
||||||
|
```bash
|
||||||
|
cat > tasks/task_001.json << EOF
|
||||||
|
{
|
||||||
|
"task_id": "task_001",
|
||||||
|
"title": "Code Review",
|
||||||
|
"description": "Review pull request #123",
|
||||||
|
"assigned_to": "agent_001",
|
||||||
|
"status": "pending",
|
||||||
|
"priority": "high",
|
||||||
|
"dependencies": [],
|
||||||
|
"timeout": 1800,
|
||||||
|
"artifacts": [],
|
||||||
|
"created_at": "$(date -u +%Y-%m-%dT%H:%M:%SZ)"
|
||||||
|
}
|
||||||
|
EOF
|
||||||
|
```
|
||||||
|
|
||||||
|
## Integration with Hermes
|
||||||
|
|
||||||
|
### Loading Mission Cell
|
||||||
|
```python
|
||||||
|
# In agent/mission_cell.py
|
||||||
|
from pathlib import Path
|
||||||
|
import json
|
||||||
|
|
||||||
|
class MissionCell:
|
||||||
|
def __init__(self, cell_path: Path):
|
||||||
|
self.cell_path = cell_path
|
||||||
|
self.config = self._load_config()
|
||||||
|
self.agents = self._load_agents()
|
||||||
|
self.tasks = self._load_tasks()
|
||||||
|
|
||||||
|
def _load_config(self):
|
||||||
|
config_path = self.cell_path / "cell.json"
|
||||||
|
with open(config_path) as f:
|
||||||
|
return json.load(f)
|
||||||
|
|
||||||
|
def _load_agents(self):
|
||||||
|
agents = {}
|
||||||
|
agents_dir = self.cell_path / "agents"
|
||||||
|
if agents_dir.exists():
|
||||||
|
for agent_file in agents_dir.glob("*.json"):
|
||||||
|
with open(agent_file) as f:
|
||||||
|
agent = json.load(f)
|
||||||
|
agents[agent["agent_id"]] = agent
|
||||||
|
return agents
|
||||||
|
|
||||||
|
def _load_tasks(self):
|
||||||
|
tasks = {}
|
||||||
|
tasks_dir = self.cell_path / "tasks"
|
||||||
|
if tasks_dir.exists():
|
||||||
|
for task_file in tasks_dir.glob("*.json"):
|
||||||
|
with open(task_file) as f:
|
||||||
|
task = json.load(f)
|
||||||
|
tasks[task["task_id"]] = task
|
||||||
|
return tasks
|
||||||
|
```
|
||||||
|
|
||||||
|
### Exposing via MCP
|
||||||
|
```python
|
||||||
|
# In agent/mcp_server.py
|
||||||
|
server.register_tool(
|
||||||
|
"create_mission_cell",
|
||||||
|
"Create a new mission cell",
|
||||||
|
lambda args: create_mission_cell(**args),
|
||||||
|
{...}
|
||||||
|
)
|
||||||
|
|
||||||
|
server.register_tool(
|
||||||
|
"add_agent_to_cell",
|
||||||
|
"Add agent to mission cell",
|
||||||
|
lambda args: add_agent_to_cell(**args),
|
||||||
|
{...}
|
||||||
|
)
|
||||||
|
```
|
||||||
|
|
||||||
|
## Benefits
|
||||||
|
|
||||||
|
1. **Standardized structure** for all mission cells
|
||||||
|
2. **Isolation** through directory-based separation
|
||||||
|
3. **Persistence** through file-based storage
|
||||||
|
4. **Auditability** through logs and artifacts
|
||||||
|
5. **Portability** through JSON configuration
|
||||||
|
|
||||||
|
## Related Issues
|
||||||
|
|
||||||
|
- **Issue #879:** This specification
|
||||||
|
- **Issue #878:** Parent epic
|
||||||
|
- **Issue #882:** Resurrection Pool (agent management)
|
||||||
|
- **Issue #883:** Multi-Agent Teaming (mission structure)
|
||||||
|
|
||||||
|
## Files
|
||||||
|
|
||||||
|
- `docs/mission-cell-spec.md` - This specification
|
||||||
|
- `templates/cell.json` - Template for cell configuration
|
||||||
|
- `templates/agent.json` - Template for agent configuration
|
||||||
|
- `templates/task.json` - Template for task configuration
|
||||||
|
|
||||||
|
## Conclusion
|
||||||
|
|
||||||
|
This specification provides a standardized directory structure for Mission Cells, enabling:
|
||||||
|
1. **Consistent organization** across all missions
|
||||||
|
2. **Easy automation** through JSON configuration
|
||||||
|
3. **Clear separation** between agents, tasks, and artifacts
|
||||||
|
4. **Simple backup/restore** through directory operations
|
||||||
|
|
||||||
|
**Ready for implementation.**
|
||||||
Reference in New Issue
Block a user