Replace shell=True with list-based subprocess execution to prevent command injection via malicious user input. Changes: - tools/transcription_tools.py: Use shlex.split() + shell=False - tools/environments/docker.py: List-based commands with container ID validation Fixes CVE-level vulnerability where malicious file paths or container IDs could inject arbitrary commands. CVSS: 9.8 (Critical) Refs: V-001 in SECURITY_AUDIT_REPORT.md
510 lines
15 KiB
Markdown
510 lines
15 KiB
Markdown
# Hermes Agent - Testing Infrastructure Deep Analysis
|
|
|
|
## Executive Summary
|
|
|
|
The hermes-agent project has a **comprehensive test suite** with **373 test files** containing approximately **4,300+ test functions**. The tests are organized into 10 subdirectories covering all major components.
|
|
|
|
---
|
|
|
|
## 1. Test Suite Structure & Statistics
|
|
|
|
### 1.1 Directory Breakdown
|
|
|
|
| Directory | Test Files | Focus Area |
|
|
|-----------|------------|------------|
|
|
| `tests/tools/` | 86 | Tool implementations, file operations, environments |
|
|
| `tests/gateway/` | 96 | Platform integrations (Discord, Telegram, Slack, etc.) |
|
|
| `tests/hermes_cli/` | 48 | CLI commands, configuration, setup flows |
|
|
| `tests/agent/` | 16 | Core agent logic, prompt building, model adapters |
|
|
| `tests/integration/` | 8 | End-to-end integration tests |
|
|
| `tests/acp/` | 8 | Agent Communication Protocol |
|
|
| `tests/cron/` | 3 | Cron job scheduling |
|
|
| `tests/skills/` | 5 | Skill management |
|
|
| `tests/honcho_integration/` | 5 | Honcho memory integration |
|
|
| `tests/fakes/` | 2 | Test fixtures and fake servers |
|
|
| **Total** | **373** | **~4,311 test functions** |
|
|
|
|
### 1.2 Test Classification
|
|
|
|
**Unit Tests:** ~95% (3,600+)
|
|
**Integration Tests:** ~5% (marked with `@pytest.mark.integration`)
|
|
**Async Tests:** ~679 tests use `@pytest.mark.asyncio`
|
|
|
|
### 1.3 Largest Test Files (by line count)
|
|
|
|
1. `tests/test_run_agent.py` - 3,329 lines (212 tests) - Core agent logic
|
|
2. `tests/tools/test_mcp_tool.py` - 2,902 lines (147 tests) - MCP protocol
|
|
3. `tests/gateway/test_voice_command.py` - 2,632 lines - Voice features
|
|
4. `tests/gateway/test_feishu.py` - 2,580 lines - Feishu platform
|
|
5. `tests/gateway/test_api_server.py` - 1,503 lines - API server
|
|
|
|
---
|
|
|
|
## 2. Coverage Heat Map - Critical Gaps Identified
|
|
|
|
### 2.1 NO TEST COVERAGE (Red Zone)
|
|
|
|
#### Agent Module Gaps:
|
|
- `agent/copilot_acp_client.py` - Copilot integration (0 tests)
|
|
- `agent/gemini_adapter.py` - Google Gemini model support (0 tests)
|
|
- `agent/knowledge_ingester.py` - Knowledge ingestion (0 tests)
|
|
- `agent/meta_reasoning.py` - Meta-reasoning capabilities (0 tests)
|
|
- `agent/skill_utils.py` - Skill utilities (0 tests)
|
|
- `agent/trajectory.py` - Trajectory management (0 tests)
|
|
|
|
#### Tools Module Gaps:
|
|
- `tools/browser_tool.py` - Browser automation (0 tests)
|
|
- `tools/code_execution_tool.py` - Code execution (0 tests)
|
|
- `tools/gitea_client.py` - Gitea integration (0 tests)
|
|
- `tools/image_generation_tool.py` - Image generation (0 tests)
|
|
- `tools/neutts_synth.py` - Neural TTS (0 tests)
|
|
- `tools/openrouter_client.py` - OpenRouter API (0 tests)
|
|
- `tools/session_search_tool.py` - Session search (0 tests)
|
|
- `tools/terminal_tool.py` - Terminal operations (0 tests)
|
|
- `tools/tts_tool.py` - Text-to-speech (0 tests)
|
|
- `tools/web_tools.py` - Web tools core (0 tests)
|
|
|
|
#### Gateway Module Gaps:
|
|
- `gateway/run.py` - Gateway runner (0 tests)
|
|
- `gateway/stream_consumer.py` - Stream consumption (0 tests)
|
|
|
|
#### Root-Level Gaps:
|
|
- `hermes_constants.py` - Constants (0 tests)
|
|
- `hermes_time.py` - Time utilities (0 tests)
|
|
- `mini_swe_runner.py` - SWE runner (0 tests)
|
|
- `rl_cli.py` - RL CLI (0 tests)
|
|
- `utils.py` - Utilities (0 tests)
|
|
|
|
### 2.2 LIMITED COVERAGE (Yellow Zone)
|
|
|
|
- `agent/models_dev.py` - Only 19 tests for complex model routing
|
|
- `agent/smart_model_routing.py` - Only 6 tests
|
|
- `tools/approval.py` - 2 test files but complex logic
|
|
- `tools/skills_guard.py` - Security-critical, needs more coverage
|
|
|
|
### 2.3 GOOD COVERAGE (Green Zone)
|
|
|
|
- `agent/anthropic_adapter.py` - 97 tests (comprehensive)
|
|
- `agent/prompt_builder.py` - 108 tests (excellent)
|
|
- `tools/mcp_tool.py` - 147 tests (very comprehensive)
|
|
- `tools/file_tools.py` - Multiple test files
|
|
- `gateway/discord.py` - 11 test files covering various aspects
|
|
- `gateway/telegram.py` - 10 test files
|
|
- `gateway/session.py` - 15 test files
|
|
|
|
---
|
|
|
|
## 3. Test Patterns Analysis
|
|
|
|
### 3.1 Fixtures Architecture
|
|
|
|
**Global Fixtures (`conftest.py`):**
|
|
- `_isolate_hermes_home` - Isolates HERMES_HOME to temp directory (autouse)
|
|
- `_ensure_current_event_loop` - Event loop management for sync tests (autouse)
|
|
- `_enforce_test_timeout` - 30-second timeout per test (autouse)
|
|
- `tmp_dir` - Temporary directory fixture
|
|
- `mock_config` - Minimal hermes config for unit tests
|
|
|
|
**Common Patterns:**
|
|
```python
|
|
# Isolation pattern
|
|
@pytest.fixture(autouse=True)
|
|
def isolate_env(tmp_path, monkeypatch):
|
|
monkeypatch.setenv("HERMES_HOME", str(tmp_path))
|
|
|
|
# Mock client pattern
|
|
@pytest.fixture
|
|
def mock_agent():
|
|
with patch("run_agent.OpenAI") as mock:
|
|
yield mock
|
|
```
|
|
|
|
### 3.2 Mock Usage Statistics
|
|
|
|
- **~12,468 mock/patch usages** across the test suite
|
|
- Heavy use of `unittest.mock.patch` and `MagicMock`
|
|
- `AsyncMock` used for async function mocking
|
|
- `SimpleNamespace` for creating mock API response objects
|
|
|
|
### 3.3 Test Organization Patterns
|
|
|
|
**Class-Based Organization:**
|
|
- 1,532 test classes identified
|
|
- Grouped by functionality: `Test<Feature><Scenario>`
|
|
- Example: `TestSanitizeApiMessages`, `TestContextPressureFlags`
|
|
|
|
**Function-Based Organization:**
|
|
- Used for simpler test files
|
|
- Naming: `test_<feature>_<scenario>`
|
|
|
|
### 3.4 Async Test Patterns
|
|
|
|
```python
|
|
@pytest.mark.asyncio
|
|
async def test_async_function():
|
|
result = await async_function()
|
|
assert result == expected
|
|
```
|
|
|
|
---
|
|
|
|
## 4. 20 New Test Recommendations (Priority Order)
|
|
|
|
### Critical Priority (Security/Risk)
|
|
|
|
1. **Browser Tool Security Tests** (`tools/browser_tool.py`)
|
|
- Test sandbox escape prevention
|
|
- Test malicious script blocking
|
|
- Test content security policy enforcement
|
|
|
|
2. **Code Execution Sandbox Tests** (`tools/code_execution_tool.py`)
|
|
- Test resource limits (CPU, memory)
|
|
- Test dangerous import blocking
|
|
- Test timeout enforcement
|
|
- Test filesystem access restrictions
|
|
|
|
3. **Terminal Tool Safety Tests** (`tools/terminal_tool.py`)
|
|
- Test dangerous command blocking
|
|
- Test command injection prevention
|
|
- Test environment variable sanitization
|
|
|
|
4. **OpenRouter Client Tests** (`tools/openrouter_client.py`)
|
|
- Test API key handling
|
|
- Test rate limit handling
|
|
- Test error response parsing
|
|
|
|
### High Priority (Core Functionality)
|
|
|
|
5. **Gemini Adapter Tests** (`agent/gemini_adapter.py`)
|
|
- Test message format conversion
|
|
- Test tool call normalization
|
|
- Test streaming response handling
|
|
|
|
6. **Copilot ACP Client Tests** (`agent/copilot_acp_client.py`)
|
|
- Test authentication flow
|
|
- Test session management
|
|
- Test message passing
|
|
|
|
7. **Knowledge Ingester Tests** (`agent/knowledge_ingester.py`)
|
|
- Test document parsing
|
|
- Test embedding generation
|
|
- Test knowledge retrieval
|
|
|
|
8. **Stream Consumer Tests** (`gateway/stream_consumer.py`)
|
|
- Test backpressure handling
|
|
- Test reconnection logic
|
|
- Test message ordering guarantees
|
|
|
|
### Medium Priority (Integration/Features)
|
|
|
|
9. **Web Tools Core Tests** (`tools/web_tools.py`)
|
|
- Test search result parsing
|
|
- Test content extraction
|
|
- Test error handling for unavailable services
|
|
|
|
10. **Image Generation Tool Tests** (`tools/image_generation_tool.py`)
|
|
- Test prompt filtering
|
|
- Test image format handling
|
|
- Test provider failover
|
|
|
|
11. **Gitea Client Tests** (`tools/gitea_client.py`)
|
|
- Test repository operations
|
|
- Test webhook handling
|
|
- Test authentication
|
|
|
|
12. **Session Search Tool Tests** (`tools/session_search_tool.py`)
|
|
- Test query parsing
|
|
- Test result ranking
|
|
- Test pagination
|
|
|
|
13. **Meta Reasoning Tests** (`agent/meta_reasoning.py`)
|
|
- Test strategy selection
|
|
- Test reflection generation
|
|
- Test learning from failures
|
|
|
|
14. **TTS Tool Tests** (`tools/tts_tool.py`)
|
|
- Test voice selection
|
|
- Test audio format conversion
|
|
- Test streaming playback
|
|
|
|
15. **Neural TTS Tests** (`tools/neutts_synth.py`)
|
|
- Test voice cloning safety
|
|
- Test audio quality validation
|
|
- Test resource cleanup
|
|
|
|
### Lower Priority (Utilities)
|
|
|
|
16. **Hermes Constants Tests** (`hermes_constants.py`)
|
|
- Test constant values
|
|
- Test environment-specific overrides
|
|
|
|
17. **Time Utilities Tests** (`hermes_time.py`)
|
|
- Test timezone handling
|
|
- Test formatting functions
|
|
|
|
18. **Utils Module Tests** (`utils.py`)
|
|
- Test helper functions
|
|
- Test validation utilities
|
|
|
|
19. **Mini SWE Runner Tests** (`mini_swe_runner.py`)
|
|
- Test repository setup
|
|
- Test test execution
|
|
- Test result parsing
|
|
|
|
20. **RL CLI Tests** (`rl_cli.py`)
|
|
- Test training command parsing
|
|
- Test configuration validation
|
|
- Test checkpoint handling
|
|
|
|
---
|
|
|
|
## 5. Test Optimization Opportunities
|
|
|
|
### 5.1 Performance Issues Identified
|
|
|
|
**Large Test Files (Split Recommended):**
|
|
- `tests/test_run_agent.py` (3,329 lines) → Split into multiple files
|
|
- `tests/tools/test_mcp_tool.py` (2,902 lines) → Split by MCP feature
|
|
- `tests/test_anthropic_adapter.py` (1,219 lines) → Consider splitting
|
|
|
|
**Potential Slow Tests:**
|
|
- Integration tests with real API calls
|
|
- Tests with file I/O operations
|
|
- Tests with subprocess spawning
|
|
|
|
### 5.2 Optimization Recommendations
|
|
|
|
1. **Parallel Execution Already Configured**
|
|
- `pytest-xdist` with `-n auto` in CI
|
|
- Maintains isolation through fixtures
|
|
|
|
2. **Fixture Scope Optimization**
|
|
- Review `autouse=True` fixtures for necessity
|
|
- Consider session-scoped fixtures for expensive setup
|
|
|
|
3. **Mock External Services**
|
|
- Some integration tests still hit real APIs
|
|
- Create more fakes like `fake_ha_server.py`
|
|
|
|
4. **Test Data Management**
|
|
- Use factory pattern for test data generation
|
|
- Share test fixtures across related tests
|
|
|
|
### 5.3 CI/CD Optimizations
|
|
|
|
Current CI (`.github/workflows/tests.yml`):
|
|
- Uses `uv` for fast dependency installation
|
|
- Runs with `-n auto` for parallelization
|
|
- Ignores integration tests by default
|
|
- 10-minute timeout
|
|
|
|
**Recommended Improvements:**
|
|
1. Add test duration reporting (`--durations=10`)
|
|
2. Add coverage reporting
|
|
3. Separate fast unit tests from slower integration tests
|
|
4. Add flaky test retry mechanism
|
|
|
|
---
|
|
|
|
## 6. Missing Integration Test Scenarios
|
|
|
|
### 6.1 Cross-Component Integration
|
|
|
|
1. **End-to-End Agent Flow**
|
|
- User message → Gateway → Agent → Tools → Response
|
|
- Test with real (mocked) LLM responses
|
|
|
|
2. **Multi-Platform Gateway**
|
|
- Message routing between platforms
|
|
- Session persistence across platforms
|
|
|
|
3. **Tool + Environment Integration**
|
|
- Terminal tool with different backends (local, docker, modal)
|
|
- File operations with permission checks
|
|
|
|
4. **Skill Lifecycle Integration**
|
|
- Skill installation → Registration → Execution → Update → Removal
|
|
|
|
5. **Memory + Honcho Integration**
|
|
- Memory storage → Retrieval → Context injection
|
|
|
|
### 6.2 Failure Scenario Integration Tests
|
|
|
|
1. **LLM Provider Failover**
|
|
- Primary provider down → Fallback provider
|
|
- Rate limiting handling
|
|
|
|
2. **Gateway Reconnection**
|
|
- Platform disconnect → Reconnect → Resume session
|
|
|
|
3. **Tool Execution Failures**
|
|
- Tool timeout → Retry → Fallback
|
|
- Tool error → Error handling → User notification
|
|
|
|
4. **Checkpoint Recovery**
|
|
- Crash during batch → Resume from checkpoint
|
|
- Corrupted checkpoint handling
|
|
|
|
### 6.3 Security Integration Tests
|
|
|
|
1. **Prompt Injection Across Stack**
|
|
- Gateway input → Agent processing → Tool execution
|
|
|
|
2. **Permission Escalation Prevention**
|
|
- User permissions → Tool allowlist → Execution
|
|
|
|
3. **Data Leak Prevention**
|
|
- Memory storage → Context building → Response generation
|
|
|
|
---
|
|
|
|
## 7. Performance Test Strategy
|
|
|
|
### 7.1 Load Testing Requirements
|
|
|
|
1. **Gateway Load Tests**
|
|
- Concurrent session handling
|
|
- Message throughput per platform
|
|
- Memory usage under load
|
|
|
|
2. **Agent Response Time Tests**
|
|
- End-to-end latency benchmarks
|
|
- Tool execution time budgets
|
|
- Context building performance
|
|
|
|
3. **Resource Utilization Tests**
|
|
- Memory leaks in long-running sessions
|
|
- File descriptor limits
|
|
- CPU usage patterns
|
|
|
|
### 7.2 Benchmark Framework
|
|
|
|
```python
|
|
# Proposed performance test structure
|
|
class TestGatewayPerformance:
|
|
@pytest.mark.benchmark
|
|
def test_message_throughput(self, benchmark):
|
|
# Measure messages processed per second
|
|
pass
|
|
|
|
@pytest.mark.benchmark
|
|
def test_session_creation_latency(self, benchmark):
|
|
# Measure session setup time
|
|
pass
|
|
```
|
|
|
|
### 7.3 Performance Regression Detection
|
|
|
|
1. **Baseline Establishment**
|
|
- Record baseline metrics for critical paths
|
|
- Store in version control
|
|
|
|
2. **Automated Comparison**
|
|
- Compare PR performance against baseline
|
|
- Fail if degradation > 10%
|
|
|
|
3. **Metrics to Track**
|
|
- Test suite execution time
|
|
- Memory peak usage
|
|
- Individual test durations
|
|
|
|
---
|
|
|
|
## 8. Test Infrastructure Improvements
|
|
|
|
### 8.1 Coverage Tooling
|
|
|
|
**Missing:** Code coverage reporting
|
|
**Recommendation:** Add `pytest-cov` to dev dependencies
|
|
|
|
```toml
|
|
[project.optional-dependencies]
|
|
dev = [
|
|
"pytest>=9.0.2,<10",
|
|
"pytest-asyncio>=1.3.0,<2",
|
|
"pytest-xdist>=3.0,<4",
|
|
"pytest-cov>=5.0,<6", # Add this
|
|
"mcp>=1.2.0,<2"
|
|
]
|
|
```
|
|
|
|
### 8.2 Test Categories
|
|
|
|
Add more pytest markers for selective test running:
|
|
|
|
```python
|
|
# In pytest.ini or pyproject.toml
|
|
markers = [
|
|
"integration: marks tests requiring external services",
|
|
"slow: marks slow tests (>5s)",
|
|
"security: marks security-focused tests",
|
|
"benchmark: marks performance benchmark tests",
|
|
"flakey: marks tests that may be unstable",
|
|
]
|
|
```
|
|
|
|
### 8.3 Test Data Factory
|
|
|
|
Create centralized test data factories:
|
|
|
|
```python
|
|
# tests/factories.py
|
|
class AgentFactory:
|
|
@staticmethod
|
|
def create_mock_agent(tools=None):
|
|
# Return configured mock agent
|
|
pass
|
|
|
|
class MessageFactory:
|
|
@staticmethod
|
|
def create_user_message(content):
|
|
# Return formatted user message
|
|
pass
|
|
```
|
|
|
|
---
|
|
|
|
## 9. Summary & Action Items
|
|
|
|
### Immediate Actions (High Impact)
|
|
|
|
1. **Add coverage reporting** to CI pipeline
|
|
2. **Create tests for uncovered security-critical modules:**
|
|
- `tools/code_execution_tool.py`
|
|
- `tools/browser_tool.py`
|
|
- `tools/terminal_tool.py`
|
|
3. **Split oversized test files** for better maintainability
|
|
4. **Add Gemini adapter tests** (increasingly important provider)
|
|
|
|
### Short-term (1-2 Sprints)
|
|
|
|
5. Create integration tests for cross-component flows
|
|
6. Add performance benchmarks for critical paths
|
|
7. Expand OpenRouter client test coverage
|
|
8. Add knowledge ingester tests
|
|
|
|
### Long-term (Quarter)
|
|
|
|
9. Achieve 80% code coverage across all modules
|
|
10. Implement performance regression testing
|
|
11. Create comprehensive security test suite
|
|
12. Document testing patterns and best practices
|
|
|
|
---
|
|
|
|
## Appendix: Test File Size Distribution
|
|
|
|
| Lines | Count | Category |
|
|
|-------|-------|----------|
|
|
| 0-100 | ~50 | Simple unit tests |
|
|
| 100-500 | ~200 | Standard test files |
|
|
| 500-1000 | ~80 | Complex feature tests |
|
|
| 1000-2000 | ~30 | Large test suites |
|
|
| 2000+ | ~13 | Monolithic test files (needs splitting) |
|
|
|
|
---
|
|
|
|
*Analysis generated: March 30, 2026*
|
|
*Total test files analyzed: 373*
|
|
*Estimated test functions: ~4,311*
|