Replace shell=True with list-based subprocess execution to prevent command injection via malicious user input. Changes: - tools/transcription_tools.py: Use shlex.split() + shell=False - tools/environments/docker.py: List-based commands with container ID validation Fixes CVE-level vulnerability where malicious file paths or container IDs could inject arbitrary commands. CVSS: 9.8 (Critical) Refs: V-001 in SECURITY_AUDIT_REPORT.md
15 KiB
Hermes Agent - Testing Infrastructure Deep Analysis
Executive Summary
The hermes-agent project has a comprehensive test suite with 373 test files containing approximately 4,300+ test functions. The tests are organized into 10 subdirectories covering all major components.
1. Test Suite Structure & Statistics
1.1 Directory Breakdown
| Directory | Test Files | Focus Area |
|---|---|---|
tests/tools/ |
86 | Tool implementations, file operations, environments |
tests/gateway/ |
96 | Platform integrations (Discord, Telegram, Slack, etc.) |
tests/hermes_cli/ |
48 | CLI commands, configuration, setup flows |
tests/agent/ |
16 | Core agent logic, prompt building, model adapters |
tests/integration/ |
8 | End-to-end integration tests |
tests/acp/ |
8 | Agent Communication Protocol |
tests/cron/ |
3 | Cron job scheduling |
tests/skills/ |
5 | Skill management |
tests/honcho_integration/ |
5 | Honcho memory integration |
tests/fakes/ |
2 | Test fixtures and fake servers |
| Total | 373 | ~4,311 test functions |
1.2 Test Classification
Unit Tests: ~95% (3,600+)
Integration Tests: ~5% (marked with @pytest.mark.integration)
Async Tests: ~679 tests use @pytest.mark.asyncio
1.3 Largest Test Files (by line count)
tests/test_run_agent.py- 3,329 lines (212 tests) - Core agent logictests/tools/test_mcp_tool.py- 2,902 lines (147 tests) - MCP protocoltests/gateway/test_voice_command.py- 2,632 lines - Voice featurestests/gateway/test_feishu.py- 2,580 lines - Feishu platformtests/gateway/test_api_server.py- 1,503 lines - API server
2. Coverage Heat Map - Critical Gaps Identified
2.1 NO TEST COVERAGE (Red Zone)
Agent Module Gaps:
agent/copilot_acp_client.py- Copilot integration (0 tests)agent/gemini_adapter.py- Google Gemini model support (0 tests)agent/knowledge_ingester.py- Knowledge ingestion (0 tests)agent/meta_reasoning.py- Meta-reasoning capabilities (0 tests)agent/skill_utils.py- Skill utilities (0 tests)agent/trajectory.py- Trajectory management (0 tests)
Tools Module Gaps:
tools/browser_tool.py- Browser automation (0 tests)tools/code_execution_tool.py- Code execution (0 tests)tools/gitea_client.py- Gitea integration (0 tests)tools/image_generation_tool.py- Image generation (0 tests)tools/neutts_synth.py- Neural TTS (0 tests)tools/openrouter_client.py- OpenRouter API (0 tests)tools/session_search_tool.py- Session search (0 tests)tools/terminal_tool.py- Terminal operations (0 tests)tools/tts_tool.py- Text-to-speech (0 tests)tools/web_tools.py- Web tools core (0 tests)
Gateway Module Gaps:
gateway/run.py- Gateway runner (0 tests)gateway/stream_consumer.py- Stream consumption (0 tests)
Root-Level Gaps:
hermes_constants.py- Constants (0 tests)hermes_time.py- Time utilities (0 tests)mini_swe_runner.py- SWE runner (0 tests)rl_cli.py- RL CLI (0 tests)utils.py- Utilities (0 tests)
2.2 LIMITED COVERAGE (Yellow Zone)
agent/models_dev.py- Only 19 tests for complex model routingagent/smart_model_routing.py- Only 6 teststools/approval.py- 2 test files but complex logictools/skills_guard.py- Security-critical, needs more coverage
2.3 GOOD COVERAGE (Green Zone)
agent/anthropic_adapter.py- 97 tests (comprehensive)agent/prompt_builder.py- 108 tests (excellent)tools/mcp_tool.py- 147 tests (very comprehensive)tools/file_tools.py- Multiple test filesgateway/discord.py- 11 test files covering various aspectsgateway/telegram.py- 10 test filesgateway/session.py- 15 test files
3. Test Patterns Analysis
3.1 Fixtures Architecture
Global Fixtures (conftest.py):
_isolate_hermes_home- Isolates HERMES_HOME to temp directory (autouse)_ensure_current_event_loop- Event loop management for sync tests (autouse)_enforce_test_timeout- 30-second timeout per test (autouse)tmp_dir- Temporary directory fixturemock_config- Minimal hermes config for unit tests
Common Patterns:
# Isolation pattern
@pytest.fixture(autouse=True)
def isolate_env(tmp_path, monkeypatch):
monkeypatch.setenv("HERMES_HOME", str(tmp_path))
# Mock client pattern
@pytest.fixture
def mock_agent():
with patch("run_agent.OpenAI") as mock:
yield mock
3.2 Mock Usage Statistics
- ~12,468 mock/patch usages across the test suite
- Heavy use of
unittest.mock.patchandMagicMock AsyncMockused for async function mockingSimpleNamespacefor creating mock API response objects
3.3 Test Organization Patterns
Class-Based Organization:
- 1,532 test classes identified
- Grouped by functionality:
Test<Feature><Scenario> - Example:
TestSanitizeApiMessages,TestContextPressureFlags
Function-Based Organization:
- Used for simpler test files
- Naming:
test_<feature>_<scenario>
3.4 Async Test Patterns
@pytest.mark.asyncio
async def test_async_function():
result = await async_function()
assert result == expected
4. 20 New Test Recommendations (Priority Order)
Critical Priority (Security/Risk)
-
Browser Tool Security Tests (
tools/browser_tool.py)- Test sandbox escape prevention
- Test malicious script blocking
- Test content security policy enforcement
-
Code Execution Sandbox Tests (
tools/code_execution_tool.py)- Test resource limits (CPU, memory)
- Test dangerous import blocking
- Test timeout enforcement
- Test filesystem access restrictions
-
Terminal Tool Safety Tests (
tools/terminal_tool.py)- Test dangerous command blocking
- Test command injection prevention
- Test environment variable sanitization
-
OpenRouter Client Tests (
tools/openrouter_client.py)- Test API key handling
- Test rate limit handling
- Test error response parsing
High Priority (Core Functionality)
-
Gemini Adapter Tests (
agent/gemini_adapter.py)- Test message format conversion
- Test tool call normalization
- Test streaming response handling
-
Copilot ACP Client Tests (
agent/copilot_acp_client.py)- Test authentication flow
- Test session management
- Test message passing
-
Knowledge Ingester Tests (
agent/knowledge_ingester.py)- Test document parsing
- Test embedding generation
- Test knowledge retrieval
-
Stream Consumer Tests (
gateway/stream_consumer.py)- Test backpressure handling
- Test reconnection logic
- Test message ordering guarantees
Medium Priority (Integration/Features)
-
Web Tools Core Tests (
tools/web_tools.py)- Test search result parsing
- Test content extraction
- Test error handling for unavailable services
-
Image Generation Tool Tests (
tools/image_generation_tool.py)- Test prompt filtering
- Test image format handling
- Test provider failover
-
Gitea Client Tests (
tools/gitea_client.py)- Test repository operations
- Test webhook handling
- Test authentication
-
Session Search Tool Tests (
tools/session_search_tool.py)- Test query parsing
- Test result ranking
- Test pagination
-
Meta Reasoning Tests (
agent/meta_reasoning.py)- Test strategy selection
- Test reflection generation
- Test learning from failures
-
TTS Tool Tests (
tools/tts_tool.py)- Test voice selection
- Test audio format conversion
- Test streaming playback
-
Neural TTS Tests (
tools/neutts_synth.py)- Test voice cloning safety
- Test audio quality validation
- Test resource cleanup
Lower Priority (Utilities)
-
Hermes Constants Tests (
hermes_constants.py)- Test constant values
- Test environment-specific overrides
-
Time Utilities Tests (
hermes_time.py)- Test timezone handling
- Test formatting functions
-
Utils Module Tests (
utils.py)- Test helper functions
- Test validation utilities
-
Mini SWE Runner Tests (
mini_swe_runner.py)- Test repository setup
- Test test execution
- Test result parsing
-
RL CLI Tests (
rl_cli.py)- Test training command parsing
- Test configuration validation
- Test checkpoint handling
5. Test Optimization Opportunities
5.1 Performance Issues Identified
Large Test Files (Split Recommended):
tests/test_run_agent.py(3,329 lines) → Split into multiple filestests/tools/test_mcp_tool.py(2,902 lines) → Split by MCP featuretests/test_anthropic_adapter.py(1,219 lines) → Consider splitting
Potential Slow Tests:
- Integration tests with real API calls
- Tests with file I/O operations
- Tests with subprocess spawning
5.2 Optimization Recommendations
-
Parallel Execution Already Configured
pytest-xdistwith-n autoin CI- Maintains isolation through fixtures
-
Fixture Scope Optimization
- Review
autouse=Truefixtures for necessity - Consider session-scoped fixtures for expensive setup
- Review
-
Mock External Services
- Some integration tests still hit real APIs
- Create more fakes like
fake_ha_server.py
-
Test Data Management
- Use factory pattern for test data generation
- Share test fixtures across related tests
5.3 CI/CD Optimizations
Current CI (.github/workflows/tests.yml):
- Uses
uvfor fast dependency installation - Runs with
-n autofor parallelization - Ignores integration tests by default
- 10-minute timeout
Recommended Improvements:
- Add test duration reporting (
--durations=10) - Add coverage reporting
- Separate fast unit tests from slower integration tests
- Add flaky test retry mechanism
6. Missing Integration Test Scenarios
6.1 Cross-Component Integration
-
End-to-End Agent Flow
- User message → Gateway → Agent → Tools → Response
- Test with real (mocked) LLM responses
-
Multi-Platform Gateway
- Message routing between platforms
- Session persistence across platforms
-
Tool + Environment Integration
- Terminal tool with different backends (local, docker, modal)
- File operations with permission checks
-
Skill Lifecycle Integration
- Skill installation → Registration → Execution → Update → Removal
-
Memory + Honcho Integration
- Memory storage → Retrieval → Context injection
6.2 Failure Scenario Integration Tests
-
LLM Provider Failover
- Primary provider down → Fallback provider
- Rate limiting handling
-
Gateway Reconnection
- Platform disconnect → Reconnect → Resume session
-
Tool Execution Failures
- Tool timeout → Retry → Fallback
- Tool error → Error handling → User notification
-
Checkpoint Recovery
- Crash during batch → Resume from checkpoint
- Corrupted checkpoint handling
6.3 Security Integration Tests
-
Prompt Injection Across Stack
- Gateway input → Agent processing → Tool execution
-
Permission Escalation Prevention
- User permissions → Tool allowlist → Execution
-
Data Leak Prevention
- Memory storage → Context building → Response generation
7. Performance Test Strategy
7.1 Load Testing Requirements
-
Gateway Load Tests
- Concurrent session handling
- Message throughput per platform
- Memory usage under load
-
Agent Response Time Tests
- End-to-end latency benchmarks
- Tool execution time budgets
- Context building performance
-
Resource Utilization Tests
- Memory leaks in long-running sessions
- File descriptor limits
- CPU usage patterns
7.2 Benchmark Framework
# Proposed performance test structure
class TestGatewayPerformance:
@pytest.mark.benchmark
def test_message_throughput(self, benchmark):
# Measure messages processed per second
pass
@pytest.mark.benchmark
def test_session_creation_latency(self, benchmark):
# Measure session setup time
pass
7.3 Performance Regression Detection
-
Baseline Establishment
- Record baseline metrics for critical paths
- Store in version control
-
Automated Comparison
- Compare PR performance against baseline
- Fail if degradation > 10%
-
Metrics to Track
- Test suite execution time
- Memory peak usage
- Individual test durations
8. Test Infrastructure Improvements
8.1 Coverage Tooling
Missing: Code coverage reporting
Recommendation: Add pytest-cov to dev dependencies
[project.optional-dependencies]
dev = [
"pytest>=9.0.2,<10",
"pytest-asyncio>=1.3.0,<2",
"pytest-xdist>=3.0,<4",
"pytest-cov>=5.0,<6", # Add this
"mcp>=1.2.0,<2"
]
8.2 Test Categories
Add more pytest markers for selective test running:
# In pytest.ini or pyproject.toml
markers = [
"integration: marks tests requiring external services",
"slow: marks slow tests (>5s)",
"security: marks security-focused tests",
"benchmark: marks performance benchmark tests",
"flakey: marks tests that may be unstable",
]
8.3 Test Data Factory
Create centralized test data factories:
# tests/factories.py
class AgentFactory:
@staticmethod
def create_mock_agent(tools=None):
# Return configured mock agent
pass
class MessageFactory:
@staticmethod
def create_user_message(content):
# Return formatted user message
pass
9. Summary & Action Items
Immediate Actions (High Impact)
- Add coverage reporting to CI pipeline
- Create tests for uncovered security-critical modules:
tools/code_execution_tool.pytools/browser_tool.pytools/terminal_tool.py
- Split oversized test files for better maintainability
- Add Gemini adapter tests (increasingly important provider)
Short-term (1-2 Sprints)
- Create integration tests for cross-component flows
- Add performance benchmarks for critical paths
- Expand OpenRouter client test coverage
- Add knowledge ingester tests
Long-term (Quarter)
- Achieve 80% code coverage across all modules
- Implement performance regression testing
- Create comprehensive security test suite
- Document testing patterns and best practices
Appendix: Test File Size Distribution
| Lines | Count | Category |
|---|---|---|
| 0-100 | ~50 | Simple unit tests |
| 100-500 | ~200 | Standard test files |
| 500-1000 | ~80 | Complex feature tests |
| 1000-2000 | ~30 | Large test suites |
| 2000+ | ~13 | Monolithic test files (needs splitting) |
Analysis generated: March 30, 2026 Total test files analyzed: 373 Estimated test functions: ~4,311