# Hermes Agent - Testing Infrastructure Deep Analysis ## Executive Summary The hermes-agent project has a **comprehensive test suite** with **373 test files** containing approximately **4,300+ test functions**. The tests are organized into 10 subdirectories covering all major components. --- ## 1. Test Suite Structure & Statistics ### 1.1 Directory Breakdown | Directory | Test Files | Focus Area | |-----------|------------|------------| | `tests/tools/` | 86 | Tool implementations, file operations, environments | | `tests/gateway/` | 96 | Platform integrations (Discord, Telegram, Slack, etc.) | | `tests/hermes_cli/` | 48 | CLI commands, configuration, setup flows | | `tests/agent/` | 16 | Core agent logic, prompt building, model adapters | | `tests/integration/` | 8 | End-to-end integration tests | | `tests/acp/` | 8 | Agent Communication Protocol | | `tests/cron/` | 3 | Cron job scheduling | | `tests/skills/` | 5 | Skill management | | `tests/honcho_integration/` | 5 | Honcho memory integration | | `tests/fakes/` | 2 | Test fixtures and fake servers | | **Total** | **373** | **~4,311 test functions** | ### 1.2 Test Classification **Unit Tests:** ~95% (3,600+) **Integration Tests:** ~5% (marked with `@pytest.mark.integration`) **Async Tests:** ~679 tests use `@pytest.mark.asyncio` ### 1.3 Largest Test Files (by line count) 1. `tests/test_run_agent.py` - 3,329 lines (212 tests) - Core agent logic 2. `tests/tools/test_mcp_tool.py` - 2,902 lines (147 tests) - MCP protocol 3. `tests/gateway/test_voice_command.py` - 2,632 lines - Voice features 4. `tests/gateway/test_feishu.py` - 2,580 lines - Feishu platform 5. `tests/gateway/test_api_server.py` - 1,503 lines - API server --- ## 2. Coverage Heat Map - Critical Gaps Identified ### 2.1 NO TEST COVERAGE (Red Zone) #### Agent Module Gaps: - `agent/copilot_acp_client.py` - Copilot integration (0 tests) - `agent/gemini_adapter.py` - Google Gemini model support (0 tests) - `agent/knowledge_ingester.py` - Knowledge ingestion (0 tests) - `agent/meta_reasoning.py` - Meta-reasoning capabilities (0 tests) - `agent/skill_utils.py` - Skill utilities (0 tests) - `agent/trajectory.py` - Trajectory management (0 tests) #### Tools Module Gaps: - `tools/browser_tool.py` - Browser automation (0 tests) - `tools/code_execution_tool.py` - Code execution (0 tests) - `tools/gitea_client.py` - Gitea integration (0 tests) - `tools/image_generation_tool.py` - Image generation (0 tests) - `tools/neutts_synth.py` - Neural TTS (0 tests) - `tools/openrouter_client.py` - OpenRouter API (0 tests) - `tools/session_search_tool.py` - Session search (0 tests) - `tools/terminal_tool.py` - Terminal operations (0 tests) - `tools/tts_tool.py` - Text-to-speech (0 tests) - `tools/web_tools.py` - Web tools core (0 tests) #### Gateway Module Gaps: - `gateway/run.py` - Gateway runner (0 tests) - `gateway/stream_consumer.py` - Stream consumption (0 tests) #### Root-Level Gaps: - `hermes_constants.py` - Constants (0 tests) - `hermes_time.py` - Time utilities (0 tests) - `mini_swe_runner.py` - SWE runner (0 tests) - `rl_cli.py` - RL CLI (0 tests) - `utils.py` - Utilities (0 tests) ### 2.2 LIMITED COVERAGE (Yellow Zone) - `agent/models_dev.py` - Only 19 tests for complex model routing - `agent/smart_model_routing.py` - Only 6 tests - `tools/approval.py` - 2 test files but complex logic - `tools/skills_guard.py` - Security-critical, needs more coverage ### 2.3 GOOD COVERAGE (Green Zone) - `agent/anthropic_adapter.py` - 97 tests (comprehensive) - `agent/prompt_builder.py` - 108 tests (excellent) - `tools/mcp_tool.py` - 147 tests (very comprehensive) - `tools/file_tools.py` - Multiple test files - `gateway/discord.py` - 11 test files covering various aspects - `gateway/telegram.py` - 10 test files - `gateway/session.py` - 15 test files --- ## 3. Test Patterns Analysis ### 3.1 Fixtures Architecture **Global Fixtures (`conftest.py`):** - `_isolate_hermes_home` - Isolates HERMES_HOME to temp directory (autouse) - `_ensure_current_event_loop` - Event loop management for sync tests (autouse) - `_enforce_test_timeout` - 30-second timeout per test (autouse) - `tmp_dir` - Temporary directory fixture - `mock_config` - Minimal hermes config for unit tests **Common Patterns:** ```python # Isolation pattern @pytest.fixture(autouse=True) def isolate_env(tmp_path, monkeypatch): monkeypatch.setenv("HERMES_HOME", str(tmp_path)) # Mock client pattern @pytest.fixture def mock_agent(): with patch("run_agent.OpenAI") as mock: yield mock ``` ### 3.2 Mock Usage Statistics - **~12,468 mock/patch usages** across the test suite - Heavy use of `unittest.mock.patch` and `MagicMock` - `AsyncMock` used for async function mocking - `SimpleNamespace` for creating mock API response objects ### 3.3 Test Organization Patterns **Class-Based Organization:** - 1,532 test classes identified - Grouped by functionality: `Test` - Example: `TestSanitizeApiMessages`, `TestContextPressureFlags` **Function-Based Organization:** - Used for simpler test files - Naming: `test__` ### 3.4 Async Test Patterns ```python @pytest.mark.asyncio async def test_async_function(): result = await async_function() assert result == expected ``` --- ## 4. 20 New Test Recommendations (Priority Order) ### Critical Priority (Security/Risk) 1. **Browser Tool Security Tests** (`tools/browser_tool.py`) - Test sandbox escape prevention - Test malicious script blocking - Test content security policy enforcement 2. **Code Execution Sandbox Tests** (`tools/code_execution_tool.py`) - Test resource limits (CPU, memory) - Test dangerous import blocking - Test timeout enforcement - Test filesystem access restrictions 3. **Terminal Tool Safety Tests** (`tools/terminal_tool.py`) - Test dangerous command blocking - Test command injection prevention - Test environment variable sanitization 4. **OpenRouter Client Tests** (`tools/openrouter_client.py`) - Test API key handling - Test rate limit handling - Test error response parsing ### High Priority (Core Functionality) 5. **Gemini Adapter Tests** (`agent/gemini_adapter.py`) - Test message format conversion - Test tool call normalization - Test streaming response handling 6. **Copilot ACP Client Tests** (`agent/copilot_acp_client.py`) - Test authentication flow - Test session management - Test message passing 7. **Knowledge Ingester Tests** (`agent/knowledge_ingester.py`) - Test document parsing - Test embedding generation - Test knowledge retrieval 8. **Stream Consumer Tests** (`gateway/stream_consumer.py`) - Test backpressure handling - Test reconnection logic - Test message ordering guarantees ### Medium Priority (Integration/Features) 9. **Web Tools Core Tests** (`tools/web_tools.py`) - Test search result parsing - Test content extraction - Test error handling for unavailable services 10. **Image Generation Tool Tests** (`tools/image_generation_tool.py`) - Test prompt filtering - Test image format handling - Test provider failover 11. **Gitea Client Tests** (`tools/gitea_client.py`) - Test repository operations - Test webhook handling - Test authentication 12. **Session Search Tool Tests** (`tools/session_search_tool.py`) - Test query parsing - Test result ranking - Test pagination 13. **Meta Reasoning Tests** (`agent/meta_reasoning.py`) - Test strategy selection - Test reflection generation - Test learning from failures 14. **TTS Tool Tests** (`tools/tts_tool.py`) - Test voice selection - Test audio format conversion - Test streaming playback 15. **Neural TTS Tests** (`tools/neutts_synth.py`) - Test voice cloning safety - Test audio quality validation - Test resource cleanup ### Lower Priority (Utilities) 16. **Hermes Constants Tests** (`hermes_constants.py`) - Test constant values - Test environment-specific overrides 17. **Time Utilities Tests** (`hermes_time.py`) - Test timezone handling - Test formatting functions 18. **Utils Module Tests** (`utils.py`) - Test helper functions - Test validation utilities 19. **Mini SWE Runner Tests** (`mini_swe_runner.py`) - Test repository setup - Test test execution - Test result parsing 20. **RL CLI Tests** (`rl_cli.py`) - Test training command parsing - Test configuration validation - Test checkpoint handling --- ## 5. Test Optimization Opportunities ### 5.1 Performance Issues Identified **Large Test Files (Split Recommended):** - `tests/test_run_agent.py` (3,329 lines) → Split into multiple files - `tests/tools/test_mcp_tool.py` (2,902 lines) → Split by MCP feature - `tests/test_anthropic_adapter.py` (1,219 lines) → Consider splitting **Potential Slow Tests:** - Integration tests with real API calls - Tests with file I/O operations - Tests with subprocess spawning ### 5.2 Optimization Recommendations 1. **Parallel Execution Already Configured** - `pytest-xdist` with `-n auto` in CI - Maintains isolation through fixtures 2. **Fixture Scope Optimization** - Review `autouse=True` fixtures for necessity - Consider session-scoped fixtures for expensive setup 3. **Mock External Services** - Some integration tests still hit real APIs - Create more fakes like `fake_ha_server.py` 4. **Test Data Management** - Use factory pattern for test data generation - Share test fixtures across related tests ### 5.3 CI/CD Optimizations Current CI (`.github/workflows/tests.yml`): - Uses `uv` for fast dependency installation - Runs with `-n auto` for parallelization - Ignores integration tests by default - 10-minute timeout **Recommended Improvements:** 1. Add test duration reporting (`--durations=10`) 2. Add coverage reporting 3. Separate fast unit tests from slower integration tests 4. Add flaky test retry mechanism --- ## 6. Missing Integration Test Scenarios ### 6.1 Cross-Component Integration 1. **End-to-End Agent Flow** - User message → Gateway → Agent → Tools → Response - Test with real (mocked) LLM responses 2. **Multi-Platform Gateway** - Message routing between platforms - Session persistence across platforms 3. **Tool + Environment Integration** - Terminal tool with different backends (local, docker, modal) - File operations with permission checks 4. **Skill Lifecycle Integration** - Skill installation → Registration → Execution → Update → Removal 5. **Memory + Honcho Integration** - Memory storage → Retrieval → Context injection ### 6.2 Failure Scenario Integration Tests 1. **LLM Provider Failover** - Primary provider down → Fallback provider - Rate limiting handling 2. **Gateway Reconnection** - Platform disconnect → Reconnect → Resume session 3. **Tool Execution Failures** - Tool timeout → Retry → Fallback - Tool error → Error handling → User notification 4. **Checkpoint Recovery** - Crash during batch → Resume from checkpoint - Corrupted checkpoint handling ### 6.3 Security Integration Tests 1. **Prompt Injection Across Stack** - Gateway input → Agent processing → Tool execution 2. **Permission Escalation Prevention** - User permissions → Tool allowlist → Execution 3. **Data Leak Prevention** - Memory storage → Context building → Response generation --- ## 7. Performance Test Strategy ### 7.1 Load Testing Requirements 1. **Gateway Load Tests** - Concurrent session handling - Message throughput per platform - Memory usage under load 2. **Agent Response Time Tests** - End-to-end latency benchmarks - Tool execution time budgets - Context building performance 3. **Resource Utilization Tests** - Memory leaks in long-running sessions - File descriptor limits - CPU usage patterns ### 7.2 Benchmark Framework ```python # Proposed performance test structure class TestGatewayPerformance: @pytest.mark.benchmark def test_message_throughput(self, benchmark): # Measure messages processed per second pass @pytest.mark.benchmark def test_session_creation_latency(self, benchmark): # Measure session setup time pass ``` ### 7.3 Performance Regression Detection 1. **Baseline Establishment** - Record baseline metrics for critical paths - Store in version control 2. **Automated Comparison** - Compare PR performance against baseline - Fail if degradation > 10% 3. **Metrics to Track** - Test suite execution time - Memory peak usage - Individual test durations --- ## 8. Test Infrastructure Improvements ### 8.1 Coverage Tooling **Missing:** Code coverage reporting **Recommendation:** Add `pytest-cov` to dev dependencies ```toml [project.optional-dependencies] dev = [ "pytest>=9.0.2,<10", "pytest-asyncio>=1.3.0,<2", "pytest-xdist>=3.0,<4", "pytest-cov>=5.0,<6", # Add this "mcp>=1.2.0,<2" ] ``` ### 8.2 Test Categories Add more pytest markers for selective test running: ```python # In pytest.ini or pyproject.toml markers = [ "integration: marks tests requiring external services", "slow: marks slow tests (>5s)", "security: marks security-focused tests", "benchmark: marks performance benchmark tests", "flakey: marks tests that may be unstable", ] ``` ### 8.3 Test Data Factory Create centralized test data factories: ```python # tests/factories.py class AgentFactory: @staticmethod def create_mock_agent(tools=None): # Return configured mock agent pass class MessageFactory: @staticmethod def create_user_message(content): # Return formatted user message pass ``` --- ## 9. Summary & Action Items ### Immediate Actions (High Impact) 1. **Add coverage reporting** to CI pipeline 2. **Create tests for uncovered security-critical modules:** - `tools/code_execution_tool.py` - `tools/browser_tool.py` - `tools/terminal_tool.py` 3. **Split oversized test files** for better maintainability 4. **Add Gemini adapter tests** (increasingly important provider) ### Short-term (1-2 Sprints) 5. Create integration tests for cross-component flows 6. Add performance benchmarks for critical paths 7. Expand OpenRouter client test coverage 8. Add knowledge ingester tests ### Long-term (Quarter) 9. Achieve 80% code coverage across all modules 10. Implement performance regression testing 11. Create comprehensive security test suite 12. Document testing patterns and best practices --- ## Appendix: Test File Size Distribution | Lines | Count | Category | |-------|-------|----------| | 0-100 | ~50 | Simple unit tests | | 100-500 | ~200 | Standard test files | | 500-1000 | ~80 | Complex feature tests | | 1000-2000 | ~30 | Large test suites | | 2000+ | ~13 | Monolithic test files (needs splitting) | --- *Analysis generated: March 30, 2026* *Total test files analyzed: 373* *Estimated test functions: ~4,311*