Files
hermes-agent/TEST_ANALYSIS_REPORT.md
Allegro 10271c6b44
Some checks failed
Supply Chain Audit / Scan PR for supply chain risks (pull_request) Failing after 25s
Tests / test (pull_request) Failing after 24s
Docker Build and Publish / build-and-push (pull_request) Failing after 35s
security: fix command injection vulnerabilities (CVSS 9.8)
Replace shell=True with list-based subprocess execution to prevent
command injection via malicious user input.

Changes:
- tools/transcription_tools.py: Use shlex.split() + shell=False
- tools/environments/docker.py: List-based commands with container ID validation

Fixes CVE-level vulnerability where malicious file paths or container IDs
could inject arbitrary commands.

CVSS: 9.8 (Critical)
Refs: V-001 in SECURITY_AUDIT_REPORT.md
2026-03-30 23:15:11 +00:00

15 KiB

Hermes Agent - Testing Infrastructure Deep Analysis

Executive Summary

The hermes-agent project has a comprehensive test suite with 373 test files containing approximately 4,300+ test functions. The tests are organized into 10 subdirectories covering all major components.


1. Test Suite Structure & Statistics

1.1 Directory Breakdown

Directory Test Files Focus Area
tests/tools/ 86 Tool implementations, file operations, environments
tests/gateway/ 96 Platform integrations (Discord, Telegram, Slack, etc.)
tests/hermes_cli/ 48 CLI commands, configuration, setup flows
tests/agent/ 16 Core agent logic, prompt building, model adapters
tests/integration/ 8 End-to-end integration tests
tests/acp/ 8 Agent Communication Protocol
tests/cron/ 3 Cron job scheduling
tests/skills/ 5 Skill management
tests/honcho_integration/ 5 Honcho memory integration
tests/fakes/ 2 Test fixtures and fake servers
Total 373 ~4,311 test functions

1.2 Test Classification

Unit Tests: ~95% (3,600+) Integration Tests: ~5% (marked with @pytest.mark.integration) Async Tests: ~679 tests use @pytest.mark.asyncio

1.3 Largest Test Files (by line count)

  1. tests/test_run_agent.py - 3,329 lines (212 tests) - Core agent logic
  2. tests/tools/test_mcp_tool.py - 2,902 lines (147 tests) - MCP protocol
  3. tests/gateway/test_voice_command.py - 2,632 lines - Voice features
  4. tests/gateway/test_feishu.py - 2,580 lines - Feishu platform
  5. tests/gateway/test_api_server.py - 1,503 lines - API server

2. Coverage Heat Map - Critical Gaps Identified

2.1 NO TEST COVERAGE (Red Zone)

Agent Module Gaps:

  • agent/copilot_acp_client.py - Copilot integration (0 tests)
  • agent/gemini_adapter.py - Google Gemini model support (0 tests)
  • agent/knowledge_ingester.py - Knowledge ingestion (0 tests)
  • agent/meta_reasoning.py - Meta-reasoning capabilities (0 tests)
  • agent/skill_utils.py - Skill utilities (0 tests)
  • agent/trajectory.py - Trajectory management (0 tests)

Tools Module Gaps:

  • tools/browser_tool.py - Browser automation (0 tests)
  • tools/code_execution_tool.py - Code execution (0 tests)
  • tools/gitea_client.py - Gitea integration (0 tests)
  • tools/image_generation_tool.py - Image generation (0 tests)
  • tools/neutts_synth.py - Neural TTS (0 tests)
  • tools/openrouter_client.py - OpenRouter API (0 tests)
  • tools/session_search_tool.py - Session search (0 tests)
  • tools/terminal_tool.py - Terminal operations (0 tests)
  • tools/tts_tool.py - Text-to-speech (0 tests)
  • tools/web_tools.py - Web tools core (0 tests)

Gateway Module Gaps:

  • gateway/run.py - Gateway runner (0 tests)
  • gateway/stream_consumer.py - Stream consumption (0 tests)

Root-Level Gaps:

  • hermes_constants.py - Constants (0 tests)
  • hermes_time.py - Time utilities (0 tests)
  • mini_swe_runner.py - SWE runner (0 tests)
  • rl_cli.py - RL CLI (0 tests)
  • utils.py - Utilities (0 tests)

2.2 LIMITED COVERAGE (Yellow Zone)

  • agent/models_dev.py - Only 19 tests for complex model routing
  • agent/smart_model_routing.py - Only 6 tests
  • tools/approval.py - 2 test files but complex logic
  • tools/skills_guard.py - Security-critical, needs more coverage

2.3 GOOD COVERAGE (Green Zone)

  • agent/anthropic_adapter.py - 97 tests (comprehensive)
  • agent/prompt_builder.py - 108 tests (excellent)
  • tools/mcp_tool.py - 147 tests (very comprehensive)
  • tools/file_tools.py - Multiple test files
  • gateway/discord.py - 11 test files covering various aspects
  • gateway/telegram.py - 10 test files
  • gateway/session.py - 15 test files

3. Test Patterns Analysis

3.1 Fixtures Architecture

Global Fixtures (conftest.py):

  • _isolate_hermes_home - Isolates HERMES_HOME to temp directory (autouse)
  • _ensure_current_event_loop - Event loop management for sync tests (autouse)
  • _enforce_test_timeout - 30-second timeout per test (autouse)
  • tmp_dir - Temporary directory fixture
  • mock_config - Minimal hermes config for unit tests

Common Patterns:

# Isolation pattern
@pytest.fixture(autouse=True)
def isolate_env(tmp_path, monkeypatch):
    monkeypatch.setenv("HERMES_HOME", str(tmp_path))

# Mock client pattern
@pytest.fixture
def mock_agent():
    with patch("run_agent.OpenAI") as mock:
        yield mock

3.2 Mock Usage Statistics

  • ~12,468 mock/patch usages across the test suite
  • Heavy use of unittest.mock.patch and MagicMock
  • AsyncMock used for async function mocking
  • SimpleNamespace for creating mock API response objects

3.3 Test Organization Patterns

Class-Based Organization:

  • 1,532 test classes identified
  • Grouped by functionality: Test<Feature><Scenario>
  • Example: TestSanitizeApiMessages, TestContextPressureFlags

Function-Based Organization:

  • Used for simpler test files
  • Naming: test_<feature>_<scenario>

3.4 Async Test Patterns

@pytest.mark.asyncio
async def test_async_function():
    result = await async_function()
    assert result == expected

4. 20 New Test Recommendations (Priority Order)

Critical Priority (Security/Risk)

  1. Browser Tool Security Tests (tools/browser_tool.py)

    • Test sandbox escape prevention
    • Test malicious script blocking
    • Test content security policy enforcement
  2. Code Execution Sandbox Tests (tools/code_execution_tool.py)

    • Test resource limits (CPU, memory)
    • Test dangerous import blocking
    • Test timeout enforcement
    • Test filesystem access restrictions
  3. Terminal Tool Safety Tests (tools/terminal_tool.py)

    • Test dangerous command blocking
    • Test command injection prevention
    • Test environment variable sanitization
  4. OpenRouter Client Tests (tools/openrouter_client.py)

    • Test API key handling
    • Test rate limit handling
    • Test error response parsing

High Priority (Core Functionality)

  1. Gemini Adapter Tests (agent/gemini_adapter.py)

    • Test message format conversion
    • Test tool call normalization
    • Test streaming response handling
  2. Copilot ACP Client Tests (agent/copilot_acp_client.py)

    • Test authentication flow
    • Test session management
    • Test message passing
  3. Knowledge Ingester Tests (agent/knowledge_ingester.py)

    • Test document parsing
    • Test embedding generation
    • Test knowledge retrieval
  4. Stream Consumer Tests (gateway/stream_consumer.py)

    • Test backpressure handling
    • Test reconnection logic
    • Test message ordering guarantees

Medium Priority (Integration/Features)

  1. Web Tools Core Tests (tools/web_tools.py)

    • Test search result parsing
    • Test content extraction
    • Test error handling for unavailable services
  2. Image Generation Tool Tests (tools/image_generation_tool.py)

    • Test prompt filtering
    • Test image format handling
    • Test provider failover
  3. Gitea Client Tests (tools/gitea_client.py)

    • Test repository operations
    • Test webhook handling
    • Test authentication
  4. Session Search Tool Tests (tools/session_search_tool.py)

    • Test query parsing
    • Test result ranking
    • Test pagination
  5. Meta Reasoning Tests (agent/meta_reasoning.py)

    • Test strategy selection
    • Test reflection generation
    • Test learning from failures
  6. TTS Tool Tests (tools/tts_tool.py)

    • Test voice selection
    • Test audio format conversion
    • Test streaming playback
  7. Neural TTS Tests (tools/neutts_synth.py)

    • Test voice cloning safety
    • Test audio quality validation
    • Test resource cleanup

Lower Priority (Utilities)

  1. Hermes Constants Tests (hermes_constants.py)

    • Test constant values
    • Test environment-specific overrides
  2. Time Utilities Tests (hermes_time.py)

    • Test timezone handling
    • Test formatting functions
  3. Utils Module Tests (utils.py)

    • Test helper functions
    • Test validation utilities
  4. Mini SWE Runner Tests (mini_swe_runner.py)

    • Test repository setup
    • Test test execution
    • Test result parsing
  5. RL CLI Tests (rl_cli.py)

    • Test training command parsing
    • Test configuration validation
    • Test checkpoint handling

5. Test Optimization Opportunities

5.1 Performance Issues Identified

Large Test Files (Split Recommended):

  • tests/test_run_agent.py (3,329 lines) → Split into multiple files
  • tests/tools/test_mcp_tool.py (2,902 lines) → Split by MCP feature
  • tests/test_anthropic_adapter.py (1,219 lines) → Consider splitting

Potential Slow Tests:

  • Integration tests with real API calls
  • Tests with file I/O operations
  • Tests with subprocess spawning

5.2 Optimization Recommendations

  1. Parallel Execution Already Configured

    • pytest-xdist with -n auto in CI
    • Maintains isolation through fixtures
  2. Fixture Scope Optimization

    • Review autouse=True fixtures for necessity
    • Consider session-scoped fixtures for expensive setup
  3. Mock External Services

    • Some integration tests still hit real APIs
    • Create more fakes like fake_ha_server.py
  4. Test Data Management

    • Use factory pattern for test data generation
    • Share test fixtures across related tests

5.3 CI/CD Optimizations

Current CI (.github/workflows/tests.yml):

  • Uses uv for fast dependency installation
  • Runs with -n auto for parallelization
  • Ignores integration tests by default
  • 10-minute timeout

Recommended Improvements:

  1. Add test duration reporting (--durations=10)
  2. Add coverage reporting
  3. Separate fast unit tests from slower integration tests
  4. Add flaky test retry mechanism

6. Missing Integration Test Scenarios

6.1 Cross-Component Integration

  1. End-to-End Agent Flow

    • User message → Gateway → Agent → Tools → Response
    • Test with real (mocked) LLM responses
  2. Multi-Platform Gateway

    • Message routing between platforms
    • Session persistence across platforms
  3. Tool + Environment Integration

    • Terminal tool with different backends (local, docker, modal)
    • File operations with permission checks
  4. Skill Lifecycle Integration

    • Skill installation → Registration → Execution → Update → Removal
  5. Memory + Honcho Integration

    • Memory storage → Retrieval → Context injection

6.2 Failure Scenario Integration Tests

  1. LLM Provider Failover

    • Primary provider down → Fallback provider
    • Rate limiting handling
  2. Gateway Reconnection

    • Platform disconnect → Reconnect → Resume session
  3. Tool Execution Failures

    • Tool timeout → Retry → Fallback
    • Tool error → Error handling → User notification
  4. Checkpoint Recovery

    • Crash during batch → Resume from checkpoint
    • Corrupted checkpoint handling

6.3 Security Integration Tests

  1. Prompt Injection Across Stack

    • Gateway input → Agent processing → Tool execution
  2. Permission Escalation Prevention

    • User permissions → Tool allowlist → Execution
  3. Data Leak Prevention

    • Memory storage → Context building → Response generation

7. Performance Test Strategy

7.1 Load Testing Requirements

  1. Gateway Load Tests

    • Concurrent session handling
    • Message throughput per platform
    • Memory usage under load
  2. Agent Response Time Tests

    • End-to-end latency benchmarks
    • Tool execution time budgets
    • Context building performance
  3. Resource Utilization Tests

    • Memory leaks in long-running sessions
    • File descriptor limits
    • CPU usage patterns

7.2 Benchmark Framework

# Proposed performance test structure
class TestGatewayPerformance:
    @pytest.mark.benchmark
    def test_message_throughput(self, benchmark):
        # Measure messages processed per second
        pass
    
    @pytest.mark.benchmark
    def test_session_creation_latency(self, benchmark):
        # Measure session setup time
        pass

7.3 Performance Regression Detection

  1. Baseline Establishment

    • Record baseline metrics for critical paths
    • Store in version control
  2. Automated Comparison

    • Compare PR performance against baseline
    • Fail if degradation > 10%
  3. Metrics to Track

    • Test suite execution time
    • Memory peak usage
    • Individual test durations

8. Test Infrastructure Improvements

8.1 Coverage Tooling

Missing: Code coverage reporting Recommendation: Add pytest-cov to dev dependencies

[project.optional-dependencies]
dev = [
    "pytest>=9.0.2,<10",
    "pytest-asyncio>=1.3.0,<2",
    "pytest-xdist>=3.0,<4",
    "pytest-cov>=5.0,<6",  # Add this
    "mcp>=1.2.0,<2"
]

8.2 Test Categories

Add more pytest markers for selective test running:

# In pytest.ini or pyproject.toml
markers = [
    "integration: marks tests requiring external services",
    "slow: marks slow tests (>5s)",
    "security: marks security-focused tests",
    "benchmark: marks performance benchmark tests",
    "flakey: marks tests that may be unstable",
]

8.3 Test Data Factory

Create centralized test data factories:

# tests/factories.py
class AgentFactory:
    @staticmethod
    def create_mock_agent(tools=None):
        # Return configured mock agent
        pass

class MessageFactory:
    @staticmethod
    def create_user_message(content):
        # Return formatted user message
        pass

9. Summary & Action Items

Immediate Actions (High Impact)

  1. Add coverage reporting to CI pipeline
  2. Create tests for uncovered security-critical modules:
    • tools/code_execution_tool.py
    • tools/browser_tool.py
    • tools/terminal_tool.py
  3. Split oversized test files for better maintainability
  4. Add Gemini adapter tests (increasingly important provider)

Short-term (1-2 Sprints)

  1. Create integration tests for cross-component flows
  2. Add performance benchmarks for critical paths
  3. Expand OpenRouter client test coverage
  4. Add knowledge ingester tests

Long-term (Quarter)

  1. Achieve 80% code coverage across all modules
  2. Implement performance regression testing
  3. Create comprehensive security test suite
  4. Document testing patterns and best practices

Appendix: Test File Size Distribution

Lines Count Category
0-100 ~50 Simple unit tests
100-500 ~200 Standard test files
500-1000 ~80 Complex feature tests
1000-2000 ~30 Large test suites
2000+ ~13 Monolithic test files (needs splitting)

Analysis generated: March 30, 2026 Total test files analyzed: 373 Estimated test functions: ~4,311