Timmy_Foundation/hermes-agent

Fork 0

Files

Allegro 10271c6b44

Supply Chain Audit / Scan PR for supply chain risks (pull_request) Failing after 25s

Details

Tests / test (pull_request) Failing after 24s

Details

Docker Build and Publish / build-and-push (pull_request) Failing after 35s

Details

security: fix command injection vulnerabilities (CVSS 9.8)

Replace shell=True with list-based subprocess execution to prevent
command injection via malicious user input.

Changes:
- tools/transcription_tools.py: Use shlex.split() + shell=False
- tools/environments/docker.py: List-based commands with container ID validation

Fixes CVE-level vulnerability where malicious file paths or container IDs
could inject arbitrary commands.

CVSS: 9.8 (Critical)
Refs: V-001 in SECURITY_AUDIT_REPORT.md

2026-03-30 23:15:11 +00:00

15 KiB

Raw Blame History

Hermes Agent - Testing Infrastructure Deep Analysis

Executive Summary

The hermes-agent project has a comprehensive test suite with 373 test files containing approximately 4,300+ test functions. The tests are organized into 10 subdirectories covering all major components.

1. Test Suite Structure & Statistics

1.1 Directory Breakdown

Directory	Test Files	Focus Area
`tests/tools/`	86	Tool implementations, file operations, environments
`tests/gateway/`	96	Platform integrations (Discord, Telegram, Slack, etc.)
`tests/hermes_cli/`	48	CLI commands, configuration, setup flows
`tests/agent/`	16	Core agent logic, prompt building, model adapters
`tests/integration/`	8	End-to-end integration tests
`tests/acp/`	8	Agent Communication Protocol
`tests/cron/`	3	Cron job scheduling
`tests/skills/`	5	Skill management
`tests/honcho_integration/`	5	Honcho memory integration
`tests/fakes/`	2	Test fixtures and fake servers
Total	373	~4,311 test functions

1.2 Test Classification

Unit Tests: ~95% (3,600+) Integration Tests: ~5% (marked with @pytest.mark.integration) Async Tests: ~679 tests use @pytest.mark.asyncio

1.3 Largest Test Files (by line count)

tests/test_run_agent.py - 3,329 lines (212 tests) - Core agent logic
tests/tools/test_mcp_tool.py - 2,902 lines (147 tests) - MCP protocol
tests/gateway/test_voice_command.py - 2,632 lines - Voice features
tests/gateway/test_feishu.py - 2,580 lines - Feishu platform
tests/gateway/test_api_server.py - 1,503 lines - API server

2. Coverage Heat Map - Critical Gaps Identified

2.1 NO TEST COVERAGE (Red Zone)

Agent Module Gaps:

agent/copilot_acp_client.py - Copilot integration (0 tests)
agent/gemini_adapter.py - Google Gemini model support (0 tests)
agent/knowledge_ingester.py - Knowledge ingestion (0 tests)
agent/meta_reasoning.py - Meta-reasoning capabilities (0 tests)
agent/skill_utils.py - Skill utilities (0 tests)
agent/trajectory.py - Trajectory management (0 tests)

Tools Module Gaps:

tools/browser_tool.py - Browser automation (0 tests)
tools/code_execution_tool.py - Code execution (0 tests)
tools/gitea_client.py - Gitea integration (0 tests)
tools/image_generation_tool.py - Image generation (0 tests)
tools/neutts_synth.py - Neural TTS (0 tests)
tools/openrouter_client.py - OpenRouter API (0 tests)
tools/session_search_tool.py - Session search (0 tests)
tools/terminal_tool.py - Terminal operations (0 tests)
tools/tts_tool.py - Text-to-speech (0 tests)
tools/web_tools.py - Web tools core (0 tests)

Gateway Module Gaps:

gateway/run.py - Gateway runner (0 tests)
gateway/stream_consumer.py - Stream consumption (0 tests)

Root-Level Gaps:

hermes_constants.py - Constants (0 tests)
hermes_time.py - Time utilities (0 tests)
mini_swe_runner.py - SWE runner (0 tests)
rl_cli.py - RL CLI (0 tests)
utils.py - Utilities (0 tests)

2.2 LIMITED COVERAGE (Yellow Zone)

agent/models_dev.py - Only 19 tests for complex model routing
agent/smart_model_routing.py - Only 6 tests
tools/approval.py - 2 test files but complex logic
tools/skills_guard.py - Security-critical, needs more coverage

2.3 GOOD COVERAGE (Green Zone)

agent/anthropic_adapter.py - 97 tests (comprehensive)
agent/prompt_builder.py - 108 tests (excellent)
tools/mcp_tool.py - 147 tests (very comprehensive)
tools/file_tools.py - Multiple test files
gateway/discord.py - 11 test files covering various aspects
gateway/telegram.py - 10 test files
gateway/session.py - 15 test files

3. Test Patterns Analysis

3.1 Fixtures Architecture

Global Fixtures (conftest.py):

_isolate_hermes_home - Isolates HERMES_HOME to temp directory (autouse)
_ensure_current_event_loop - Event loop management for sync tests (autouse)
_enforce_test_timeout - 30-second timeout per test (autouse)
tmp_dir - Temporary directory fixture
mock_config - Minimal hermes config for unit tests

Common Patterns:

# Isolation pattern
@pytest.fixture(autouse=True)
def isolate_env(tmp_path, monkeypatch):
    monkeypatch.setenv("HERMES_HOME", str(tmp_path))

# Mock client pattern
@pytest.fixture
def mock_agent():
    with patch("run_agent.OpenAI") as mock:
        yield mock

3.2 Mock Usage Statistics

~12,468 mock/patch usages across the test suite
Heavy use of unittest.mock.patch and MagicMock
AsyncMock used for async function mocking
SimpleNamespace for creating mock API response objects

3.3 Test Organization Patterns

Class-Based Organization:

1,532 test classes identified
Grouped by functionality: Test<Feature><Scenario>
Example: TestSanitizeApiMessages, TestContextPressureFlags

Function-Based Organization:

Used for simpler test files
Naming: test_<feature>_<scenario>

3.4 Async Test Patterns

@pytest.mark.asyncio
async def test_async_function():
    result = await async_function()
    assert result == expected

4. 20 New Test Recommendations (Priority Order)

Critical Priority (Security/Risk)

Browser Tool Security Tests (tools/browser_tool.py)
- Test sandbox escape prevention
- Test malicious script blocking
- Test content security policy enforcement
Code Execution Sandbox Tests (tools/code_execution_tool.py)
- Test resource limits (CPU, memory)
- Test dangerous import blocking
- Test timeout enforcement
- Test filesystem access restrictions
Terminal Tool Safety Tests (tools/terminal_tool.py)
- Test dangerous command blocking
- Test command injection prevention
- Test environment variable sanitization
OpenRouter Client Tests (tools/openrouter_client.py)
- Test API key handling
- Test rate limit handling
- Test error response parsing

High Priority (Core Functionality)

Gemini Adapter Tests (agent/gemini_adapter.py)
- Test message format conversion
- Test tool call normalization
- Test streaming response handling
Copilot ACP Client Tests (agent/copilot_acp_client.py)
- Test authentication flow
- Test session management
- Test message passing
Knowledge Ingester Tests (agent/knowledge_ingester.py)
- Test document parsing
- Test embedding generation
- Test knowledge retrieval
Stream Consumer Tests (gateway/stream_consumer.py)
- Test backpressure handling
- Test reconnection logic
- Test message ordering guarantees

Medium Priority (Integration/Features)

Web Tools Core Tests (tools/web_tools.py)
- Test search result parsing
- Test content extraction
- Test error handling for unavailable services
Image Generation Tool Tests (tools/image_generation_tool.py)
- Test prompt filtering
- Test image format handling
- Test provider failover
Gitea Client Tests (tools/gitea_client.py)
- Test repository operations
- Test webhook handling
- Test authentication
Session Search Tool Tests (tools/session_search_tool.py)
- Test query parsing
- Test result ranking
- Test pagination
Meta Reasoning Tests (agent/meta_reasoning.py)
- Test strategy selection
- Test reflection generation
- Test learning from failures
TTS Tool Tests (tools/tts_tool.py)
- Test voice selection
- Test audio format conversion
- Test streaming playback
Neural TTS Tests (tools/neutts_synth.py)
- Test voice cloning safety
- Test audio quality validation
- Test resource cleanup

Lower Priority (Utilities)

Hermes Constants Tests (hermes_constants.py)
- Test constant values
- Test environment-specific overrides
Time Utilities Tests (hermes_time.py)
- Test timezone handling
- Test formatting functions
Utils Module Tests (utils.py)
- Test helper functions
- Test validation utilities
Mini SWE Runner Tests (mini_swe_runner.py)
- Test repository setup
- Test test execution
- Test result parsing
RL CLI Tests (rl_cli.py)
- Test training command parsing
- Test configuration validation
- Test checkpoint handling

5. Test Optimization Opportunities

5.1 Performance Issues Identified

Large Test Files (Split Recommended):

tests/test_run_agent.py (3,329 lines) → Split into multiple files
tests/tools/test_mcp_tool.py (2,902 lines) → Split by MCP feature
tests/test_anthropic_adapter.py (1,219 lines) → Consider splitting

Potential Slow Tests:

Integration tests with real API calls
Tests with file I/O operations
Tests with subprocess spawning

5.2 Optimization Recommendations

Parallel Execution Already Configured
- pytest-xdist with -n auto in CI
- Maintains isolation through fixtures
Fixture Scope Optimization
- Review autouse=True fixtures for necessity
- Consider session-scoped fixtures for expensive setup
Mock External Services
- Some integration tests still hit real APIs
- Create more fakes like fake_ha_server.py
Test Data Management
- Use factory pattern for test data generation
- Share test fixtures across related tests

5.3 CI/CD Optimizations

Current CI (.github/workflows/tests.yml):

Uses uv for fast dependency installation
Runs with -n auto for parallelization
Ignores integration tests by default
10-minute timeout

Recommended Improvements:

Add test duration reporting (--durations=10)
Add coverage reporting
Separate fast unit tests from slower integration tests
Add flaky test retry mechanism

6. Missing Integration Test Scenarios

6.1 Cross-Component Integration

End-to-End Agent Flow
- User message → Gateway → Agent → Tools → Response
- Test with real (mocked) LLM responses
Multi-Platform Gateway
- Message routing between platforms
- Session persistence across platforms
Tool + Environment Integration
- Terminal tool with different backends (local, docker, modal)
- File operations with permission checks
Skill Lifecycle Integration
- Skill installation → Registration → Execution → Update → Removal
Memory + Honcho Integration
- Memory storage → Retrieval → Context injection

6.2 Failure Scenario Integration Tests

LLM Provider Failover
- Primary provider down → Fallback provider
- Rate limiting handling
Gateway Reconnection
- Platform disconnect → Reconnect → Resume session
Tool Execution Failures
- Tool timeout → Retry → Fallback
- Tool error → Error handling → User notification
Checkpoint Recovery
- Crash during batch → Resume from checkpoint
- Corrupted checkpoint handling

6.3 Security Integration Tests

Prompt Injection Across Stack
- Gateway input → Agent processing → Tool execution
Permission Escalation Prevention
- User permissions → Tool allowlist → Execution
Data Leak Prevention
- Memory storage → Context building → Response generation

7. Performance Test Strategy

7.1 Load Testing Requirements

Gateway Load Tests
- Concurrent session handling
- Message throughput per platform
- Memory usage under load
Agent Response Time Tests
- End-to-end latency benchmarks
- Tool execution time budgets
- Context building performance
Resource Utilization Tests
- Memory leaks in long-running sessions
- File descriptor limits
- CPU usage patterns

7.2 Benchmark Framework

# Proposed performance test structure
class TestGatewayPerformance:
    @pytest.mark.benchmark
    def test_message_throughput(self, benchmark):
        # Measure messages processed per second
        pass
    
    @pytest.mark.benchmark
    def test_session_creation_latency(self, benchmark):
        # Measure session setup time
        pass

7.3 Performance Regression Detection

Baseline Establishment
- Record baseline metrics for critical paths
- Store in version control
Automated Comparison
- Compare PR performance against baseline
- Fail if degradation > 10%
Metrics to Track
- Test suite execution time
- Memory peak usage
- Individual test durations

8. Test Infrastructure Improvements

8.1 Coverage Tooling

Missing: Code coverage reporting Recommendation: Add pytest-cov to dev dependencies

[project.optional-dependencies]
dev = [
    "pytest>=9.0.2,<10",
    "pytest-asyncio>=1.3.0,<2",
    "pytest-xdist>=3.0,<4",
    "pytest-cov>=5.0,<6",  # Add this
    "mcp>=1.2.0,<2"
]

8.2 Test Categories

Add more pytest markers for selective test running:

# In pytest.ini or pyproject.toml
markers = [
    "integration: marks tests requiring external services",
    "slow: marks slow tests (>5s)",
    "security: marks security-focused tests",
    "benchmark: marks performance benchmark tests",
    "flakey: marks tests that may be unstable",
]

8.3 Test Data Factory

Create centralized test data factories:

# tests/factories.py
class AgentFactory:
    @staticmethod
    def create_mock_agent(tools=None):
        # Return configured mock agent
        pass

class MessageFactory:
    @staticmethod
    def create_user_message(content):
        # Return formatted user message
        pass

9. Summary & Action Items

Immediate Actions (High Impact)

Add coverage reporting to CI pipeline
Create tests for uncovered security-critical modules:
- tools/code_execution_tool.py
- tools/browser_tool.py
- tools/terminal_tool.py
Split oversized test files for better maintainability
Add Gemini adapter tests (increasingly important provider)

Short-term (1-2 Sprints)

Create integration tests for cross-component flows
Add performance benchmarks for critical paths
Expand OpenRouter client test coverage
Add knowledge ingester tests

Long-term (Quarter)

Achieve 80% code coverage across all modules
Implement performance regression testing
Create comprehensive security test suite
Document testing patterns and best practices

Appendix: Test File Size Distribution

Lines	Count	Category
0-100	~50	Simple unit tests
100-500	~200	Standard test files
500-1000	~80	Complex feature tests
1000-2000	~30	Large test suites
2000+	~13	Monolithic test files (needs splitting)

Analysis generated: March 30, 2026 Total test files analyzed: 373 Estimated test functions: ~4,311

15 KiB Raw Blame History

Hermes Agent - Testing Infrastructure Deep Analysis

Executive Summary

1. Test Suite Structure & Statistics

1.1 Directory Breakdown

1.2 Test Classification

1.3 Largest Test Files (by line count)

2. Coverage Heat Map - Critical Gaps Identified

2.1 NO TEST COVERAGE (Red Zone)

Agent Module Gaps:

Tools Module Gaps:

Gateway Module Gaps:

Root-Level Gaps:

2.2 LIMITED COVERAGE (Yellow Zone)

2.3 GOOD COVERAGE (Green Zone)

3. Test Patterns Analysis

3.1 Fixtures Architecture

3.2 Mock Usage Statistics

3.3 Test Organization Patterns

3.4 Async Test Patterns

4. 20 New Test Recommendations (Priority Order)

Critical Priority (Security/Risk)

High Priority (Core Functionality)

Medium Priority (Integration/Features)

Lower Priority (Utilities)

5. Test Optimization Opportunities

5.1 Performance Issues Identified

5.2 Optimization Recommendations

5.3 CI/CD Optimizations

6. Missing Integration Test Scenarios

6.1 Cross-Component Integration

6.2 Failure Scenario Integration Tests

6.3 Security Integration Tests

7. Performance Test Strategy

7.1 Load Testing Requirements

7.2 Benchmark Framework

7.3 Performance Regression Detection

8. Test Infrastructure Improvements

8.1 Coverage Tooling

8.2 Test Categories

8.3 Test Data Factory

9. Summary & Action Items

Immediate Actions (High Impact)

Short-term (1-2 Sprints)

Long-term (Quarter)

Appendix: Test File Size Distribution

15 KiB

Raw Blame History