evaluate() was calling _llm_judge twice per item (once via
compute_reward, once directly) — double the API cost for no benefit.
Now extracts correctness from compute_reward's buffer instead.
Also: compute_reward appends to training metric buffers during eval,
which would pollute wandb training charts. Now rolls back buffer
entries added during eval so training metrics stay clean.
The evaluate method was doing single-turn chat_completion (no tools),
which defeats the purpose of an agentic research benchmark. Fixed to
run the full HermesAgentLoop with web_search/web_extract tools.
Results comparison (Claude Sonnet 4.5, FRAMES benchmark):
Without tools (broken): 0.56 mean correctness
With agent loop + tools: 1.00 mean correctness, 0.994 reward
New eval metrics: mean_correctness, mean_reward, mean_tool_calls,
tool_usage_rate — all logged via evaluate_log() in lighteval format.
AgentResult has .messages (list of dicts), not .final_response or
.tool_calls. Fixed compute_reward to extract the final response
and tool names from the message history.
Verified with live process mode test:
- Agent used 7 tool calls (web_search, web_extract)
- Produced a 1106-char researched response about Winter Olympics
- Reward: 0.384 (partial correctness via LLM judge)
- JSONL output contains valid tokens, masks, scores, messages
The environment was merged missing several standard components.
Updated to match the patterns established by 82 Atropos environments
and our own HermesAgentBaseEnv contract.
Added:
- WebResearchEnvConfig — custom Pydantic config with reward weights,
efficiency thresholds, eval settings, dataset config (all tunable
via CLI/YAML without code changes)
- config_init() classmethod — default server config (OpenRouter +
Claude) so the env works out of the box
- wandb_log() override — logs reward breakdown metrics (correctness,
tool_usage, efficiency, diversity, correct_rate, tool_usage_rate)
with proper buffer management and super() call
- evaluate() — uses server.chat_completion instead of broken stub
_run_agent_on_item(). Logs via evaluate_log() for lighteval-
compatible output.
Fixed:
- Removed broken _run_agent_on_item() stub that returned empty results
- evaluate() now uses server.chat_completion (same pattern as
TerminalTestEnv) for actual model evaluation
- compute_reward reads tool calls from AgentResult properly
- LLM judge uses self.server.chat_completion instead of ctx
Reward config is now tunable without code changes:
--env.correctness_weight 0.6
--env.tool_usage_weight 0.2
--env.efficiency_weight 0.2
--env.diversity_bonus 0.1
--env.efficient_max_calls 5