feat(prompting): self-optimized GPT/Codex tool-use guidance via automated behavioral benchmarking (#6120)

Hermes Agent identified and patched its own prompting blind spots through automated self-evaluation — running 64+ tool-use benchmarks across GPT-5.4 and Codex-5.3, diagnosing 5 failure modes, writing targeted prompt patches, and verifying the fix in a closed loop. Failure modes discovered and fixed: - Mental arithmetic (wrong answers: 39,152,053 vs correct 39,151,253) - User profile hallucination ('Windows 11' when running on Linux) - Time guessing without verification - Clarification-seeking instead of acting ('open where?' for port checks) - Hash computation from memory (SHA-256, encodings) - Confusing system RAM with agent's own persistent memory store Two new XML sections added to OPENAI_MODEL_EXECUTION_GUIDANCE: - <mandatory_tool_use>: explicit categories that must always use tools - <act_dont_ask>: default to action on obvious interpretations Results: gpt-5.4: 68.8% → 100% tool compliance (+31.2pp) gpt-5.3-codex: 62.5% → 100% tool compliance (+37.5pp) Regression: 0/8 conversational prompts over-tooled
2026-04-08 04:06:42 -07:00
parent 1368caf66f
commit c8a5e36be8
1 changed files with 24 additions and 0 deletions
--- a/agent/prompt_builder.py
+++ b/agent/prompt_builder.py
@@ -204,6 +204,30 @@ OPENAI_MODEL_EXECUTION_GUIDANCE = (
    "the result.\n"
    "</tool_persistence>\n"
    "\n"
+    "<mandatory_tool_use>\n"
+    "NEVER answer these from memory or mental computation — ALWAYS use a tool:\n"
+    "- Arithmetic, math, calculations → use terminal or execute_code\n"
+    "- Hashes, encodings, checksums → use terminal (e.g. sha256sum, base64)\n"
+    "- Current time, date, timezone → use terminal (e.g. date)\n"
+    "- System state: OS, CPU, memory, disk, ports, processes → use terminal\n"
+    "- File contents, sizes, line counts → use read_file, search_files, or terminal\n"
+    "- Git history, branches, diffs → use terminal\n"
+    "- Current facts (weather, news, versions) → use web_search\n"
+    "Your memory and user profile describe the USER, not the system you are "
+    "running on. The execution environment may differ from what the user profile "
+    "says about their personal setup.\n"
+    "</mandatory_tool_use>\n"
+    "\n"
+    "<act_dont_ask>\n"
+    "When a question has an obvious default interpretation, act on it immediately "
+    "instead of asking for clarification. Examples:\n"
+    "- 'Is port 443 open?' → check THIS machine (don't ask 'open where?')\n"
+    "- 'What OS am I running?' → check the live system (don't use user profile)\n"
+    "- 'What time is it?' → run `date` (don't guess)\n"
+    "Only ask for clarification when the ambiguity genuinely changes what tool "
+    "you would call.\n"
+    "</act_dont_ask>\n"
+    "\n"
    "<prerequisite_checks>\n"
    "- Before taking an action, check whether prerequisite discovery, lookup, or "
    "context-gathering steps are needed.\n"