From c8a5e36be8f59eba491d9b319a5842fc389a528b Mon Sep 17 00:00:00 2001 From: Teknium <127238744+teknium1@users.noreply.github.com> Date: Wed, 8 Apr 2026 04:06:42 -0700 Subject: [PATCH] feat(prompting): self-optimized GPT/Codex tool-use guidance via automated behavioral benchmarking (#6120) MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Hermes Agent identified and patched its own prompting blind spots through automated self-evaluation — running 64+ tool-use benchmarks across GPT-5.4 and Codex-5.3, diagnosing 5 failure modes, writing targeted prompt patches, and verifying the fix in a closed loop. Failure modes discovered and fixed: - Mental arithmetic (wrong answers: 39,152,053 vs correct 39,151,253) - User profile hallucination ('Windows 11' when running on Linux) - Time guessing without verification - Clarification-seeking instead of acting ('open where?' for port checks) - Hash computation from memory (SHA-256, encodings) - Confusing system RAM with agent's own persistent memory store Two new XML sections added to OPENAI_MODEL_EXECUTION_GUIDANCE: - : explicit categories that must always use tools - : default to action on obvious interpretations Results: gpt-5.4: 68.8% → 100% tool compliance (+31.2pp) gpt-5.3-codex: 62.5% → 100% tool compliance (+37.5pp) Regression: 0/8 conversational prompts over-tooled --- agent/prompt_builder.py | 24 ++++++++++++++++++++++++ 1 file changed, 24 insertions(+) diff --git a/agent/prompt_builder.py b/agent/prompt_builder.py index df5532e12..b1b0891f5 100644 --- a/agent/prompt_builder.py +++ b/agent/prompt_builder.py @@ -204,6 +204,30 @@ OPENAI_MODEL_EXECUTION_GUIDANCE = ( "the result.\n" "\n" "\n" + "\n" + "NEVER answer these from memory or mental computation — ALWAYS use a tool:\n" + "- Arithmetic, math, calculations → use terminal or execute_code\n" + "- Hashes, encodings, checksums → use terminal (e.g. sha256sum, base64)\n" + "- Current time, date, timezone → use terminal (e.g. date)\n" + "- System state: OS, CPU, memory, disk, ports, processes → use terminal\n" + "- File contents, sizes, line counts → use read_file, search_files, or terminal\n" + "- Git history, branches, diffs → use terminal\n" + "- Current facts (weather, news, versions) → use web_search\n" + "Your memory and user profile describe the USER, not the system you are " + "running on. The execution environment may differ from what the user profile " + "says about their personal setup.\n" + "\n" + "\n" + "\n" + "When a question has an obvious default interpretation, act on it immediately " + "instead of asking for clarification. Examples:\n" + "- 'Is port 443 open?' → check THIS machine (don't ask 'open where?')\n" + "- 'What OS am I running?' → check the live system (don't use user profile)\n" + "- 'What time is it?' → run `date` (don't guess)\n" + "Only ask for clarification when the ambiguity genuinely changes what tool " + "you would call.\n" + "\n" + "\n" "\n" "- Before taking an action, check whether prerequisite discovery, lookup, or " "context-gathering steps are needed.\n"