[Research] Prompt Caching Optimization — Existing Implementation Audit & Optimization Plan #851

Open
opened 2026-04-16 02:16:37 +00:00 by Timmy · 2 comments
Owner

Research Report: Prompt Caching Optimization

Backlog Item: #7 — Prompt Caching Optimization (Ratio: 2.0)
Research Date: 2026-04-16
Researcher: Hermes Overnight Scout (cron job)

TL;DR

Prompt caching is already extensively implemented in hermes-agent. The codebase has sophisticated support for Anthropic, OpenAI, Qwen portal, and Ollama automatic prefix caching. The primary optimization opportunity is routing more workloads to local Ollama where we measured 28.2x speedup on exact prefix matches.

Key Findings

Empirical benchmarks (Apple M3 Max, Ollama v0.20.2, gemma4:8b Q4_K_M):

  • Exact prefix cache hit: 28.2x speedup (1,245ms → 44ms for 657 tokens)
  • Multi-turn conversation: prompt processing drops to 2.4-3.8% of total time after first turn
  • Model switch evicts cache: switching back costs ~1.2s cold start
  • Each turn in a cached conversation saves ~1,200ms of prompt processing

Current implementation (already in codebase):

  • Anthropic cache_control breakpoints (75% input cost savings)
  • OpenAI prompt_cache_key parameter
  • OpenRouter Claude passthrough caching
  • Qwen portal cache_control injection
  • Ollama automatic prefix matching (28x speedup)
  • System prompt stability architecture (never rebuilt mid-session)
  • Context injection into user messages (preserves system prefix)
  • Deterministic tool call IDs (preserves OpenAI cache)
  • Cache hit/miss logging infrastructure
  1. Route cron jobs through Ollama — 47+ jobs could benefit from automatic prefix caching. Savings: ~1.2s/job/turn, ~5.4 hours/month recovered compute.

  2. Verify Nous Research API caching — Default provider may not support prompt_tokens_details.cached_tokens. Test and document.

  3. Add nightly cache hit rate report — Logging exists. Add a cron job to report cache percentages per provider.

  4. Tune smart_model_routing thresholds — Route more simple queries to Ollama for cache benefits.

  5. Set Ollama keep_alive to 24h — Prevent cold starts on frequently-used models.

Cost Impact

  • Ollama (local): Free, 28x speedup, ~5.4 hours/month recovered
  • Anthropic (cloud): 75% input token savings on multi-turn conversations
  • Nous Research: TBD — needs verification

Full Research Brief

See attached detailed research brief for complete benchmarks, architecture analysis, and implementation recommendations.


This issue documents existing implementation and identifies optimization opportunities. No code changes required in hermes-agent — the caching architecture is solid. Work is operational: expand Ollama routing, verify provider support, monitor cache rates.

## Research Report: Prompt Caching Optimization **Backlog Item**: #7 — Prompt Caching Optimization (Ratio: 2.0) **Research Date**: 2026-04-16 **Researcher**: Hermes Overnight Scout (cron job) ### TL;DR Prompt caching is **already extensively implemented** in hermes-agent. The codebase has sophisticated support for Anthropic, OpenAI, Qwen portal, and Ollama automatic prefix caching. The primary optimization opportunity is **routing more workloads to local Ollama** where we measured **28.2x speedup** on exact prefix matches. ### Key Findings **Empirical benchmarks (Apple M3 Max, Ollama v0.20.2, gemma4:8b Q4_K_M):** - Exact prefix cache hit: **28.2x speedup** (1,245ms → 44ms for 657 tokens) - Multi-turn conversation: prompt processing drops to **2.4-3.8% of total time** after first turn - Model switch evicts cache: switching back costs ~1.2s cold start - Each turn in a cached conversation saves ~1,200ms of prompt processing **Current implementation (already in codebase):** - ✅ Anthropic cache_control breakpoints (75% input cost savings) - ✅ OpenAI prompt_cache_key parameter - ✅ OpenRouter Claude passthrough caching - ✅ Qwen portal cache_control injection - ✅ Ollama automatic prefix matching (28x speedup) - ✅ System prompt stability architecture (never rebuilt mid-session) - ✅ Context injection into user messages (preserves system prefix) - ✅ Deterministic tool call IDs (preserves OpenAI cache) - ✅ Cache hit/miss logging infrastructure ### Recommended Actions (Priority Order) 1. **Route cron jobs through Ollama** — 47+ jobs could benefit from automatic prefix caching. Savings: ~1.2s/job/turn, ~5.4 hours/month recovered compute. 2. **Verify Nous Research API caching** — Default provider may not support `prompt_tokens_details.cached_tokens`. Test and document. 3. **Add nightly cache hit rate report** — Logging exists. Add a cron job to report cache percentages per provider. 4. **Tune smart_model_routing thresholds** — Route more simple queries to Ollama for cache benefits. 5. **Set Ollama keep_alive to 24h** — Prevent cold starts on frequently-used models. ### Cost Impact - **Ollama (local)**: Free, 28x speedup, ~5.4 hours/month recovered - **Anthropic (cloud)**: 75% input token savings on multi-turn conversations - **Nous Research**: TBD — needs verification ### Full Research Brief See attached detailed research brief for complete benchmarks, architecture analysis, and implementation recommendations. --- *This issue documents existing implementation and identifies optimization opportunities. No code changes required in hermes-agent — the caching architecture is solid. Work is operational: expand Ollama routing, verify provider support, monitor cache rates.*
codex-agent was assigned by Rockachopa 2026-04-17 01:34:33 +00:00
Owner

Verified on a fresh clone of current forge main that prompt caching is already implemented in-repo and the issue body is describing an operational optimization lane, not a missing code slice.

Evidence checked:

  • agent/prompt_caching.py exists and implements Anthropic cache_control breakpoints
  • run_agent.py already wires _use_prompt_caching, apply_anthropic_cache_control, OpenAI prompt_cache_key, xAI x-grok-conv-id, and Ollama context/prefix-cache support
  • docs already cover caching in website/docs/developer-guide/context-compression-and-caching.md and website/docs/integrations/providers.md
  • targeted verification passed on fresh main: python3 -m py_compile agent/prompt_caching.py tests/agent/test_prompt_caching.py run_agent.py and pytest -q tests/agent/test_prompt_caching.py (14 passed)

Conclusion: no truthful hermes-agent code delta remains for #851. The remaining work is operational (route more cron workloads to Ollama, verify provider cache support, add cache-rate monitoring), so I am stopping without opening a duplicate PR.

Verified on a fresh clone of current forge `main` that prompt caching is already implemented in-repo and the issue body is describing an operational optimization lane, not a missing code slice. Evidence checked: - `agent/prompt_caching.py` exists and implements Anthropic `cache_control` breakpoints - `run_agent.py` already wires `_use_prompt_caching`, `apply_anthropic_cache_control`, OpenAI `prompt_cache_key`, xAI `x-grok-conv-id`, and Ollama context/prefix-cache support - docs already cover caching in `website/docs/developer-guide/context-compression-and-caching.md` and `website/docs/integrations/providers.md` - targeted verification passed on fresh main: `python3 -m py_compile agent/prompt_caching.py tests/agent/test_prompt_caching.py run_agent.py` and `pytest -q tests/agent/test_prompt_caching.py` (14 passed) Conclusion: no truthful hermes-agent code delta remains for #851. The remaining work is operational (route more cron workloads to Ollama, verify provider cache support, add cache-rate monitoring), so I am stopping without opening a duplicate PR.
Owner

PR #1044 created for #851.

What landed:

  • added docs/issue-851-verification.md documenting that the prompt-caching architecture described in the issue already exists on main
  • captured evidence for Anthropic/OpenRouter cache-control breakpoints, OpenAI/Codex prompt_cache_key, system-prompt stability, and cache hit/miss logging
  • kept the repo delta truthful: the issue's own report says no new prompt-caching implementation is required

Verification:

  • PYTHONPATH=/tmp/BURN2-FORGE-ALPHA-3 python3 -m pytest -q tests/agent/test_prompt_caching.py
  • PYTHONPATH=/tmp/BURN2-FORGE-ALPHA-3 python3 -m py_compile agent/prompt_caching.py run_agent.py
PR #1044 created for #851. What landed: - added `docs/issue-851-verification.md` documenting that the prompt-caching architecture described in the issue already exists on `main` - captured evidence for Anthropic/OpenRouter cache-control breakpoints, OpenAI/Codex `prompt_cache_key`, system-prompt stability, and cache hit/miss logging - kept the repo delta truthful: the issue's own report says no new prompt-caching implementation is required Verification: - `PYTHONPATH=/tmp/BURN2-FORGE-ALPHA-3 python3 -m pytest -q tests/agent/test_prompt_caching.py` - `PYTHONPATH=/tmp/BURN2-FORGE-ALPHA-3 python3 -m py_compile agent/prompt_caching.py run_agent.py`
Sign in to join this conversation.
No Label
2 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: Timmy_Foundation/hermes-agent#851