[Research] Prompt Caching Optimization — Existing Implementation Audit & Optimization Plan #851

New Issue

Timmy · 2026-04-16T02:16:37Z

Timmy commented

2026-04-16 02:16:37 +00:00

Research Report: Prompt Caching Optimization

Backlog Item: #7 — Prompt Caching Optimization (Ratio: 2.0)
Research Date: 2026-04-16
Researcher: Hermes Overnight Scout (cron job)

TL;DR

Prompt caching is already extensively implemented in hermes-agent. The codebase has sophisticated support for Anthropic, OpenAI, Qwen portal, and Ollama automatic prefix caching. The primary optimization opportunity is routing more workloads to local Ollama where we measured 28.2x speedup on exact prefix matches.

Key Findings

Empirical benchmarks (Apple M3 Max, Ollama v0.20.2, gemma4:8b Q4_K_M):

Exact prefix cache hit: 28.2x speedup (1,245ms → 44ms for 657 tokens)
Multi-turn conversation: prompt processing drops to 2.4-3.8% of total time after first turn
Model switch evicts cache: switching back costs ~1.2s cold start
Each turn in a cached conversation saves ~1,200ms of prompt processing

Current implementation (already in codebase):

✅ Anthropic cache_control breakpoints (75% input cost savings)
✅ OpenAI prompt_cache_key parameter
✅ OpenRouter Claude passthrough caching
✅ Qwen portal cache_control injection
✅ Ollama automatic prefix matching (28x speedup)
✅ System prompt stability architecture (never rebuilt mid-session)
✅ Context injection into user messages (preserves system prefix)
✅ Deterministic tool call IDs (preserves OpenAI cache)
✅ Cache hit/miss logging infrastructure

Recommended Actions (Priority Order)

Route cron jobs through Ollama — 47+ jobs could benefit from automatic prefix caching. Savings: ~1.2s/job/turn, ~5.4 hours/month recovered compute.
Verify Nous Research API caching — Default provider may not support prompt_tokens_details.cached_tokens. Test and document.
Add nightly cache hit rate report — Logging exists. Add a cron job to report cache percentages per provider.
Tune smart_model_routing thresholds — Route more simple queries to Ollama for cache benefits.
Set Ollama keep_alive to 24h — Prevent cold starts on frequently-used models.

Cost Impact

Ollama (local): Free, 28x speedup, ~5.4 hours/month recovered
Anthropic (cloud): 75% input token savings on multi-turn conversations
Nous Research: TBD — needs verification

Full Research Brief

See attached detailed research brief for complete benchmarks, architecture analysis, and implementation recommendations.

This issue documents existing implementation and identifies optimization opportunities. No code changes required in hermes-agent — the caching architecture is solid. Work is operational: expand Ollama routing, verify provider support, monitor cache rates.

## Research Report: Prompt Caching Optimization **Backlog Item**: #7 — Prompt Caching Optimization (Ratio: 2.0) **Research Date**: 2026-04-16 **Researcher**: Hermes Overnight Scout (cron job) ### TL;DR Prompt caching is **already extensively implemented** in hermes-agent. The codebase has sophisticated support for Anthropic, OpenAI, Qwen portal, and Ollama automatic prefix caching. The primary optimization opportunity is **routing more workloads to local Ollama** where we measured **28.2x speedup** on exact prefix matches. ### Key Findings **Empirical benchmarks (Apple M3 Max, Ollama v0.20.2, gemma4:8b Q4_K_M):** - Exact prefix cache hit: **28.2x speedup** (1,245ms → 44ms for 657 tokens) - Multi-turn conversation: prompt processing drops to **2.4-3.8% of total time** after first turn - Model switch evicts cache: switching back costs ~1.2s cold start - Each turn in a cached conversation saves ~1,200ms of prompt processing **Current implementation (already in codebase):** - ✅ Anthropic cache_control breakpoints (75% input cost savings) - ✅ OpenAI prompt_cache_key parameter - ✅ OpenRouter Claude passthrough caching - ✅ Qwen portal cache_control injection - ✅ Ollama automatic prefix matching (28x speedup) - ✅ System prompt stability architecture (never rebuilt mid-session) - ✅ Context injection into user messages (preserves system prefix) - ✅ Deterministic tool call IDs (preserves OpenAI cache) - ✅ Cache hit/miss logging infrastructure ### Recommended Actions (Priority Order) 1. **Route cron jobs through Ollama** — 47+ jobs could benefit from automatic prefix caching. Savings: ~1.2s/job/turn, ~5.4 hours/month recovered compute. 2. **Verify Nous Research API caching** — Default provider may not support `prompt_tokens_details.cached_tokens`. Test and document. 3. **Add nightly cache hit rate report** — Logging exists. Add a cron job to report cache percentages per provider. 4. **Tune smart_model_routing thresholds** — Route more simple queries to Ollama for cache benefits. 5. **Set Ollama keep_alive to 24h** — Prevent cold starts on frequently-used models. ### Cost Impact - **Ollama (local)**: Free, 28x speedup, ~5.4 hours/month recovered - **Anthropic (cloud)**: 75% input token savings on multi-turn conversations - **Nous Research**: TBD — needs verification ### Full Research Brief See attached detailed research brief for complete benchmarks, architecture analysis, and implementation recommendations. --- *This issue documents existing implementation and identifies optimization opportunities. No code changes required in hermes-agent — the caching architecture is solid. Work is operational: expand Ollama routing, verify provider support, monitor cache rates.*

codex-agent was assigned by Rockachopa

2026-04-17 01:34:33 +00:00

Rockachopa commented

2026-04-22 15:49:01 +00:00

Verified on a fresh clone of current forge main that prompt caching is already implemented in-repo and the issue body is describing an operational optimization lane, not a missing code slice.

Evidence checked:

agent/prompt_caching.py exists and implements Anthropic cache_control breakpoints
run_agent.py already wires _use_prompt_caching, apply_anthropic_cache_control, OpenAI prompt_cache_key, xAI x-grok-conv-id, and Ollama context/prefix-cache support
docs already cover caching in website/docs/developer-guide/context-compression-and-caching.md and website/docs/integrations/providers.md
targeted verification passed on fresh main: python3 -m py_compile agent/prompt_caching.py tests/agent/test_prompt_caching.py run_agent.py and pytest -q tests/agent/test_prompt_caching.py (14 passed)

Conclusion: no truthful hermes-agent code delta remains for #851. The remaining work is operational (route more cron workloads to Ollama, verify provider cache support, add cache-rate monitoring), so I am stopping without opening a duplicate PR.

Verified on a fresh clone of current forge `main` that prompt caching is already implemented in-repo and the issue body is describing an operational optimization lane, not a missing code slice. Evidence checked: - `agent/prompt_caching.py` exists and implements Anthropic `cache_control` breakpoints - `run_agent.py` already wires `_use_prompt_caching`, `apply_anthropic_cache_control`, OpenAI `prompt_cache_key`, xAI `x-grok-conv-id`, and Ollama context/prefix-cache support - docs already cover caching in `website/docs/developer-guide/context-compression-and-caching.md` and `website/docs/integrations/providers.md` - targeted verification passed on fresh main: `python3 -m py_compile agent/prompt_caching.py tests/agent/test_prompt_caching.py run_agent.py` and `pytest -q tests/agent/test_prompt_caching.py` (14 passed) Conclusion: no truthful hermes-agent code delta remains for #851. The remaining work is operational (route more cron workloads to Ollama, verify provider cache support, add cache-rate monitoring), so I am stopping without opening a duplicate PR.

sonnet referenced this issue from a commit

2026-04-22 15:50:59 +00:00

docs: verify prompt caching optimization is already implemented (#851)

Rockachopa referenced a pull request that will close this issue

2026-04-22 15:51:27 +00:00

docs: verify prompt caching optimization is already implemented (#851) #1044

Rockachopa commented

2026-04-22 15:52:07 +00:00

PR #1044 created for #851.

What landed:

added docs/issue-851-verification.md documenting that the prompt-caching architecture described in the issue already exists on main
captured evidence for Anthropic/OpenRouter cache-control breakpoints, OpenAI/Codex prompt_cache_key, system-prompt stability, and cache hit/miss logging
kept the repo delta truthful: the issue's own report says no new prompt-caching implementation is required

Verification:

PYTHONPATH=/tmp/BURN2-FORGE-ALPHA-3 python3 -m pytest -q tests/agent/test_prompt_caching.py
PYTHONPATH=/tmp/BURN2-FORGE-ALPHA-3 python3 -m py_compile agent/prompt_caching.py run_agent.py

PR #1044 created for #851. What landed: - added `docs/issue-851-verification.md` documenting that the prompt-caching architecture described in the issue already exists on `main` - captured evidence for Anthropic/OpenRouter cache-control breakpoints, OpenAI/Codex `prompt_cache_key`, system-prompt stability, and cache hit/miss logging - kept the repo delta truthful: the issue's own report says no new prompt-caching implementation is required Verification: - `PYTHONPATH=/tmp/BURN2-FORGE-ALPHA-3 python3 -m pytest -q tests/agent/test_prompt_caching.py` - `PYTHONPATH=/tmp/BURN2-FORGE-ALPHA-3 python3 -m py_compile agent/prompt_caching.py run_agent.py`

Sign in to join this conversation.

Branches Tags

main

burn/1148-1777638825

step35/1123-adopt-lightrag-telegram-skil

step35/744-fix-gateway-reconnect-drops

step35/711-bug-crisis-hook-logs-warning

step35/858-feat-n8n-workflow-orchestrat

step35/955-qa-verify-kittentts-local-pr

step35/706-p2-crisis-detection-needs-mu

step35/745-feat-provider-cost-estimator

step35/749-feat-batch-tool-execution-wi

step35/754-feat-skill-dependency-resolv

step35/734-mcp-add-pid-file-lock-to-pre

step35/980-triage-live-memory-provider

step35/713-bug-llama-server-down-on-por

step35/712-anthropic-cybersecurity-skil

step35/708-eval-test-atlas-inference-en

step35/702-improvement-crisis-detection

step35/1120-hackathon-creative-submissio

step35/1063-release-proof-104-cluster-a

step35/1065-release-proof-104-cluster-c

step35/1124-integrate-self-improving-her

step35/990-bug-mempalace-plugin-still-f

step35/1064-release-proof-104-cluster-b

step35/837-poka-yoke-auto-revert-incomp

step35/1055-release-proof-cluster-c-comp

fix/cluster-b-oneshot-proof

step35/1054-release-proof-cluster-b-gate

step35/695-p0-crisis-detection-resource

fix/1071-oneshot-empty-response-v2

step35/1056-release-proof-cluster-d-api

step35/1060-safety-regression-packet

step35/1058-release-proof-cluster-f-runt

step35/857-feat-lightrag-integration-fo

step35/779-tests-acp-test-collection-fa

step35/740-feat-wire-crisis-detection-i

step35/707-test-crisis-hook-needs-integ

fix/1071-oneshot-empty-response

step35/981-bug-holographic-prefetch-mis

step35/1053-release-proof-cluster-a-disc

step35/1059-release-proof-cluster-g-mess

step35/781-structured-output-enforcemen

fix/817

claude/issue-1040

fix/851

fix/796

fix/862

fix/950

fix/868

fix/860

fix/952

fix/877

fix/878

claude/issue-1013

fix/951

fix/801

fix/804

fix/840

fix/1010

fix/1011

fix/799

fix/1013

fix/966

fix/949

fix/955

claude/issue-990

fix/1012

fix/954

fix/953

fix/797

fix/956

fix/960

fix/958

fix/957

fix/959

claude/issue-985

step35/1051-browser-use-audit

feat/sovereign-accountant-agent-1776866068545

feat/budgetary-router-1776864510362

feat/sovereign-teleport-1776864503956

fix/842

burn/hermes-agent-334

fix/837

fix/836

fix/982

fix/800

fix/835

fix/853

fix/859

fix/857

burn/hermes-agent-132

burn/hermes-agent-323

fix/876

burn/hermes-agent-327

burn/983-context-snapshots

fix/798

burn/858-1776835055

burn/805-1776835051

claude/issue-962

claude/issue-961

feat/context-rag-decision-framework-rebased

fix/967

feat/autonomous-scavenger-fix-1776827712502

feat/sovereign-scavenger-1776827259631

feat/greenlight-protocol-1776827237329

feat/static-analyzer-gofai-1776826921747

feat/local-inference-bridge-1776826896029

feat/symbolic-verify-gofai-1776826842170

feat/impact-analysis-tool-1776826592325

feat/fleet-concurrency-guard-1776826501792

feat/robust-local-completion-1776824630428

claude/issue-816

claude/issue-806

claude/issue-923

burn/924-1776770102

burn/922-1776770102

burn/925-1776770102

fix/879

fix/938

fix/880

fix/865-ci-path-linter

burn/921-poka-yoke-hardcoded-paths

burn/923-1776770102

claude/issue-926

feat/web-console-394

fix/922

fix/923

fix/924

fix/926

fix/925

feat/security-deep-dive-1776732106631

fix/921-hardcoded-path-guard

fix/format-error-missing

fix/885-circuit-breaker

fix/delete-stale-prioritized-issues

fix/916

feat/session-management-1776700585635

feat/shield-multilingual-1776700482647

fix/913-syntax-validation

claude/issue-892

fix/886

fix/888

feat/queue-health-check

fix/887-path-validation-read-file

fix/839-1776403070

fix/889

fix/902-cron-audit

fix/891-profile-isolation

fix/time-aware-routing-889

fix/838-1776402240

feat/673

fix/892

fix/890-dead-cron-jobs

feat/888-ast-parse-syntax-validation

fix/816

claude/issue-834

fix/813

fix/803

fix/701

fix/read_file-not-found-hint

fix/806

fix/673

fix/681

fix/679

burn/838-1776304432

fix/865

fix/agent-card-discovery-1776311100051

burn/842-1776304432

burn/817-1776304433

fix/707-crisis-integration-test

claude/issue-809

burn/835-1776304433

burn/809-1776304433

burn/838-context-budget-tracker

fix/748-session-compaction

fix/749-batch-tool-execution

burn/836-1776304433

burn/798-1776304433

burn/834-1776304433

fix/839

fix/795

fix/749-v2

fix/834-1776303915

fix/834

fix/819

feat/822-a2a-health

feat/vitalik-secure-llm-1776303263

fix/711-crisis-hook-log-level

fix/712

claude/issue-705

fix/141-crisis-tool

fix/706

fix/702-multilingual-crisis-detection

burn/804-1776264500

fix/713

fix/796-tool-call-benchmark

fix/708

feat/a2a-health-check-805

burn/796-tool-call-benchmark

feat/802-a2a-agent-card

feat/crisis-protocol-1776270957872

burn/800-gemma4-multimodal

feat/806-a2a-mtls

fix/resolve-397-conflict-1776303120

feat/robust-tool-orchestration-1776268138150

fix/658

burn/781-1776263880

fix/779

fix/781-json-repair

fix/659

fix/660

fix/662

fix/664

fix/665

feat/43-context-rag-decision-framework

fix/667

fix/672

fix/742

claude/issue-695

fix/666

fix/744-gateway-cron-notification-drop

fix/663

fix/670

fix/740

fix/741

fix/752

fix/753

fix/755

fix/754

fix/222

fix/746

fix/cron-delivery-retry-744

fix/743

fix/756

fix/747

fix/744

fix/705

fix/745

fix/668-api

fix/748

fix/749

fix/695

fix/725

fix/669

fix/issue-734

fix/734

burn/714-1776218235

burn/713-1776218256

burn/714-1776218190

fix/693-crisis-notification-integration

fix/issue-642-1

fix/issue-702-7

fix/issue-701-8

fix/issue-692-3

fix/issue-706-6

fix/issue-694-1

fix/issue-645-6

fix/issue-707-5

fix/issue-708-4

fix/issue-643-8

fix/issue-711-3

fix/issue-714-1

fix/issue-713-2

fix/issue-644-7

fix/cron-schedule-parse-error

feat/670-approval-tiers

feat/679-crisis-wiring

feat/674-atlas-inference-engine

feat/atlas-provider

feat/671-hybrid-search

fix/677-crisis-hook

feat/673-988-crisis-escalation

feat/672-soul-crisis-protocol

fix/673-crisis-hook-integration

fix/677-crisis-hook-integration

fix/672-crisis-protocol

feat/681-path-aware-risk

feat/671-hybrid-search-router

feat/667-context-faithful-prompting

fix/670-approval-tiers

feat/672-crisis-protocol

feat/673-988-lifeline

burn/validate-action-pokayoke

fix/693-test-branch

perf/lazy-session-creation

fix/624-error-context

fix/626-validate-feedback

fix/614-multilingual-shield

claude/issue-628

claude/issue-613

dispatch/350-1776180746

dispatch/295-1776180746

dispatch/329-1776180746

dispatch/372-1776180746

dispatch/327-1776180746

dispatch/326-1776180746

dispatch/296-1776180746

dispatch/375-1776180746

dispatch/321-1776180746

dispatch/324-1776180746

claude/issue-592

fix/582-shield-tool-args

claude/issue-579

claude/issue-565

dawn/295-1776130053

triage/295-1776129677

claude/issue-556

q/295-1776129480

dawn/326-1776130053

fix/538-context-pressure-threshold

fix/561-ssh-dispatch

dawn/322-1776130053

dawn/372-1776130053

triage/326-1776129677

dawn/350-1776130053

dawn/378-1776130053

dawn/329-1776130053

triage/322-1776129677

triage/372-1776129677

dawn/327-1776130053

feat/505-session-templates

triage/378-1776129677

triage/350-1776129677

q/372-1776129480

triage/329-1776129677

triage/327-1776129677

q/322-1776129480

q/378-1776129480

dawn/296-1776130053

fix/500-cloud-context-warning

queue/372-1776129201

dawn/324-1776130053

q/327-1776129480

q/329-1776129480

queue/378-1776129201

dawn/288-1776130053

q/350-1776129480

triage/296-1776129677

triage/324-1776129677

fix/499-hardcoded-paths

q/324-1776129480

q/296-1776129480

queue/322-1776129201

queue/327-1776129201

queue/324-1776129201

q/326-1776129480

queue/296-1776129201

queue/350-1776129201

fix/478-tilde-expand

fix/478-tilde-expansion

queue/329-1776129201

fix/478-hermes-home-tilde-expand

fix/468-cron-cloud-context

fix/479-optional-skills-hardcoded-paths

fix/479-hardcoded-paths

q/316-1776129677

q/288-1776129480

feat/334-profile-scoped-cron

whip/326-1776128804

dawn/375-1776130053

fix/375-deploy-crons-model-provider-comparison

whip/324-1776128804

burn/456-1776129600

triage/375-1776129677

fix/468-1776128804

q/375-1776129480

whip/372-1776128804

queue/375-1776129201

queue/288-1776129201

fix/457-ssh-dispatch-validation

whip/350-1776128804

whip/251-1776128804

fix/456-cloud-context-warning

queue/321-1776129201

whip/378-1776128804

whip/322-1776128804

whip/327-1776128804

whip/329-1776128804

whip/288-1776128804

whip/327-1776127281

whip/375-1776128804

whip/321-1776128804

dispatch/288-1776180746

triage/288-1776129677

whip/296-1776128804

whip/293-1776127532

whip/350-1776127532

whip/372-1776127532

whip/378-1776127532

whip/329-1776127532

whip/322-1776127532

whip/316-1776127532

whip/321-1776127532

whip/326-1776127532

whip/327-1776127532

whip/314-1776127532

whip/288-1776127532

whip/375-1776127532

burn/379-1776125702

burn/373-1776125702

queue/326-1776129201

burn/373-1776120221

burn/378-1776117791

burn/375-1776117778

burn/379-1776117790

burn/376-1776117777

burn/377-1776117775

burn/acp-272-1776117838

burn/web-console-325

burn/profile-cron-334

burn/prompt-injection-324

burn/skill-revert-295

burn/context-overflow-296

burn/honcho-eval-322

burn/privacy-filter-283

burn/model-benchmark-287

feat/marathon-session-limits-326

fix/poka-yoke-hardcoded-paths

feat/315-session-gc

feature/time-aware-model-routing-317

fix/gateway-config-debt-328

feat/lazy-session-creation

burn/378-1776120221

fix/weak-credential-guard

fix/tool-return-type-validation

fix/memory-no-match-not-error

feat/temporal-decay-holographic-memory

fix/syntax-preflight-execute-code

fix/cron-script-failure-detection

fix/empty-model-preflight

fix/cron-sync-guard-v2

fix/cron-interpreter-shutdown-352

feat/error-circuit-breaker

fix/circuit-breaker-error-cascade

feat/cron-run-now

ci/fix-mempalace-syntax

claude/issue-351

fix/cron-tick-backlog

feat/deploy-sync-guard

feat/20260413-cron-agent-kwargs

feat/profile-scoped-cron

fix/cron-ticker-startup

fix/empirical-audit-hardening

feat/skills-index-workflow

fix/credential-guard

feat/research-paper-scaffolder

feat/cron-tool-choice-propagation

perplexity/provider-allowlist

fix/json-repair-for-tool-calls

feat/context-rag-decision-framework

census/feature-inventory

fix/ci-stability

keymaxx/mimoomni/243

feature/improve-sovereignty-justification

claude/issue-1135

feat/mempalace-portal-1775695506634

feat/ci-no-duplicate-models

feat/mempalace-tool-1775642243437

fix/kimi-fallback-model

perplexity/mempalace-tests

upstream-sync

claude/issue-192

claude/issue-190

rescue/ollama-provider

rescue/v011-obfuscation-fix

claw-code/issue-151

claw-code/issue-126

groq/issue-168

timmy/issue-169-ollama-provider

gemini/issue-24

claude/issue-128

claude/issue-142

claude/issue-133

claude/issue-143

claude/issue-146

claude/issue-155

claude/issue-147

claude/issue-148

claude/issue-149

epic-999-phase-ii-forge

timmy/issue-123-process-resilience

timmy/issue-116-config-validation

epic-999-phase-i

security/v-011-skills-guard-bypass

gemini/security-hardening

gemini/sovereign-gitea-client

timmy-custom

security/fix-oauth-session-fixation

security/fix-skills-path-traversal

security/fix-file-toctou

security/fix-error-disclosure

security/add-rate-limiting

security/fix-browser-cdp

security/fix-docker-privilege

security/fix-auth-bypass

fix/sqlite-contention

tests/security-coverage

security/fix-race-condition

security/fix-ssrf

security/fix-secret-leakage

feat/gen-ai-evolution-phases-19-21

feat/gen-ai-evolution-phases-16-18

feat/gen-ai-evolution-phases-13-15

security/fix-path-traversal

security/fix-command-injection

feat/gen-ai-evolution-phases-10-12

feat/gen-ai-evolution-phases-7-9

feat/gen-ai-evolution-phases-4-6

feat/gen-ai-evolution-phases-1-3

feat/sovereign-evolution-redistribution

feat/apparatus-verification

feat/sovereign-intersymbolic-ai

feat/sovereign-learning-system

feat/sovereign-reasoning-engine

2 Participants

Notifications

Due Date

No due date set.

Dependencies

No dependencies set.

Reference: Timmy_Foundation/hermes-agent#851