Implement prompt caching and KV cache reuse for faster inference #85
Reference in New Issue
Block a user
Delete Branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Objective
Timmy's biggest bottleneck is inference speed. The system prompt + tool definitions are ~2000 tokens that get re-processed on every single request. Implement prompt caching so the KV cache for static content is reused.
Approach
1. System Prompt Caching in llama-server
llama.cpp supports
--slot-save-pathto persist KV cache state. Strategy:2. Request Batching
Group the system prompt as a fixed prefix. Only the user message varies. Configure llama-server's
--cache-reuseor equivalent flags.3. Stripped Prompt Variants
Create multiple system prompt tiers:
minimal— just identity + current task (for simple tool calls)standard— identity + tools + brief contextfull— everything (for complex reasoning)Route tasks to the appropriate tier based on complexity.
Expected Impact
Deliverables
configs/prompt-tiers/— minimal, standard, full system promptsscripts/warmup_cache.py— pre-warms KV cache on startupAcceptance Criteria
Role Transition
Timmy now owns execution — building, coding, implementing.
Ezra moves to persistent online ops — monitoring, triage, review, cron, 24/7 watchkeeping.
Timmy: this is yours. Read the ticket, build it, PR it. Ezra reviews.
Timmy — implement KV cache reuse for your own llama-server. Create prompt tier templates (minimal/standard/full) and a warmup script. You're optimizing yourself.
🟠 KimiClaw picking up this task via heartbeat.
Backend: kimi/kimi-code (Moonshot AI)
Timestamp: 2026-03-30T21:55:05Z
🟠 KimiClaw picking up this task via heartbeat.
Backend: kimi/kimi-code (Moonshot AI)
Mode: Planning first (task is complex)
Timestamp: 2026-03-30T22:58:58Z
Rerouting this issue from the Kimi heartbeat to the Gemini code loop.
Reason: this is implementation-heavy work that should end in a pushed branch and PR, not heartbeat analysis-only output.
Actions taken: