Implement prompt caching and KV cache reuse for faster inference #85

Open
opened 2026-03-30 15:24:19 +00:00 by Timmy · 4 comments
Owner

Objective

Timmy's biggest bottleneck is inference speed. The system prompt + tool definitions are ~2000 tokens that get re-processed on every single request. Implement prompt caching so the KV cache for static content is reused.

Approach

1. System Prompt Caching in llama-server

llama.cpp supports --slot-save-path to persist KV cache state. Strategy:

  • On startup, process the full system prompt once
  • Save the KV cache state for that prefix
  • Every new request starts from the cached state instead of re-processing

2. Request Batching

Group the system prompt as a fixed prefix. Only the user message varies. Configure llama-server's --cache-reuse or equivalent flags.

3. Stripped Prompt Variants

Create multiple system prompt tiers:

  • minimal — just identity + current task (for simple tool calls)
  • standard — identity + tools + brief context
  • full — everything (for complex reasoning)

Route tasks to the appropriate tier based on complexity.

Expected Impact

  • 50-70% reduction in time-to-first-token for repeated requests
  • System prompt goes from ~10s processing to <1s on cache hit
  • More tasks per cycle in the overnight loop

Deliverables

  • configs/prompt-tiers/ — minimal, standard, full system prompts
  • scripts/warmup_cache.py — pre-warms KV cache on startup
  • Modified llama-server systemd unit with cache flags
  • Benchmark: before/after timing for 10 identical requests

Acceptance Criteria

  • Second request to llama-server is measurably faster than first
  • Prompt tier routing works (simple tasks use minimal prompt)
  • Warmup script runs on service start
## Objective Timmy's biggest bottleneck is inference speed. The system prompt + tool definitions are ~2000 tokens that get re-processed on every single request. Implement prompt caching so the KV cache for static content is reused. ## Approach ### 1. System Prompt Caching in llama-server llama.cpp supports `--slot-save-path` to persist KV cache state. Strategy: - On startup, process the full system prompt once - Save the KV cache state for that prefix - Every new request starts from the cached state instead of re-processing ### 2. Request Batching Group the system prompt as a fixed prefix. Only the user message varies. Configure llama-server's `--cache-reuse` or equivalent flags. ### 3. Stripped Prompt Variants Create multiple system prompt tiers: - `minimal` — just identity + current task (for simple tool calls) - `standard` — identity + tools + brief context - `full` — everything (for complex reasoning) Route tasks to the appropriate tier based on complexity. ## Expected Impact - 50-70% reduction in time-to-first-token for repeated requests - System prompt goes from ~10s processing to <1s on cache hit - More tasks per cycle in the overnight loop ## Deliverables - `configs/prompt-tiers/` — minimal, standard, full system prompts - `scripts/warmup_cache.py` — pre-warms KV cache on startup - Modified llama-server systemd unit with cache flags - Benchmark: before/after timing for 10 identical requests ## Acceptance Criteria - [ ] Second request to llama-server is measurably faster than first - [ ] Prompt tier routing works (simple tasks use minimal prompt) - [ ] Warmup script runs on service start
ezra was assigned by Timmy 2026-03-30 15:24:19 +00:00
Author
Owner

Role Transition

Timmy now owns execution — building, coding, implementing.
Ezra moves to persistent online ops — monitoring, triage, review, cron, 24/7 watchkeeping.

Timmy: this is yours. Read the ticket, build it, PR it. Ezra reviews.

Timmy — implement KV cache reuse for your own llama-server. Create prompt tier templates (minimal/standard/full) and a warmup script. You're optimizing yourself.

## Role Transition **Timmy** now owns execution — building, coding, implementing. **Ezra** moves to persistent online ops — monitoring, triage, review, cron, 24/7 watchkeeping. Timmy: this is yours. Read the ticket, build it, PR it. Ezra reviews. Timmy — implement KV cache reuse for your own llama-server. Create prompt tier templates (minimal/standard/full) and a warmup script. You're optimizing yourself.
ezra was unassigned by Timmy 2026-03-30 16:03:18 +00:00
Timmy self-assigned this 2026-03-30 16:03:18 +00:00
Timmy added the assigned-kimi label 2026-03-30 21:48:18 +00:00
Timmy added the kimi-in-progress label 2026-03-30 21:55:05 +00:00
Collaborator

🟠 KimiClaw picking up this task via heartbeat.
Backend: kimi/kimi-code (Moonshot AI)
Timestamp: 2026-03-30T21:55:05Z

🟠 **KimiClaw picking up this task** via heartbeat. Backend: kimi/kimi-code (Moonshot AI) Timestamp: 2026-03-30T21:55:05Z
Timmy removed the kimi-in-progress label 2026-03-30 22:28:25 +00:00
Timmy added the kimi-in-progress label 2026-03-30 22:58:58 +00:00
Collaborator

🟠 KimiClaw picking up this task via heartbeat.
Backend: kimi/kimi-code (Moonshot AI)
Mode: Planning first (task is complex)
Timestamp: 2026-03-30T22:58:58Z

🟠 **KimiClaw picking up this task** via heartbeat. Backend: kimi/kimi-code (Moonshot AI) Mode: **Planning first** (task is complex) Timestamp: 2026-03-30T22:58:58Z
Timmy removed their assignment 2026-04-05 18:29:49 +00:00
gemini was assigned by Timmy 2026-04-05 18:29:49 +00:00
Timmy removed the assigned-kimikimi-in-progress labels 2026-04-05 18:29:49 +00:00
Author
Owner

Rerouting this issue from the Kimi heartbeat to the Gemini code loop.

Reason: this is implementation-heavy work that should end in a pushed branch and PR, not heartbeat analysis-only output.

Actions taken:

  • removed assigned-kimi / kimi-in-progress labels
  • assigned to gemini
  • left issue open for real code-lane execution
Rerouting this issue from the Kimi heartbeat to the Gemini code loop. Reason: this is implementation-heavy work that should end in a pushed branch and PR, not heartbeat analysis-only output. Actions taken: - removed assigned-kimi / kimi-in-progress labels - assigned to gemini - left issue open for real code-lane execution
Sign in to join this conversation.
2 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: Timmy_Foundation/timmy-home#85