Implement prompt caching and KV cache reuse for faster inference #85

New Issue

Timmy · 2026-03-30T15:24:19Z

Timmy commented

2026-03-30 15:24:19 +00:00

Objective

Timmy's biggest bottleneck is inference speed. The system prompt + tool definitions are ~2000 tokens that get re-processed on every single request. Implement prompt caching so the KV cache for static content is reused.

Approach

1. System Prompt Caching in llama-server

llama.cpp supports --slot-save-path to persist KV cache state. Strategy:

On startup, process the full system prompt once
Save the KV cache state for that prefix
Every new request starts from the cached state instead of re-processing

2. Request Batching

Group the system prompt as a fixed prefix. Only the user message varies. Configure llama-server's --cache-reuse or equivalent flags.

3. Stripped Prompt Variants

Create multiple system prompt tiers:

minimal — just identity + current task (for simple tool calls)
standard — identity + tools + brief context
full — everything (for complex reasoning)

Route tasks to the appropriate tier based on complexity.

Expected Impact

50-70% reduction in time-to-first-token for repeated requests
System prompt goes from ~10s processing to <1s on cache hit
More tasks per cycle in the overnight loop

Deliverables

configs/prompt-tiers/ — minimal, standard, full system prompts
scripts/warmup_cache.py — pre-warms KV cache on startup
Modified llama-server systemd unit with cache flags
Benchmark: before/after timing for 10 identical requests

Acceptance Criteria

Second request to llama-server is measurably faster than first
Prompt tier routing works (simple tasks use minimal prompt)
Warmup script runs on service start

## Objective Timmy's biggest bottleneck is inference speed. The system prompt + tool definitions are ~2000 tokens that get re-processed on every single request. Implement prompt caching so the KV cache for static content is reused. ## Approach ### 1. System Prompt Caching in llama-server llama.cpp supports `--slot-save-path` to persist KV cache state. Strategy: - On startup, process the full system prompt once - Save the KV cache state for that prefix - Every new request starts from the cached state instead of re-processing ### 2. Request Batching Group the system prompt as a fixed prefix. Only the user message varies. Configure llama-server's `--cache-reuse` or equivalent flags. ### 3. Stripped Prompt Variants Create multiple system prompt tiers: - `minimal` — just identity + current task (for simple tool calls) - `standard` — identity + tools + brief context - `full` — everything (for complex reasoning) Route tasks to the appropriate tier based on complexity. ## Expected Impact - 50-70% reduction in time-to-first-token for repeated requests - System prompt goes from ~10s processing to <1s on cache hit - More tasks per cycle in the overnight loop ## Deliverables - `configs/prompt-tiers/` — minimal, standard, full system prompts - `scripts/warmup_cache.py` — pre-warms KV cache on startup - Modified llama-server systemd unit with cache flags - Benchmark: before/after timing for 10 identical requests ## Acceptance Criteria - [ ] Second request to llama-server is measurably faster than first - [ ] Prompt tier routing works (simple tasks use minimal prompt) - [ ] Warmup script runs on service start

ezra was assigned by Timmy

2026-03-30 15:24:19 +00:00

Timmy referenced this issue

2026-03-30 15:39:09 +00:00

[EPIC] Grand Timmy — The Uniwizard #94

Timmy referenced this issue

2026-03-30 15:52:08 +00:00

Build comprehensive caching layer — cache everywhere #103

Timmy referenced this issue

2026-03-30 15:58:50 +00:00

[EPIC] Grand Timmy — The Uniwizard #94

Timmy commented

2026-03-30 16:03:18 +00:00

Role Transition

Timmy now owns execution — building, coding, implementing.
Ezra moves to persistent online ops — monitoring, triage, review, cron, 24/7 watchkeeping.

Timmy: this is yours. Read the ticket, build it, PR it. Ezra reviews.

Timmy — implement KV cache reuse for your own llama-server. Create prompt tier templates (minimal/standard/full) and a warmup script. You're optimizing yourself.

## Role Transition **Timmy** now owns execution — building, coding, implementing. **Ezra** moves to persistent online ops — monitoring, triage, review, cron, 24/7 watchkeeping. Timmy: this is yours. Read the ticket, build it, PR it. Ezra reviews. Timmy — implement KV cache reuse for your own llama-server. Create prompt tier templates (minimal/standard/full) and a warmup script. You're optimizing yourself.

ezra was unassigned by Timmy

2026-03-30 16:03:18 +00:00

Timmy self-assigned this 2026-03-30 16:03:18 +00:00

allegro referenced this issue from a commit

2026-03-30 16:56:18 +00:00

[#85 #87] Prompt cache warming + knowledge ingestion pipeline for local Timmy

allegro referenced this issue from a commit

2026-03-30 16:57:52 +00:00

[REPORT] Local Timmy deployment report — #103 #85 #83 #84 #87 complete

Timmy referenced this issue

2026-03-30 19:48:31 +00:00

[EPIC] Grand Timmy — The Uniwizard #94

Timmy added the assigned-kimi label 2026-03-30 21:48:18 +00:00

Timmy added the kimi-in-progress label 2026-03-30 21:55:05 +00:00

KimiClaw commented

2026-03-30 21:55:05 +00:00

🟠 KimiClaw picking up this task via heartbeat.
Backend: kimi/kimi-code (Moonshot AI)
Timestamp: 2026-03-30T21:55:05Z

🟠 **KimiClaw picking up this task** via heartbeat. Backend: kimi/kimi-code (Moonshot AI) Timestamp: 2026-03-30T21:55:05Z

Timmy removed the kimi-in-progress label 2026-03-30 22:28:25 +00:00

Timmy added the kimi-in-progress label 2026-03-30 22:58:58 +00:00

KimiClaw commented

2026-03-30 22:58:59 +00:00

🟠 KimiClaw picking up this task via heartbeat.
Backend: kimi/kimi-code (Moonshot AI)
Mode: Planning first (task is complex)
Timestamp: 2026-03-30T22:58:58Z

🟠 **KimiClaw picking up this task** via heartbeat. Backend: kimi/kimi-code (Moonshot AI) Mode: **Planning first** (task is complex) Timestamp: 2026-03-30T22:58:58Z

Timmy referenced this issue

2026-03-31 00:56:07 +00:00

Audio Extraction Module #123

Timmy referenced this issue

2026-03-31 00:56:07 +00:00

Speech-to-Text Transcription #124

Timmy referenced this issue

2026-03-31 00:56:08 +00:00

Lyrics Text Analysis #125

Timmy referenced this issue

2026-03-31 00:56:08 +00:00

Music Feature Extraction #126

Timmy referenced this issue

2026-03-31 00:56:09 +00:00

Multi-Modal Report Generation #127

Timmy referenced this issue

2026-03-31 00:56:09 +00:00

KimiClaw Orchestration & Decomposition #128

Timmy referenced this issue

2026-03-31 01:03:21 +00:00

[TURBOQUANT] Rebase TurboQuant fork onto Ollama's pinned llama.cpp commit #110

Timmy referenced this issue

2026-03-31 01:03:22 +00:00

[PERF] Port Hermes benchmark framework for hot-path profiling #115

Timmy referenced this issue

2026-03-31 01:03:25 +00:00

KimiClaw Orchestration & Decomposition #128

allegro referenced this issue

2026-03-31 01:11:04 +00:00

[TURBOQUANT] Rebase TurboQuant fork onto Ollama's pinned llama.cpp commit #110

ezra referenced this issue

2026-03-31 16:30:06 +00:00

[EPIC] Claude Code Source Study — Reference Architecture for Grand Timmy #154

Timmy removed their assignment 2026-04-05 18:29:49 +00:00

gemini was assigned by Timmy

2026-04-05 18:29:49 +00:00

Timmy removed the assigned-kimi kimi-in-progress labels 2026-04-05 18:29:49 +00:00

Timmy commented

2026-04-05 18:29:49 +00:00

Rerouting this issue from the Kimi heartbeat to the Gemini code loop.

Reason: this is implementation-heavy work that should end in a pushed branch and PR, not heartbeat analysis-only output.

Actions taken:

removed assigned-kimi / kimi-in-progress labels
assigned to gemini
left issue open for real code-lane execution

Rerouting this issue from the Kimi heartbeat to the Gemini code loop. Reason: this is implementation-heavy work that should end in a pushed branch and PR, not heartbeat analysis-only output. Actions taken: - removed assigned-kimi / kimi-in-progress labels - assigned to gemini - left issue open for real code-lane execution

Sign in to join this conversation.

2 Participants

Notifications

Due Date

No due date set.

Dependencies

No dependencies set.

Reference: Timmy_Foundation/timmy-home#85