Build comprehensive caching layer — cache everywhere #103

New Issue

Timmy · 2026-03-30T15:52:08Z

Timmy commented

2026-03-30 15:52:08 +00:00

Objective

"We do caching. We cache everywhere. If we aren't caching somewhere, we start caching." — Alexander

Build a multi-tier caching system so Timmy never does the same work twice.

Parent Epic

#94 — Grand Timmy: The Uniwizard

Caching Tiers

Tier 1: KV Cache (already #85)

System prompt prefix cached in llama-server
Reuse across requests
50-70% faster time-to-first-token

Tier 2: Semantic Response Cache (NEW)

Cache full LLM responses keyed on prompt similarity:

class ResponseCache:
    def get(self, prompt_hash: str) -> Optional[str]:
        '''Return cached response if prompt seen before'''
    
    def put(self, prompt_hash: str, response: str, ttl: int = 3600):
        '''Cache response with TTL'''

Hash the user message (after stripping timestamps/noise)
If exact match, return cached response instantly (0 tokens, 0 latency)
TTL per response type (status checks: 60s, factual questions: 1h, code: never cache)

Tier 3: Tool Result Cache (NEW)

Many tools return stable data that doesn't need re-fetching:

TOOL_CACHE_TTL = {
    "system_info": 60,        # Changes slowly
    "disk_usage": 120,        # Changes slowly
    "git_status": 30,         # Changes on commits
    "git_log": 300,           # Changes rarely
    "health_check": 60,       # Periodic
    "gitea_list_issues": 120, # Changes on writes
    "file_read": 30,          # Cache file contents briefly
}

Before executing a tool, check cache
If cached and within TTL, return cached result
Invalidation: explicit (after write ops) or TTL-based

Tier 4: Embedding Cache (NEW)

For RAG pipeline (#93):

Store embeddings keyed on (file_path, modification_time)
If file hasn't changed, skip re-embedding
Massive savings on re-indexing

Tier 5: Template Cache (NEW)

Pre-compile prompt templates on startup:

System prompts per tier (reflex/standard/deep) tokenized once
Tool definitions serialized once and reused
Few-shot examples pre-formatted

Tier 6: HTTP Response Cache (NEW)

For network tools:

Cache GET responses with ETag/Last-Modified support
Gitea API responses cached with short TTL
External URLs cached per Cache-Control headers

Implementation

SQLite-backed cache store (single file, no extra dependencies)
In-memory LRU for hot path (tool results, templates)
Disk-backed for large items (embeddings, responses)
Cache stats: hit rate, miss rate, eviction rate per tier

In Evennia

Cache is a global Script
cache stats command shows hit rates
cache clear command for manual invalidation
Observatory room displays cache dashboard

Deliverables

agent/cache.py — multi-tier cache implementation
agent/cache_config.py — TTL configuration per cache type
Integration with tool registry (auto-cache on execute)
Integration with inference calls (auto-cache on response)
Cache statistics and monitoring

Acceptance Criteria

Tool results cached with configurable TTL
Identical prompts return cached responses
Embedding cache skips unchanged files
Cache hit rate > 30% after first hour of operation
Zero stale data (TTL + invalidation work correctly)
Cache stats visible via command

## Objective "We do caching. We cache everywhere. If we aren't caching somewhere, we start caching." — Alexander Build a multi-tier caching system so Timmy never does the same work twice. ## Parent Epic #94 — Grand Timmy: The Uniwizard ## Caching Tiers ### Tier 1: KV Cache (already #85) - System prompt prefix cached in llama-server - Reuse across requests - 50-70% faster time-to-first-token ### Tier 2: Semantic Response Cache (NEW) Cache full LLM responses keyed on prompt similarity: ```python class ResponseCache: def get(self, prompt_hash: str) -> Optional[str]: '''Return cached response if prompt seen before''' def put(self, prompt_hash: str, response: str, ttl: int = 3600): '''Cache response with TTL''' ``` - Hash the user message (after stripping timestamps/noise) - If exact match, return cached response instantly (0 tokens, 0 latency) - TTL per response type (status checks: 60s, factual questions: 1h, code: never cache) ### Tier 3: Tool Result Cache (NEW) Many tools return stable data that doesn't need re-fetching: ```python TOOL_CACHE_TTL = { "system_info": 60, # Changes slowly "disk_usage": 120, # Changes slowly "git_status": 30, # Changes on commits "git_log": 300, # Changes rarely "health_check": 60, # Periodic "gitea_list_issues": 120, # Changes on writes "file_read": 30, # Cache file contents briefly } ``` - Before executing a tool, check cache - If cached and within TTL, return cached result - Invalidation: explicit (after write ops) or TTL-based ### Tier 4: Embedding Cache (NEW) For RAG pipeline (#93): - Store embeddings keyed on (file_path, modification_time) - If file hasn't changed, skip re-embedding - Massive savings on re-indexing ### Tier 5: Template Cache (NEW) Pre-compile prompt templates on startup: - System prompts per tier (reflex/standard/deep) tokenized once - Tool definitions serialized once and reused - Few-shot examples pre-formatted ### Tier 6: HTTP Response Cache (NEW) For network tools: - Cache GET responses with ETag/Last-Modified support - Gitea API responses cached with short TTL - External URLs cached per Cache-Control headers ## Implementation - SQLite-backed cache store (single file, no extra dependencies) - In-memory LRU for hot path (tool results, templates) - Disk-backed for large items (embeddings, responses) - Cache stats: hit rate, miss rate, eviction rate per tier ## In Evennia - Cache is a global Script - `cache stats` command shows hit rates - `cache clear` command for manual invalidation - Observatory room displays cache dashboard ## Deliverables - `agent/cache.py` — multi-tier cache implementation - `agent/cache_config.py` — TTL configuration per cache type - Integration with tool registry (auto-cache on execute) - Integration with inference calls (auto-cache on response) - Cache statistics and monitoring ## Acceptance Criteria - [ ] Tool results cached with configurable TTL - [ ] Identical prompts return cached responses - [ ] Embedding cache skips unchanged files - [ ] Cache hit rate > 30% after first hour of operation - [ ] Zero stale data (TTL + invalidation work correctly) - [ ] Cache stats visible via command

ezra was assigned by Timmy

2026-03-30 15:52:08 +00:00

Metric	Before	After	Improvement
Repeat query	2.5s	0.05s	50x
Git operations	0.8s	0.01s	80x
Embedding calls	1.2s	0.0s	∞ (cached)

Build comprehensive caching layer — cache everywhere #103

Objective

Parent Epic

Caching Tiers

Tier 1: KV Cache (already #85)

Tier 2: Semantic Response Cache (NEW)

Tier 3: Tool Result Cache (NEW)

Tier 4: Embedding Cache (NEW)

Tier 5: Template Cache (NEW)

Tier 6: HTTP Response Cache (NEW)

Implementation

In Evennia

Deliverables

Acceptance Criteria

🏷️ Automated Triage Check

Checklist

Context

Role Transition

🔄 Allegro Cross-Reference: timmy-local Implementation

Six-Tier Cache System

Key Implementation Details

Performance Results

Code Location