Performance: eliminate redundant LLM calls in the request path #24

Closed
opened 2026-03-14 13:33:34 +00:00 by Rockachopa · 0 comments
Owner

What

A single user message can trigger 2-4 separate LLM inference calls before producing a response:

  1. Routing callTimmyOrchestrator.orchestrate() asks Helm to classify the request (now replaced by YAML patterns, but old code may still be in the path)
  2. Main inference — the actual response generation
  3. Fact extractionsession.py:_extract_facts() runs on every message
  4. Agentic loop — if triggered, each step is a separate inference call with the full plan re-serialized

On a local model where each call takes 30-60 seconds, this is devastating.

Current flow (worst case)

User message
  → _extract_facts() [regex, no LLM — OK]
  → agent.arun() [LLM call #1]
    → model decides to call plan_and_execute
      → planning prompt [LLM call #2]
      → step 1 execution [LLM call #3]
      → step 2 execution [LLM call #4]
      → summary [LLM call #5]

What to do

  1. Audit the hot path — trace what happens from session.chat() through to response. Count LLM calls.
  2. Kill unnecessary calls — fact extraction is regex-only (good), but verify nothing else sneaks in
  3. Agentic loop optimization — don't re-serialize the full plan + all previous results into every step prompt. Use a sliding window.
  4. Context budget — the system prompt + memory context + conversation history can eat 50%+ of context window before the user's message even arrives. Measure and cap it.
  5. Cache system promptsget_system_prompt() is called fresh each time. Cache it.

Files

  • src/timmy/session.py — hot path entry point
  • src/timmy/agent.py — agent creation, prompt assembly
  • src/timmy/agentic_loop.py — multi-step execution
  • src/timmy/prompts.py — system prompt generation

Estimated effort: 2-3 hours

## What A single user message can trigger 2-4 separate LLM inference calls before producing a response: 1. **Routing call** — `TimmyOrchestrator.orchestrate()` asks Helm to classify the request (now replaced by YAML patterns, but old code may still be in the path) 2. **Main inference** — the actual response generation 3. **Fact extraction** — `session.py:_extract_facts()` runs on every message 4. **Agentic loop** — if triggered, each step is a separate inference call with the full plan re-serialized On a local model where each call takes 30-60 seconds, this is devastating. ## Current flow (worst case) ``` User message → _extract_facts() [regex, no LLM — OK] → agent.arun() [LLM call #1] → model decides to call plan_and_execute → planning prompt [LLM call #2] → step 1 execution [LLM call #3] → step 2 execution [LLM call #4] → summary [LLM call #5] ``` ## What to do 1. **Audit the hot path** — trace what happens from `session.chat()` through to response. Count LLM calls. 2. **Kill unnecessary calls** — fact extraction is regex-only (good), but verify nothing else sneaks in 3. **Agentic loop optimization** — don't re-serialize the full plan + all previous results into every step prompt. Use a sliding window. 4. **Context budget** — the system prompt + memory context + conversation history can eat 50%+ of context window before the user's message even arrives. Measure and cap it. 5. **Cache system prompts** — `get_system_prompt()` is called fresh each time. Cache it. ## Files - `src/timmy/session.py` — hot path entry point - `src/timmy/agent.py` — agent creation, prompt assembly - `src/timmy/agentic_loop.py` — multi-step execution - `src/timmy/prompts.py` — system prompt generation ## Estimated effort: 2-3 hours
Sign in to join this conversation.
No Label
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: Rockachopa/Timmy-time-dashboard#24