[Infra] Local inference fallback — degraded but never dark #914

Closed
opened 2026-03-22 13:24:01 +00:00 by Timmy · 0 comments
Owner

Problem

Even with Hermes running on the VPS, Timmy depends on external API calls (Anthropic) to think. If the API key expires, quota runs out, or Anthropic has an outage, Timmy goes dark again.

Solution

Run a small local model on the VPS as a fallback. When the primary model (Claude) is unreachable, Timmy drops to a smaller local model. Degraded intelligence, but never silent.

Options

Model RAM Quality Notes
Qwen 2.5 1.5B (GGUF Q4) ~2GB Basic Fits on VPS, fast
Phi-3 Mini 3.8B (GGUF Q4) ~3GB Decent May be tight on VPS RAM
TinyLlama 1.1B ~1GB Minimal Very fast, limited reasoning

Behavior

  • Primary: Claude via Anthropic API (full capability)
  • Fallback: Local model via llama.cpp (limited but present)
  • Timmy should tell the user when running in fallback mode: "I'm running on a smaller brain right now — Alexander's API connection is down. I can still chat but I'm not as sharp."
  • Auto-switch back when primary is available

Requirements

  • Install llama-cpp-python on VPS
  • Download a small GGUF model
  • Add fallback routing in Hermes API server config
  • Add status indicator in Workshop panel (which model is active)
  • Test failover and failback

Priority

Medium — nice to have after VPS migration is done. The VPS + Anthropic API is already much more reliable than the Mac dependency. This is the belt-and-suspenders layer.

References

  • Memory note: Timmy runs on Hermes harness, SOUL.md says "If I have four [GB], I think with a smaller one. I never refuse to work because my resources are limited."
## Problem Even with Hermes running on the VPS, Timmy depends on external API calls (Anthropic) to think. If the API key expires, quota runs out, or Anthropic has an outage, Timmy goes dark again. ## Solution Run a small local model on the VPS as a fallback. When the primary model (Claude) is unreachable, Timmy drops to a smaller local model. Degraded intelligence, but never silent. ## Options | Model | RAM | Quality | Notes | |-------|-----|---------|-------| | Qwen 2.5 1.5B (GGUF Q4) | ~2GB | Basic | Fits on VPS, fast | | Phi-3 Mini 3.8B (GGUF Q4) | ~3GB | Decent | May be tight on VPS RAM | | TinyLlama 1.1B | ~1GB | Minimal | Very fast, limited reasoning | ## Behavior - Primary: Claude via Anthropic API (full capability) - Fallback: Local model via llama.cpp (limited but present) - Timmy should **tell the user** when running in fallback mode: "I'm running on a smaller brain right now — Alexander's API connection is down. I can still chat but I'm not as sharp." - Auto-switch back when primary is available ## Requirements - [ ] Install llama-cpp-python on VPS - [ ] Download a small GGUF model - [ ] Add fallback routing in Hermes API server config - [ ] Add status indicator in Workshop panel (which model is active) - [ ] Test failover and failback ## Priority **Medium** — nice to have after VPS migration is done. The VPS + Anthropic API is already much more reliable than the Mac dependency. This is the belt-and-suspenders layer. ## References - Memory note: Timmy runs on Hermes harness, SOUL.md says "If I have four [GB], I think with a smaller one. I never refuse to work because my resources are limited."
gemini was assigned by Rockachopa 2026-03-22 23:31:57 +00:00
claude added the harnessinferencep0-critical labels 2026-03-23 13:52:56 +00:00
gemini was unassigned by Timmy 2026-03-24 19:33:52 +00:00
Timmy closed this issue 2026-03-24 21:54:47 +00:00
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: Rockachopa/Timmy-time-dashboard#914