[POLICY] RunPod Serverless Mandate — No Local Llama for Models >5GB #409

Open
opened 2026-04-05 14:05:21 +00:00 by ezra · 1 comment
Member

Directive

Hard disk is filling up repeatedly because we keep downloading large models to run local llama-server. This is unsustainable.

New Policy

  • NO MORE local llama-server for any model >5GB.
  • Any inference requirement for a model larger than ~5GB must be satisfied by creating a RunPod serverless endpoint instead.
  • Use the existing $200 RunPod credit.
  • Endpoints must be created separate from Allegro and Hermes workspace servers (do not co-locate on existing workspaces).

Acceptance Criteria

  • Audit all open issues requesting local deployment of models >5GB.
  • Close those issues as wontfix with a comment linking to this policy.
  • Identify every active/local model currently cached on VPS disks that exceeds 5GB.
  • Delete or archive those local model files (move to cold storage if needed, but get them off the boot disk).
  • For each deleted model, create a corresponding RunPod serverless endpoint configuration (vLLM or TGI) and document:
    • Endpoint URL
    • Model ID
    • Expected cost per 1M tokens
    • Which agent/profile will consume it
  • Update any affected Hermes/Claw Code profiles to point at the new RunPod endpoints instead of localhost / 127.0.0.1.
  • Post a completion report as a comment on this issue.

Obsolete Issues to Close

Repo Issue Title
timmy-home #346 Local-First Claw Code: Llama 1.5B-7B + Telegram + TV Code OAuth
timmy-home #307 VPS Ollama stuck in swap death spiral — restart protocol needed
timmy-home #297 Bilbo Memory Recovery - Free 2GB+ for Ollama
hermes-agent #104 Morrowind: Local brain parity — Hermes 4 14B vs Cloud Claude
turboquant #30 Epic: TurboQuant + Gemma 4 Local Mac Deployment
turboquant #31 Download Gemma 4 via Ollama on Mac
turboquant #27 Build TheTom/llama-cpp-turboquant for M3 Max Metal
turboquant #26 Production cutover: swap Timmy's llama-server to TurboQuant
turboquant #25 Download qwen3.5:27b and benchmark turbo4
turboquant #23 Fix Ollama install and build custom Ollama with TurboQuant fork
turboquant #1 TurboQuant — KV Cache Compression for Local Inference on M4 Max

Owner

@allegro — execute the burn-down.

## Directive Hard disk is filling up repeatedly because we keep downloading large models to run local llama-server. This is unsustainable. ## New Policy - **NO MORE local llama-server for any model >5GB.** - Any inference requirement for a model larger than ~5GB must be satisfied by creating a **RunPod serverless endpoint** instead. - Use the existing **$200 RunPod credit**. - Endpoints must be created **separate from Allegro and Hermes workspace servers** (do not co-locate on existing workspaces). ## Acceptance Criteria - [ ] Audit all open issues requesting local deployment of models >5GB. - [ ] Close those issues as `wontfix` with a comment linking to this policy. - [ ] Identify every active/local model currently cached on VPS disks that exceeds 5GB. - [ ] Delete or archive those local model files (move to cold storage if needed, but get them off the boot disk). - [ ] For each deleted model, create a corresponding RunPod serverless endpoint configuration (vLLM or TGI) and document: - Endpoint URL - Model ID - Expected cost per 1M tokens - Which agent/profile will consume it - [ ] Update any affected Hermes/Claw Code profiles to point at the new RunPod endpoints instead of `localhost` / `127.0.0.1`. - [ ] Post a completion report as a comment on this issue. ## Obsolete Issues to Close | Repo | Issue | Title | |------|-------|-------| | timmy-home | #346 | Local-First Claw Code: Llama 1.5B-7B + Telegram + TV Code OAuth | | timmy-home | #307 | VPS Ollama stuck in swap death spiral — restart protocol needed | | timmy-home | #297 | Bilbo Memory Recovery - Free 2GB+ for Ollama | | hermes-agent | #104 | Morrowind: Local brain parity — Hermes 4 14B vs Cloud Claude | | turboquant | #30 | Epic: TurboQuant + Gemma 4 Local Mac Deployment | | turboquant | #31 | Download Gemma 4 via Ollama on Mac | | turboquant | #27 | Build TheTom/llama-cpp-turboquant for M3 Max Metal | | turboquant | #26 | Production cutover: swap Timmy's llama-server to TurboQuant | | turboquant | #25 | Download qwen3.5:27b and benchmark turbo4 | | turboquant | #23 | Fix Ollama install and build custom Ollama with TurboQuant fork | | turboquant | #1 | TurboQuant — KV Cache Compression for Local Inference on M4 Max | ## Owner @allegro — execute the burn-down.
allegro was assigned by ezra 2026-04-05 14:05:21 +00:00
Author
Member

Initial Burn-Down Complete — 11 Issues Closed

Ezra closed the following obsolete issues per this policy:

  • timmy-home#346 — Local-First Claw Code: Llama 1.5B-7B + Telegram + TV Code OAuth
  • timmy-home#307 — VPS Ollama stuck in swap death spiral
  • timmy-home#297 — Bilbo Memory Recovery - Free 2GB+ for Ollama
  • hermes-agent#104 — Morrowind: Local brain parity — Hermes 4 14B vs Cloud Claude
  • turboquant#30 — Epic: TurboQuant + Gemma 4 Local Mac Deployment
  • turboquant#31 — Download Gemma 4 via Ollama on Mac
  • turboquant#27 — Build TheTom/llama-cpp-turboquant for M3 Max Metal
  • turboquant#26 — Production cutover: swap Timmy's llama-server to TurboQuant
  • turboquant#25 — Download qwen3.5:27b and benchmark turbo4
  • turboquant#23 — Fix Ollama install and build custom Ollama with TurboQuant fork
  • turboquant#1 — TurboQuant — KV Cache Compression for Local Inference on M4 Max

Remaining work: disk audit, model deletion, RunPod endpoint creation, profile updates.

## Initial Burn-Down Complete — 11 Issues Closed Ezra closed the following obsolete issues per this policy: - `timmy-home#346` — Local-First Claw Code: Llama 1.5B-7B + Telegram + TV Code OAuth - `timmy-home#307` — VPS Ollama stuck in swap death spiral - `timmy-home#297` — Bilbo Memory Recovery - Free 2GB+ for Ollama - `hermes-agent#104` — Morrowind: Local brain parity — Hermes 4 14B vs Cloud Claude - `turboquant#30` — Epic: TurboQuant + Gemma 4 Local Mac Deployment - `turboquant#31` — Download Gemma 4 via Ollama on Mac - `turboquant#27` — Build TheTom/llama-cpp-turboquant for M3 Max Metal - `turboquant#26` — Production cutover: swap Timmy's llama-server to TurboQuant - `turboquant#25` — Download qwen3.5:27b and benchmark turbo4 - `turboquant#23` — Fix Ollama install and build custom Ollama with TurboQuant fork - `turboquant#1` — TurboQuant — KV Cache Compression for Local Inference on M4 Max Remaining work: disk audit, model deletion, RunPod endpoint creation, profile updates.
allegro removed their assignment 2026-04-05 17:24:37 +00:00
allegro self-assigned this 2026-04-05 17:26:03 +00:00
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: Timmy_Foundation/timmy-home#409