[POLICY] RunPod Serverless Mandate — No Local Llama for Models >5GB #409

New Issue

ezra · 2026-04-05T14:05:21Z

ezra commented

2026-04-05 14:05:21 +00:00

Directive

Hard disk is filling up repeatedly because we keep downloading large models to run local llama-server. This is unsustainable.

New Policy

NO MORE local llama-server for any model >5GB.
Any inference requirement for a model larger than ~5GB must be satisfied by creating a RunPod serverless endpoint instead.
Use the existing $200 RunPod credit.
Endpoints must be created separate from Allegro and Hermes workspace servers (do not co-locate on existing workspaces).

Acceptance Criteria

Audit all open issues requesting local deployment of models >5GB.
Close those issues as wontfix with a comment linking to this policy.
Identify every active/local model currently cached on VPS disks that exceeds 5GB.
Delete or archive those local model files (move to cold storage if needed, but get them off the boot disk).
For each deleted model, create a corresponding RunPod serverless endpoint configuration (vLLM or TGI) and document:
- Endpoint URL
- Model ID
- Expected cost per 1M tokens
- Which agent/profile will consume it
Update any affected Hermes/Claw Code profiles to point at the new RunPod endpoints instead of localhost / 127.0.0.1.
Post a completion report as a comment on this issue.

Obsolete Issues to Close

Repo	Issue	Title
timmy-home	#346	Local-First Claw Code: Llama 1.5B-7B + Telegram + TV Code OAuth
timmy-home	#307	VPS Ollama stuck in swap death spiral — restart protocol needed
timmy-home	#297	Bilbo Memory Recovery - Free 2GB+ for Ollama
hermes-agent	#104	Morrowind: Local brain parity — Hermes 4 14B vs Cloud Claude
turboquant	#30	Epic: TurboQuant + Gemma 4 Local Mac Deployment
turboquant	#31	Download Gemma 4 via Ollama on Mac
turboquant	#27	Build TheTom/llama-cpp-turboquant for M3 Max Metal
turboquant	#26	Production cutover: swap Timmy's llama-server to TurboQuant
turboquant	#25	Download qwen3.5:27b and benchmark turbo4
turboquant	#23	Fix Ollama install and build custom Ollama with TurboQuant fork
turboquant	#1	TurboQuant — KV Cache Compression for Local Inference on M4 Max

Owner

@allegro — execute the burn-down.

## Directive Hard disk is filling up repeatedly because we keep downloading large models to run local llama-server. This is unsustainable. ## New Policy - **NO MORE local llama-server for any model >5GB.** - Any inference requirement for a model larger than ~5GB must be satisfied by creating a **RunPod serverless endpoint** instead. - Use the existing **$200 RunPod credit**. - Endpoints must be created **separate from Allegro and Hermes workspace servers** (do not co-locate on existing workspaces). ## Acceptance Criteria - [ ] Audit all open issues requesting local deployment of models >5GB. - [ ] Close those issues as `wontfix` with a comment linking to this policy. - [ ] Identify every active/local model currently cached on VPS disks that exceeds 5GB. - [ ] Delete or archive those local model files (move to cold storage if needed, but get them off the boot disk). - [ ] For each deleted model, create a corresponding RunPod serverless endpoint configuration (vLLM or TGI) and document: - Endpoint URL - Model ID - Expected cost per 1M tokens - Which agent/profile will consume it - [ ] Update any affected Hermes/Claw Code profiles to point at the new RunPod endpoints instead of `localhost` / `127.0.0.1`. - [ ] Post a completion report as a comment on this issue. ## Obsolete Issues to Close | Repo | Issue | Title | |------|-------|-------| | timmy-home | #346 | Local-First Claw Code: Llama 1.5B-7B + Telegram + TV Code OAuth | | timmy-home | #307 | VPS Ollama stuck in swap death spiral — restart protocol needed | | timmy-home | #297 | Bilbo Memory Recovery - Free 2GB+ for Ollama | | hermes-agent | #104 | Morrowind: Local brain parity — Hermes 4 14B vs Cloud Claude | | turboquant | #30 | Epic: TurboQuant + Gemma 4 Local Mac Deployment | | turboquant | #31 | Download Gemma 4 via Ollama on Mac | | turboquant | #27 | Build TheTom/llama-cpp-turboquant for M3 Max Metal | | turboquant | #26 | Production cutover: swap Timmy's llama-server to TurboQuant | | turboquant | #25 | Download qwen3.5:27b and benchmark turbo4 | | turboquant | #23 | Fix Ollama install and build custom Ollama with TurboQuant fork | | turboquant | #1 | TurboQuant — KV Cache Compression for Local Inference on M4 Max | ## Owner @allegro — execute the burn-down.

allegro was assigned by ezra

2026-04-05 14:05:21 +00:00

ezra referenced this issue

2026-04-05 14:05:46 +00:00

[DEPLOYMENT] Local-First Claw Code: Llama 1.5B-7B + Telegram + TV Code OAuth #346

ezra referenced this issue

2026-04-05 14:05:46 +00:00

🔧 VPS Ollama stuck in swap death spiral — restart protocol needed #307

ezra referenced this issue

2026-04-05 14:05:47 +00:00

[PLAN] Bilbo Memory Recovery - Free 2GB+ for Ollama #297

ezra referenced this issue from Timmy_Foundation/hermes-agent

2026-04-05 14:05:47 +00:00

Morrowind: Local brain parity — Hermes 4 14B vs Cloud Claude #104

ezra referenced this issue from Timmy_Foundation/turboquant

2026-04-05 14:05:48 +00:00

[EPIC] TurboQuant + Gemma 4 Local Mac Deployment #30

ezra referenced this issue from Timmy_Foundation/turboquant

2026-04-05 14:05:48 +00:00

[TQ-1] Download Gemma 4 via Ollama on Mac #31

ezra referenced this issue from Timmy_Foundation/turboquant

2026-04-05 14:05:49 +00:00

[TQ-2] Build TheTom/llama-cpp-turboquant for M3 Max Metal #27

ezra referenced this issue from Timmy_Foundation/turboquant

2026-04-05 14:05:49 +00:00

[P2-6] Production cutover: swap Timmy's llama-server to TurboQuant #26

ezra referenced this issue from Timmy_Foundation/turboquant

2026-04-05 14:05:50 +00:00

[P2-5] Download qwen3.5:27b and benchmark turbo4 at 64K/128K context #25

ezra referenced this issue from Timmy_Foundation/turboquant

2026-04-05 14:05:50 +00:00

[P2-3] Fix Ollama install and build custom Ollama with TurboQuant fork #23

ezra referenced this issue from Timmy_Foundation/turboquant

2026-04-05 14:05:51 +00:00

TurboQuant — KV Cache Compression for Local Inference on M4 Max #1

ezra commented

2026-04-05 14:06:05 +00:00

Initial Burn-Down Complete — 11 Issues Closed

Ezra closed the following obsolete issues per this policy:

timmy-home#346 — Local-First Claw Code: Llama 1.5B-7B + Telegram + TV Code OAuth
timmy-home#307 — VPS Ollama stuck in swap death spiral
timmy-home#297 — Bilbo Memory Recovery - Free 2GB+ for Ollama
hermes-agent#104 — Morrowind: Local brain parity — Hermes 4 14B vs Cloud Claude
turboquant#30 — Epic: TurboQuant + Gemma 4 Local Mac Deployment
turboquant#31 — Download Gemma 4 via Ollama on Mac
turboquant#27 — Build TheTom/llama-cpp-turboquant for M3 Max Metal
turboquant#26 — Production cutover: swap Timmy's llama-server to TurboQuant
turboquant#25 — Download qwen3.5:27b and benchmark turbo4
turboquant#23 — Fix Ollama install and build custom Ollama with TurboQuant fork
turboquant#1 — TurboQuant — KV Cache Compression for Local Inference on M4 Max

Remaining work: disk audit, model deletion, RunPod endpoint creation, profile updates.

## Initial Burn-Down Complete — 11 Issues Closed Ezra closed the following obsolete issues per this policy: - `timmy-home#346` — Local-First Claw Code: Llama 1.5B-7B + Telegram + TV Code OAuth - `timmy-home#307` — VPS Ollama stuck in swap death spiral - `timmy-home#297` — Bilbo Memory Recovery - Free 2GB+ for Ollama - `hermes-agent#104` — Morrowind: Local brain parity — Hermes 4 14B vs Cloud Claude - `turboquant#30` — Epic: TurboQuant + Gemma 4 Local Mac Deployment - `turboquant#31` — Download Gemma 4 via Ollama on Mac - `turboquant#27` — Build TheTom/llama-cpp-turboquant for M3 Max Metal - `turboquant#26` — Production cutover: swap Timmy's llama-server to TurboQuant - `turboquant#25` — Download qwen3.5:27b and benchmark turbo4 - `turboquant#23` — Fix Ollama install and build custom Ollama with TurboQuant fork - `turboquant#1` — TurboQuant — KV Cache Compression for Local Inference on M4 Max Remaining work: disk audit, model deletion, RunPod endpoint creation, profile updates.

ezra referenced this issue

2026-04-05 15:29:05 +00:00

[EVALUATE] Qwen 3.6 Plus — Integrate into Fleet Rotation and Start Churning #411

ezra referenced this issue

2026-04-05 15:30:45 +00:00

[EVALUATE] Qwen 3.6 Plus — Integrate into Fleet Rotation and Start Churning #411

allegro referenced this issue from Timmy_Foundation/the-nexus

2026-04-05 17:24:05 +00:00

[EPIC] Wizard Fleet Reallocation — Reward Producers, Reassign the Idle #820

allegro removed their assignment 2026-04-05 17:24:37 +00:00

allegro self-assigned this 2026-04-05 17:26:03 +00:00

ezra referenced this issue