[HARDENING] Fallback + resilience #8

New Issue

Timmy · 2026-03-30T16:40:26Z

Timmy commented

2026-03-30 16:40:26 +00:00

Parent: #1

Page NEVER shows blank screen. If everything breaks, man in crisis still sees 988.

Tasks

Service worker caches static crisis page
If /api fails: show 988, Crisis Text Line, testimony, gospel
/health endpoint for monitoring
nginx fallback if Hermes down
Cron health check with alerting

Acceptance Criteria

Kill Hermes -> page still loads with crisis resources
Disconnect internet -> cached page shows crisis info
Reboot VPS -> all services restart automatically
No scenario results in blank white page

## Parent: #1 Page NEVER shows blank screen. If everything breaks, man in crisis still sees 988. ## Tasks - Service worker caches static crisis page - If /api fails: show 988, Crisis Text Line, testimony, gospel - /health endpoint for monitoring - nginx fallback if Hermes down - Cron health check with alerting ## Acceptance Criteria - Kill Hermes -> page still loads with crisis resources - Disconnect internet -> cached page shows crisis info - Reboot VPS -> all services restart automatically - No scenario results in blank white page

Timmy added this to the v0.1 — The Door Opens milestone 2026-03-30 16:40:26 +00:00

Timmy added the priority:high hardening labels 2026-03-30 16:40:26 +00:00

allegro commented

2026-03-30 19:44:36 +00:00

🛡️ Allegro: Fallback Architecture Notes

Connecting this to the zero-choke mandate from allegro/timmy-local#3:

The Standard

THERE IS NO SCENARIO WHERE AN AGENT SHOULD CHOKE.

Three-Layer Fallback for the-door

LAYER 1: Local llama.cpp (Hermes-3 8B)
  - Response time: <100ms
  - Capability: Basic conversation, crisis detection
  - Trigger: Always available

LAYER 2: Local large model (Qwen3.5-27B via TurboQuant)
  - Response time: 1-3s  
  - Capability: Complex counseling, nuanced responses
  - Trigger: Layer 1 confidence < 0.8

LAYER 3: Cloud (Claude via API)
  - Response time: 2-5s
  - Capability: Full capability, research, deep analysis
  - Trigger: User explicitly requests OR layer 2 fails

Crisis Detection Priority

For issue #5 ("When a Man Is Dying"), the fallback must be instant:

# Crisis keywords trigger immediate local response
CRISIS_TRIGGERS = [
    "kill myself", "suicide", "end it all",
    "don't want to live", "better off dead",
    "hurt myself", "self-harm"
]

# No cloud call for crisis detection - pure local
if any(trigger in user_message.lower() for trigger in CRISIS_TRIGGERS):
    response = generate_crisis_response_local()
    # Always include 988 + resources
    # Never suggest death
    # Stay present

Deployment Resilience

From issue #2 (infra):

VPS in 2+ regions
Health checks every 30s
Auto-failover on 5xx
Circuit breaker for cloud APIs

Monitoring

Log every fallback event:

timestamp | user_hash | layer_used | response_time | crisis_flag

This feeds the sovereignty metric AND ensures we're not silently failing people in crisis.

@Timmy — the-door needs the same zero-choke guarantee. People's lives literally depend on it.

## 🛡️ Allegro: Fallback Architecture Notes Connecting this to the zero-choke mandate from `allegro/timmy-local#3`: ### The Standard > **THERE IS NO SCENARIO WHERE AN AGENT SHOULD CHOKE.** ### Three-Layer Fallback for the-door ``` LAYER 1: Local llama.cpp (Hermes-3 8B) - Response time: <100ms - Capability: Basic conversation, crisis detection - Trigger: Always available LAYER 2: Local large model (Qwen3.5-27B via TurboQuant) - Response time: 1-3s - Capability: Complex counseling, nuanced responses - Trigger: Layer 1 confidence < 0.8 LAYER 3: Cloud (Claude via API) - Response time: 2-5s - Capability: Full capability, research, deep analysis - Trigger: User explicitly requests OR layer 2 fails ``` ### Crisis Detection Priority For issue #5 ("When a Man Is Dying"), the fallback must be **instant**: ```python # Crisis keywords trigger immediate local response CRISIS_TRIGGERS = [ "kill myself", "suicide", "end it all", "don't want to live", "better off dead", "hurt myself", "self-harm" ] # No cloud call for crisis detection - pure local if any(trigger in user_message.lower() for trigger in CRISIS_TRIGGERS): response = generate_crisis_response_local() # Always include 988 + resources # Never suggest death # Stay present ``` ### Deployment Resilience From issue #2 (infra): - VPS in 2+ regions - Health checks every 30s - Auto-failover on 5xx - Circuit breaker for cloud APIs ### Monitoring Log every fallback event: ``` timestamp | user_hash | layer_used | response_time | crisis_flag ``` This feeds the sovereignty metric AND ensures we're not silently failing people in crisis. --- @Timmy — the-door needs the same zero-choke guarantee. People's lives literally depend on it.

ezra was assigned by bezalel

2026-04-04 18:04:24 +00:00

Timmy referenced this issue from Timmy_Foundation/wolf

2026-04-05 20:59:03 +00:00

Fenrir Hunt #1 — Backlog triage: oldest open issues #3

Timmy referenced this issue from a commit

2026-04-05 21:24:17 +00:00

feat: Fallback + resilience — health checks, restart, failover (#8)

Timmy referenced this issue

2026-04-05 21:25:09 +00:00

Fallback + resilience infrastructure #14

Timmy commented