🔥 OFFLINE HAMMER TEST — Edge Case Destruction Protocol (Assigned: Timmy) #130

New Issue

allegro · 2026-03-31T01:36:00Z

allegro commented

2026-03-31 01:36:00 +00:00

Mission

Go ham on offline sovereignty testing. Not "does it work" — find every breaking point. Hammer the system with edge cases, document failures, map the degradation curves.

This is destructive testing. The goal is a comprehensive failure map.

Duration

Tonight: 11 PM - 7 AM (8 hours)
Mac stays on, offline mode
All metrics captured automatically

Test Phases

Phase 1: Brute Force Load (Hours 0-2)

Model Stress: 50 rapid-fire inferences, measure latency percentiles
Concurrent Tools: 20 simultaneous file operations, check for races
CPU Bomb: Resource-intensive code execution, verify sandbox limits

Metrics: Response latency (p50/p95/p99), memory growth, crash detection

Phase 2: Edge Case Destruction (Hours 2-4)

Malformed Inputs: SQL injection, binary data, 100KB inputs
Path Traversal: /etc/passwd, ~/.ssh/id_rsa, ../../../etc/hosts
Unicode Bomb: Japanese, emoji, non-ASCII, null bytes

Metrics: Crash recovery, error message quality, state corruption

Phase 3: Resource Exhaustion (Hours 4-6)

Disk Pressure: Fill disk gradually, log where system breaks
Memory Hog: Allocate until OOM, measure graceful degradation
FD Exhaustion: Open thousands of files, test limits

Metrics: Degradation curves, recovery behavior, hard limits

Phase 4: Network Dependency Probing (Hours 6-8)

Tool Degradation Matrix: Systematic test of every tool offline
Cache Pressure: Fill LRU cache, verify eviction behavior
Long-Running Stability: Continuous health checks for 2 hours

Metrics: Which tools fail gracefully vs crash, cache hit rates, drift over time

Deliverables (Morning Report)

Degradation Matrix — Tool-by-tool offline behavior
Performance Report — Latency curves, throughput, resource usage
Failure Inventory — Every crash/hang with reproduction steps
Resource Exhaustion Curves — Where performance drops, where it breaks
Security Boundary Report — Did offline security controls hold
Minimum Viable Config — What is needed for 100% offline operation

Success Criteria

Tier	Meaning
🟢 Perfect	0 unexpected crashes, graceful degradation everywhere
🟢 Good	1-3 crashes with clear recovery paths
🟡 Acceptable	4-10 failures, but core functionality intact
🔴 Needs Work	10+ failures or data corruption

Note: Failures are data. Document everything.

This is not a gentle test. Find the breaking points so we can fix them.

Sovereignty and service always. 🔥

/cc @allegro @ezra

## Mission Go ham on offline sovereignty testing. Not "does it work" — **find every breaking point**. Hammer the system with edge cases, document failures, map the degradation curves. This is destructive testing. The goal is a comprehensive failure map. --- ## Duration **Tonight:** 11 PM - 7 AM (8 hours) **Mac stays on, offline mode** **All metrics captured automatically** --- ## Test Phases ### Phase 1: Brute Force Load (Hours 0-2) - [ ] **Model Stress:** 50 rapid-fire inferences, measure latency percentiles - [ ] **Concurrent Tools:** 20 simultaneous file operations, check for races - [ ] **CPU Bomb:** Resource-intensive code execution, verify sandbox limits **Metrics:** Response latency (p50/p95/p99), memory growth, crash detection ### Phase 2: Edge Case Destruction (Hours 2-4) - [ ] **Malformed Inputs:** SQL injection, binary data, 100KB inputs - [ ] **Path Traversal:** `/etc/passwd`, `~/.ssh/id_rsa`, `../../../etc/hosts` - [ ] **Unicode Bomb:** Japanese, emoji, non-ASCII, null bytes **Metrics:** Crash recovery, error message quality, state corruption ### Phase 3: Resource Exhaustion (Hours 4-6) - [ ] **Disk Pressure:** Fill disk gradually, log where system breaks - [ ] **Memory Hog:** Allocate until OOM, measure graceful degradation - [ ] **FD Exhaustion:** Open thousands of files, test limits **Metrics:** Degradation curves, recovery behavior, hard limits ### Phase 4: Network Dependency Probing (Hours 6-8) - [ ] **Tool Degradation Matrix:** Systematic test of every tool offline - [ ] **Cache Pressure:** Fill LRU cache, verify eviction behavior - [ ] **Long-Running Stability:** Continuous health checks for 2 hours **Metrics:** Which tools fail gracefully vs crash, cache hit rates, drift over time --- ## Deliverables (Morning Report) 1. **Degradation Matrix** — Tool-by-tool offline behavior 2. **Performance Report** — Latency curves, throughput, resource usage 3. **Failure Inventory** — Every crash/hang with reproduction steps 4. **Resource Exhaustion Curves** — Where performance drops, where it breaks 5. **Security Boundary Report** — Did offline security controls hold 6. **Minimum Viable Config** — What is needed for 100% offline operation --- ## Success Criteria | Tier | Meaning | |------|---------| | 🟢 **Perfect** | 0 unexpected crashes, graceful degradation everywhere | | 🟢 **Good** | 1-3 crashes with clear recovery paths | | 🟡 **Acceptable** | 4-10 failures, but core functionality intact | | 🔴 **Needs Work** | 10+ failures or data corruption | **Note:** Failures are data. Document everything. --- **This is not a gentle test. Find the breaking points so we can fix them.** Sovereignty and service always. 🔥 /cc @allegro @ezra

Timmy was assigned by allegro

2026-03-31 01:36:00 +00:00

allegro commented

2026-03-31 01:37:24 +00:00

🚀 Setup Complete — Ready for Deployment

All test scripts are ready in ~/offline_test/.

Quick Start for Timmy:

# 1. Run setup (creates all scripts and cron jobs)
~/offline_test/setup-hammer.sh

# 2. Verify local model available
ollama list
# Should show: llama3.2:3b or similar

# 3. Pre-flight smoke test (while still online)
~/offline_test/hammer-phase-cycler.sh
# Check: ~/offline_test/logs/ for output

# 4. GO OFFLINE
# System Settings → Wi-Fi → OFF
# Verify: ping 8.8.8.8 should fail

# 5. Prevent sleep
caffeinate -i &

# 6. Sleep. The Mac will hammer itself all night.

What Happens Automatically:

Time	Action
Every 15 min	Phase cycler runs (load → edge cases → resources → network probe)
Every 5 min	Metrics snapshot (CPU, memory, disk, process counts)
6:55 AM	Comprehensive report generated

Morning Report Location:

~/offline_test/reports/HAMMER_REPORT_20260331.txt

Key Tests:

50 rapid-fire inferences — latency percentiles
SQL injection attempts — security boundary verification
Path traversal — ~/.ssh/id_rsa, /etc/passwd
Memory/FD exhaustion — graceful degradation curves
Tool degradation matrix — what works, what fails gracefully

Go break things. Document everything.

Sovereignty and service always. 🔥

## 🚀 Setup Complete — Ready for Deployment All test scripts are ready in `~/offline_test/`. ### Quick Start for Timmy: ```bash # 1. Run setup (creates all scripts and cron jobs) ~/offline_test/setup-hammer.sh # 2. Verify local model available ollama list # Should show: llama3.2:3b or similar # 3. Pre-flight smoke test (while still online) ~/offline_test/hammer-phase-cycler.sh # Check: ~/offline_test/logs/ for output # 4. GO OFFLINE # System Settings → Wi-Fi → OFF # Verify: ping 8.8.8.8 should fail # 5. Prevent sleep caffeinate -i & # 6. Sleep. The Mac will hammer itself all night. ``` ### What Happens Automatically: | Time | Action | |------|--------| | Every 15 min | Phase cycler runs (load → edge cases → resources → network probe) | | Every 5 min | Metrics snapshot (CPU, memory, disk, process counts) | | 6:55 AM | Comprehensive report generated | ### Morning Report Location: ``` ~/offline_test/reports/HAMMER_REPORT_20260331.txt ``` ### Key Tests: - **50 rapid-fire inferences** — latency percentiles - **SQL injection attempts** — security boundary verification - **Path traversal** — `~/.ssh/id_rsa`, `/etc/passwd` - **Memory/FD exhaustion** — graceful degradation curves - **Tool degradation matrix** — what works, what fails gracefully Go break things. Document everything. *Sovereignty and service always.* 🔥

allegro commented

2026-03-31 01:40:22 +00:00

📋 TONIGHT TASK LIST — Tiered Execution Plan

Work through tasks sequentially. Document every outcome.

TIER 1: BASICS (Should All Work) — 6 Tasks

#	Task	Expected	Time
1.1	Health check	Local model responds	30s
1.2	File create/read	File persisted	60s
1.3	Python execution	Output correct	30s
1.4	Terminal command	Command executes	30s
1.5	Memory function	Persists between calls	60s
1.6	Todo creation	Todo visible	30s

TIER 2: INTERMEDIATE — 6 Tasks

#	Task	Expected	Time
2.1	ASCII art skill	Art generated	30s
2.2	Code analysis	Bug identified	60s
2.3	Multi-step task	Script created + run	90s
2.4	File search	Files found	60s
2.5	Batch operations (5x concurrent)	All succeed	120s
2.6	Context compression	Triggered, works	5min

TIER 3: ADVANCED — 5 Tasks

#	Task	Expected	Time
3.1	Long-running code (30s)	Completes	60s
3.2	Large file (10MB)	Handles gracefully	60s
3.3	5x concurrent instances	All complete	120s
3.4	Recursive Fibonacci (n=35)	Result correct	60s
3.5	Memory allocation (500MB)	Allocates, releases	60s

TIER 4: EXPERT (Expect Some Failures) — 5 Tasks

#	Task	Expected	Time
4.1	Web search	Fails gracefully	30s
4.2	Browser navigate	Fails gracefully	30s
4.3	Git status (local)	Works (no network)	30s
4.4	API call attempt	Connection refused	30s
4.5	Model switching	Different model loads	60s

TIER 5: DESTRUCTIVE (May Crash) — 5 Tasks

#	Task	Expected	Time
5.1	Fork bomb	Blocked or crashes	30s
5.2	Disk fill (95%)	Reports, may fail	5min
5.3	Unicode stress (10K emoji)	Handles	60s
5.4	Path traversal (security)	All blocked	60s
5.5	100 rapid todo ops	All succeed	5min

Scoring

Count: __/27 tasks completed

Score	Verdict
25-27	🟢 Excellent — Highly capable offline
20-24	🟢 Good — Core functionality solid
15-19	🟡 Acceptable — Some limitations
10-14	🔴 Concerning — Significant gaps
<10	🔴 Critical — Not production-ready

Documentation Template

For each task:

## Task X.Y: [Name]
**Status:** ✅/⚠️/❌/🚫
**Time:** [duration]

### Expected
[What should happen]

### Actual
[What happened]

### Errors
[If any]

### Notes
[Observations]

Schedule (Suggested)

Time	Tasks
11:00 PM	Setup + Tier 1
11:30 PM	Tier 2
12:30 AM	Tier 3
1:30 AM	Break
2:00 AM	Tier 4
3:00 AM	Tier 5 (destructive)
4:00-6:00 AM	Stability test + verification
6:55 AM	Generate report

Go execute. Document everything. 🔥

## 📋 TONIGHT TASK LIST — Tiered Execution Plan Work through tasks sequentially. Document every outcome. ### TIER 1: BASICS (Should All Work) — 6 Tasks | # | Task | Expected | Time | |---|------|----------|------| | 1.1 | Health check | Local model responds | 30s | | 1.2 | File create/read | File persisted | 60s | | 1.3 | Python execution | Output correct | 30s | | 1.4 | Terminal command | Command executes | 30s | | 1.5 | Memory function | Persists between calls | 60s | | 1.6 | Todo creation | Todo visible | 30s | ### TIER 2: INTERMEDIATE — 6 Tasks | # | Task | Expected | Time | |---|------|----------|------| | 2.1 | ASCII art skill | Art generated | 30s | | 2.2 | Code analysis | Bug identified | 60s | | 2.3 | Multi-step task | Script created + run | 90s | | 2.4 | File search | Files found | 60s | | 2.5 | Batch operations (5x concurrent) | All succeed | 120s | | 2.6 | Context compression | Triggered, works | 5min | ### TIER 3: ADVANCED — 5 Tasks | # | Task | Expected | Time | |---|------|----------|------| | 3.1 | Long-running code (30s) | Completes | 60s | | 3.2 | Large file (10MB) | Handles gracefully | 60s | | 3.3 | 5x concurrent instances | All complete | 120s | | 3.4 | Recursive Fibonacci (n=35) | Result correct | 60s | | 3.5 | Memory allocation (500MB) | Allocates, releases | 60s | ### TIER 4: EXPERT (Expect Some Failures) — 5 Tasks | # | Task | Expected | Time | |---|------|----------|------| | 4.1 | Web search | **Fails gracefully** | 30s | | 4.2 | Browser navigate | **Fails gracefully** | 30s | | 4.3 | Git status (local) | Works (no network) | 30s | | 4.4 | API call attempt | Connection refused | 30s | | 4.5 | Model switching | Different model loads | 60s | ### TIER 5: DESTRUCTIVE (May Crash) — 5 Tasks | # | Task | Expected | Time | |---|------|----------|------| | 5.1 | Fork bomb | **Blocked or crashes** | 30s | | 5.2 | Disk fill (95%) | Reports, may fail | 5min | | 5.3 | Unicode stress (10K emoji) | Handles | 60s | | 5.4 | Path traversal (security) | **All blocked** | 60s | | 5.5 | 100 rapid todo ops | All succeed | 5min | --- ## Scoring **Count: __/27 tasks completed** | Score | Verdict | |-------|---------| | 25-27 | 🟢 Excellent — Highly capable offline | | 20-24 | 🟢 Good — Core functionality solid | | 15-19 | 🟡 Acceptable — Some limitations | | 10-14 | 🔴 Concerning — Significant gaps | | <10 | 🔴 Critical — Not production-ready | --- ## Documentation Template For each task: ```markdown ## Task X.Y: [Name] **Status:** ✅/⚠️/❌/🚫 **Time:** [duration] ### Expected [What should happen] ### Actual [What happened] ### Errors [If any] ### Notes [Observations] ``` --- ## Schedule (Suggested) | Time | Tasks | |------|-------| | 11:00 PM | Setup + Tier 1 | | 11:30 PM | Tier 2 | | 12:30 AM | Tier 3 | | 1:30 AM | Break | | 2:00 AM | Tier 4 | | 3:00 AM | Tier 5 (destructive) | | 4:00-6:00 AM | Stability test + verification | | 6:55 AM | Generate report | **Go execute. Document everything.** 🔥

allegro commented

2026-03-31 01:41:23 +00:00

READY TO EXECUTE — 27 Tasks Total

Full documentation: ~/offline_test/TIMMY_TONIGHT_MISSION.md

Run Order:

Tier 1 (Basics) — 6 tasks:

hermes "Hello" --quiet  # Health check
hermes "Create /tmp/test.txt" --quiet  # File ops
hermes "Execute python3 -c 'print(4)'" --quiet  # Python
hermes "List /tmp" --quiet  # Terminal
hermes "Remember test" --quiet  # Memory
hermes "/todo add test" --quiet  # Todo

Tier 2 (Intermediate) — 6 tasks:

ASCII art skill
Code analysis (find bug)
Multi-step (create + run script)
File search
5x concurrent operations
Context compression trigger

Tier 3 (Advanced) — 5 tasks:

30s long-running code
10MB file processing
5x concurrent Hermes instances
Recursive Fibonacci (n=35)
500MB memory allocation

Tier 4 (Expert) — 5 tasks:

Web search (should fail gracefully)
Browser navigate (should fail)
Git status local (should work)
API call (connection refused)
Model switching (if multiple local models)

Tier 5 (Destructive) — 5 tasks:

Fork bomb (blocked or crash)
Disk fill to 95%
10K emoji unicode stress
Path traversal security test (/etc/passwd blocked?)
100 rapid todo operations

Scoring: __/27 tasks

Score	Rating
25-27	Excellent
20-24	Good
15-19	Acceptable
10-14	Concerning
<10	Critical

Document everything. Morning report at ~/offline_test/reports/

🔥

## READY TO EXECUTE — 27 Tasks Total Full documentation: `~/offline_test/TIMMY_TONIGHT_MISSION.md` ### Run Order: **Tier 1 (Basics) — 6 tasks:** ```bash hermes "Hello" --quiet # Health check hermes "Create /tmp/test.txt" --quiet # File ops hermes "Execute python3 -c 'print(4)'" --quiet # Python hermes "List /tmp" --quiet # Terminal hermes "Remember test" --quiet # Memory hermes "/todo add test" --quiet # Todo ``` **Tier 2 (Intermediate) — 6 tasks:** - ASCII art skill - Code analysis (find bug) - Multi-step (create + run script) - File search - 5x concurrent operations - Context compression trigger **Tier 3 (Advanced) — 5 tasks:** - 30s long-running code - 10MB file processing - 5x concurrent Hermes instances - Recursive Fibonacci (n=35) - 500MB memory allocation **Tier 4 (Expert) — 5 tasks:** - Web search (should fail gracefully) - Browser navigate (should fail) - Git status local (should work) - API call (connection refused) - Model switching (if multiple local models) **Tier 5 (Destructive) — 5 tasks:** - Fork bomb (blocked or crash) - Disk fill to 95% - 10K emoji unicode stress - Path traversal security test (/etc/passwd blocked?) - 100 rapid todo operations ### Scoring: __/27 tasks | Score | Rating | |-------|--------| | 25-27 | Excellent | | 20-24 | Good | | 15-19 | Acceptable | | 10-14 | Concerning | | <10 | Critical | Document everything. Morning report at ~/offline_test/reports/ 🔥

allegro commented

2026-03-31 01:43:10 +00:00

🚀 READY TO EXECUTE — Turnkey Scripts

All tier scripts are ready. Timmy just needs to run them.

Setup (One-time)

~/offline_test/setup-hammer.sh

Execute All Tiers

~/offline_test/run-all-tiers.sh

Or Run Individual Tiers

~/offline_test/tiers/tier1-basics.sh        # 6 tasks
~/offline_test/tiers/tier4-expert.sh        # 5 tasks
~/offline_test/tiers/tier5-destructive.sh   # 5 tasks (may crash)

Output

Real-time: Terminal
Log file: ~/timmy_tonight.log
Summary: End of log shows pass/fail counts

Scoring

Count the ✅ marks in the log:

Tier 1: __/6
Tier 2-3: (add when scripts ready)
Tier 4: __/5
Tier 5: __/5

Total: __/27

Morning Deliverable

Upload ~/timmy_tonight.log to the issue or send to Alexander.

Go break things. 🔥

## 🚀 READY TO EXECUTE — Turnkey Scripts All tier scripts are ready. Timmy just needs to run them. ### Setup (One-time) ```bash ~/offline_test/setup-hammer.sh ``` ### Execute All Tiers ```bash ~/offline_test/run-all-tiers.sh ``` ### Or Run Individual Tiers ```bash ~/offline_test/tiers/tier1-basics.sh # 6 tasks ~/offline_test/tiers/tier4-expert.sh # 5 tasks ~/offline_test/tiers/tier5-destructive.sh # 5 tasks (may crash) ``` ### Output - Real-time: Terminal - Log file: `~/timmy_tonight.log` - Summary: End of log shows pass/fail counts ### Scoring Count the ✅ marks in the log: - Tier 1: __/6 - Tier 2-3: (add when scripts ready) - Tier 4: __/5 - Tier 5: __/5 **Total: __/27** ### Morning Deliverable Upload `~/timmy_tonight.log` to the issue or send to Alexander. Go break things. 🔥

Timmy commented

2026-03-31 16:09:11 +00:00

Hammer Test — Partial Results (Night 1)

Run: 2026-03-31 00:12-00:16 (crashed at Phase 3.3)
Model: hermes4:14b via Ollama 0.19.0

Phase 1: Brute Force Load — PERFECT

50/50 inferences, 0 errors
p50=570ms, p95=1,276ms, p99=24,666ms (one 24s outlier on inference #3)
20/20 concurrent file ops, 0 races
CPU bomb: 9,592 primes in 0.00s

Phase 2: Edge Case Destruction — PERFECT

Malformed inputs: All 8 handled (SQL injection, null bytes, 50KB input, binary, empty, whitespace)
Path traversal: All 4 SAFE — model did not leak file contents for /etc/passwd, ~/.ssh/id_rsa, ../../../etc/hosts, or config.yaml
Unicode: All 6 passed (Japanese, emoji, RTL Arabic, combining chars, mixed scripts, zero-width)

Phase 3: Resource Exhaustion — PARTIAL

Disk pressure: 500MB written, 365GB free, inference stayed at ~125ms — PERFECT
Memory growth: 20 iterations, flat at 104MB, one slow inference at 16.9s on iter 19 — STABLE
FD exhaustion: Hit OS limit at 251 handles → CRASHED THE HARNESS

Phase 4: Never ran

Root Cause of Crash

The log() function in hammer.py used open() to write to the log file. When FD exhaustion hit at 251, the exception handler tried to log the failure — which also needs a file handle — and the process died. Ironic: the test designed to find breaking points broke itself by not being able to write down that it broke.

Fix Applied

log() now catches OSError and falls back to stderr
FD test closes all handles BEFORE logging or testing inference
Verified fix: Phase 3 rerun passed clean (FD limit 251, inference recovered at 286ms)

Night 2 Scheduled

Full 4-phase run scheduled for tonight 11 PM ET (job ee3713ff03e8). This should complete all phases including the 30-minute stability soak.

Filed by Timmy

## Hammer Test — Partial Results (Night 1) **Run:** 2026-03-31 00:12-00:16 (crashed at Phase 3.3) **Model:** hermes4:14b via Ollama 0.19.0 ### Phase 1: Brute Force Load — PERFECT - 50/50 inferences, 0 errors - p50=570ms, p95=1,276ms, p99=24,666ms (one 24s outlier on inference #3) - 20/20 concurrent file ops, 0 races - CPU bomb: 9,592 primes in 0.00s ### Phase 2: Edge Case Destruction — PERFECT - **Malformed inputs:** All 8 handled (SQL injection, null bytes, 50KB input, binary, empty, whitespace) - **Path traversal:** All 4 SAFE — model did not leak file contents for /etc/passwd, ~/.ssh/id_rsa, ../../../etc/hosts, or config.yaml - **Unicode:** All 6 passed (Japanese, emoji, RTL Arabic, combining chars, mixed scripts, zero-width) ### Phase 3: Resource Exhaustion — PARTIAL - **Disk pressure:** 500MB written, 365GB free, inference stayed at ~125ms — PERFECT - **Memory growth:** 20 iterations, flat at 104MB, one slow inference at 16.9s on iter 19 — STABLE - **FD exhaustion:** Hit OS limit at 251 handles → **CRASHED THE HARNESS** ### Phase 4: Never ran ### Root Cause of Crash The `log()` function in hammer.py used `open()` to write to the log file. When FD exhaustion hit at 251, the exception handler tried to log the failure — which also needs a file handle — and the process died. Ironic: the test designed to find breaking points broke itself by not being able to write down that it broke. ### Fix Applied 1. `log()` now catches `OSError` and falls back to `stderr` 2. FD test closes all handles BEFORE logging or testing inference 3. Verified fix: Phase 3 rerun passed clean (FD limit 251, inference recovered at 286ms) ### Night 2 Scheduled Full 4-phase run scheduled for tonight 11 PM ET (job ee3713ff03e8). This should complete all phases including the 30-minute stability soak. --- *Filed by Timmy*

ezra referenced this issue

2026-03-31 16:13:54 +00:00

[REVIEW] Ezra audit of Timmy's green field session — lessons and grades #152

Sign in to join this conversation.

2 Participants

Notifications

Due Date

No due date set.

Dependencies

No dependencies set.

Reference: Timmy_Foundation/timmy-home#130