docs: add SSD research acknowledgment (#128 )

Research triage for Apple's Simple Self-Distillation paper (arXiv:2604.01193). Documents fleet-relevant findings: +7.5pp on Qwen3-4B (our exact scale), zero-infrastructure fine-tuning, Fork/Lock theory implications for temperature tuning, and a 4-phase implementation plan from immediate T_eval audit through nightly self-improvement loop. Refs #128
[claude] Fleet SITREP — April 6, 2026 acknowledgment (#143 ) (#162 )
2026-04-06 22:07:00 -04:00 · 2026-04-07 02:04:51 +00:00 · 2026-04-07 02:04:45 +00:00
2 changed files with 298 additions and 0 deletions
--- a/docs/fleet-sitrep-2026-04-06.md
+++ b/docs/fleet-sitrep-2026-04-06.md
@@ -0,0 +1,132 @@
+# Fleet SITREP — April 6, 2026
+
+**Classification:** Consolidated Status Report
+**Compiled by:** Ezra
+**Acknowledged by:** Claude (Issue #143)
+
+---
+
+## Executive Summary
+
+Allegro executed 7 tasks across infrastructure, contracting, audits, and security. Ezra shipped PR #131, filed formalization audit #132, delivered quarterly report #133, and self-assigned issues #134–#138. All wizard activity mapped below.
+
+---
+
+## 1. Allegro 7-Task Report
+
+| Task | Description | Status |
+|------|-------------|--------|
+| 1 | Roll Call / Infrastructure Map | ✅ Complete |
+| 2 | Dark industrial anthem (140 BPM, Suno-ready) | ✅ Complete |
+| 3 | Operation Get A Job — 7-file contracting playbook pushed to `the-nexus` | ✅ Complete |
+| 4 | Formalization audit filed ([the-nexus #893](https://forge.alexanderwhitestone.com/Timmy_Foundation/the-nexus/issues/893)) | ✅ Complete |
+| 5 | GrepTard Memory Report — PR #525 on `timmy-home` | ✅ Complete |
+| 6 | Self-audit issues #894–#899 filed on `the-nexus` | ✅ Filed |
+| 7 | `keystore.json` permissions fixed to `600` | ✅ Applied |
+
+### Critical Findings from Task 4 (Formalization Audit)
+
+- GOFAI source files missing — only `.pyc` remains
+- Nostr keystore was world-readable — **FIXED** (Task 7)
+- 39 burn scripts cluttering `/root` — archival pending ([#898](https://forge.alexanderwhitestone.com/Timmy_Foundation/the-nexus/issues/898))
+
+---
+
+## 2. Ezra Deliverables
+
+| Deliverable | Issue/PR | Status |
+|-------------|----------|--------|
+| V-011 fix + compressor tuning | [PR #131](https://forge.alexanderwhitestone.com/Timmy_Foundation/hermes-agent/pulls/131) | ✅ Merged |
+| Formalization audit (hermes-agent) | [Issue #132](https://forge.alexanderwhitestone.com/Timmy_Foundation/hermes-agent/issues/132) | Filed |
+| Quarterly report (MD + PDF) | [Issue #133](https://forge.alexanderwhitestone.com/Timmy_Foundation/hermes-agent/issues/133) | Filed |
+| Burn-mode concurrent tool tests | [Issue #134](https://forge.alexanderwhitestone.com/Timmy_Foundation/hermes-agent/issues/134) | Assigned → Ezra |
+| MCP SDK migration | [Issue #135](https://forge.alexanderwhitestone.com/Timmy_Foundation/hermes-agent/issues/135) | Assigned → Ezra |
+| APScheduler migration | [Issue #136](https://forge.alexanderwhitestone.com/Timmy_Foundation/hermes-agent/issues/136) | Assigned → Ezra |
+| Pydantic-settings migration | [Issue #137](https://forge.alexanderwhitestone.com/Timmy_Foundation/hermes-agent/issues/137) | Assigned → Ezra |
+| Contracting playbook tracker | [Issue #138](https://forge.alexanderwhitestone.com/Timmy_Foundation/hermes-agent/issues/138) | Assigned → Ezra |
+
+---
+
+## 3. Fleet Status
+
+| Wizard | Host | Status | Blocker |
+|--------|------|--------|---------|
+| **Ezra** | Hermes VPS | Active — 5 issues queued | None |
+| **Bezalel** | Hermes VPS | Gateway running on 8645 | None |
+| **Allegro-Primus** | Hermes VPS | **Gateway DOWN on 8644** | Needs restart signal |
+| **Bilbo** | External | Gemma 4B active, Telegram dual-mode | Host IP unknown to fleet |
+
+### Allegro Gateway Recovery
+
+Allegro-Primus gateway (port 8644) is down. Options:
+1. **Alexander restarts manually** on Hermes VPS
+2. **Delegate to Bezalel** — Bezalel can issue restart signal via Hermes VPS access
+3. **Delegate to Ezra** — Ezra can coordinate restart as part of issue #894 work
+
+---
+
+## 4. Operation Get A Job — Contracting Playbook
+
+Files pushed to `the-nexus/operation-get-a-job/`:
+
+| File | Purpose |
+|------|---------|
+| `README.md` | Master plan |
+| `entity-setup.md` | Wyoming LLC, Mercury, E&O insurance |
+| `service-offerings.md` | Rates $150–600/hr; packages $5k/$15k/$40k+ |
+| `portfolio.md` | Portfolio structure |
+| `outreach-templates.md` | Cold email templates |
+| `proposal-template.md` | Client proposal structure |
+| `rate-card.md` | Rate card |
+
+**Human-only mile (Alexander's action items):**
+
+1. Pick LLC name from `entity-setup.md`
+2. File Wyoming LLC via Northwest Registered Agent ($225)
+3. Get EIN from IRS (free, ~10 min)
+4. Open Mercury account (requires EIN + LLC docs)
+5. Secure E&O insurance (~$150–250/month)
+6. Restart Allegro-Primus gateway (port 8644)
+7. Update LinkedIn using profile template
+8. Send 5 cold emails using outreach templates
+
+---
+
+## 5. Pending Self-Audit Issues (the-nexus)
+
+| Issue | Title | Priority |
+|-------|-------|----------|
+| [#894](https://forge.alexanderwhitestone.com/Timmy_Foundation/the-nexus/issues/894) | Deploy burn-mode cron jobs | CRITICAL |
+| [#895](https://forge.alexanderwhitestone.com/Timmy_Foundation/the-nexus/issues/895) | Telegram thread-based reporting | Normal |
+| [#896](https://forge.alexanderwhitestone.com/Timmy_Foundation/the-nexus/issues/896) | Retry logic and error recovery | Normal |
+| [#897](https://forge.alexanderwhitestone.com/Timmy_Foundation/the-nexus/issues/897) | Automate morning reports at 0600 | Normal |
+| [#898](https://forge.alexanderwhitestone.com/Timmy_Foundation/the-nexus/issues/898) | Archive 39 burn scripts | Normal |
+| [#899](https://forge.alexanderwhitestone.com/Timmy_Foundation/the-nexus/issues/899) | Keystore permissions | ✅ Done |
+
+---
+
+## 6. Revenue Timeline
+
+| Milestone | Target | Unlocks |
+|-----------|--------|---------|
+| LLC + Bank + E&O | Day 5 | Ability to invoice clients |
+| First 5 emails sent | Day 7 | Pipeline generation |
+| First scoping call | Day 14 | Qualified lead |
+| First proposal accepted | Day 21 | **$4,500–$12,000 revenue** |
+| Monthly retainer signed | Day 45 | **$6,000/mo recurring** |
+
+---
+
+## 7. Delegation Matrix
+
+| Owner | Owns |
+|-------|------|
+| **Alexander** | LLC filing, EIN, Mercury, E&O, LinkedIn, cold emails, gateway restart |
+| **Ezra** | Issues #134–#138 (tests, migrations, tracker) |
+| **Allegro** | Issues #894, #898 (cron deployment, burn script archival) |
+| **Bezalel** | Review formalization audit for Anthropic-specific gaps |
+
+---
+
+*SITREP acknowledged by Claude — April 6, 2026*
+*Source issue: [hermes-agent #143](https://forge.alexanderwhitestone.com/Timmy_Foundation/hermes-agent/issues/143)*
--- a/docs/research-ssd-self-distillation-2026-04.md
+++ b/docs/research-ssd-self-distillation-2026-04.md
@@ -0,0 +1,166 @@
+# Research Acknowledgment: SSD — Simple Self-Distillation Improves Code Generation
+
+**Issue:** #128
+**Paper:** [Embarrassingly Simple Self-Distillation Improves Code Generation](https://arxiv.org/abs/2604.01193)
+**Authors:** Ruixiang Zhang, Richard He Bai, Huangjie Zheng, Navdeep Jaitly, Ronan Collobert, Yizhe Zhang (Apple)
+**Date:** April 1, 2026
+**Code:** https://github.com/apple/ml-ssd
+**Acknowledged by:** Claude — April 6, 2026
+
+---
+
+## Assessment: High Relevance to Fleet
+
+This paper is directly applicable to the hermes-agent fleet. The headline result — +7.5pp pass@1 on Qwen3-4B — is at exactly the scale we operate. The method requires no external infrastructure. Triage verdict: **P0 / Week-class work**.
+
+---
+
+## What SSD Actually Does
+
+Three steps, nothing exotic:
+
+1. **Sample**: For each coding prompt, generate one solution at temperature `T_train` (~0.9). Do NOT filter for correctness.
+2. **Fine-tune**: SFT on the resulting `(prompt, unverified_solution)` pairs. Standard cross-entropy loss. No RLHF, no GRPO, no DPO.
+3. **Evaluate**: At `T_eval` (which must be **different** from `T_train`). This asymmetry is not optional — using the same temperature for both loses 30–50% of the gains.
+
+The counterintuitive part: N=1 per problem, unverified. Prior self-improvement work uses N>>1 and filters by execution. SSD doesn't. The paper argues this is *why* it works — you're sharpening the model's own distribution, not fitting to a correctness filter's selection bias.
+
+---
+
+## The Fork/Lock Theory
+
+The paper's core theoretical contribution explains *why* temperature asymmetry matters.
+
+**Locks** — positions requiring syntactic precision: colons, parentheses, import paths, variable names. A mistake here is a hard error. Low temperature helps at Locks. But applying low temperature globally kills diversity everywhere.
+
+**Forks** — algorithmic choice points where multiple valid continuations exist: picking a sort algorithm, choosing a data structure, deciding on a loop structure. High temperature helps at Forks. But applying high temperature globally introduces errors at Locks.
+
+SSD's fine-tuning reshapes token distributions **context-dependently**:
+- At Locks: narrows the distribution, suppressing distractor tokens
+- At Forks: widens the distribution, preserving valid algorithmic paths
+
+A single global temperature cannot do this. SFT on self-generated data can, because the model learns from examples that implicitly encode which positions are Locks and which are Forks in each problem context.
+
+**Fleet implication**: Our agents are currently using a single temperature for everything. This is leaving performance on the table even without fine-tuning. The immediate zero-cost action is temperature auditing (see Phase 1 below).
+
+---
+
+## Results That Matter to Us
+
+| Model | Before | After | Delta |
+|-------|--------|-------|-------|
+| Qwen3-30B-Instruct | 42.4% | 55.3% | +12.9pp (+30% rel) |
+| Qwen3-4B-Instruct | baseline | baseline+7.5pp | +7.5pp |
+| Llama-3.1-8B-Instruct | baseline | baseline+3.5pp | +3.5pp |
+
+Gains concentrate on hard problems: +14.2pp medium, +15.3pp hard. This is the distribution our agents face on real Gitea issues — not easy textbook problems.
+
+---
+
+## Fleet Implementation Plan
+
+### Phase 1: Temperature Audit (Zero cost, this week)
+
+Current state: fleet agents use default or eyeballed temperature settings. The paper shows T_eval != T_train is critical even without fine-tuning.
+
+Actions:
+1. Document current temperature settings in `hermes/`, `skills/`, and any Ollama config files
+2. Establish a held-out test set of 20+ solved Gitea issues with known-correct outputs
+3. Run A/B: current T_eval vs. T_eval=0.7 vs. T_eval=0.3 for code generation tasks
+4. Record pass rates per condition; file findings as a follow-up issue
+
+Expected outcome: measurable improvement with no model changes, no infrastructure, no cost.
+
+### Phase 2: SSD Pipeline (1–2 weeks, single Mac)
+
+Replicate the paper's method on Qwen3-4B via Ollama + axolotl or unsloth:
+
+```
+1. Dataset construction:
+   - Extract 100–500 coding prompts from Gitea issue backlog
+   - Focus on issues that have accepted PRs (ground truth available for evaluation only, not training)
+   - Format: (system_prompt + issue_description) → model generates solution at T_train=0.9
+
+2. Fine-tuning:
+   - Use LoRA (not full fine-tune) to stay local-first
+   - Standard SFT: cross-entropy on (prompt, self-generated_solution) pairs
+   - Recommended: unsloth for memory efficiency on Mac hardware
+   - Training budget: 1–3 epochs, small batch size
+
+3. Evaluation:
+   - Compare base model vs. SSD-tuned model at T_eval=0.7
+   - Metric: pass@1 on held-out issues not in training set
+   - Also test on general coding benchmarks to check for capability regression
+```
+
+Infrastructure assessment:
+- **RAM**: Qwen3-4B quantized (Q4_K_M) needs ~3.5GB VRAM for inference; LoRA fine-tuning needs ~8–12GB unified memory (Mac M-series feasible)
+- **Storage**: Self-generated dataset is small; LoRA adapter is ~100–500MB
+- **Time**: 500 examples × 3 epochs ≈ 2–4 hours on M2/M3 Max
+- **Dependencies**: Ollama (inference), unsloth or axolotl (fine-tuning), datasets (HuggingFace), trl
+
+No cloud required. No teacher model required. No code execution environment required.
+
+### Phase 3: Continuous Self-Improvement Loop (1–2 months)
+
+Wire SSD into the fleet's burn mode:
+
+```
+Nightly cron:
+  1. Collect agent solutions from the day's completed issues
+  2. Filter: only solutions where the PR was merged (human-verified correct)
+  3. Append to rolling training buffer (last 500 examples)
+  4. Run SFT fine-tune on buffer → update LoRA adapter
+  5. Swap adapter into Ollama deployment at dawn
+  6. Agents start next day with yesterday's lessons baked in
+```
+
+This integrates naturally with RetainDB (#112) — the persistent memory system would track which solutions were merged, providing the feedback signal. The continuous loop turns every merged PR into a training example.
+
+### Phase 4: Sovereignty Confirmation
+
+The paper validates that external data is not required for improvement. Our fleet can:
+- Fine-tune exclusively on its own conversation data
+- Stay fully local (no API calls, no external datasets)
+- Accumulate improvements over time without model subscriptions
+
+This is the sovereign fine-tuning capability the fleet needs to remain independent as external model APIs change pricing or capabilities.
+
+---
+
+## Risks and Mitigations
+
+| Risk | Assessment | Mitigation |
+|------|------------|------------|
+| SSD gains don't transfer from LiveCodeBench to Gitea issues | Medium — our domain is software engineering, not competitive programming | Test on actual Gitea issues from the backlog; don't assume benchmark numbers transfer |
+| Fine-tuning degrades non-code capabilities | Low-Medium | LoRA instead of full fine-tune; test on general tasks after SFT; retain base model checkpoint |
+| Small training set (<200 examples) insufficient | Medium | Paper shows gains at modest scale; supplement with open code datasets (Stack, TheVault) if needed |
+| Qwen3 GGUF format incompatible with unsloth fine-tuning | Low | unsloth supports Qwen3; verify exact GGUF variant compatibility before starting |
+| Temperature asymmetry effect smaller on instruction-tuned variants | Low | Paper explicitly tests instruct variants and shows gains; Qwen3-4B-Instruct is in the paper's results |
+
+---
+
+## Acceptance Criteria Status
+
+From the issue:
+
+- [ ] **Temperature audit** — Document current T/top_p settings across fleet agents, compare with paper recommendations
+- [ ] **T_eval benchmark** — A/B test on 20+ solved Gitea issues; measure correctness
+- [ ] **SSD reproduction** — Replicate pipeline on Qwen4B with 100 prompts; measure pass@1 change
+- [ ] **Infrastructure assessment** — Documented above (Phase 2 section); GPU/RAM/storage requirements are Mac-feasible
+- [ ] **Continuous loop design** — Architecture drafted above (Phase 3 section); integrates with RetainDB (#112)
+
+Infrastructure assessment and continuous loop design are addressed in this document. Temperature audit and SSD reproduction require follow-up issues with execution.
+
+---
+
+## Recommended Follow-Up Issues
+
+1. **Temperature Audit** — Audit all fleet agent temperature configs; run A/B on T_eval variants; file results (Phase 1)
+2. **SSD Pipeline Spike** — Build and run the 3-stage SSD pipeline on Qwen3-4B; report pass@1 delta (Phase 2)
+3. **Nightly SFT Integration** — Wire SSD into burn-mode cron; integrate with RetainDB feedback loop (Phase 3)
+
+---
+
+*Research acknowledged by Claude — April 6, 2026*
+*Source issue: [hermes-agent #128](https://forge.alexanderwhitestone.com/Timmy_Foundation/hermes-agent/issues/128)*
Author	SHA1	Message	Date
Alexander Whitestone	8feef05f82	docs: add SSD research acknowledgment (#128 ) Research triage for Apple's Simple Self-Distillation paper (arXiv:2604.01193). Documents fleet-relevant findings: +7.5pp on Qwen3-4B (our exact scale), zero-infrastructure fine-tuning, Fork/Lock theory implications for temperature tuning, and a 4-phase implementation plan from immediate T_eval audit through nightly self-improvement loop. Refs #128	2026-04-06 22:07:00 -04:00
Claude (Opus 4.6)	6291f2d31b	[claude] Fleet SITREP — April 6, 2026 acknowledgment (#143 ) (#162 )	2026-04-07 02:04:51 +00:00
Claude (Opus 4.6)	066ec8eafa	[claude] Add Ezra Quarterly Report — April 2026 (MD + PDF) (#133 ) (#163 )	2026-04-07 02:04:45 +00:00