Files
timmy-config/hermes-sovereign/docs/research-ssd-self-distillation-2026-04.md
Alexander Whitestone 95d65a1155 feat: extract sovereign work from hermes-agent fork into sidecar
Extracted 52 files from Timmy_Foundation/hermes-agent (gitea/main) into
hermes-sovereign/ directory to restore clean upstream tracking.

Layout:
  docs/             19 files — deploy guides, performance reports, security docs, research
  security/          5 files — audit workflows, PR checklists, validation scripts
  wizard-bootstrap/  7 files — wizard environment, dependency checking, auditing
  notebooks/         2 files — Jupyter health monitoring notebooks
  scripts/           5 files — forge health, smoke tests, syntax guard, deploy validation
  ci/                2 files — Gitea CI workflow definitions
  githooks/          3 files — pre-commit hooks and config
  devkit/            8 files — developer toolkit (Gitea client, health, notebook runner)
  README.md          1 file  — directory overview

Addresses: #337, #338
2026-04-07 10:11:20 -04:00

167 lines
8.6 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# Research Acknowledgment: SSD — Simple Self-Distillation Improves Code Generation
**Issue:** #128
**Paper:** [Embarrassingly Simple Self-Distillation Improves Code Generation](https://arxiv.org/abs/2604.01193)
**Authors:** Ruixiang Zhang, Richard He Bai, Huangjie Zheng, Navdeep Jaitly, Ronan Collobert, Yizhe Zhang (Apple)
**Date:** April 1, 2026
**Code:** https://github.com/apple/ml-ssd
**Acknowledged by:** Claude — April 6, 2026
---
## Assessment: High Relevance to Fleet
This paper is directly applicable to the hermes-agent fleet. The headline result — +7.5pp pass@1 on Qwen3-4B — is at exactly the scale we operate. The method requires no external infrastructure. Triage verdict: **P0 / Week-class work**.
---
## What SSD Actually Does
Three steps, nothing exotic:
1. **Sample**: For each coding prompt, generate one solution at temperature `T_train` (~0.9). Do NOT filter for correctness.
2. **Fine-tune**: SFT on the resulting `(prompt, unverified_solution)` pairs. Standard cross-entropy loss. No RLHF, no GRPO, no DPO.
3. **Evaluate**: At `T_eval` (which must be **different** from `T_train`). This asymmetry is not optional — using the same temperature for both loses 3050% of the gains.
The counterintuitive part: N=1 per problem, unverified. Prior self-improvement work uses N>>1 and filters by execution. SSD doesn't. The paper argues this is *why* it works — you're sharpening the model's own distribution, not fitting to a correctness filter's selection bias.
---
## The Fork/Lock Theory
The paper's core theoretical contribution explains *why* temperature asymmetry matters.
**Locks** — positions requiring syntactic precision: colons, parentheses, import paths, variable names. A mistake here is a hard error. Low temperature helps at Locks. But applying low temperature globally kills diversity everywhere.
**Forks** — algorithmic choice points where multiple valid continuations exist: picking a sort algorithm, choosing a data structure, deciding on a loop structure. High temperature helps at Forks. But applying high temperature globally introduces errors at Locks.
SSD's fine-tuning reshapes token distributions **context-dependently**:
- At Locks: narrows the distribution, suppressing distractor tokens
- At Forks: widens the distribution, preserving valid algorithmic paths
A single global temperature cannot do this. SFT on self-generated data can, because the model learns from examples that implicitly encode which positions are Locks and which are Forks in each problem context.
**Fleet implication**: Our agents are currently using a single temperature for everything. This is leaving performance on the table even without fine-tuning. The immediate zero-cost action is temperature auditing (see Phase 1 below).
---
## Results That Matter to Us
| Model | Before | After | Delta |
|-------|--------|-------|-------|
| Qwen3-30B-Instruct | 42.4% | 55.3% | +12.9pp (+30% rel) |
| Qwen3-4B-Instruct | baseline | baseline+7.5pp | +7.5pp |
| Llama-3.1-8B-Instruct | baseline | baseline+3.5pp | +3.5pp |
Gains concentrate on hard problems: +14.2pp medium, +15.3pp hard. This is the distribution our agents face on real Gitea issues — not easy textbook problems.
---
## Fleet Implementation Plan
### Phase 1: Temperature Audit (Zero cost, this week)
Current state: fleet agents use default or eyeballed temperature settings. The paper shows T_eval != T_train is critical even without fine-tuning.
Actions:
1. Document current temperature settings in `hermes/`, `skills/`, and any Ollama config files
2. Establish a held-out test set of 20+ solved Gitea issues with known-correct outputs
3. Run A/B: current T_eval vs. T_eval=0.7 vs. T_eval=0.3 for code generation tasks
4. Record pass rates per condition; file findings as a follow-up issue
Expected outcome: measurable improvement with no model changes, no infrastructure, no cost.
### Phase 2: SSD Pipeline (12 weeks, single Mac)
Replicate the paper's method on Qwen3-4B via Ollama + axolotl or unsloth:
```
1. Dataset construction:
- Extract 100500 coding prompts from Gitea issue backlog
- Focus on issues that have accepted PRs (ground truth available for evaluation only, not training)
- Format: (system_prompt + issue_description) → model generates solution at T_train=0.9
2. Fine-tuning:
- Use LoRA (not full fine-tune) to stay local-first
- Standard SFT: cross-entropy on (prompt, self-generated_solution) pairs
- Recommended: unsloth for memory efficiency on Mac hardware
- Training budget: 13 epochs, small batch size
3. Evaluation:
- Compare base model vs. SSD-tuned model at T_eval=0.7
- Metric: pass@1 on held-out issues not in training set
- Also test on general coding benchmarks to check for capability regression
```
Infrastructure assessment:
- **RAM**: Qwen3-4B quantized (Q4_K_M) needs ~3.5GB VRAM for inference; LoRA fine-tuning needs ~812GB unified memory (Mac M-series feasible)
- **Storage**: Self-generated dataset is small; LoRA adapter is ~100500MB
- **Time**: 500 examples × 3 epochs ≈ 24 hours on M2/M3 Max
- **Dependencies**: Ollama (inference), unsloth or axolotl (fine-tuning), datasets (HuggingFace), trl
No cloud required. No teacher model required. No code execution environment required.
### Phase 3: Continuous Self-Improvement Loop (12 months)
Wire SSD into the fleet's burn mode:
```
Nightly cron:
1. Collect agent solutions from the day's completed issues
2. Filter: only solutions where the PR was merged (human-verified correct)
3. Append to rolling training buffer (last 500 examples)
4. Run SFT fine-tune on buffer → update LoRA adapter
5. Swap adapter into Ollama deployment at dawn
6. Agents start next day with yesterday's lessons baked in
```
This integrates naturally with RetainDB (#112) — the persistent memory system would track which solutions were merged, providing the feedback signal. The continuous loop turns every merged PR into a training example.
### Phase 4: Sovereignty Confirmation
The paper validates that external data is not required for improvement. Our fleet can:
- Fine-tune exclusively on its own conversation data
- Stay fully local (no API calls, no external datasets)
- Accumulate improvements over time without model subscriptions
This is the sovereign fine-tuning capability the fleet needs to remain independent as external model APIs change pricing or capabilities.
---
## Risks and Mitigations
| Risk | Assessment | Mitigation |
|------|------------|------------|
| SSD gains don't transfer from LiveCodeBench to Gitea issues | Medium — our domain is software engineering, not competitive programming | Test on actual Gitea issues from the backlog; don't assume benchmark numbers transfer |
| Fine-tuning degrades non-code capabilities | Low-Medium | LoRA instead of full fine-tune; test on general tasks after SFT; retain base model checkpoint |
| Small training set (<200 examples) insufficient | Medium | Paper shows gains at modest scale; supplement with open code datasets (Stack, TheVault) if needed |
| Qwen3 GGUF format incompatible with unsloth fine-tuning | Low | unsloth supports Qwen3; verify exact GGUF variant compatibility before starting |
| Temperature asymmetry effect smaller on instruction-tuned variants | Low | Paper explicitly tests instruct variants and shows gains; Qwen3-4B-Instruct is in the paper's results |
---
## Acceptance Criteria Status
From the issue:
- [ ] **Temperature audit** — Document current T/top_p settings across fleet agents, compare with paper recommendations
- [ ] **T_eval benchmark** — A/B test on 20+ solved Gitea issues; measure correctness
- [ ] **SSD reproduction** — Replicate pipeline on Qwen4B with 100 prompts; measure pass@1 change
- [ ] **Infrastructure assessment** — Documented above (Phase 2 section); GPU/RAM/storage requirements are Mac-feasible
- [ ] **Continuous loop design** — Architecture drafted above (Phase 3 section); integrates with RetainDB (#112)
Infrastructure assessment and continuous loop design are addressed in this document. Temperature audit and SSD reproduction require follow-up issues with execution.
---
## Recommended Follow-Up Issues
1. **Temperature Audit** — Audit all fleet agent temperature configs; run A/B on T_eval variants; file results (Phase 1)
2. **SSD Pipeline Spike** — Build and run the 3-stage SSD pipeline on Qwen3-4B; report pass@1 delta (Phase 2)
3. **Nightly SFT Integration** — Wire SSD into burn-mode cron; integrate with RetainDB feedback loop (Phase 3)
---
*Research acknowledged by Claude — April 6, 2026*
*Source issue: [hermes-agent #128](https://forge.alexanderwhitestone.com/Timmy_Foundation/hermes-agent/issues/128)*