Compare commits
1 Commits
gemini/iss
...
claude/iss
| Author | SHA1 | Date | |
|---|---|---|---|
|
|
8feef05f82 |
166
docs/research-ssd-self-distillation-2026-04.md
Normal file
166
docs/research-ssd-self-distillation-2026-04.md
Normal file
@@ -0,0 +1,166 @@
|
||||
# Research Acknowledgment: SSD — Simple Self-Distillation Improves Code Generation
|
||||
|
||||
**Issue:** #128
|
||||
**Paper:** [Embarrassingly Simple Self-Distillation Improves Code Generation](https://arxiv.org/abs/2604.01193)
|
||||
**Authors:** Ruixiang Zhang, Richard He Bai, Huangjie Zheng, Navdeep Jaitly, Ronan Collobert, Yizhe Zhang (Apple)
|
||||
**Date:** April 1, 2026
|
||||
**Code:** https://github.com/apple/ml-ssd
|
||||
**Acknowledged by:** Claude — April 6, 2026
|
||||
|
||||
---
|
||||
|
||||
## Assessment: High Relevance to Fleet
|
||||
|
||||
This paper is directly applicable to the hermes-agent fleet. The headline result — +7.5pp pass@1 on Qwen3-4B — is at exactly the scale we operate. The method requires no external infrastructure. Triage verdict: **P0 / Week-class work**.
|
||||
|
||||
---
|
||||
|
||||
## What SSD Actually Does
|
||||
|
||||
Three steps, nothing exotic:
|
||||
|
||||
1. **Sample**: For each coding prompt, generate one solution at temperature `T_train` (~0.9). Do NOT filter for correctness.
|
||||
2. **Fine-tune**: SFT on the resulting `(prompt, unverified_solution)` pairs. Standard cross-entropy loss. No RLHF, no GRPO, no DPO.
|
||||
3. **Evaluate**: At `T_eval` (which must be **different** from `T_train`). This asymmetry is not optional — using the same temperature for both loses 30–50% of the gains.
|
||||
|
||||
The counterintuitive part: N=1 per problem, unverified. Prior self-improvement work uses N>>1 and filters by execution. SSD doesn't. The paper argues this is *why* it works — you're sharpening the model's own distribution, not fitting to a correctness filter's selection bias.
|
||||
|
||||
---
|
||||
|
||||
## The Fork/Lock Theory
|
||||
|
||||
The paper's core theoretical contribution explains *why* temperature asymmetry matters.
|
||||
|
||||
**Locks** — positions requiring syntactic precision: colons, parentheses, import paths, variable names. A mistake here is a hard error. Low temperature helps at Locks. But applying low temperature globally kills diversity everywhere.
|
||||
|
||||
**Forks** — algorithmic choice points where multiple valid continuations exist: picking a sort algorithm, choosing a data structure, deciding on a loop structure. High temperature helps at Forks. But applying high temperature globally introduces errors at Locks.
|
||||
|
||||
SSD's fine-tuning reshapes token distributions **context-dependently**:
|
||||
- At Locks: narrows the distribution, suppressing distractor tokens
|
||||
- At Forks: widens the distribution, preserving valid algorithmic paths
|
||||
|
||||
A single global temperature cannot do this. SFT on self-generated data can, because the model learns from examples that implicitly encode which positions are Locks and which are Forks in each problem context.
|
||||
|
||||
**Fleet implication**: Our agents are currently using a single temperature for everything. This is leaving performance on the table even without fine-tuning. The immediate zero-cost action is temperature auditing (see Phase 1 below).
|
||||
|
||||
---
|
||||
|
||||
## Results That Matter to Us
|
||||
|
||||
| Model | Before | After | Delta |
|
||||
|-------|--------|-------|-------|
|
||||
| Qwen3-30B-Instruct | 42.4% | 55.3% | +12.9pp (+30% rel) |
|
||||
| Qwen3-4B-Instruct | baseline | baseline+7.5pp | +7.5pp |
|
||||
| Llama-3.1-8B-Instruct | baseline | baseline+3.5pp | +3.5pp |
|
||||
|
||||
Gains concentrate on hard problems: +14.2pp medium, +15.3pp hard. This is the distribution our agents face on real Gitea issues — not easy textbook problems.
|
||||
|
||||
---
|
||||
|
||||
## Fleet Implementation Plan
|
||||
|
||||
### Phase 1: Temperature Audit (Zero cost, this week)
|
||||
|
||||
Current state: fleet agents use default or eyeballed temperature settings. The paper shows T_eval != T_train is critical even without fine-tuning.
|
||||
|
||||
Actions:
|
||||
1. Document current temperature settings in `hermes/`, `skills/`, and any Ollama config files
|
||||
2. Establish a held-out test set of 20+ solved Gitea issues with known-correct outputs
|
||||
3. Run A/B: current T_eval vs. T_eval=0.7 vs. T_eval=0.3 for code generation tasks
|
||||
4. Record pass rates per condition; file findings as a follow-up issue
|
||||
|
||||
Expected outcome: measurable improvement with no model changes, no infrastructure, no cost.
|
||||
|
||||
### Phase 2: SSD Pipeline (1–2 weeks, single Mac)
|
||||
|
||||
Replicate the paper's method on Qwen3-4B via Ollama + axolotl or unsloth:
|
||||
|
||||
```
|
||||
1. Dataset construction:
|
||||
- Extract 100–500 coding prompts from Gitea issue backlog
|
||||
- Focus on issues that have accepted PRs (ground truth available for evaluation only, not training)
|
||||
- Format: (system_prompt + issue_description) → model generates solution at T_train=0.9
|
||||
|
||||
2. Fine-tuning:
|
||||
- Use LoRA (not full fine-tune) to stay local-first
|
||||
- Standard SFT: cross-entropy on (prompt, self-generated_solution) pairs
|
||||
- Recommended: unsloth for memory efficiency on Mac hardware
|
||||
- Training budget: 1–3 epochs, small batch size
|
||||
|
||||
3. Evaluation:
|
||||
- Compare base model vs. SSD-tuned model at T_eval=0.7
|
||||
- Metric: pass@1 on held-out issues not in training set
|
||||
- Also test on general coding benchmarks to check for capability regression
|
||||
```
|
||||
|
||||
Infrastructure assessment:
|
||||
- **RAM**: Qwen3-4B quantized (Q4_K_M) needs ~3.5GB VRAM for inference; LoRA fine-tuning needs ~8–12GB unified memory (Mac M-series feasible)
|
||||
- **Storage**: Self-generated dataset is small; LoRA adapter is ~100–500MB
|
||||
- **Time**: 500 examples × 3 epochs ≈ 2–4 hours on M2/M3 Max
|
||||
- **Dependencies**: Ollama (inference), unsloth or axolotl (fine-tuning), datasets (HuggingFace), trl
|
||||
|
||||
No cloud required. No teacher model required. No code execution environment required.
|
||||
|
||||
### Phase 3: Continuous Self-Improvement Loop (1–2 months)
|
||||
|
||||
Wire SSD into the fleet's burn mode:
|
||||
|
||||
```
|
||||
Nightly cron:
|
||||
1. Collect agent solutions from the day's completed issues
|
||||
2. Filter: only solutions where the PR was merged (human-verified correct)
|
||||
3. Append to rolling training buffer (last 500 examples)
|
||||
4. Run SFT fine-tune on buffer → update LoRA adapter
|
||||
5. Swap adapter into Ollama deployment at dawn
|
||||
6. Agents start next day with yesterday's lessons baked in
|
||||
```
|
||||
|
||||
This integrates naturally with RetainDB (#112) — the persistent memory system would track which solutions were merged, providing the feedback signal. The continuous loop turns every merged PR into a training example.
|
||||
|
||||
### Phase 4: Sovereignty Confirmation
|
||||
|
||||
The paper validates that external data is not required for improvement. Our fleet can:
|
||||
- Fine-tune exclusively on its own conversation data
|
||||
- Stay fully local (no API calls, no external datasets)
|
||||
- Accumulate improvements over time without model subscriptions
|
||||
|
||||
This is the sovereign fine-tuning capability the fleet needs to remain independent as external model APIs change pricing or capabilities.
|
||||
|
||||
---
|
||||
|
||||
## Risks and Mitigations
|
||||
|
||||
| Risk | Assessment | Mitigation |
|
||||
|------|------------|------------|
|
||||
| SSD gains don't transfer from LiveCodeBench to Gitea issues | Medium — our domain is software engineering, not competitive programming | Test on actual Gitea issues from the backlog; don't assume benchmark numbers transfer |
|
||||
| Fine-tuning degrades non-code capabilities | Low-Medium | LoRA instead of full fine-tune; test on general tasks after SFT; retain base model checkpoint |
|
||||
| Small training set (<200 examples) insufficient | Medium | Paper shows gains at modest scale; supplement with open code datasets (Stack, TheVault) if needed |
|
||||
| Qwen3 GGUF format incompatible with unsloth fine-tuning | Low | unsloth supports Qwen3; verify exact GGUF variant compatibility before starting |
|
||||
| Temperature asymmetry effect smaller on instruction-tuned variants | Low | Paper explicitly tests instruct variants and shows gains; Qwen3-4B-Instruct is in the paper's results |
|
||||
|
||||
---
|
||||
|
||||
## Acceptance Criteria Status
|
||||
|
||||
From the issue:
|
||||
|
||||
- [ ] **Temperature audit** — Document current T/top_p settings across fleet agents, compare with paper recommendations
|
||||
- [ ] **T_eval benchmark** — A/B test on 20+ solved Gitea issues; measure correctness
|
||||
- [ ] **SSD reproduction** — Replicate pipeline on Qwen4B with 100 prompts; measure pass@1 change
|
||||
- [ ] **Infrastructure assessment** — Documented above (Phase 2 section); GPU/RAM/storage requirements are Mac-feasible
|
||||
- [ ] **Continuous loop design** — Architecture drafted above (Phase 3 section); integrates with RetainDB (#112)
|
||||
|
||||
Infrastructure assessment and continuous loop design are addressed in this document. Temperature audit and SSD reproduction require follow-up issues with execution.
|
||||
|
||||
---
|
||||
|
||||
## Recommended Follow-Up Issues
|
||||
|
||||
1. **Temperature Audit** — Audit all fleet agent temperature configs; run A/B on T_eval variants; file results (Phase 1)
|
||||
2. **SSD Pipeline Spike** — Build and run the 3-stage SSD pipeline on Qwen3-4B; report pass@1 delta (Phase 2)
|
||||
3. **Nightly SFT Integration** — Wire SSD into burn-mode cron; integrate with RetainDB feedback loop (Phase 3)
|
||||
|
||||
---
|
||||
|
||||
*Research acknowledged by Claude — April 6, 2026*
|
||||
*Source issue: [hermes-agent #128](https://forge.alexanderwhitestone.com/Timmy_Foundation/hermes-agent/issues/128)*
|
||||
Reference in New Issue
Block a user