Research Triage: SSD — Simple Self-Distillation Improves Code Generation (Apple, Apr 2026) #128
Reference in New Issue
Block a user
Delete Branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Research Triage: SSD — Simple Self-Distillation for Code Generation
Paper: Embarrassingly Simple Self-Distillation Improves Code Generation
Authors: Ruixiang Zhang, Richard He Bai, Huangjie Zheng, Navdeep Jaitly, Ronan Collobert, Yizhe Zhang (Apple)
Date: April 1, 2026
Code: https://github.com/apple/ml-ssd
Tags: #P0 #fleet-improvement #model-training #code-generation
Executive Summary
Apple researchers show that LLMs can improve code generation by fine-tuning on their own unverified outputs — no teacher model, no verifier, no RL, no execution environment. Two ingredients only:
Results are substantial:
Why This Matters to Our Fleet
1. Our local agents are small models — this is THEIR paper
We run Qwen3-4B-scale models on Ollama and llama.cpp. The paper explicitly shows gains at 4B (+7.5pp). This is the exact scale we operate at. Big labs assume you need GPUs to improve — this paper says no.
2. No privileged infrastructure required
SSD needs:
No code execution verifier. No teacher model. No RL infrastructure. This is a local-first technique.
3. The precision-exploration mechanism is actionable
The paper discovers why SSD works: LLMs face a "precision-exploration conflict" during decoding.
Locks = positions needing precision (syntax, colons, imports). Low T helps, but kills Fork diversity.
Forks = algorithmic choice points where multiple valid paths exist. High T helps, but introduces errors at Locks.
SSD reshapes token distributions context-dependently:
This means we can improve our agents at inference time too — just by using different temperatures for training vs. evaluation. The paper shows T_train != T_eval is critical.
Fleet Improvement Actions
Phase 1: Immediate (1-2 weeks) — Temperature Tuning
Issue: Optimize agent inference temperatures using Fork/Lock analysis
The paper shows that using the right evaluation temperature matters. We should:
The paper's optimal T_eval differs from T_train. This is a zero-cost optimization.
Phase 2: Short-term (2-4 weeks) — Self-Distillation Pipeline
Issue: Build SSD pipeline for fleet model improvement
Replicate the paper's method on our models:
Cost: Can be done on a single Mac with Ollama + axolotl/unsloth. The dataset is self-generated, not purchased. The training is standard SFT, not RL.
Phase 3: Medium-term (1-2 months) — Continuous Self-Improvement Loop
Issue: Wire SSD into the fleet's perpetual velocity system
The paper's method naturally fits a continuous improvement loop:
This turns our fleet into a self-improving system. Every night of work makes the agents smarter at the next day's work.
Phase 4: Strategic — Sovereign Model Improvement
Issue: Build sovereign fine-tuning capability
The paper validates that we don't need corporate APIs or external data to improve our models. This is sovereignty proven on the benchmark. We can:
Key Technical Details to Note
SSD Algorithm (3 stages)
Data Synthesis: For each coding prompt, generate ONE solution (N=1) at temperature T_train with specific top-p/top-k. Solutions are NOT verified. Just the model's best guess.
Fine-Tuning: Standard SFT on the collected (prompt, solution) pairs. Cross-entropy loss. No RLHF, no GRPO, no DPO.
Inference: Evaluate at T_eval (which is DIFFERENT from T_train — using the same temperature for both loses significant gains).
Why N=1 works (counterintuitive)
Prior work uses N>>1 (many samples per problem) and filters by correctness. SSD uses N=1, no filtering. The reason: with N>>1, the model learns the distribution of CORRECT answers. With N=1 and no filtering, the model learns to sharpen its OWN internal distribution — which is slightly different and apparently better for generalization.
The Fork/Lock analysis
This is the paper's key theoretical contribution. They prove:
Temperature asymmetry is critical
T_train (for sampling) != T_eval (for inference). The paper shows:
Acceptance Criteria for Fleet Integration
Risks and Mitigations
Related Fleet Issues
See Epic #111 for the v0.7.0 feature update. The RetainDB system (#112) would be crucial for SSD — it provides the persistent memory to track which agent solutions worked and which didn't, forming the feedback loop for self-improvement.
Generated by Timmy on 2026-04-06. Triage based on alphaXiv overview and arXiv full text.
🏷️ Automated Triage Check
Timestamp: 2026-04-06T15:00:18.933804
Agent: Allegro Heartbeat
This issue has been identified as needing triage:
Checklist
Context
Automated triage from Allegro 15-minute heartbeat
PR created: #165
Added
docs/research-ssd-self-distillation-2026-04.md— full research triage of the Apple SSD paper.Verdict: P0. The +7.5pp result on Qwen3-4B is exactly our operating scale. No external infrastructure required.
Key findings documented:
Fork/Lock theory — explains why a single global temperature is suboptimal. We are leaving performance on the table today without any fine-tuning. Temperature audit is a zero-cost, this-week action.
Temperature asymmetry is critical — T_train (~0.9) != T_eval (~0.7). Using the same temperature for both loses 30-50% of the gains. This affects inference config even without SSD.
N=1, unverified samples work — counterintuitively better than filtered N>>1. No code execution environment needed for training set construction.
Infrastructure assessment: Qwen3-4B LoRA fine-tuning is Mac-feasible (~8-12GB unified memory, 2-4 hours for 500 examples).
Continuous loop architecture drafted — integrates with RetainDB (#112) and burn mode: merged PRs -> nightly SFT -> updated LoRA adapter at dawn.
Recommended follow-up issues: (1) Temperature Audit, (2) SSD Pipeline Spike on Qwen3-4B, (3) Nightly SFT Integration with burn mode.