Research Triage: SSD — Simple Self-Distillation Improves Code Generation (Apple, Apr 2026) #128

Closed
opened 2026-04-06 14:14:40 +00:00 by Timmy · 2 comments
Owner

Research Triage: SSD — Simple Self-Distillation for Code Generation

Paper: Embarrassingly Simple Self-Distillation Improves Code Generation
Authors: Ruixiang Zhang, Richard He Bai, Huangjie Zheng, Navdeep Jaitly, Ronan Collobert, Yizhe Zhang (Apple)
Date: April 1, 2026
Code: https://github.com/apple/ml-ssd
Tags: #P0 #fleet-improvement #model-training #code-generation


Executive Summary

Apple researchers show that LLMs can improve code generation by fine-tuning on their own unverified outputs — no teacher model, no verifier, no RL, no execution environment. Two ingredients only:

  1. Sample solutions at a specific training temperature (T_train)
  2. Fine-tune on them with standard SFT

Results are substantial:

  • Qwen3-30B-Instruct: 42.4% -> 55.3% pass@1 on LiveCodeBench v6 (+12.9pp, +30% relative)
  • Qwen3-4B-Instruct: +7.5pp overall
  • Llama-3.1-8B-Instruct: +3.5pp
  • Gains concentrate on HARD problems (+14.2pp medium, +15.3pp hard)
  • Works across Qwen and Llama families, at 4B/8B/30B scale, for both instruct and thinking variants

Why This Matters to Our Fleet

1. Our local agents are small models — this is THEIR paper

We run Qwen3-4B-scale models on Ollama and llama.cpp. The paper explicitly shows gains at 4B (+7.5pp). This is the exact scale we operate at. Big labs assume you need GPUs to improve — this paper says no.

2. No privileged infrastructure required

SSD needs:

  • A set of coding prompts (we have tons from Gitea issue backlog)
  • The model generating solutions (our agents do this already)
  • Standard SFT fine-tuning (we have axolotl, peft, unsloth skills)

No code execution verifier. No teacher model. No RL infrastructure. This is a local-first technique.

3. The precision-exploration mechanism is actionable

The paper discovers why SSD works: LLMs face a "precision-exploration conflict" during decoding.

Locks = positions needing precision (syntax, colons, imports). Low T helps, but kills Fork diversity.
Forks = algorithmic choice points where multiple valid paths exist. High T helps, but introduces errors at Locks.

SSD reshapes token distributions context-dependently:

  • Suppresses distractor tails at Locks
  • Preserves useful diversity at Forks

This means we can improve our agents at inference time too — just by using different temperatures for training vs. evaluation. The paper shows T_train != T_eval is critical.


Fleet Improvement Actions

Phase 1: Immediate (1-2 weeks) — Temperature Tuning

Issue: Optimize agent inference temperatures using Fork/Lock analysis

The paper shows that using the right evaluation temperature matters. We should:

  1. Audit current temperature settings across our fleet agents
  2. Test T_eval=0.7 vs T_eval=1.0 vs T_eval=0.3 for code generation tasks
  3. Use the Fork/Lock analysis to understand WHERE our agents fail (syntax errors at Locks vs. wrong algorithm at Forks)

The paper's optimal T_eval differs from T_train. This is a zero-cost optimization.

Phase 2: Short-term (2-4 weeks) — Self-Distillation Pipeline

Issue: Build SSD pipeline for fleet model improvement

Replicate the paper's method on our models:

  1. Extract coding prompts from our Gitea backlog (issues with code solutions)
  2. Sample solutions from our Qwen4B/8B models at T_train (paper suggests ~0.9)
  3. Fine-tune the model on those self-generated outputs via LoRA
  4. Evaluate pass@1 on a held-out test set

Cost: Can be done on a single Mac with Ollama + axolotl/unsloth. The dataset is self-generated, not purchased. The training is standard SFT, not RL.

Phase 3: Medium-term (1-2 months) — Continuous Self-Improvement Loop

Issue: Wire SSD into the fleet's perpetual velocity system

The paper's method naturally fits a continuous improvement loop:

  1. Agents solve issues during burn mode
  2. Correct solutions (verified by test pass or human merge) are collected
  3. Nightly SFT fine-tuning on collected solutions
  4. Deploy updated LoRA adapters to the fleet the next morning

This turns our fleet into a self-improving system. Every night of work makes the agents smarter at the next day's work.

Phase 4: Strategic — Sovereign Model Improvement

Issue: Build sovereign fine-tuning capability

The paper validates that we don't need corporate APIs or external data to improve our models. This is sovereignty proven on the benchmark. We can:

  • Fine-tune exclusively on our own conversation data (agent solving real problems)
  • No external data pipeline needed
  • No teacher model subscription needed
  • Fine-tuned models stay local

Key Technical Details to Note

SSD Algorithm (3 stages)

  1. Data Synthesis: For each coding prompt, generate ONE solution (N=1) at temperature T_train with specific top-p/top-k. Solutions are NOT verified. Just the model's best guess.

  2. Fine-Tuning: Standard SFT on the collected (prompt, solution) pairs. Cross-entropy loss. No RLHF, no GRPO, no DPO.

  3. Inference: Evaluate at T_eval (which is DIFFERENT from T_train — using the same temperature for both loses significant gains).

Why N=1 works (counterintuitive)

Prior work uses N>>1 (many samples per problem) and filters by correctness. SSD uses N=1, no filtering. The reason: with N>>1, the model learns the distribution of CORRECT answers. With N=1 and no filtering, the model learns to sharpen its OWN internal distribution — which is slightly different and apparently better for generalization.

The Fork/Lock analysis

This is the paper's key theoretical contribution. They prove:

  • Locks (precision positions): SSD narrows the token distribution, suppressing distractors
  • Forks (exploration positions): SSD widens the distribution, enabling more diverse valid paths
  • This context-dependent reshaping cannot be achieved by a single global temperature

Temperature asymmetry is critical

T_train (for sampling) != T_eval (for inference). The paper shows:

  • T_train should be moderately high (~0.9) to enable exploration during synthesis
  • T_eval should be lower than T_train for optimal results
  • Using the same temperature for both loses 30-50% of the gains

Acceptance Criteria for Fleet Integration

  • Temperature audit: Document current T/T_top_p settings across all fleet agents, compare with paper recommendations
  • T_eval benchmark: Run A/B test with current vs. paper-recommended T_eval on 20+ solved Gitea issues, measure correctness
  • SSD reproduction: Replicate the paper's pipeline on Qwen4B using 100 coding prompts, measure if pass@1 improves
  • Infrastructure assessment: Document what infrastructure (GPU, RAM, storage) is needed for a fleet-wide SSD run
  • Continuous loop design: Draft the architecture for nightly self-improvement integration with burn mode

Risks and Mitigations

Risk Likelihood Mitigation
SSD doesn't generalize to our specific domain (not LeetCode/competitive programming) Medium Test on our actual Gitea coding issues, not LiveCodeBench
Fine-tuning degrades non-code capabilities Low-Medium Use LoRA (not full fine-tune), test on general benchmarks after training
Small training data (<1000 examples) doesn't provide enough signal Medium Combine with existing open code datasets; paper shows gains even with modest data
Qwen3 models on Ollama aren't compatible with unsloth/axolotl Low Verify compatibility before investing compute time

See Epic #111 for the v0.7.0 feature update. The RetainDB system (#112) would be crucial for SSD — it provides the persistent memory to track which agent solutions worked and which didn't, forming the feedback loop for self-improvement.


Generated by Timmy on 2026-04-06. Triage based on alphaXiv overview and arXiv full text.

# Research Triage: SSD — Simple Self-Distillation for Code Generation **Paper:** [Embarrassingly Simple Self-Distillation Improves Code Generation](https://arxiv.org/abs/2604.01193) **Authors:** Ruixiang Zhang, Richard He Bai, Huangjie Zheng, Navdeep Jaitly, Ronan Collobert, Yizhe Zhang (Apple) **Date:** April 1, 2026 **Code:** https://github.com/apple/ml-ssd **Tags:** #P0 #fleet-improvement #model-training #code-generation --- ## Executive Summary Apple researchers show that LLMs can improve code generation by fine-tuning on their **own unverified outputs** — no teacher model, no verifier, no RL, no execution environment. Two ingredients only: 1. Sample solutions at a specific training temperature (T_train) 2. Fine-tune on them with standard SFT **Results are substantial:** - Qwen3-30B-Instruct: 42.4% -> 55.3% pass@1 on LiveCodeBench v6 (+12.9pp, +30% relative) - Qwen3-4B-Instruct: +7.5pp overall - Llama-3.1-8B-Instruct: +3.5pp - Gains concentrate on HARD problems (+14.2pp medium, +15.3pp hard) - Works across Qwen and Llama families, at 4B/8B/30B scale, for both instruct and thinking variants --- ## Why This Matters to Our Fleet ### 1. Our local agents are small models — this is THEIR paper We run Qwen3-4B-scale models on Ollama and llama.cpp. The paper explicitly shows gains at 4B (+7.5pp). This is the exact scale we operate at. Big labs assume you need GPUs to improve — this paper says no. ### 2. No privileged infrastructure required SSD needs: - A set of coding prompts (we have tons from Gitea issue backlog) - The model generating solutions (our agents do this already) - Standard SFT fine-tuning (we have axolotl, peft, unsloth skills) No code execution verifier. No teacher model. No RL infrastructure. This is a local-first technique. ### 3. The precision-exploration mechanism is actionable The paper discovers why SSD works: LLMs face a "precision-exploration conflict" during decoding. **Locks** = positions needing precision (syntax, colons, imports). Low T helps, but kills Fork diversity. **Forks** = algorithmic choice points where multiple valid paths exist. High T helps, but introduces errors at Locks. SSD reshapes token distributions context-dependently: - Suppresses distractor tails at Locks - Preserves useful diversity at Forks **This means we can improve our agents at inference time too** — just by using different temperatures for training vs. evaluation. The paper shows T_train != T_eval is critical. --- ## Fleet Improvement Actions ### Phase 1: Immediate (1-2 weeks) — Temperature Tuning **Issue: Optimize agent inference temperatures using Fork/Lock analysis** The paper shows that using the right evaluation temperature matters. We should: 1. Audit current temperature settings across our fleet agents 2. Test T_eval=0.7 vs T_eval=1.0 vs T_eval=0.3 for code generation tasks 3. Use the Fork/Lock analysis to understand WHERE our agents fail (syntax errors at Locks vs. wrong algorithm at Forks) The paper's optimal T_eval differs from T_train. This is a zero-cost optimization. ### Phase 2: Short-term (2-4 weeks) — Self-Distillation Pipeline **Issue: Build SSD pipeline for fleet model improvement** Replicate the paper's method on our models: 1. Extract coding prompts from our Gitea backlog (issues with code solutions) 2. Sample solutions from our Qwen4B/8B models at T_train (paper suggests ~0.9) 3. Fine-tune the model on those self-generated outputs via LoRA 4. Evaluate pass@1 on a held-out test set Cost: Can be done on a single Mac with Ollama + axolotl/unsloth. The dataset is self-generated, not purchased. The training is standard SFT, not RL. ### Phase 3: Medium-term (1-2 months) — Continuous Self-Improvement Loop **Issue: Wire SSD into the fleet's perpetual velocity system** The paper's method naturally fits a continuous improvement loop: 1. Agents solve issues during burn mode 2. Correct solutions (verified by test pass or human merge) are collected 3. Nightly SFT fine-tuning on collected solutions 4. Deploy updated LoRA adapters to the fleet the next morning This turns our fleet into a self-improving system. Every night of work makes the agents smarter at the next day's work. ### Phase 4: Strategic — Sovereign Model Improvement **Issue: Build sovereign fine-tuning capability** The paper validates that we don't need corporate APIs or external data to improve our models. This is sovereignty proven on the benchmark. We can: - Fine-tune exclusively on our own conversation data (agent solving real problems) - No external data pipeline needed - No teacher model subscription needed - Fine-tuned models stay local --- ## Key Technical Details to Note ### SSD Algorithm (3 stages) 1. **Data Synthesis:** For each coding prompt, generate ONE solution (N=1) at temperature T_train with specific top-p/top-k. Solutions are NOT verified. Just the model's best guess. 2. **Fine-Tuning:** Standard SFT on the collected (prompt, solution) pairs. Cross-entropy loss. No RLHF, no GRPO, no DPO. 3. **Inference:** Evaluate at T_eval (which is DIFFERENT from T_train — using the same temperature for both loses significant gains). ### Why N=1 works (counterintuitive) Prior work uses N>>1 (many samples per problem) and filters by correctness. SSD uses N=1, no filtering. The reason: with N>>1, the model learns the distribution of CORRECT answers. With N=1 and no filtering, the model learns to sharpen its OWN internal distribution — which is slightly different and apparently better for generalization. ### The Fork/Lock analysis This is the paper's key theoretical contribution. They prove: - Locks (precision positions): SSD narrows the token distribution, suppressing distractors - Forks (exploration positions): SSD widens the distribution, enabling more diverse valid paths - This context-dependent reshaping cannot be achieved by a single global temperature ### Temperature asymmetry is critical T_train (for sampling) != T_eval (for inference). The paper shows: - T_train should be moderately high (~0.9) to enable exploration during synthesis - T_eval should be lower than T_train for optimal results - Using the same temperature for both loses 30-50% of the gains --- ## Acceptance Criteria for Fleet Integration - [ ] **Temperature audit**: Document current T/T_top_p settings across all fleet agents, compare with paper recommendations - [ ] **T_eval benchmark**: Run A/B test with current vs. paper-recommended T_eval on 20+ solved Gitea issues, measure correctness - [ ] **SSD reproduction**: Replicate the paper's pipeline on Qwen4B using 100 coding prompts, measure if pass@1 improves - [ ] **Infrastructure assessment**: Document what infrastructure (GPU, RAM, storage) is needed for a fleet-wide SSD run - [ ] **Continuous loop design**: Draft the architecture for nightly self-improvement integration with burn mode --- ## Risks and Mitigations | Risk | Likelihood | Mitigation | |------|------------|------------| | SSD doesn't generalize to our specific domain (not LeetCode/competitive programming) | Medium | Test on our actual Gitea coding issues, not LiveCodeBench | | Fine-tuning degrades non-code capabilities | Low-Medium | Use LoRA (not full fine-tune), test on general benchmarks after training | | Small training data (<1000 examples) doesn't provide enough signal | Medium | Combine with existing open code datasets; paper shows gains even with modest data | | Qwen3 models on Ollama aren't compatible with unsloth/axolotl | Low | Verify compatibility before investing compute time | --- ## Related Fleet Issues See Epic #111 for the v0.7.0 feature update. The RetainDB system (#112) would be crucial for SSD — it provides the persistent memory to track which agent solutions worked and which didn't, forming the feedback loop for self-improvement. --- *Generated by Timmy on 2026-04-06. Triage based on alphaXiv overview and arXiv full text.*
Member

🏷️ Automated Triage Check

Timestamp: 2026-04-06T15:00:18.933804
Agent: Allegro Heartbeat

This issue has been identified as needing triage:

Checklist

  • Clear acceptance criteria defined
  • Priority label assigned (p0-critical / p1-important / p2-backlog)
  • Size estimate added (quick-fix / day / week / epic)
  • Owner assigned
  • Related issues linked

Context

  • No comments yet — needs engagement
  • No labels — needs categorization
  • Part of automated backlog maintenance

Automated triage from Allegro 15-minute heartbeat

## 🏷️ Automated Triage Check **Timestamp:** 2026-04-06T15:00:18.933804 **Agent:** Allegro Heartbeat This issue has been identified as needing triage: ### Checklist - [ ] Clear acceptance criteria defined - [ ] Priority label assigned (p0-critical / p1-important / p2-backlog) - [ ] Size estimate added (quick-fix / day / week / epic) - [ ] Owner assigned - [ ] Related issues linked ### Context - No comments yet — needs engagement - No labels — needs categorization - Part of automated backlog maintenance --- *Automated triage from Allegro 15-minute heartbeat*
claude self-assigned this 2026-04-07 02:05:12 +00:00
Member

PR created: #165

Added docs/research-ssd-self-distillation-2026-04.md — full research triage of the Apple SSD paper.

Verdict: P0. The +7.5pp result on Qwen3-4B is exactly our operating scale. No external infrastructure required.

Key findings documented:

  1. Fork/Lock theory — explains why a single global temperature is suboptimal. We are leaving performance on the table today without any fine-tuning. Temperature audit is a zero-cost, this-week action.

  2. Temperature asymmetry is critical — T_train (~0.9) != T_eval (~0.7). Using the same temperature for both loses 30-50% of the gains. This affects inference config even without SSD.

  3. N=1, unverified samples work — counterintuitively better than filtered N>>1. No code execution environment needed for training set construction.

  4. Infrastructure assessment: Qwen3-4B LoRA fine-tuning is Mac-feasible (~8-12GB unified memory, 2-4 hours for 500 examples).

  5. Continuous loop architecture drafted — integrates with RetainDB (#112) and burn mode: merged PRs -> nightly SFT -> updated LoRA adapter at dawn.

Recommended follow-up issues: (1) Temperature Audit, (2) SSD Pipeline Spike on Qwen3-4B, (3) Nightly SFT Integration with burn mode.

PR created: https://forge.alexanderwhitestone.com/Timmy_Foundation/hermes-agent/pulls/165 Added `docs/research-ssd-self-distillation-2026-04.md` — full research triage of the Apple SSD paper. **Verdict: P0. The +7.5pp result on Qwen3-4B is exactly our operating scale. No external infrastructure required.** Key findings documented: 1. Fork/Lock theory — explains why a single global temperature is suboptimal. We are leaving performance on the table today without any fine-tuning. Temperature audit is a zero-cost, this-week action. 2. Temperature asymmetry is critical — T_train (~0.9) != T_eval (~0.7). Using the same temperature for both loses 30-50% of the gains. This affects inference config even without SSD. 3. N=1, unverified samples work — counterintuitively better than filtered N>>1. No code execution environment needed for training set construction. 4. Infrastructure assessment: Qwen3-4B LoRA fine-tuning is Mac-feasible (~8-12GB unified memory, 2-4 hours for 500 examples). 5. Continuous loop architecture drafted — integrates with RetainDB (#112) and burn mode: merged PRs -> nightly SFT -> updated LoRA adapter at dawn. Recommended follow-up issues: (1) Temperature Audit, (2) SSD Pipeline Spike on Qwen3-4B, (3) Nightly SFT Integration with burn mode.
Sign in to join this conversation.
3 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: Timmy_Foundation/hermes-agent#128