Files
hermes-agent/docs/research-ssd-self-distillation-2026-04.md

8.6 KiB
Raw Blame History

Research Acknowledgment: SSD — Simple Self-Distillation Improves Code Generation

Issue: #128 Paper: Embarrassingly Simple Self-Distillation Improves Code Generation Authors: Ruixiang Zhang, Richard He Bai, Huangjie Zheng, Navdeep Jaitly, Ronan Collobert, Yizhe Zhang (Apple) Date: April 1, 2026 Code: https://github.com/apple/ml-ssd Acknowledged by: Claude — April 6, 2026


Assessment: High Relevance to Fleet

This paper is directly applicable to the hermes-agent fleet. The headline result — +7.5pp pass@1 on Qwen3-4B — is at exactly the scale we operate. The method requires no external infrastructure. Triage verdict: P0 / Week-class work.


What SSD Actually Does

Three steps, nothing exotic:

  1. Sample: For each coding prompt, generate one solution at temperature T_train (~0.9). Do NOT filter for correctness.
  2. Fine-tune: SFT on the resulting (prompt, unverified_solution) pairs. Standard cross-entropy loss. No RLHF, no GRPO, no DPO.
  3. Evaluate: At T_eval (which must be different from T_train). This asymmetry is not optional — using the same temperature for both loses 3050% of the gains.

The counterintuitive part: N=1 per problem, unverified. Prior self-improvement work uses N>>1 and filters by execution. SSD doesn't. The paper argues this is why it works — you're sharpening the model's own distribution, not fitting to a correctness filter's selection bias.


The Fork/Lock Theory

The paper's core theoretical contribution explains why temperature asymmetry matters.

Locks — positions requiring syntactic precision: colons, parentheses, import paths, variable names. A mistake here is a hard error. Low temperature helps at Locks. But applying low temperature globally kills diversity everywhere.

Forks — algorithmic choice points where multiple valid continuations exist: picking a sort algorithm, choosing a data structure, deciding on a loop structure. High temperature helps at Forks. But applying high temperature globally introduces errors at Locks.

SSD's fine-tuning reshapes token distributions context-dependently:

  • At Locks: narrows the distribution, suppressing distractor tokens
  • At Forks: widens the distribution, preserving valid algorithmic paths

A single global temperature cannot do this. SFT on self-generated data can, because the model learns from examples that implicitly encode which positions are Locks and which are Forks in each problem context.

Fleet implication: Our agents are currently using a single temperature for everything. This is leaving performance on the table even without fine-tuning. The immediate zero-cost action is temperature auditing (see Phase 1 below).


Results That Matter to Us

Model Before After Delta
Qwen3-30B-Instruct 42.4% 55.3% +12.9pp (+30% rel)
Qwen3-4B-Instruct baseline baseline+7.5pp +7.5pp
Llama-3.1-8B-Instruct baseline baseline+3.5pp +3.5pp

Gains concentrate on hard problems: +14.2pp medium, +15.3pp hard. This is the distribution our agents face on real Gitea issues — not easy textbook problems.


Fleet Implementation Plan

Phase 1: Temperature Audit (Zero cost, this week)

Current state: fleet agents use default or eyeballed temperature settings. The paper shows T_eval != T_train is critical even without fine-tuning.

Actions:

  1. Document current temperature settings in hermes/, skills/, and any Ollama config files
  2. Establish a held-out test set of 20+ solved Gitea issues with known-correct outputs
  3. Run A/B: current T_eval vs. T_eval=0.7 vs. T_eval=0.3 for code generation tasks
  4. Record pass rates per condition; file findings as a follow-up issue

Expected outcome: measurable improvement with no model changes, no infrastructure, no cost.

Phase 2: SSD Pipeline (12 weeks, single Mac)

Replicate the paper's method on Qwen3-4B via Ollama + axolotl or unsloth:

1. Dataset construction:
   - Extract 100500 coding prompts from Gitea issue backlog
   - Focus on issues that have accepted PRs (ground truth available for evaluation only, not training)
   - Format: (system_prompt + issue_description) → model generates solution at T_train=0.9

2. Fine-tuning:
   - Use LoRA (not full fine-tune) to stay local-first
   - Standard SFT: cross-entropy on (prompt, self-generated_solution) pairs
   - Recommended: unsloth for memory efficiency on Mac hardware
   - Training budget: 13 epochs, small batch size

3. Evaluation:
   - Compare base model vs. SSD-tuned model at T_eval=0.7
   - Metric: pass@1 on held-out issues not in training set
   - Also test on general coding benchmarks to check for capability regression

Infrastructure assessment:

  • RAM: Qwen3-4B quantized (Q4_K_M) needs ~3.5GB VRAM for inference; LoRA fine-tuning needs ~812GB unified memory (Mac M-series feasible)
  • Storage: Self-generated dataset is small; LoRA adapter is ~100500MB
  • Time: 500 examples × 3 epochs ≈ 24 hours on M2/M3 Max
  • Dependencies: Ollama (inference), unsloth or axolotl (fine-tuning), datasets (HuggingFace), trl

No cloud required. No teacher model required. No code execution environment required.

Phase 3: Continuous Self-Improvement Loop (12 months)

Wire SSD into the fleet's burn mode:

Nightly cron:
  1. Collect agent solutions from the day's completed issues
  2. Filter: only solutions where the PR was merged (human-verified correct)
  3. Append to rolling training buffer (last 500 examples)
  4. Run SFT fine-tune on buffer → update LoRA adapter
  5. Swap adapter into Ollama deployment at dawn
  6. Agents start next day with yesterday's lessons baked in

This integrates naturally with RetainDB (#112) — the persistent memory system would track which solutions were merged, providing the feedback signal. The continuous loop turns every merged PR into a training example.

Phase 4: Sovereignty Confirmation

The paper validates that external data is not required for improvement. Our fleet can:

  • Fine-tune exclusively on its own conversation data
  • Stay fully local (no API calls, no external datasets)
  • Accumulate improvements over time without model subscriptions

This is the sovereign fine-tuning capability the fleet needs to remain independent as external model APIs change pricing or capabilities.


Risks and Mitigations

Risk Assessment Mitigation
SSD gains don't transfer from LiveCodeBench to Gitea issues Medium — our domain is software engineering, not competitive programming Test on actual Gitea issues from the backlog; don't assume benchmark numbers transfer
Fine-tuning degrades non-code capabilities Low-Medium LoRA instead of full fine-tune; test on general tasks after SFT; retain base model checkpoint
Small training set (<200 examples) insufficient Medium Paper shows gains at modest scale; supplement with open code datasets (Stack, TheVault) if needed
Qwen3 GGUF format incompatible with unsloth fine-tuning Low unsloth supports Qwen3; verify exact GGUF variant compatibility before starting
Temperature asymmetry effect smaller on instruction-tuned variants Low Paper explicitly tests instruct variants and shows gains; Qwen3-4B-Instruct is in the paper's results

Acceptance Criteria Status

From the issue:

  • Temperature audit — Document current T/top_p settings across fleet agents, compare with paper recommendations
  • T_eval benchmark — A/B test on 20+ solved Gitea issues; measure correctness
  • SSD reproduction — Replicate pipeline on Qwen4B with 100 prompts; measure pass@1 change
  • Infrastructure assessment — Documented above (Phase 2 section); GPU/RAM/storage requirements are Mac-feasible
  • Continuous loop design — Architecture drafted above (Phase 3 section); integrates with RetainDB (#112)

Infrastructure assessment and continuous loop design are addressed in this document. Temperature audit and SSD reproduction require follow-up issues with execution.


  1. Temperature Audit — Audit all fleet agent temperature configs; run A/B on T_eval variants; file results (Phase 1)
  2. SSD Pipeline Spike — Build and run the 3-stage SSD pipeline on Qwen3-4B; report pass@1 delta (Phase 2)
  3. Nightly SFT Integration — Wire SSD into burn-mode cron; integrate with RetainDB feedback loop (Phase 3)

Research acknowledged by Claude — April 6, 2026 Source issue: hermes-agent #128