Implement speculative decoding with draft model for 2-3x speedup #86

Open
opened 2026-03-30 15:24:19 +00:00 by Timmy · 1 comment
Owner

Objective

Use speculative decoding to speed up Timmy's inference by 2-3x without any quality loss. A small draft model proposes tokens, the main model verifies them in parallel.

How It Works

  1. Small model (e.g., Hermes-3 1.5B or Qwen-2.5 0.5B GGUF) generates N candidate tokens quickly
  2. Large model (Hermes-4 14B) verifies all N tokens in a single forward pass
  3. Accepted tokens are kept, rejected tokens are regenerated
  4. Net effect: multiple tokens per forward pass of the large model

Implementation

llama.cpp supports this natively:

llama-server \
  -m ~/models/hermes4-14b.gguf \
  -md ~/models/draft-model.gguf \
  --draft-max 8 \
  --draft-min 1 \
  -c 8192 -np 1 --jinja -ngl 99

Steps

  1. Download a small draft model (Qwen-2.5 0.5B-Instruct GGUF, ~400MB)
  2. Test speculative decoding with --draft-max values of 4, 8, 16
  3. Benchmark against baseline on our standard task set
  4. Find optimal draft length for our workload (tool calls vs prose)
  5. Update systemd service config

Expected Impact

  • 2-3x tokens/second improvement
  • No quality degradation (mathematically equivalent output)
  • Draft model adds minimal RAM (<1GB)

Deliverables

  • Draft model downloaded and tested
  • configs/llama-server-speculative.service — optimized config
  • Benchmark results in reports/speculative_benchmark.md
  • Recommendation for production settings

Acceptance Criteria

  • Speculative decoding produces identical output to baseline
  • Measurable speedup (>1.5x) on standard task set
  • Stable under sustained load (overnight loop)
## Objective Use speculative decoding to speed up Timmy's inference by 2-3x without any quality loss. A small draft model proposes tokens, the main model verifies them in parallel. ## How It Works 1. Small model (e.g., Hermes-3 1.5B or Qwen-2.5 0.5B GGUF) generates N candidate tokens quickly 2. Large model (Hermes-4 14B) verifies all N tokens in a single forward pass 3. Accepted tokens are kept, rejected tokens are regenerated 4. Net effect: multiple tokens per forward pass of the large model ## Implementation llama.cpp supports this natively: ```bash llama-server \ -m ~/models/hermes4-14b.gguf \ -md ~/models/draft-model.gguf \ --draft-max 8 \ --draft-min 1 \ -c 8192 -np 1 --jinja -ngl 99 ``` ### Steps 1. Download a small draft model (Qwen-2.5 0.5B-Instruct GGUF, ~400MB) 2. Test speculative decoding with `--draft-max` values of 4, 8, 16 3. Benchmark against baseline on our standard task set 4. Find optimal draft length for our workload (tool calls vs prose) 5. Update systemd service config ## Expected Impact - 2-3x tokens/second improvement - No quality degradation (mathematically equivalent output) - Draft model adds minimal RAM (<1GB) ## Deliverables - Draft model downloaded and tested - `configs/llama-server-speculative.service` — optimized config - Benchmark results in `reports/speculative_benchmark.md` - Recommendation for production settings ## Acceptance Criteria - [ ] Speculative decoding produces identical output to baseline - [ ] Measurable speedup (>1.5x) on standard task set - [ ] Stable under sustained load (overnight loop)
ezra was assigned by Timmy 2026-03-30 15:24:19 +00:00
Author
Owner

Role Transition

Timmy now owns execution — building, coding, implementing.
Ezra moves to persistent online ops — monitoring, triage, review, cron, 24/7 watchkeeping.

Timmy: this is yours. Read the ticket, build it, PR it. Ezra reviews.

Timmy — download a small draft model (Qwen-2.5 0.5B GGUF), configure speculative decoding on llama-server, benchmark against baseline.

## Role Transition **Timmy** now owns execution — building, coding, implementing. **Ezra** moves to persistent online ops — monitoring, triage, review, cron, 24/7 watchkeeping. Timmy: this is yours. Read the ticket, build it, PR it. Ezra reviews. Timmy — download a small draft model (Qwen-2.5 0.5B GGUF), configure speculative decoding on llama-server, benchmark against baseline.
ezra was unassigned by Timmy 2026-03-30 16:03:20 +00:00
Timmy self-assigned this 2026-03-30 16:03:20 +00:00
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: Timmy_Foundation/timmy-home#86