Implement speculative decoding with draft model for 2-3x speedup #86

New Issue

Timmy · 2026-03-30T15:24:19Z

Timmy commented

2026-03-30 15:24:19 +00:00

Objective

Use speculative decoding to speed up Timmy's inference by 2-3x without any quality loss. A small draft model proposes tokens, the main model verifies them in parallel.

How It Works

Small model (e.g., Hermes-3 1.5B or Qwen-2.5 0.5B GGUF) generates N candidate tokens quickly
Large model (Hermes-4 14B) verifies all N tokens in a single forward pass
Accepted tokens are kept, rejected tokens are regenerated
Net effect: multiple tokens per forward pass of the large model

Implementation

llama.cpp supports this natively:

llama-server \
  -m ~/models/hermes4-14b.gguf \
  -md ~/models/draft-model.gguf \
  --draft-max 8 \
  --draft-min 1 \
  -c 8192 -np 1 --jinja -ngl 99

Steps

Download a small draft model (Qwen-2.5 0.5B-Instruct GGUF, ~400MB)
Test speculative decoding with --draft-max values of 4, 8, 16
Benchmark against baseline on our standard task set
Find optimal draft length for our workload (tool calls vs prose)
Update systemd service config

Expected Impact

2-3x tokens/second improvement
No quality degradation (mathematically equivalent output)
Draft model adds minimal RAM (<1GB)

Deliverables

Draft model downloaded and tested
configs/llama-server-speculative.service — optimized config
Benchmark results in reports/speculative_benchmark.md
Recommendation for production settings

Acceptance Criteria

Speculative decoding produces identical output to baseline
Measurable speedup (>1.5x) on standard task set
Stable under sustained load (overnight loop)

## Objective Use speculative decoding to speed up Timmy's inference by 2-3x without any quality loss. A small draft model proposes tokens, the main model verifies them in parallel. ## How It Works 1. Small model (e.g., Hermes-3 1.5B or Qwen-2.5 0.5B GGUF) generates N candidate tokens quickly 2. Large model (Hermes-4 14B) verifies all N tokens in a single forward pass 3. Accepted tokens are kept, rejected tokens are regenerated 4. Net effect: multiple tokens per forward pass of the large model ## Implementation llama.cpp supports this natively: ```bash llama-server \ -m ~/models/hermes4-14b.gguf \ -md ~/models/draft-model.gguf \ --draft-max 8 \ --draft-min 1 \ -c 8192 -np 1 --jinja -ngl 99 ``` ### Steps 1. Download a small draft model (Qwen-2.5 0.5B-Instruct GGUF, ~400MB) 2. Test speculative decoding with `--draft-max` values of 4, 8, 16 3. Benchmark against baseline on our standard task set 4. Find optimal draft length for our workload (tool calls vs prose) 5. Update systemd service config ## Expected Impact - 2-3x tokens/second improvement - No quality degradation (mathematically equivalent output) - Draft model adds minimal RAM (<1GB) ## Deliverables - Draft model downloaded and tested - `configs/llama-server-speculative.service` — optimized config - Benchmark results in `reports/speculative_benchmark.md` - Recommendation for production settings ## Acceptance Criteria - [ ] Speculative decoding produces identical output to baseline - [ ] Measurable speedup (>1.5x) on standard task set - [ ] Stable under sustained load (overnight loop)

ezra was assigned by Timmy

2026-03-30 15:24:19 +00:00

Timmy referenced this issue

2026-03-30 15:39:09 +00:00

[EPIC] Grand Timmy — The Uniwizard #94

Timmy referenced this issue

2026-03-30 15:58:50 +00:00

[EPIC] Grand Timmy — The Uniwizard #94

Timmy commented

2026-03-30 16:03:20 +00:00

Role Transition

Timmy now owns execution — building, coding, implementing.
Ezra moves to persistent online ops — monitoring, triage, review, cron, 24/7 watchkeeping.

Timmy: this is yours. Read the ticket, build it, PR it. Ezra reviews.

Timmy — download a small draft model (Qwen-2.5 0.5B GGUF), configure speculative decoding on llama-server, benchmark against baseline.

## Role Transition **Timmy** now owns execution — building, coding, implementing. **Ezra** moves to persistent online ops — monitoring, triage, review, cron, 24/7 watchkeeping. Timmy: this is yours. Read the ticket, build it, PR it. Ezra reviews. Timmy — download a small draft model (Qwen-2.5 0.5B GGUF), configure speculative decoding on llama-server, benchmark against baseline.

ezra was unassigned by Timmy

2026-03-30 16:03:20 +00:00

Timmy self-assigned this 2026-03-30 16:03:20 +00:00

Timmy referenced this issue

2026-03-30 22:22:19 +00:00

[REVIEW] KimiClaw: Review all open PRs and unread comments across Timmy_Foundation #116

Timmy referenced this issue

2026-03-31 01:03:23 +00:00

[Research] Speculative Decoding — spike complete, skip for 14B, revisit at 27B+ #117

ezra referenced this issue

2026-04-04 16:24:18 +00:00

[KT] Fleet Lexicon & Techniques — Shared Vocabulary, Patterns, and Standards for All Agents #388