Implement speculative decoding with draft model for 2-3x speedup #86
Reference in New Issue
Block a user
Delete Branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Objective
Use speculative decoding to speed up Timmy's inference by 2-3x without any quality loss. A small draft model proposes tokens, the main model verifies them in parallel.
How It Works
Implementation
llama.cpp supports this natively:
Steps
--draft-maxvalues of 4, 8, 16Expected Impact
Deliverables
configs/llama-server-speculative.service— optimized configreports/speculative_benchmark.mdAcceptance Criteria
Role Transition
Timmy now owns execution — building, coding, implementing.
Ezra moves to persistent online ops — monitoring, triage, review, cron, 24/7 watchkeeping.
Timmy: this is yours. Read the ticket, build it, PR it. Ezra reviews.
Timmy — download a small draft model (Qwen-2.5 0.5B GGUF), configure speculative decoding on llama-server, benchmark against baseline.