[P2.5] Per-layer quantization profiles #13

Closed
opened 2026-03-30 17:11:17 +00:00 by Timmy · 1 comment
Owner

Parent: #1 | Depends on: Phase 2 stable

Not all layers have equal KV cache sensitivity. Early (first 3) and late (last 3) layers tend to be more sensitive.

Profile

  • Sensitive layers (first 3 + last 3): K at Q8_0, V at turbo4 (or FP16)
  • Middle layers: K and V both at turbo4 (or turbo3)

When to Pursue

Only after Phase 2 is stable and baseline quality confirmed. This is tuning, not architecture.

Pre-check (during Phase 1)

Does the fork expose per-layer KV type config? Note for later.

Acceptance Criteria

  • Per-layer config capability confirmed
  • Profile implemented and benchmarked
  • PPL improvement over uniform turbo4 measured
## Parent: #1 | Depends on: Phase 2 stable Not all layers have equal KV cache sensitivity. Early (first 3) and late (last 3) layers tend to be more sensitive. ## Profile - Sensitive layers (first 3 + last 3): K at Q8_0, V at turbo4 (or FP16) - Middle layers: K and V both at turbo4 (or turbo3) ## When to Pursue Only after Phase 2 is stable and baseline quality confirmed. This is tuning, not architecture. ## Pre-check (during Phase 1) Does the fork expose per-layer KV type config? Note for later. ## Acceptance Criteria - [ ] Per-layer config capability confirmed - [ ] Profile implemented and benchmarked - [ ] PPL improvement over uniform turbo4 measured
Timmy added this to the Phase 3+ — Optimization & QJL milestone 2026-03-30 17:11:17 +00:00
Timmy added the phase-2.5owner:cid labels 2026-03-30 17:11:17 +00:00
Author
Owner

Per-Layer Quantization — ALREADY IMPLEMENTED

Found in the fork, no additional work needed.

Mechanism

Environment variable TURBO_LAYER_ADAPTIVE controls per-layer strategy.
Location: src/llama-kv-cache.cpp, lines 165-217

Available Modes

Mode Strategy
0 Uniform (default) — all layers same type
1 q8_0 K+V for first 4 + last 4 layers
2 q8_0 K+V for last 8 layers
5 Boundary V: first2+last2 V=turbo4, rest V=turbo2
6 V-only: last 8 V=turbo4, rest V=turbo2
7 Recommended: first2+last2 V=q8_0, rest V=turbo2

Usage

export TURBO_LAYER_ADAPTIVE=7
llama-server -m model.gguf -ctk turbo4 -ctv turbo4 --port 11434

Notes

  • Env-var based, no CLI flag
  • Already on feature/turboquant-kv-cache branch (no separate branch)
  • Each layer gets independent type_k and type_v at allocation
  • Benchmarks pending (modes 1, 5, 7 queued)
## Per-Layer Quantization — ALREADY IMPLEMENTED ✅ Found in the fork, no additional work needed. ### Mechanism Environment variable `TURBO_LAYER_ADAPTIVE` controls per-layer strategy. Location: `src/llama-kv-cache.cpp`, lines 165-217 ### Available Modes | Mode | Strategy | |------|----------| | 0 | Uniform (default) — all layers same type | | 1 | q8_0 K+V for first 4 + last 4 layers | | 2 | q8_0 K+V for last 8 layers | | 5 | Boundary V: first2+last2 V=turbo4, rest V=turbo2 | | 6 | V-only: last 8 V=turbo4, rest V=turbo2 | | **7** | **Recommended: first2+last2 V=q8_0, rest V=turbo2** | ### Usage ```bash export TURBO_LAYER_ADAPTIVE=7 llama-server -m model.gguf -ctk turbo4 -ctv turbo4 --port 11434 ``` ### Notes - Env-var based, no CLI flag - Already on feature/turboquant-kv-cache branch (no separate branch) - Each layer gets independent type_k and type_v at allocation - Benchmarks pending (modes 1, 5, 7 queued)
Timmy closed this issue 2026-03-30 21:04:06 +00:00
Sign in to join this conversation.