[P2.5] Per-layer quantization profiles #13

New Issue

Timmy · 2026-03-30T17:11:17Z

Timmy commented

2026-03-30 17:11:17 +00:00

Parent: #1 | Depends on: Phase 2 stable

Not all layers have equal KV cache sensitivity. Early (first 3) and late (last 3) layers tend to be more sensitive.

Profile

Sensitive layers (first 3 + last 3): K at Q8_0, V at turbo4 (or FP16)
Middle layers: K and V both at turbo4 (or turbo3)

When to Pursue

Only after Phase 2 is stable and baseline quality confirmed. This is tuning, not architecture.

Pre-check (during Phase 1)

Does the fork expose per-layer KV type config? Note for later.

Acceptance Criteria

Per-layer config capability confirmed
Profile implemented and benchmarked
PPL improvement over uniform turbo4 measured

## Parent: #1 | Depends on: Phase 2 stable Not all layers have equal KV cache sensitivity. Early (first 3) and late (last 3) layers tend to be more sensitive. ## Profile - Sensitive layers (first 3 + last 3): K at Q8_0, V at turbo4 (or FP16) - Middle layers: K and V both at turbo4 (or turbo3) ## When to Pursue Only after Phase 2 is stable and baseline quality confirmed. This is tuning, not architecture. ## Pre-check (during Phase 1) Does the fork expose per-layer KV type config? Note for later. ## Acceptance Criteria - [ ] Per-layer config capability confirmed - [ ] Profile implemented and benchmarked - [ ] PPL improvement over uniform turbo4 measured

Timmy added this to the Phase 3+ — Optimization & QJL milestone 2026-03-30 17:11:17 +00:00

Timmy added the phase-2.5 owner:cid labels 2026-03-30 17:11:17 +00:00

Timmy commented

2026-03-30 21:04:06 +00:00

Per-Layer Quantization — ALREADY IMPLEMENTED ✅

Found in the fork, no additional work needed.

Mechanism

Environment variable TURBO_LAYER_ADAPTIVE controls per-layer strategy.
Location: src/llama-kv-cache.cpp, lines 165-217

Available Modes

Mode	Strategy
0	Uniform (default) — all layers same type
1	q8_0 K+V for first 4 + last 4 layers
2	q8_0 K+V for last 8 layers
5	Boundary V: first2+last2 V=turbo4, rest V=turbo2
6	V-only: last 8 V=turbo4, rest V=turbo2
7	Recommended: first2+last2 V=q8_0, rest V=turbo2

Usage

export TURBO_LAYER_ADAPTIVE=7
llama-server -m model.gguf -ctk turbo4 -ctv turbo4 --port 11434

Notes

Env-var based, no CLI flag
Already on feature/turboquant-kv-cache branch (no separate branch)
Each layer gets independent type_k and type_v at allocation
Benchmarks pending (modes 1, 5, 7 queued)

## Per-Layer Quantization — ALREADY IMPLEMENTED ✅ Found in the fork, no additional work needed. ### Mechanism Environment variable `TURBO_LAYER_ADAPTIVE` controls per-layer strategy. Location: `src/llama-kv-cache.cpp`, lines 165-217 ### Available Modes | Mode | Strategy | |------|----------| | 0 | Uniform (default) — all layers same type | | 1 | q8_0 K+V for first 4 + last 4 layers | | 2 | q8_0 K+V for last 8 layers | | 5 | Boundary V: first2+last2 V=turbo4, rest V=turbo2 | | 6 | V-only: last 8 V=turbo4, rest V=turbo2 | | **7** | **Recommended: first2+last2 V=q8_0, rest V=turbo2** | ### Usage ```bash export TURBO_LAYER_ADAPTIVE=7 llama-server -m model.gguf -ctk turbo4 -ctv turbo4 --port 11434 ``` ### Notes - Env-var based, no CLI flag - Already on feature/turboquant-kv-cache branch (no separate branch) - Each layer gets independent type_k and type_v at allocation - Benchmarks pending (modes 1, 5, 7 queued)

Timmy closed this issue

2026-03-30 21:04:06 +00:00

Timmy referenced this issue

2026-04-04 01:18:41 +00:00

TurboQuant — KV Cache Compression for Local Inference on M4 Max #1

Sign in to join this conversation.

1 Participants

Notifications

Due Date

No due date set.

Dependencies

No dependencies set.

Reference: Timmy_Foundation/turboquant#13