diff --git a/FULL-REPORT.md b/FULL-REPORT.md deleted file mode 100644 index bb4eb325..00000000 --- a/FULL-REPORT.md +++ /dev/null @@ -1,245 +0,0 @@ -# TurboQuant — Full Knowledge Transfer Report - -**Date:** 2026-03-30 -**Prepared for:** Frankie's Team (Strago, Cid, Locke, John) -**Spec:** turboquant-build-spec v2.2 (Strago) - ---- - -## TL;DR - -TurboQuant works. PolarQuant KV cache compression delivers **73% memory savings with 1% prompt overhead**. 128K context on the MacBook becomes viable. Custom Ollama build is deferred (multi-day effort), but the fork's `llama-server` is a ready drop-in. Per-layer adaptive quantization is already implemented. QJL is infrastructure-only — not needed at current compression targets. - ---- - -## Hardware Correction - -**Spec says:** M4 Max, 32GB -**Actual:** M3 Max, 36GB (sysctl hw.memsize = 38,654,705,664 bytes) - -Impact: Memory budget **increases** from ~27GB to ~31GB usable. Model ceiling improves. - ---- - -## Phase 1 — PolarQuant MVP: COMPLETE ✅ - -### Gate Check (#2): Metal Shaders EXIST -The `feature/turboquant-kv-cache` branch has production-quality Metal support: -- Flash attention for turbo2/3/4 (all dk variants) -- WHT rotation kernels (turbo_fwht_128) -- Lloyd-Max codebooks (hardcoded, non-uniform) -- Asymmetric K/V (q8_0 × turbo mixed) -- Runtime optimizations: 4-mag LUT (M4+), sparse V dequant, profiling - -**Note:** Allegro's analysis (checking only `master` branch) incorrectly concluded "NO TurboQuant." The implementation lives on the feature branch. - -### PolarQuant Verification (#5): 5/6 PASS - -| Item | Verdict | -|------|---------| -| WHT rotation (structured orthogonal) | PASS (Metal). CPU turbo4 ref uses dense random (legacy) | -| Same rotation quant/dequant | PASS | -| Lloyd-Max codebook (not uniform) | PASS | -| Radius at FP16+ | PASS | -| No per-vector normalization | PASS | -| Dequant matches quant in Metal | PASS | - -**Flag:** CPU turbo4 reference path is algorithmically incompatible with Metal dequant. Only matters if CPU fallback invoked for turbo4. Metal production path is clean. - -### Benchmark Results - -**Model tested:** Hermes-4-14B Q4_K_M (8.38 GiB) - -#### Throughput - -| Config (K/V) | Prompt (pp512) | Δ | Generation (tg128) | Δ | -|:-------------|:---------------|:--|:-------------------|:--| -| f16/f16 (baseline) | 304.28 t/s | — | 27.47 t/s | — | -| **turbo4/turbo4** | **300.00 t/s** | **-1.1%** | **22.45 t/s** | **-11.1%** | -| turbo3/turbo3 | 271.07 t/s | -10.7% | 21.07 t/s | -16.6% | -| q8_0/turbo4 (asymmetric) | 260.57 t/s | -14.1% | 23.75 t/s | -5.9% | - -#### KV Memory Savings - -| Context | f16 KV | turbo4 KV | Savings | -|:--------|:-------|:----------|:--------| -| 2K | 320 MiB | 85 MiB | 73.4% | -| 8K | 1,280 MiB | 340 MiB | 73.4% | -| 32K | 5,120 MiB | 1,360 MiB | 73.4% | -| 65K | 10,240 MiB | 2,720 MiB | 73.4% | - -Measured matches calculated exactly. Zero fragmentation overhead. - -#### What This Means for qwen3.5:27b - -| Scenario | Total Memory | Fits 31GB? | -|:---------|:-------------|:-----------| -| 27B + f16 KV @ 128K | ~38 GB | ❌ No | -| 27B + **turbo4 KV @ 128K** | **~23.4 GB** | **✅ Yes (7.6GB headroom)** | - ---- - -## Phase 2 — Ollama Integration: PARTIALLY COMPLETE - -### What Works -- Ollama installation fixed (v0.17.7, running on :11434) -- API compatibility assessed: TurboQuant changes are additive (new types/ops only) - -### What Doesn't (Yet) -Custom Ollama build is **not feasible** in current timeframe: -- Ollama vendors llama.cpp with 34 custom patches -- Fork diverges from Ollama's pinned commit -- Integration requires patching 30+ files across Metal/CUDA/CPU backends -- Ollama's own HEAD has pre-existing build failures - -**This is deferred to Phase 4 / upstream watch.** When Ollama updates their llama.cpp pin or TurboQuant lands upstream, the gap narrows. - -### Production Alternative: llama-server - -The fork's `llama-server` binary is **already built and working**: - -```bash -# Drop-in replacement for Ollama's API endpoint -/path/to/llama-server \ - -m /path/to/qwen3.5-27b-q4_k_m.gguf \ - --port 11434 \ - -ctk turbo4 -ctv turbo4 \ - -c 131072 -``` - -- OpenAI-compatible chat completions API -- Streaming SSE support -- All TurboQuant KV types supported -- Per-layer adaptive via TURBO_LAYER_ADAPTIVE env var -- Same port/protocol as Ollama — clients don't need to change - -### Outstanding Phase 2 Items for Cid -- [ ] Download qwen3.5:27b Q4_K_M model -- [ ] Deploy llama-server with turbo4 on MacBook -- [ ] Run full 10-prompt quality matrix (prompts written by Allegro on #16) -- [ ] PPL test with wikitext-2-raw corpus -- [ ] John quality sign-off - ---- - -## Phase 2.5 — Per-Layer Quantization: ALREADY IMPLEMENTED ✅ - -Found in the fork. No additional work needed. - -### Mechanism -`TURBO_LAYER_ADAPTIVE` environment variable, 7 modes: - -| Mode | Strategy | Use Case | -|:-----|:---------|:---------| -| 0 | Uniform (default) | Simple, consistent | -| 1 | q8_0 for first 4 + last 4 layers | Protect sensitive layers | -| 7 | **Recommended:** first2+last2 V=q8_0, rest V=turbo2 | Best quality/compression ratio | - -### Usage -```bash -export TURBO_LAYER_ADAPTIVE=7 -llama-server -m model.gguf -ctk turbo4 -ctv turbo4 -``` - -### Benchmark Status -Mode benchmarks queued. Uniform turbo4 baseline established. Per-layer modes expected to improve quality at same compression ratio. - ---- - -## Phase 3 — QJL: ASSESSED, NOT NEEDED ✅ - -### Finding -**turbo4 is pure 4-bit PolarQuant** — QJL is NOT active. - -`TURBO4_USE_4BIT` defaults to 1 in `ggml-common.h`. The legacy 3-bit+QJL path exists but is disabled. QJL infrastructure (sign arrays, WHT transforms, 128x128 projection matrices) is embedded in Metal but referenced by no active kernel. - -### Recommendation -**Not needed for current goals.** 4-bit PolarQuant already delivers 73% savings with minimal quality impact. QJL only matters below 3 bits/channel, which isn't required on 36GB hardware with the updated memory budget. - ---- - -## Source Repos Assessment - -| Repo | Status | Value | -|:-----|:-------|:------| -| TheTom/llama-cpp-turboquant | **PRIMARY** — production Metal shaders on feature branch | Build from this | -| TheTom/turboquant_plus | Python reference + 511 tests | Algorithm verification | -| rachittshah/mlx-turboquant | Complete MLX PoC, 2-5x slower (no Metal fusion) | Quality validation reference | -| amirzandieh/QJL | Author CUDA (~1500 lines) | Future QJL Metal port reference | - ---- - -## Risk Register - -| Risk | Status | Mitigation | -|:-----|:-------|:-----------| -| Metal shaders missing | ✅ RESOLVED — they exist | — | -| Fork too stale | ✅ RESOLVED — builds clean | — | -| Ollama integration blocked | ⚠️ ACTIVE — multi-day effort | Use llama-server instead | -| PPL regression | ⏸️ UNTESTED — needs wikitext corpus | Download and test in prod | -| tg128 borderline (89% vs 90% threshold) | ⚠️ MINOR — within measurement noise | speed-optimization branch may help | -| CPU turbo4 incompatible with Metal | ℹ️ LOW — only matters if Metal unavailable | Document; Metal is production path | - ---- - -## Recommended Deployment Plan for Cid - -``` -Step 1: Download qwen3.5:27b Q4_K_M via HuggingFace - huggingface-cli download bartowski/qwen3.5-27B-GGUF qwen3.5-27b-q4_k_m.gguf - -Step 2: Build fork (if not already done) - cd /path/to/llama-cpp-turboquant - git checkout feature/turboquant-kv-cache - cmake -B build -DGGML_METAL=ON -DCMAKE_BUILD_TYPE=Release - cmake --build build -j$(sysctl -n hw.ncpu) - -Step 3: Deploy llama-server - export TURBO_LAYER_ADAPTIVE=7 - ./build/bin/llama-server \ - -m /path/to/qwen3.5-27b-q4_k_m.gguf \ - --port 11434 \ - -ctk turbo4 -ctv turbo4 \ - -c 131072 \ - --host 0.0.0.0 - -Step 4: Validate - curl http://localhost:11434/v1/chat/completions \ - -H "Content-Type: application/json" \ - -d '{"model":"qwen3.5","messages":[{"role":"user","content":"hello"}]}' - -Step 5: Run quality matrix (prompts on issue #16) -Step 6: John reviews output quality -Step 7: If pass → production. If fail → drop to turbo3 or adjust per-layer profile. -``` - ---- - -## Issues Summary - -| # | Title | Status | -|:--|:------|:-------| -| 1 | Epic: TurboQuant KV Cache Compression | Open (tracker) | -| 2 | Metal kernel check | ✅ Closed — PASS | -| 3 | Fork assessment | ✅ Closed — PASS, M3 Max 36GB | -| 4 | Build llama.cpp fork | ✅ Closed — clean build | -| 5 | PolarQuant verification | ✅ Closed — 5/6 PASS | -| 6 | Baseline benchmarks | ✅ Closed — recorded | -| 7 | TurboQuant benchmarks | ✅ Closed — 73% savings | -| 8 | Memory profiling | ✅ Closed — 0% fragmentation | -| 9 | Ollama API check | ✅ Closed — additive, but diverged | -| 10 | Custom Ollama build | ✅ Closed — deferred, llama-server instead | -| 11 | Full test matrix | Open — awaiting production deploy | -| 12 | Long-session test | Open — awaiting production deploy | -| 13 | Per-layer profiles | ✅ Closed — already implemented | -| 14 | QJL assessment | ✅ Closed — not needed | -| 15 | Upstream watch | Open — ongoing | -| 16 | Test prompts | Open — Allegro contributed prompts | - -**12/16 issues resolved. 4 remaining are production validation tasks for Cid.** - ---- - -*Repo: http://143.198.27.163:3000/Timmy_Foundation/turboquant* -*Build: /tmp/llama-cpp-turboquant/build/bin/ (all binaries)* -*Branch: feature/turboquant-kv-cache* diff --git a/PHASE1-REPORT.md b/PHASE1-REPORT.md deleted file mode 100644 index 4d49cf41..00000000 --- a/PHASE1-REPORT.md +++ /dev/null @@ -1,139 +0,0 @@ -# TurboQuant Phase 1 Report — PolarQuant MVP - -**Date:** 2026-03-30 -**Prepared by:** Timmy (execution) for Frankie's team (Strago, Cid, Locke, John) -**Spec:** turboquant-build-spec v2.2 (Strago) - ---- - -## Executive Summary - -Phase 1 is COMPLETE. TurboQuant KV cache compression works on Apple Silicon with production-quality Metal shaders. turbo4 delivers **73% KV memory savings with only 1% prompt processing overhead and 11% generation overhead.** The path to 128K context on 36GB hardware is clear. - -**Hardware correction:** The MacBook is M3 Max 36GB (not M4 Max 32GB as in spec). This INCREASES our memory budget from 27GB to ~31GB. - ---- - -## Gate Check (#2): PASSED ✅ - -Metal shaders exist and are comprehensive: -- Full flash attention for turbo2/3/4 with dk32-dk576 variants -- WHT rotation kernels (turbo_fwht_128, turbo_rotate_forward/inverse) -- PolarQuant codebooks hardcoded (Lloyd-Max for N(0, 1/√128)) -- Asymmetric K/V support (q8_0 × turbo mixed pairs) -- M4+ optimizations (4-mag LUT), sparse V dequant, profiling modes -- Additional experiment branches: layer-adaptive, fused-centroid-decode, speed-optimization - -**Decision: llama.cpp path confirmed. No MLX pivot needed.** - ---- - -## Fork Assessment (#3): PASSED ✅ - -- Branch: `feature/turboquant-kv-cache` (commit adac2c6) -- Fork freshness: ADEQUATE (recent enough for direct build) -- Build: Clean cmake + make, 100% success in ~3 minutes -- All binaries: llama-cli, llama-bench, llama-perplexity, llama-server - ---- - -## PolarQuant Verification (#5): 5/6 PASS, 1 PARTIAL ✅ - -| Item | Verdict | -|------|---------| -| WHT rotation (structured orthogonal) | PARTIAL PASS — Metal GPU uses WHT ✅. CPU turbo4 ref uses dense random (legacy, not production) | -| Same rotation quant/dequant | PASS — turbo_rotate_forward() ↔ turbo_rotate_inverse() identical sign arrays | -| Lloyd-Max codebook (not uniform) | PASS — non-uniform centroids, "Lloyd-Max for N(0, 1/128)" | -| Radius at FP16+ | PASS — ggml_half norm per 128-element group | -| No per-vector normalization | PASS — one group norm only, static_asserts enforce block sizes | -| Dequant matches quant in Metal | PASS — same centroids, signs, butterfly structure | - -**⚠️ Flag for Cid:** CPU turbo4 reference path is incompatible with Metal dequant. Only matters if CPU fallback is ever invoked for turbo4. - ---- - -## Benchmark Results - -### Model Under Test -- **Hermes-4-14B Q4_K_M** (8.38 GiB, 14.77B params) -- Machine: Apple M3 Max, 36GB unified, Metal GPU Family 9 - -### Throughput (3-run averages) - -| Config (K/V) | Prompt (pp512) | Δ | Generation (tg128) | Δ | -|:-------------|:---------------|:--|:-------------------|:--| -| f16/f16 (baseline) | 304.28 t/s | — | 27.47 t/s | — | -| **turbo4/turbo4** | **300.00 t/s** | **-1.1%** | **22.45 t/s** | **-11.1%** | -| turbo3/turbo3 | 271.07 t/s | -10.7% | 21.07 t/s | -16.6% | -| q8_0/turbo4 (asym) | 260.57 t/s | -14.1% | 23.75 t/s | -5.9% | - -### KV Cache Memory (turbo4 vs f16) - -| Context | f16 KV | turbo4 KV | Savings | -|:--------|:-------|:----------|:--------| -| 2K | 320 MiB | 85 MiB | 73.4% | -| 8K | 1,280 MiB | 340 MiB | 73.4% | -| 32K | 5,120 MiB | 1,360 MiB | 73.4% | -| 65K | 10,240 MiB | 2,720 MiB | 73.4% | - -Measured matches calculated exactly — zero fragmentation overhead. - -### Pass Criteria Assessment - -| Criteria | Threshold | Result | Verdict | -|:---------|:----------|:-------|:--------| -| PPL delta ≤ 0.5 | ≤ 0.5 | ⏭️ Not tested (no wikitext corpus) | DEFERRED | -| tok/s ≥ 90% baseline (prompt) | ≥ 274 t/s | 300.00 t/s (98.9%) | **PASS** | -| tok/s ≥ 90% baseline (gen) | ≥ 24.7 t/s | 22.45 t/s (89%) | **BORDERLINE** | -| No OOM at 32K | No crash | Runs clean | **PASS** | -| Memory consistent with theory | ±15% | 0% delta | **PASS** | - ---- - -## What This Means for qwen3.5:27b (Spec Target) - -| Scenario | Total Memory | Fits in 31GB? | -|:---------|:-------------|:--------------| -| 27B Q4_K_M + f16 KV @ 64K | ~26 GB | ⚠️ Tight | -| 27B Q4_K_M + f16 KV @ 128K | ~38 GB | ❌ No | -| 27B Q4_K_M + **turbo4 KV @ 64K** | ~20.5 GB | ✅ Comfortable | -| 27B Q4_K_M + **turbo4 KV @ 128K** | ~23.4 GB | ✅ Fits (7.6GB headroom) | - -**TurboQuant turns 128K context from impossible to comfortable.** - ---- - -## Open Items for Phase 2 - -1. **Perplexity test** — Need wikitext-2-raw corpus downloaded. PPL is the most important quality metric and we don't have it yet. -2. **Ollama integration** — CLI is a broken symlink. Need to fix Ollama install, then build custom Ollama with our fork as submodule. -3. **qwen3.5:27b model** — Need to download the actual target model (only have Hermes-4-14B on disk currently). -4. **10 test prompts** — Need to be written before Phase 2 quality comparison. -5. **Generation speed borderline** — tg128 at 89% is just below the 90% threshold. May improve with the speed-optimization branch. Worth testing. - ---- - -## Recommendation - -**PROCEED TO PHASE 2.** - -turbo4 delivers the goods: 73% KV memory savings, near-zero prompt overhead, acceptable generation overhead. The verification checklist confirms the implementation is algorithmically sound. The only gap is PPL testing, which is a corpus download away — not a fundamental risk. - -The real unlock — 128K context on 36GB hardware — is within reach. Phase 2 is Ollama integration and production deployment. - ---- - -## Issues Closed - -- [x] #2 Metal kernel check — PASSED -- [x] #3 Fork assessment — PASSED -- [x] #4 Build llama.cpp fork — COMPLETE -- [x] #5 PolarQuant verification — 5/6 PASS -- [x] #6 FP16 baseline benchmarks — RECORDED -- [x] #7 TurboQuant benchmarks — RECORDED -- [x] #8 Memory profiling — COMPLETE - ---- - -*Phase 1 execution time: ~25 minutes (build) + ~20 minutes (benchmarks) = ~45 minutes total.* -*Within "typical case" estimate from spec (1-2 hours).* diff --git a/BUILD-SPEC.md b/docs/PROJECT_STATUS.md similarity index 67% rename from BUILD-SPEC.md rename to docs/PROJECT_STATUS.md index eab2e09e..c798a1f2 100644 --- a/BUILD-SPEC.md +++ b/docs/PROJECT_STATUS.md @@ -1,3 +1,397 @@ +# TurboQuant Project Status + +# TurboQuant Phase 1 Report — PolarQuant MVP + +**Date:** 2026-03-30 +**Prepared by:** Timmy (execution) for Frankie's team (Strago, Cid, Locke, John) +**Spec:** turboquant-build-spec v2.2 (Strago) + +--- + +## Executive Summary + +Phase 1 is COMPLETE. TurboQuant KV cache compression works on Apple Silicon with production-quality Metal shaders. turbo4 delivers **73% KV memory savings with only 1% prompt processing overhead and 11% generation overhead.** The path to 128K context on 36GB hardware is clear. + +**Hardware correction:** The MacBook is M3 Max 36GB (not M4 Max 32GB as in spec). This INCREASES our memory budget from 27GB to ~31GB. + +--- + +## Gate Check (#2): PASSED ✅ + +Metal shaders exist and are comprehensive: +- Full flash attention for turbo2/3/4 with dk32-dk576 variants +- WHT rotation kernels (turbo_fwht_128, turbo_rotate_forward/inverse) +- PolarQuant codebooks hardcoded (Lloyd-Max for N(0, 1/√128)) +- Asymmetric K/V support (q8_0 × turbo mixed pairs) +- M4+ optimizations (4-mag LUT), sparse V dequant, profiling modes +- Additional experiment branches: layer-adaptive, fused-centroid-decode, speed-optimization + +**Decision: llama.cpp path confirmed. No MLX pivot needed.** + +--- + +## Fork Assessment (#3): PASSED ✅ + +- Branch: `feature/turboquant-kv-cache` (commit adac2c6) +- Fork freshness: ADEQUATE (recent enough for direct build) +- Build: Clean cmake + make, 100% success in ~3 minutes +- All binaries: llama-cli, llama-bench, llama-perplexity, llama-server + +--- + +## PolarQuant Verification (#5): 5/6 PASS, 1 PARTIAL ✅ + +| Item | Verdict | +|------|---------| +| WHT rotation (structured orthogonal) | PARTIAL PASS — Metal GPU uses WHT ✅. CPU turbo4 ref uses dense random (legacy, not production) | +| Same rotation quant/dequant | PASS — turbo_rotate_forward() ↔ turbo_rotate_inverse() identical sign arrays | +| Lloyd-Max codebook (not uniform) | PASS — non-uniform centroids, "Lloyd-Max for N(0, 1/128)" | +| Radius at FP16+ | PASS — ggml_half norm per 128-element group | +| No per-vector normalization | PASS — one group norm only, static_asserts enforce block sizes | +| Dequant matches quant in Metal | PASS — same centroids, signs, butterfly structure | + +**⚠️ Flag for Cid:** CPU turbo4 reference path is incompatible with Metal dequant. Only matters if CPU fallback is ever invoked for turbo4. + +--- + +## Benchmark Results + +### Model Under Test +- **Hermes-4-14B Q4_K_M** (8.38 GiB, 14.77B params) +- Machine: Apple M3 Max, 36GB unified, Metal GPU Family 9 + +### Throughput (3-run averages) + +| Config (K/V) | Prompt (pp512) | Δ | Generation (tg128) | Δ | +|:-------------|:---------------|:--|:-------------------|:--| +| f16/f16 (baseline) | 304.28 t/s | — | 27.47 t/s | — | +| **turbo4/turbo4** | **300.00 t/s** | **-1.1%** | **22.45 t/s** | **-11.1%** | +| turbo3/turbo3 | 271.07 t/s | -10.7% | 21.07 t/s | -16.6% | +| q8_0/turbo4 (asym) | 260.57 t/s | -14.1% | 23.75 t/s | -5.9% | + +### KV Cache Memory (turbo4 vs f16) + +| Context | f16 KV | turbo4 KV | Savings | +|:--------|:-------|:----------|:--------| +| 2K | 320 MiB | 85 MiB | 73.4% | +| 8K | 1,280 MiB | 340 MiB | 73.4% | +| 32K | 5,120 MiB | 1,360 MiB | 73.4% | +| 65K | 10,240 MiB | 2,720 MiB | 73.4% | + +Measured matches calculated exactly — zero fragmentation overhead. + +### Pass Criteria Assessment + +| Criteria | Threshold | Result | Verdict | +|:---------|:----------|:-------|:--------| +| PPL delta ≤ 0.5 | ≤ 0.5 | ⏭️ Not tested (no wikitext corpus) | DEFERRED | +| tok/s ≥ 90% baseline (prompt) | ≥ 274 t/s | 300.00 t/s (98.9%) | **PASS** | +| tok/s ≥ 90% baseline (gen) | ≥ 24.7 t/s | 22.45 t/s (89%) | **BORDERLINE** | +| No OOM at 32K | No crash | Runs clean | **PASS** | +| Memory consistent with theory | ±15% | 0% delta | **PASS** | + +--- + +## What This Means for qwen3.5:27b (Spec Target) + +| Scenario | Total Memory | Fits in 31GB? | +|:---------|:-------------|:--------------| +| 27B Q4_K_M + f16 KV @ 64K | ~26 GB | ⚠️ Tight | +| 27B Q4_K_M + f16 KV @ 128K | ~38 GB | ❌ No | +| 27B Q4_K_M + **turbo4 KV @ 64K** | ~20.5 GB | ✅ Comfortable | +| 27B Q4_K_M + **turbo4 KV @ 128K** | ~23.4 GB | ✅ Fits (7.6GB headroom) | + +**TurboQuant turns 128K context from impossible to comfortable.** + +--- + +## Open Items for Phase 2 + +1. **Perplexity test** — Need wikitext-2-raw corpus downloaded. PPL is the most important quality metric and we don't have it yet. +2. **Ollama integration** — CLI is a broken symlink. Need to fix Ollama install, then build custom Ollama with our fork as submodule. +3. **qwen3.5:27b model** — Need to download the actual target model (only have Hermes-4-14B on disk currently). +4. **10 test prompts** — Need to be written before Phase 2 quality comparison. +5. **Generation speed borderline** — tg128 at 89% is just below the 90% threshold. May improve with the speed-optimization branch. Worth testing. + +--- + +## Recommendation + +**PROCEED TO PHASE 2.** + +turbo4 delivers the goods: 73% KV memory savings, near-zero prompt overhead, acceptable generation overhead. The verification checklist confirms the implementation is algorithmically sound. The only gap is PPL testing, which is a corpus download away — not a fundamental risk. + +The real unlock — 128K context on 36GB hardware — is within reach. Phase 2 is Ollama integration and production deployment. + +--- + +## Issues Closed + +- [x] #2 Metal kernel check — PASSED +- [x] #3 Fork assessment — PASSED +- [x] #4 Build llama.cpp fork — COMPLETE +- [x] #5 PolarQuant verification — 5/6 PASS +- [x] #6 FP16 baseline benchmarks — RECORDED +- [x] #7 TurboQuant benchmarks — RECORDED +- [x] #8 Memory profiling — COMPLETE + +--- + +*Phase 1 execution time: ~25 minutes (build) + ~20 minutes (benchmarks) = ~45 minutes total.* +*Within "typical case" estimate from spec (1-2 hours).* + + +--- + +# TurboQuant — Full Knowledge Transfer Report + +**Date:** 2026-03-30 +**Prepared for:** Frankie's Team (Strago, Cid, Locke, John) +**Spec:** turboquant-build-spec v2.2 (Strago) + +--- + +## TL;DR + +TurboQuant works. PolarQuant KV cache compression delivers **73% memory savings with 1% prompt overhead**. 128K context on the MacBook becomes viable. Custom Ollama build is deferred (multi-day effort), but the fork's `llama-server` is a ready drop-in. Per-layer adaptive quantization is already implemented. QJL is infrastructure-only — not needed at current compression targets. + +--- + +## Hardware Correction + +**Spec says:** M4 Max, 32GB +**Actual:** M3 Max, 36GB (sysctl hw.memsize = 38,654,705,664 bytes) + +Impact: Memory budget **increases** from ~27GB to ~31GB usable. Model ceiling improves. + +--- + +## Phase 1 — PolarQuant MVP: COMPLETE ✅ + +### Gate Check (#2): Metal Shaders EXIST +The `feature/turboquant-kv-cache` branch has production-quality Metal support: +- Flash attention for turbo2/3/4 (all dk variants) +- WHT rotation kernels (turbo_fwht_128) +- Lloyd-Max codebooks (hardcoded, non-uniform) +- Asymmetric K/V (q8_0 × turbo mixed) +- Runtime optimizations: 4-mag LUT (M4+), sparse V dequant, profiling + +**Note:** Allegro's analysis (checking only `master` branch) incorrectly concluded "NO TurboQuant." The implementation lives on the feature branch. + +### PolarQuant Verification (#5): 5/6 PASS + +| Item | Verdict | +|------|---------| +| WHT rotation (structured orthogonal) | PASS (Metal). CPU turbo4 ref uses dense random (legacy) | +| Same rotation quant/dequant | PASS | +| Lloyd-Max codebook (not uniform) | PASS | +| Radius at FP16+ | PASS | +| No per-vector normalization | PASS | +| Dequant matches quant in Metal | PASS | + +**Flag:** CPU turbo4 reference path is algorithmically incompatible with Metal dequant. Only matters if CPU fallback invoked for turbo4. Metal production path is clean. + +### Benchmark Results + +**Model tested:** Hermes-4-14B Q4_K_M (8.38 GiB) + +#### Throughput + +| Config (K/V) | Prompt (pp512) | Δ | Generation (tg128) | Δ | +|:-------------|:---------------|:--|:-------------------|:--| +| f16/f16 (baseline) | 304.28 t/s | — | 27.47 t/s | — | +| **turbo4/turbo4** | **300.00 t/s** | **-1.1%** | **22.45 t/s** | **-11.1%** | +| turbo3/turbo3 | 271.07 t/s | -10.7% | 21.07 t/s | -16.6% | +| q8_0/turbo4 (asymmetric) | 260.57 t/s | -14.1% | 23.75 t/s | -5.9% | + +#### KV Memory Savings + +| Context | f16 KV | turbo4 KV | Savings | +|:--------|:-------|:----------|:--------| +| 2K | 320 MiB | 85 MiB | 73.4% | +| 8K | 1,280 MiB | 340 MiB | 73.4% | +| 32K | 5,120 MiB | 1,360 MiB | 73.4% | +| 65K | 10,240 MiB | 2,720 MiB | 73.4% | + +Measured matches calculated exactly. Zero fragmentation overhead. + +#### What This Means for qwen3.5:27b + +| Scenario | Total Memory | Fits 31GB? | +|:---------|:-------------|:-----------| +| 27B + f16 KV @ 128K | ~38 GB | ❌ No | +| 27B + **turbo4 KV @ 128K** | **~23.4 GB** | **✅ Yes (7.6GB headroom)** | + +--- + +## Phase 2 — Ollama Integration: PARTIALLY COMPLETE + +### What Works +- Ollama installation fixed (v0.17.7, running on :11434) +- API compatibility assessed: TurboQuant changes are additive (new types/ops only) + +### What Doesn't (Yet) +Custom Ollama build is **not feasible** in current timeframe: +- Ollama vendors llama.cpp with 34 custom patches +- Fork diverges from Ollama's pinned commit +- Integration requires patching 30+ files across Metal/CUDA/CPU backends +- Ollama's own HEAD has pre-existing build failures + +**This is deferred to Phase 4 / upstream watch.** When Ollama updates their llama.cpp pin or TurboQuant lands upstream, the gap narrows. + +### Production Alternative: llama-server + +The fork's `llama-server` binary is **already built and working**: + +```bash +# Drop-in replacement for Ollama's API endpoint +/path/to/llama-server \ + -m /path/to/qwen3.5-27b-q4_k_m.gguf \ + --port 11434 \ + -ctk turbo4 -ctv turbo4 \ + -c 131072 +``` + +- OpenAI-compatible chat completions API +- Streaming SSE support +- All TurboQuant KV types supported +- Per-layer adaptive via TURBO_LAYER_ADAPTIVE env var +- Same port/protocol as Ollama — clients don't need to change + +### Outstanding Phase 2 Items for Cid +- [ ] Download qwen3.5:27b Q4_K_M model +- [ ] Deploy llama-server with turbo4 on MacBook +- [ ] Run full 10-prompt quality matrix (prompts written by Allegro on #16) +- [ ] PPL test with wikitext-2-raw corpus +- [ ] John quality sign-off + +--- + +## Phase 2.5 — Per-Layer Quantization: ALREADY IMPLEMENTED ✅ + +Found in the fork. No additional work needed. + +### Mechanism +`TURBO_LAYER_ADAPTIVE` environment variable, 7 modes: + +| Mode | Strategy | Use Case | +|:-----|:---------|:---------| +| 0 | Uniform (default) | Simple, consistent | +| 1 | q8_0 for first 4 + last 4 layers | Protect sensitive layers | +| 7 | **Recommended:** first2+last2 V=q8_0, rest V=turbo2 | Best quality/compression ratio | + +### Usage +```bash +export TURBO_LAYER_ADAPTIVE=7 +llama-server -m model.gguf -ctk turbo4 -ctv turbo4 +``` + +### Benchmark Status +Mode benchmarks queued. Uniform turbo4 baseline established. Per-layer modes expected to improve quality at same compression ratio. + +--- + +## Phase 3 — QJL: ASSESSED, NOT NEEDED ✅ + +### Finding +**turbo4 is pure 4-bit PolarQuant** — QJL is NOT active. + +`TURBO4_USE_4BIT` defaults to 1 in `ggml-common.h`. The legacy 3-bit+QJL path exists but is disabled. QJL infrastructure (sign arrays, WHT transforms, 128x128 projection matrices) is embedded in Metal but referenced by no active kernel. + +### Recommendation +**Not needed for current goals.** 4-bit PolarQuant already delivers 73% savings with minimal quality impact. QJL only matters below 3 bits/channel, which isn't required on 36GB hardware with the updated memory budget. + +--- + +## Source Repos Assessment + +| Repo | Status | Value | +|:-----|:-------|:------| +| TheTom/llama-cpp-turboquant | **PRIMARY** — production Metal shaders on feature branch | Build from this | +| TheTom/turboquant_plus | Python reference + 511 tests | Algorithm verification | +| rachittshah/mlx-turboquant | Complete MLX PoC, 2-5x slower (no Metal fusion) | Quality validation reference | +| amirzandieh/QJL | Author CUDA (~1500 lines) | Future QJL Metal port reference | + +--- + +## Risk Register + +| Risk | Status | Mitigation | +|:-----|:-------|:-----------| +| Metal shaders missing | ✅ RESOLVED — they exist | — | +| Fork too stale | ✅ RESOLVED — builds clean | — | +| Ollama integration blocked | ⚠️ ACTIVE — multi-day effort | Use llama-server instead | +| PPL regression | ⏸️ UNTESTED — needs wikitext corpus | Download and test in prod | +| tg128 borderline (89% vs 90% threshold) | ⚠️ MINOR — within measurement noise | speed-optimization branch may help | +| CPU turbo4 incompatible with Metal | ℹ️ LOW — only matters if Metal unavailable | Document; Metal is production path | + +--- + +## Recommended Deployment Plan for Cid + +``` +Step 1: Download qwen3.5:27b Q4_K_M via HuggingFace + huggingface-cli download bartowski/qwen3.5-27B-GGUF qwen3.5-27b-q4_k_m.gguf + +Step 2: Build fork (if not already done) + cd /path/to/llama-cpp-turboquant + git checkout feature/turboquant-kv-cache + cmake -B build -DGGML_METAL=ON -DCMAKE_BUILD_TYPE=Release + cmake --build build -j$(sysctl -n hw.ncpu) + +Step 3: Deploy llama-server + export TURBO_LAYER_ADAPTIVE=7 + ./build/bin/llama-server \ + -m /path/to/qwen3.5-27b-q4_k_m.gguf \ + --port 11434 \ + -ctk turbo4 -ctv turbo4 \ + -c 131072 \ + --host 0.0.0.0 + +Step 4: Validate + curl http://localhost:11434/v1/chat/completions \ + -H "Content-Type: application/json" \ + -d '{"model":"qwen3.5","messages":[{"role":"user","content":"hello"}]}' + +Step 5: Run quality matrix (prompts on issue #16) +Step 6: John reviews output quality +Step 7: If pass → production. If fail → drop to turbo3 or adjust per-layer profile. +``` + +--- + +## Issues Summary + +| # | Title | Status | +|:--|:------|:-------| +| 1 | Epic: TurboQuant KV Cache Compression | Open (tracker) | +| 2 | Metal kernel check | ✅ Closed — PASS | +| 3 | Fork assessment | ✅ Closed — PASS, M3 Max 36GB | +| 4 | Build llama.cpp fork | ✅ Closed — clean build | +| 5 | PolarQuant verification | ✅ Closed — 5/6 PASS | +| 6 | Baseline benchmarks | ✅ Closed — recorded | +| 7 | TurboQuant benchmarks | ✅ Closed — 73% savings | +| 8 | Memory profiling | ✅ Closed — 0% fragmentation | +| 9 | Ollama API check | ✅ Closed — additive, but diverged | +| 10 | Custom Ollama build | ✅ Closed — deferred, llama-server instead | +| 11 | Full test matrix | Open — awaiting production deploy | +| 12 | Long-session test | Open — awaiting production deploy | +| 13 | Per-layer profiles | ✅ Closed — already implemented | +| 14 | QJL assessment | ✅ Closed — not needed | +| 15 | Upstream watch | Open — ongoing | +| 16 | Test prompts | Open — Allegro contributed prompts | + +**12/16 issues resolved. 4 remaining are production validation tasks for Cid.** + +--- + +*Repo: http://143.198.27.163:3000/Timmy_Foundation/turboquant* +*Build: /tmp/llama-cpp-turboquant/build/bin/ (all binaries)* +*Branch: feature/turboquant-kv-cache* + + +--- + # TurboQuant Implementation — Build Spec (v2) **Prepared by:** Strago | **Date:** 2026-03-30 | **Updated:** 2026-03-30 (v2 — external review fixes) **Task:** STR-2026-03-30-01 | **For:** Cid (build) + Frankie (coordination) @@ -447,3 +841,7 @@ This gives the same average compression ratio as uniform turbo4 but concentrates --- *Build spec v2 ready for Cid intake. No clarifying questions needed.* + + +--- +