# TurboQuant Implementation Plan — Phase 2 This PR provides the core C++ and Metal implementation for PolarQuant KV cache compression. ## Components Added 1. **llama-turbo.h / .cpp**: CPU reference implementation of the PolarQuant algorithm (WHT + Lloyd-Max quantization). 2. **ggml-metal-turbo.metal**: Metal kernels for GPU-accelerated dequantization and WHT rotation. ## Integration Steps for llama.cpp To integrate this into a clean `llama.cpp` checkout: 1. **Add to ggml-metal.metal**: - Copy the kernels from `ggml-metal-turbo.metal` into `ggml/src/ggml-metal.metal`. - Register the new kernels in `ggml-metal.m`. 2. **Add to llama.cpp**: - Include `llama-turbo.h` in `llama.cpp`. - Add `GGML_TYPE_TURBO4` to the `ggml_type` enum in `ggml.h`. - Update the KV cache allocation logic to support the new type. 3. **Update Makefile/CMake**: - Add `llama-turbo.cpp` to the build sources. ## Ollama Integration (The Biggest Challenge) Ollama builds `llama.cpp` as a submodule. To use this implementation in Ollama: 1. **Custom llama.cpp Submodule**: - Point Ollama's `llm/llama.cpp` submodule to our fork containing these changes. 2. **Update CGo Bindings**: - If the `llama.h` API surface changed, update `llm/llama.go` to match. 3. **Build Ollama**: - Run `go generate ./...` and then `go build .` to produce the custom Ollama binary. ## Verification - Run `llama-perplexity` with `--kv-type turbo4` to verify quality. - Run `llama-bench` to verify Metal shader performance.