PR-IMPLEMENTATION-PLAN.md


# TurboQuant Implementation Plan — Phase 2

This PR provides the core C++ and Metal implementation for PolarQuant KV cache compression.

## Components Added
1. **llama-turbo.h / .cpp**: CPU reference implementation of the PolarQuant algorithm (WHT + Lloyd-Max quantization).
2. **ggml-metal-turbo.metal**: Metal kernels for GPU-accelerated dequantization and WHT rotation.

## Integration Steps for llama.cpp
To integrate this into a clean `llama.cpp` checkout:

1. **Add to ggml-metal.metal**:
   - Copy the kernels from `ggml-metal-turbo.metal` into `ggml/src/ggml-metal.metal`.
   - Register the new kernels in `ggml-metal.m`.

2. **Add to llama.cpp**:
   - Include `llama-turbo.h` in `llama.cpp`.
   - Add `GGML_TYPE_TURBO4` to the `ggml_type` enum in `ggml.h`.
   - Update the KV cache allocation logic to support the new type.

3. **Update Makefile/CMake**:
   - Add `llama-turbo.cpp` to the build sources.

## Ollama Integration (The Biggest Challenge)
Ollama builds `llama.cpp` as a submodule. To use this implementation in Ollama:

1. **Custom llama.cpp Submodule**:
   - Point Ollama's `llm/llama.cpp` submodule to our fork containing these changes.
2. **Update CGo Bindings**:
   - If the `llama.h` API surface changed, update `llm/llama.go` to match.
3. **Build Ollama**:
   - Run `go generate ./...` and then `go build .` to produce the custom Ollama binary.

## Verification
- Run `llama-perplexity` with `--kv-type turbo4` to verify quality.
- Run `llama-bench` to verify Metal shader performance.
Add implementation plan 2026-03-30 21:06:51 +00:00
			`# TurboQuant Implementation Plan — Phase 2`

			`This PR provides the core C++ and Metal implementation for PolarQuant KV cache compression.`

			`## Components Added`
			`1. llama-turbo.h / .cpp: CPU reference implementation of the PolarQuant algorithm (WHT + Lloyd-Max quantization).`
			`2. ggml-metal-turbo.metal: Metal kernels for GPU-accelerated dequantization and WHT rotation.`

			`## Integration Steps for llama.cpp`
			To integrate this into a clean `llama.cpp` checkout:

			`1. Add to ggml-metal.metal:`
			- Copy the kernels from `ggml-metal-turbo.metal` into `ggml/src/ggml-metal.metal`.
			- Register the new kernels in `ggml-metal.m`.

			`2. Add to llama.cpp:`
			- Include `llama-turbo.h` in `llama.cpp`.
			- Add `GGML_TYPE_TURBO4` to the `ggml_type` enum in `ggml.h`.
			`- Update the KV cache allocation logic to support the new type.`

			`3. Update Makefile/CMake:`
			- Add `llama-turbo.cpp` to the build sources.

			`## Ollama Integration (The Biggest Challenge)`
			Ollama builds `llama.cpp` as a submodule. To use this implementation in Ollama:

			`1. Custom llama.cpp Submodule:`
			- Point Ollama's `llm/llama.cpp` submodule to our fork containing these changes.
			`2. Update CGo Bindings:`
			- If the `llama.h` API surface changed, update `llm/llama.go` to match.
			`3. Build Ollama:`
			- Run `go generate ./...` and then `go build .` to produce the custom Ollama binary.

			`## Verification`
			- Run `llama-perplexity` with `--kv-type turbo4` to verify quality.
			- Run `llama-bench` to verify Metal shader performance.