diff --git a/PR-IMPLEMENTATION-PLAN.md b/PR-IMPLEMENTATION-PLAN.md new file mode 100644 index 00000000..b695efb3 --- /dev/null +++ b/PR-IMPLEMENTATION-PLAN.md @@ -0,0 +1,38 @@ + +# TurboQuant Implementation Plan — Phase 2 + +This PR provides the core C++ and Metal implementation for PolarQuant KV cache compression. + +## Components Added +1. **llama-turbo.h / .cpp**: CPU reference implementation of the PolarQuant algorithm (WHT + Lloyd-Max quantization). +2. **ggml-metal-turbo.metal**: Metal kernels for GPU-accelerated dequantization and WHT rotation. + +## Integration Steps for llama.cpp +To integrate this into a clean `llama.cpp` checkout: + +1. **Add to ggml-metal.metal**: + - Copy the kernels from `ggml-metal-turbo.metal` into `ggml/src/ggml-metal.metal`. + - Register the new kernels in `ggml-metal.m`. + +2. **Add to llama.cpp**: + - Include `llama-turbo.h` in `llama.cpp`. + - Add `GGML_TYPE_TURBO4` to the `ggml_type` enum in `ggml.h`. + - Update the KV cache allocation logic to support the new type. + +3. **Update Makefile/CMake**: + - Add `llama-turbo.cpp` to the build sources. + +## Ollama Integration (The Biggest Challenge) +Ollama builds `llama.cpp` as a submodule. To use this implementation in Ollama: + +1. **Custom llama.cpp Submodule**: + - Point Ollama's `llm/llama.cpp` submodule to our fork containing these changes. +2. **Update CGo Bindings**: + - If the `llama.h` API surface changed, update `llm/llama.go` to match. +3. **Build Ollama**: + - Run `go generate ./...` and then `go build .` to produce the custom Ollama binary. + +## Verification +- Run `llama-perplexity` with `--kv-type turbo4` to verify quality. +- Run `llama-bench` to verify Metal shader performance. + \ No newline at end of file