38 lines
1.5 KiB
Markdown
38 lines
1.5 KiB
Markdown
|
|
|
||
|
|
# TurboQuant Implementation Plan — Phase 2
|
||
|
|
|
||
|
|
This PR provides the core C++ and Metal implementation for PolarQuant KV cache compression.
|
||
|
|
|
||
|
|
## Components Added
|
||
|
|
1. **llama-turbo.h / .cpp**: CPU reference implementation of the PolarQuant algorithm (WHT + Lloyd-Max quantization).
|
||
|
|
2. **ggml-metal-turbo.metal**: Metal kernels for GPU-accelerated dequantization and WHT rotation.
|
||
|
|
|
||
|
|
## Integration Steps for llama.cpp
|
||
|
|
To integrate this into a clean `llama.cpp` checkout:
|
||
|
|
|
||
|
|
1. **Add to ggml-metal.metal**:
|
||
|
|
- Copy the kernels from `ggml-metal-turbo.metal` into `ggml/src/ggml-metal.metal`.
|
||
|
|
- Register the new kernels in `ggml-metal.m`.
|
||
|
|
|
||
|
|
2. **Add to llama.cpp**:
|
||
|
|
- Include `llama-turbo.h` in `llama.cpp`.
|
||
|
|
- Add `GGML_TYPE_TURBO4` to the `ggml_type` enum in `ggml.h`.
|
||
|
|
- Update the KV cache allocation logic to support the new type.
|
||
|
|
|
||
|
|
3. **Update Makefile/CMake**:
|
||
|
|
- Add `llama-turbo.cpp` to the build sources.
|
||
|
|
|
||
|
|
## Ollama Integration (The Biggest Challenge)
|
||
|
|
Ollama builds `llama.cpp` as a submodule. To use this implementation in Ollama:
|
||
|
|
|
||
|
|
1. **Custom llama.cpp Submodule**:
|
||
|
|
- Point Ollama's `llm/llama.cpp` submodule to our fork containing these changes.
|
||
|
|
2. **Update CGo Bindings**:
|
||
|
|
- If the `llama.h` API surface changed, update `llm/llama.go` to match.
|
||
|
|
3. **Build Ollama**:
|
||
|
|
- Run `go generate ./...` and then `go build .` to produce the custom Ollama binary.
|
||
|
|
|
||
|
|
## Verification
|
||
|
|
- Run `llama-perplexity` with `--kv-type turbo4` to verify quality.
|
||
|
|
- Run `llama-bench` to verify Metal shader performance.
|
||
|
|
|