142 lines
3.4 KiB
Markdown
142 lines
3.4 KiB
Markdown
|
|
# Hermes Profiles for TurboQuant
|
||
|
|
|
||
|
|
This directory contains Hermes configuration profiles for running models with TurboQuant KV cache compression.
|
||
|
|
|
||
|
|
## Available Profiles
|
||
|
|
|
||
|
|
### gemma4-turboquant.yaml
|
||
|
|
|
||
|
|
**Profile for Gemma 4 model with TurboQuant KV cache compression.**
|
||
|
|
|
||
|
|
- **Primary Provider:** Local llama.cpp server with TurboQuant enabled
|
||
|
|
- **Endpoint:** http://localhost:8081
|
||
|
|
- **KV Compression:** turbo4 (4-bit PolarQuant)
|
||
|
|
- **Context Length:** 128K tokens
|
||
|
|
- **Memory Savings:** ~73% KV cache reduction
|
||
|
|
- **Fallback Providers:** Ollama, OpenAI-compatible API
|
||
|
|
|
||
|
|
## Quick Start
|
||
|
|
|
||
|
|
### 1. Build TurboQuant-enabled llama.cpp
|
||
|
|
|
||
|
|
```bash
|
||
|
|
git clone https://github.com/TheTom/llama-cpp-turboquant.git
|
||
|
|
cd llama-cpp-turboquant
|
||
|
|
git checkout feature/turboquant-kv-cache
|
||
|
|
cmake -B build -DGGML_METAL=ON -DCMAKE_BUILD_TYPE=Release
|
||
|
|
cmake --build build -j$(sysctl -n hw.ncpu)
|
||
|
|
```
|
||
|
|
|
||
|
|
### 2. Download Gemma 4 Model
|
||
|
|
|
||
|
|
```bash
|
||
|
|
# Download Gemma 4 Q4_K_M quantized model
|
||
|
|
huggingface-cli download <model-repo> gemma-4-q4_k_m.gguf
|
||
|
|
```
|
||
|
|
|
||
|
|
### 3. Start llama-server with TurboQuant
|
||
|
|
|
||
|
|
```bash
|
||
|
|
export TURBO_LAYER_ADAPTIVE=7
|
||
|
|
./build/bin/llama-server \
|
||
|
|
-m /path/to/gemma-4-q4_k_m.gguf \
|
||
|
|
--port 8081 \
|
||
|
|
-ctk turbo4 -ctv turbo4 \
|
||
|
|
-c 131072 \
|
||
|
|
--host 0.0.0.0
|
||
|
|
```
|
||
|
|
|
||
|
|
### 4. Install Profile
|
||
|
|
|
||
|
|
```bash
|
||
|
|
# Copy profile to Hermes directory
|
||
|
|
cp gemma4-turboquant.yaml ~/.hermes/profiles/
|
||
|
|
|
||
|
|
# Or create symlink
|
||
|
|
ln -sf $(pwd)/gemma4-turboquant.yaml ~/.hermes/profiles/
|
||
|
|
```
|
||
|
|
|
||
|
|
### 5. Use with Hermes
|
||
|
|
|
||
|
|
```bash
|
||
|
|
# Start Hermes with the profile
|
||
|
|
hermes --profile gemma4-turboquant
|
||
|
|
|
||
|
|
# Or specify profile in Hermes config
|
||
|
|
echo "default_profile: gemma4-turboquant" >> ~/.hermes/config.yaml
|
||
|
|
```
|
||
|
|
|
||
|
|
## Profile Configuration
|
||
|
|
|
||
|
|
The profile includes:
|
||
|
|
|
||
|
|
- **Primary Provider:** Local llama.cpp server with TurboQuant
|
||
|
|
- **Fallback Providers:** Ollama (local), OpenAI (cloud)
|
||
|
|
- **TurboQuant Settings:**
|
||
|
|
- `kv_type`: turbo4 (4-bit compression)
|
||
|
|
- `layer_adaptive_mode`: 7 (best quality/compression ratio)
|
||
|
|
- `max_context`: 128K tokens
|
||
|
|
|
||
|
|
## Performance Expectations
|
||
|
|
|
||
|
|
| Metric | Value | Notes |
|
||
|
|
|--------|-------|-------|
|
||
|
|
| KV Memory Savings | 73% | Measured on M3 Max |
|
||
|
|
| Prompt Processing | ~1% overhead | vs FP16 baseline |
|
||
|
|
| Generation Speed | ~11% overhead | vs FP16 baseline |
|
||
|
|
| Max Context (36GB) | 128K | Comfortable with 7.6GB headroom |
|
||
|
|
|
||
|
|
## Customization
|
||
|
|
|
||
|
|
### Adjust Compression Level
|
||
|
|
|
||
|
|
```yaml
|
||
|
|
turboquant:
|
||
|
|
kv_type: "turbo3" # Lower compression, faster
|
||
|
|
# or
|
||
|
|
kv_type: "turbo2" # Minimal compression, fastest
|
||
|
|
```
|
||
|
|
|
||
|
|
### Disable Per-Layer Adaptive
|
||
|
|
|
||
|
|
```yaml
|
||
|
|
turboquant:
|
||
|
|
layer_adaptive_mode: 0 # Uniform quantization
|
||
|
|
```
|
||
|
|
|
||
|
|
### Use Asymmetric K/V
|
||
|
|
|
||
|
|
For better quality on sensitive models:
|
||
|
|
|
||
|
|
```bash
|
||
|
|
# Start server with asymmetric K/V
|
||
|
|
llama-server -m model.gguf --port 8081 -ctk q8_0 -ctv turbo4 -c 131072
|
||
|
|
```
|
||
|
|
|
||
|
|
## Troubleshooting
|
||
|
|
|
||
|
|
### Server Won't Start
|
||
|
|
|
||
|
|
1. Check if port 8081 is available: `lsof -i :8081`
|
||
|
|
2. Verify model path is correct
|
||
|
|
3. Ensure TurboQuant branch is checked out
|
||
|
|
|
||
|
|
### Poor Generation Quality
|
||
|
|
|
||
|
|
1. Try `turbo3` instead of `turbo4`
|
||
|
|
2. Disable per-layer adaptive (mode 0)
|
||
|
|
3. Use asymmetric K/V: `-ctk q8_0 -ctv turbo4`
|
||
|
|
|
||
|
|
### High Memory Usage
|
||
|
|
|
||
|
|
1. Reduce context length: `-c 65536` (64K)
|
||
|
|
2. Check `TURBO_LAYER_ADAPTIVE` is set
|
||
|
|
3. Monitor with: `vmmap --summary $(pgrep llama-server)`
|
||
|
|
|
||
|
|
## References
|
||
|
|
|
||
|
|
- [TurboQuant Build Spec](../BUILD-SPEC.md)
|
||
|
|
- [Phase 1 Report](../PHASE1-REPORT.md)
|
||
|
|
- [Full Knowledge Transfer](../FULL-REPORT.md)
|
||
|
|
- [llama.cpp TurboQuant Fork](https://github.com/TheTom/llama-cpp-turboquant)
|