# Hermes Profiles for TurboQuant This directory contains Hermes configuration profiles for running models with TurboQuant KV cache compression. ## Available Profiles ### gemma4-turboquant.yaml **Profile for Gemma 4 model with TurboQuant KV cache compression.** - **Primary Provider:** Local llama.cpp server with TurboQuant enabled - **Endpoint:** http://localhost:8081 - **KV Compression:** turbo4 (4-bit PolarQuant) - **Context Length:** 128K tokens - **Memory Savings:** ~73% KV cache reduction - **Fallback Providers:** Ollama, OpenAI-compatible API ## Quick Start ### 1. Build TurboQuant-enabled llama.cpp ```bash git clone https://github.com/TheTom/llama-cpp-turboquant.git cd llama-cpp-turboquant git checkout feature/turboquant-kv-cache cmake -B build -DGGML_METAL=ON -DCMAKE_BUILD_TYPE=Release cmake --build build -j$(sysctl -n hw.ncpu) ``` ### 2. Download Gemma 4 Model ```bash # Download Gemma 4 Q4_K_M quantized model huggingface-cli download gemma-4-q4_k_m.gguf ``` ### 3. Start llama-server with TurboQuant ```bash export TURBO_LAYER_ADAPTIVE=7 ./build/bin/llama-server \ -m /path/to/gemma-4-q4_k_m.gguf \ --port 8081 \ -ctk turbo4 -ctv turbo4 \ -c 131072 \ --host 0.0.0.0 ``` ### 4. Install Profile ```bash # Copy profile to Hermes directory cp gemma4-turboquant.yaml ~/.hermes/profiles/ # Or create symlink ln -sf $(pwd)/gemma4-turboquant.yaml ~/.hermes/profiles/ ``` ### 5. Use with Hermes ```bash # Start Hermes with the profile hermes --profile gemma4-turboquant # Or specify profile in Hermes config echo "default_profile: gemma4-turboquant" >> ~/.hermes/config.yaml ``` ## Profile Configuration The profile includes: - **Primary Provider:** Local llama.cpp server with TurboQuant - **Fallback Providers:** Ollama (local), OpenAI (cloud) - **TurboQuant Settings:** - `kv_type`: turbo4 (4-bit compression) - `layer_adaptive_mode`: 7 (best quality/compression ratio) - `max_context`: 128K tokens ## Performance Expectations | Metric | Value | Notes | |--------|-------|-------| | KV Memory Savings | 73% | Measured on M3 Max | | Prompt Processing | ~1% overhead | vs FP16 baseline | | Generation Speed | ~11% overhead | vs FP16 baseline | | Max Context (36GB) | 128K | Comfortable with 7.6GB headroom | ## Customization ### Adjust Compression Level ```yaml turboquant: kv_type: "turbo3" # Lower compression, faster # or kv_type: "turbo2" # Minimal compression, fastest ``` ### Disable Per-Layer Adaptive ```yaml turboquant: layer_adaptive_mode: 0 # Uniform quantization ``` ### Use Asymmetric K/V For better quality on sensitive models: ```bash # Start server with asymmetric K/V llama-server -m model.gguf --port 8081 -ctk q8_0 -ctv turbo4 -c 131072 ``` ## Troubleshooting ### Server Won't Start 1. Check if port 8081 is available: `lsof -i :8081` 2. Verify model path is correct 3. Ensure TurboQuant branch is checked out ### Poor Generation Quality 1. Try `turbo3` instead of `turbo4` 2. Disable per-layer adaptive (mode 0) 3. Use asymmetric K/V: `-ctk q8_0 -ctv turbo4` ### High Memory Usage 1. Reduce context length: `-c 65536` (64K) 2. Check `TURBO_LAYER_ADAPTIVE` is set 3. Monitor with: `vmmap --summary $(pgrep llama-server)` ## References - [TurboQuant Build Spec](../BUILD-SPEC.md) - [Phase 1 Report](../PHASE1-REPORT.md) - [Full Knowledge Transfer](../FULL-REPORT.md) - [llama.cpp TurboQuant Fork](https://github.com/TheTom/llama-cpp-turboquant)