feat: Add Hermes profile for Gemma 4 + TurboQuant (Issue #28)

- Add gemma4-turboquant.yaml profile for Hermes - Configure local llama.cpp server with TurboQuant KV compression - Set turbo4 (4-bit) compression with per-layer adaptive mode 7 - Support 128K context with 73% KV memory savings - Include fallback providers (Ollama, OpenAI) - Add profiles/README.md with setup and usage instructions - Document performance expectations and troubleshooting Closes #28
2026-04-09 21:15:57 -04:00
parent dea59c04d7
commit aa0e76c1ab
2 changed files with 310 additions and 0 deletions
--- a/profiles/README.md
+++ b/profiles/README.md
@@ -0,0 +1,141 @@
+# Hermes Profiles for TurboQuant
+
+This directory contains Hermes configuration profiles for running models with TurboQuant KV cache compression.
+
+## Available Profiles
+
+### gemma4-turboquant.yaml
+
+**Profile for Gemma 4 model with TurboQuant KV cache compression.**
+
+- **Primary Provider:** Local llama.cpp server with TurboQuant enabled
+- **Endpoint:** http://localhost:8081
+- **KV Compression:** turbo4 (4-bit PolarQuant)
+- **Context Length:** 128K tokens
+- **Memory Savings:** ~73% KV cache reduction
+- **Fallback Providers:** Ollama, OpenAI-compatible API
+
+## Quick Start
+
+### 1. Build TurboQuant-enabled llama.cpp
+
+```bash
+git clone https://github.com/TheTom/llama-cpp-turboquant.git
+cd llama-cpp-turboquant
+git checkout feature/turboquant-kv-cache
+cmake -B build -DGGML_METAL=ON -DCMAKE_BUILD_TYPE=Release
+cmake --build build -j$(sysctl -n hw.ncpu)
+```
+
+### 2. Download Gemma 4 Model
+
+```bash
+# Download Gemma 4 Q4_K_M quantized model
+huggingface-cli download <model-repo> gemma-4-q4_k_m.gguf
+```
+
+### 3. Start llama-server with TurboQuant
+
+```bash
+export TURBO_LAYER_ADAPTIVE=7
+./build/bin/llama-server \
+  -m /path/to/gemma-4-q4_k_m.gguf \
+  --port 8081 \
+  -ctk turbo4 -ctv turbo4 \
+  -c 131072 \
+  --host 0.0.0.0
+```
+
+### 4. Install Profile
+
+```bash
+# Copy profile to Hermes directory
+cp gemma4-turboquant.yaml ~/.hermes/profiles/
+
+# Or create symlink
+ln -sf $(pwd)/gemma4-turboquant.yaml ~/.hermes/profiles/
+```
+
+### 5. Use with Hermes
+
+```bash
+# Start Hermes with the profile
+hermes --profile gemma4-turboquant
+
+# Or specify profile in Hermes config
+echo "default_profile: gemma4-turboquant" >> ~/.hermes/config.yaml
+```
+
+## Profile Configuration
+
+The profile includes:
+
+- **Primary Provider:** Local llama.cpp server with TurboQuant
+- **Fallback Providers:** Ollama (local), OpenAI (cloud)
+- **TurboQuant Settings:**
+  - `kv_type`: turbo4 (4-bit compression)
+  - `layer_adaptive_mode`: 7 (best quality/compression ratio)
+  - `max_context`: 128K tokens
+
+## Performance Expectations
+
+| Metric | Value | Notes |
+|--------|-------|-------|
+| KV Memory Savings | 73% | Measured on M3 Max |
+| Prompt Processing | ~1% overhead | vs FP16 baseline |
+| Generation Speed | ~11% overhead | vs FP16 baseline |
+| Max Context (36GB) | 128K | Comfortable with 7.6GB headroom |
+
+## Customization
+
+### Adjust Compression Level
+
+```yaml
+turboquant:
+  kv_type: "turbo3"  # Lower compression, faster
+  # or
+  kv_type: "turbo2"  # Minimal compression, fastest
+```
+
+### Disable Per-Layer Adaptive
+
+```yaml
+turboquant:
+  layer_adaptive_mode: 0  # Uniform quantization
+```
+
+### Use Asymmetric K/V
+
+For better quality on sensitive models:
+
+```bash
+# Start server with asymmetric K/V
+llama-server -m model.gguf --port 8081 -ctk q8_0 -ctv turbo4 -c 131072
+```
+
+## Troubleshooting
+
+### Server Won't Start
+
+1. Check if port 8081 is available: `lsof -i :8081`
+2. Verify model path is correct
+3. Ensure TurboQuant branch is checked out
+
+### Poor Generation Quality
+
+1. Try `turbo3` instead of `turbo4`
+2. Disable per-layer adaptive (mode 0)
+3. Use asymmetric K/V: `-ctk q8_0 -ctv turbo4`
+
+### High Memory Usage
+
+1. Reduce context length: `-c 65536` (64K)
+2. Check `TURBO_LAYER_ADAPTIVE` is set
+3. Monitor with: `vmmap --summary $(pgrep llama-server)`
+
+## References
+
+- [TurboQuant Build Spec](../BUILD-SPEC.md)
+- [Phase 1 Report](../PHASE1-REPORT.md)
+- [Full Knowledge Transfer](../FULL-REPORT.md)
+- [llama.cpp TurboQuant Fork](https://github.com/TheTom/llama-cpp-turboquant)