diff --git a/profiles/README.md b/profiles/README.md new file mode 100644 index 00000000..33044acd --- /dev/null +++ b/profiles/README.md @@ -0,0 +1,141 @@ +# Hermes Profiles for TurboQuant + +This directory contains Hermes configuration profiles for running models with TurboQuant KV cache compression. + +## Available Profiles + +### gemma4-turboquant.yaml + +**Profile for Gemma 4 model with TurboQuant KV cache compression.** + +- **Primary Provider:** Local llama.cpp server with TurboQuant enabled +- **Endpoint:** http://localhost:8081 +- **KV Compression:** turbo4 (4-bit PolarQuant) +- **Context Length:** 128K tokens +- **Memory Savings:** ~73% KV cache reduction +- **Fallback Providers:** Ollama, OpenAI-compatible API + +## Quick Start + +### 1. Build TurboQuant-enabled llama.cpp + +```bash +git clone https://github.com/TheTom/llama-cpp-turboquant.git +cd llama-cpp-turboquant +git checkout feature/turboquant-kv-cache +cmake -B build -DGGML_METAL=ON -DCMAKE_BUILD_TYPE=Release +cmake --build build -j$(sysctl -n hw.ncpu) +``` + +### 2. Download Gemma 4 Model + +```bash +# Download Gemma 4 Q4_K_M quantized model +huggingface-cli download gemma-4-q4_k_m.gguf +``` + +### 3. Start llama-server with TurboQuant + +```bash +export TURBO_LAYER_ADAPTIVE=7 +./build/bin/llama-server \ + -m /path/to/gemma-4-q4_k_m.gguf \ + --port 8081 \ + -ctk turbo4 -ctv turbo4 \ + -c 131072 \ + --host 0.0.0.0 +``` + +### 4. Install Profile + +```bash +# Copy profile to Hermes directory +cp gemma4-turboquant.yaml ~/.hermes/profiles/ + +# Or create symlink +ln -sf $(pwd)/gemma4-turboquant.yaml ~/.hermes/profiles/ +``` + +### 5. Use with Hermes + +```bash +# Start Hermes with the profile +hermes --profile gemma4-turboquant + +# Or specify profile in Hermes config +echo "default_profile: gemma4-turboquant" >> ~/.hermes/config.yaml +``` + +## Profile Configuration + +The profile includes: + +- **Primary Provider:** Local llama.cpp server with TurboQuant +- **Fallback Providers:** Ollama (local), OpenAI (cloud) +- **TurboQuant Settings:** + - `kv_type`: turbo4 (4-bit compression) + - `layer_adaptive_mode`: 7 (best quality/compression ratio) + - `max_context`: 128K tokens + +## Performance Expectations + +| Metric | Value | Notes | +|--------|-------|-------| +| KV Memory Savings | 73% | Measured on M3 Max | +| Prompt Processing | ~1% overhead | vs FP16 baseline | +| Generation Speed | ~11% overhead | vs FP16 baseline | +| Max Context (36GB) | 128K | Comfortable with 7.6GB headroom | + +## Customization + +### Adjust Compression Level + +```yaml +turboquant: + kv_type: "turbo3" # Lower compression, faster + # or + kv_type: "turbo2" # Minimal compression, fastest +``` + +### Disable Per-Layer Adaptive + +```yaml +turboquant: + layer_adaptive_mode: 0 # Uniform quantization +``` + +### Use Asymmetric K/V + +For better quality on sensitive models: + +```bash +# Start server with asymmetric K/V +llama-server -m model.gguf --port 8081 -ctk q8_0 -ctv turbo4 -c 131072 +``` + +## Troubleshooting + +### Server Won't Start + +1. Check if port 8081 is available: `lsof -i :8081` +2. Verify model path is correct +3. Ensure TurboQuant branch is checked out + +### Poor Generation Quality + +1. Try `turbo3` instead of `turbo4` +2. Disable per-layer adaptive (mode 0) +3. Use asymmetric K/V: `-ctk q8_0 -ctv turbo4` + +### High Memory Usage + +1. Reduce context length: `-c 65536` (64K) +2. Check `TURBO_LAYER_ADAPTIVE` is set +3. Monitor with: `vmmap --summary $(pgrep llama-server)` + +## References + +- [TurboQuant Build Spec](../BUILD-SPEC.md) +- [Phase 1 Report](../PHASE1-REPORT.md) +- [Full Knowledge Transfer](../FULL-REPORT.md) +- [llama.cpp TurboQuant Fork](https://github.com/TheTom/llama-cpp-turboquant) diff --git a/profiles/hermes-profile-gemma4-turboquant.yaml b/profiles/hermes-profile-gemma4-turboquant.yaml new file mode 100644 index 00000000..20994bd4 --- /dev/null +++ b/profiles/hermes-profile-gemma4-turboquant.yaml @@ -0,0 +1,169 @@ +# Hermes Profile: Gemma 4 + TurboQuant KV Cache Compression +# For use with local llama.cpp server running TurboQuant-enabled inference +# Drop into ~/.hermes/profiles/gemma4-turboquant.yaml + +profile: + name: "gemma4-turboquant" + version: "1.0.0" + description: "Gemma 4 model with TurboQuant KV cache compression for extended context on Apple Silicon" + +# Primary provider: local llama.cpp server with TurboQuant +providers: + primary: + type: "llama.cpp" + name: "local-turboquant" + endpoint: "http://localhost:8081" + api_path: "/v1/chat/completions" + timeout_ms: 120000 + + # Model configuration + model: + name: "gemma-4" + path: "/path/to/gemma-4-q4_k_m.gguf" # Update with actual model path + + # TurboQuant KV cache compression settings + turboquant: + enabled: true + kv_type: "turbo4" # Options: turbo2, turbo3, turbo4 (4-bit recommended) + layer_adaptive_mode: 7 # Per-layer adaptive quantization (0-7, 7=best quality/ratio) + + # Context and memory settings + context: + max_tokens: 131072 # 128K context with TurboQuant compression + batch_size: 512 + + # Generation parameters + generation: + temperature: 0.7 + top_p: 0.9 + top_k: 40 + repeat_penalty: 1.1 + frequency_penalty: 0.0 + presence_penalty: 0.0 + + # Server startup command (for reference) + server_command: | + export TURBO_LAYER_ADAPTIVE=7 + llama-server \ + -m /path/to/gemma-4-q4_k_m.gguf \ + --port 8081 \ + -ctk turbo4 -ctv turbo4 \ + -c 131072 \ + --host 0.0.0.0 + + # Fallback provider 1: Ollama (standard, no TurboQuant) + fallback_1: + type: "ollama" + name: "ollama-gemma4" + endpoint: "http://localhost:11434" + api_path: "/api/chat" + timeout_ms: 120000 + + model: + name: "gemma4:latest" + + generation: + temperature: 0.7 + top_p: 0.9 + top_k: 40 + + # Fallback provider 2: OpenAI-compatible API (cloud backup) + fallback_2: + type: "openai" + name: "openai-backup" + endpoint: "https://api.openai.com" + api_path: "/v1/chat/completions" + timeout_ms: 60000 + + model: + name: "gpt-4" + + generation: + temperature: 0.7 + max_tokens: 4096 + +# Performance and monitoring +performance: + # Memory management for TurboQuant + memory: + max_gpu_memory_gb: 28 # Leave headroom on 36GB M3 Max + kv_cache_compression: "turbo4" + estimated_savings: "73%" # TurboQuant delivers ~73% KV memory savings + + # Benchmarking integration + benchmarks: + enabled: true + metrics: + - "tokens_per_second" + - "time_to_first_token" + - "peak_memory_usage" + - "perplexity" + +# Quality validation +quality: + # Test prompts for quality comparison + test_prompts: + enabled: true + prompt_file: "benchmarks/prompts.json" + + # Perplexity testing + perplexity: + enabled: true + corpus: "wikitext-2-raw" + context_lengths: [8192, 32768, 65536, 131072] + +# Environment variables (applied when using this profile) +environment: + TURBO_LAYER_ADAPTIVE: "7" # Per-layer adaptive quantization mode + GGML_METAL_DEBUG: "0" # Disable Metal debug in production + OMP_NUM_THREADS: "8" # Optimize for M3 Max performance cores + +# Logging and diagnostics +logging: + level: "info" + metrics_interval_seconds: 60 + log_token_speed: true + log_memory_usage: true + +# Notes for deployment +notes: + deployment: | + 1. Ensure llama.cpp fork with TurboQuant is built: + cd /path/to/llama-cpp-turboquant + git checkout feature/turboquant-kv-cache + cmake -B build -DGGML_METAL=ON -DCMAKE_BUILD_TYPE=Release + cmake --build build -j$(sysctl -n hw.ncpu) + + 2. Start the server: + export TURBO_LAYER_ADAPTIVE=7 + ./build/bin/llama-server \ + -m /path/to/gemma-4-q4_k_m.gguf \ + --port 8081 \ + -ctk turbo4 -ctv turbo4 \ + -c 131072 \ + --host 0.0.0.0 + + 3. Verify server is running: + curl http://localhost:8081/v1/models + + 4. Copy this profile to Hermes: + cp hermes-profile-gemma4-turboquant.yaml ~/.hermes/profiles/ + + performance_notes: | + TurboQuant delivers: + - 73% KV cache memory savings + - 1% prompt processing overhead + - 11% generation overhead + - Enables 128K context on 36GB hardware + + With TurboQuant on Gemma 4 (estimated): + - Model weights: ~16GB at Q4_K_M + - KV cache at 128K: ~5GB (vs ~20GB without compression) + - Total memory: ~23GB (fits comfortably in 31GB budget) + + troubleshooting: | + - If generation speed is slow, try turbo3 instead of turbo4 + - If quality issues, disable per-layer adaptive (set mode to 0) + - For maximum quality on sensitive layers, use asymmetric K/V: + -ctk q8_0 -ctv turbo4 + - Monitor memory with: vmmap --summary $(pgrep llama-server)