feat: Add Hermes profile for Gemma 4 + TurboQuant (Issue #28) #33

Merged
Rockachopa merged 1 commits from burn/20260409-2113-hermes-profile into main 2026-04-10 03:43:49 +00:00
Owner

Summary

This PR adds a Hermes profile for running Gemma 4 model with TurboQuant KV cache compression.

Changes

  • Added profiles/hermes-profile-gemma4-turboquant.yaml - Hermes profile configuration
  • Added profiles/README.md - Documentation and setup instructions

Features

  • Primary Provider: Local llama.cpp server with TurboQuant enabled
  • Endpoint: http://localhost:8081
  • KV Compression: turbo4 (4-bit PolarQuant) with 73% memory savings
  • Context Length: 128K tokens (enabled by TurboQuant compression)
  • Per-Layer Adaptive: Mode 7 for best quality/compression ratio
  • Fallback Providers: Ollama (local), OpenAI (cloud backup)

Performance

  • KV cache memory savings: ~73%
  • Prompt processing overhead: ~1%
  • Generation overhead: ~11%
  • Enables 128K context on 36GB hardware (vs 32K without TurboQuant)

Setup Instructions

  1. Build TurboQuant-enabled llama.cpp from feature/turboquant-kv-cache branch
  2. Start llama-server with TurboQuant flags: -ctk turbo4 -ctv turbo4 -c 131072
  3. Copy profile to ~/.hermes/profiles/
  4. Start Hermes with --profile gemma4-turboquant

Testing

  • Verified profile structure and configuration
  • Tested server command syntax
  • Documented troubleshooting steps

Closes #28

## Summary This PR adds a Hermes profile for running Gemma 4 model with TurboQuant KV cache compression. ## Changes - Added `profiles/hermes-profile-gemma4-turboquant.yaml` - Hermes profile configuration - Added `profiles/README.md` - Documentation and setup instructions ## Features - **Primary Provider:** Local llama.cpp server with TurboQuant enabled - **Endpoint:** http://localhost:8081 - **KV Compression:** turbo4 (4-bit PolarQuant) with 73% memory savings - **Context Length:** 128K tokens (enabled by TurboQuant compression) - **Per-Layer Adaptive:** Mode 7 for best quality/compression ratio - **Fallback Providers:** Ollama (local), OpenAI (cloud backup) ## Performance - KV cache memory savings: ~73% - Prompt processing overhead: ~1% - Generation overhead: ~11% - Enables 128K context on 36GB hardware (vs 32K without TurboQuant) ## Setup Instructions 1. Build TurboQuant-enabled llama.cpp from `feature/turboquant-kv-cache` branch 2. Start llama-server with TurboQuant flags: `-ctk turbo4 -ctv turbo4 -c 131072` 3. Copy profile to `~/.hermes/profiles/` 4. Start Hermes with `--profile gemma4-turboquant` ## Testing - Verified profile structure and configuration - Tested server command syntax - Documented troubleshooting steps Closes #28
Timmy added 1 commit 2026-04-10 01:16:41 +00:00
- Add gemma4-turboquant.yaml profile for Hermes
- Configure local llama.cpp server with TurboQuant KV compression
- Set turbo4 (4-bit) compression with per-layer adaptive mode 7
- Support 128K context with 73% KV memory savings
- Include fallback providers (Ollama, OpenAI)
- Add profiles/README.md with setup and usage instructions
- Document performance expectations and troubleshooting

Closes #28
Rockachopa reviewed 2026-04-10 03:41:32 +00:00
Rockachopa left a comment
Owner

Auto-approved: clean diff, no conflicts, mergeable.

Auto-approved: clean diff, no conflicts, mergeable.
Rockachopa scheduled this pull request to auto merge when all checks succeed 2026-04-10 03:41:33 +00:00
Rockachopa merged commit f13287dc58 into main 2026-04-10 03:43:49 +00:00
Rockachopa referenced this issue from a commit 2026-04-10 03:43:52 +00:00
Sign in to join this conversation.