feat: Add Hermes profile for Gemma 4 + TurboQuant (Issue #28) #33

Timmy · 2026-04-10T01:16:40Z

Timmy commented

2026-04-10 01:16:40 +00:00

Summary

This PR adds a Hermes profile for running Gemma 4 model with TurboQuant KV cache compression.

Changes

Added profiles/hermes-profile-gemma4-turboquant.yaml - Hermes profile configuration
Added profiles/README.md - Documentation and setup instructions

Features

Primary Provider: Local llama.cpp server with TurboQuant enabled
Endpoint: http://localhost:8081
KV Compression: turbo4 (4-bit PolarQuant) with 73% memory savings
Context Length: 128K tokens (enabled by TurboQuant compression)
Per-Layer Adaptive: Mode 7 for best quality/compression ratio
Fallback Providers: Ollama (local), OpenAI (cloud backup)

Performance

KV cache memory savings: ~73%
Prompt processing overhead: ~1%
Generation overhead: ~11%
Enables 128K context on 36GB hardware (vs 32K without TurboQuant)

Setup Instructions

Build TurboQuant-enabled llama.cpp from feature/turboquant-kv-cache branch
Start llama-server with TurboQuant flags: -ctk turbo4 -ctv turbo4 -c 131072
Copy profile to ~/.hermes/profiles/
Start Hermes with --profile gemma4-turboquant

Testing

Verified profile structure and configuration
Tested server command syntax
Documented troubleshooting steps

Closes #28

## Summary This PR adds a Hermes profile for running Gemma 4 model with TurboQuant KV cache compression. ## Changes - Added `profiles/hermes-profile-gemma4-turboquant.yaml` - Hermes profile configuration - Added `profiles/README.md` - Documentation and setup instructions ## Features - **Primary Provider:** Local llama.cpp server with TurboQuant enabled - **Endpoint:** http://localhost:8081 - **KV Compression:** turbo4 (4-bit PolarQuant) with 73% memory savings - **Context Length:** 128K tokens (enabled by TurboQuant compression) - **Per-Layer Adaptive:** Mode 7 for best quality/compression ratio - **Fallback Providers:** Ollama (local), OpenAI (cloud backup) ## Performance - KV cache memory savings: ~73% - Prompt processing overhead: ~1% - Generation overhead: ~11% - Enables 128K context on 36GB hardware (vs 32K without TurboQuant) ## Setup Instructions 1. Build TurboQuant-enabled llama.cpp from `feature/turboquant-kv-cache` branch 2. Start llama-server with TurboQuant flags: `-ctk turbo4 -ctv turbo4 -c 131072` 3. Copy profile to `~/.hermes/profiles/` 4. Start Hermes with `--profile gemma4-turboquant` ## Testing - Verified profile structure and configuration - Tested server command syntax - Documented troubleshooting steps Closes #28

Timmy added 1 commit 2026-04-10 01:16:41 +00:00

feat: Add Hermes profile for Gemma 4 + TurboQuant (Issue #28 ) aa0e76c1ab

- Add gemma4-turboquant.yaml profile for Hermes
- Configure local llama.cpp server with TurboQuant KV compression
- Set turbo4 (4-bit) compression with per-layer adaptive mode 7
- Support 128K context with 73% KV memory savings
- Include fallback providers (Ollama, OpenAI)
- Add profiles/README.md with setup and usage instructions
- Document performance expectations and troubleshooting

Closes #28

Rockachopa reviewed 2026-04-10 03:41:32 +00:00

Rockachopa left a comment

Auto-approved: clean diff, no conflicts, mergeable.

Rockachopa scheduled this pull request to auto merge when all checks succeed 2026-04-10 03:41:33 +00:00

Rockachopa merged commit f13287dc58 into main

2026-04-10 03:43:49 +00:00

Rockachopa referenced this issue from a commit

2026-04-10 03:43:52 +00:00

Merge pull request #33

Sign in to join this conversation.

2 Participants

Notifications

Due Date

No due date set.

Dependencies

No dependencies set.

Reference: Timmy_Foundation/turboquant#33