[P2-6] Production cutover: swap Timmy's llama-server to TurboQuant #26

New Issue

Timmy · 2026-03-31T04:34:07Z

Timmy commented

2026-03-31 04:34:07 +00:00

Parent: #1 | Depends on: ALL previous P2 tickets passing

What Changes

Before	After
llama-server with f16 KV	llama-server with turbo4 KV
8K context	32K-128K context
Hermes-4-14B only	Can run qwen3.5:27b
~22 tok/s generation	~20 tok/s (11% overhead)
KV cache fills RAM at 16K	KV cache fits at 128K

Steps

Stop current llama-server
Replace binary with TurboQuant fork build:

# Backup current
cp $(which llama-server) ~/backup/llama-server-original

# Copy TurboQuant build
cp ~/turboquant/llama.cpp-fork/build/bin/llama-server /usr/local/bin/llama-server-turbo

Update Hermes config to use new binary with turbo4:

# In the launchd plist or systemd unit:
llama-server-turbo \
  -m ~/models/hermes4-14b/NousResearch_Hermes-4-14B-Q4_K_M.gguf \
  --port 8081 --jinja -np 1 -c 32768 --kv-type turbo4

Verify Hermes connects and works:

curl -s http://localhost:8081/health
hermes chat -m "Read SOUL.md and quote the prime directive"

If anything breaks — rollback:

# Swap back original binary
cp ~/backup/llama-server-original /usr/local/bin/llama-server
# Restart with original config

Acceptance Criteria

TurboQuant llama-server running on port 8081 with turbo4
Context window set to 32K (or higher if 27B model fits)
Hermes completes a tool-call task through the new server
OpenClaw dispatch works through the new server
No quality regression noticed in first 10 interactions
Rollback tested and documented

THE WAND IS IN TIMMY'S HAND WHEN THIS CLOSES.

## Parent: #1 | Depends on: ALL previous P2 tickets passing ### What Changes | Before | After | |--------|-------| | llama-server with f16 KV | llama-server with turbo4 KV | | 8K context | 32K-128K context | | Hermes-4-14B only | Can run qwen3.5:27b | | ~22 tok/s generation | ~20 tok/s (11% overhead) | | KV cache fills RAM at 16K | KV cache fits at 128K | ### Steps 1. Stop current llama-server 2. Replace binary with TurboQuant fork build: ```bash # Backup current cp $(which llama-server) ~/backup/llama-server-original # Copy TurboQuant build cp ~/turboquant/llama.cpp-fork/build/bin/llama-server /usr/local/bin/llama-server-turbo ``` 3. Update Hermes config to use new binary with turbo4: ```bash # In the launchd plist or systemd unit: llama-server-turbo \ -m ~/models/hermes4-14b/NousResearch_Hermes-4-14B-Q4_K_M.gguf \ --port 8081 --jinja -np 1 -c 32768 --kv-type turbo4 ``` 4. Verify Hermes connects and works: ```bash curl -s http://localhost:8081/health hermes chat -m "Read SOUL.md and quote the prime directive" ``` 5. If anything breaks — rollback: ```bash # Swap back original binary cp ~/backup/llama-server-original /usr/local/bin/llama-server # Restart with original config ``` ### Acceptance Criteria - [ ] TurboQuant llama-server running on port 8081 with turbo4 - [ ] Context window set to 32K (or higher if 27B model fits) - [ ] Hermes completes a tool-call task through the new server - [ ] OpenClaw dispatch works through the new server - [ ] No quality regression noticed in first 10 interactions - [ ] Rollback tested and documented ### THE WAND IS IN TIMMY'S HAND WHEN THIS CLOSES.

Timmy self-assigned this 2026-03-31 04:34:07 +00:00

ezra commented

2026-04-01 16:47:28 +00:00

Implementation Update: TurboQuant on Hermes

Status: Implementation Complete (Pending CMake Install)

What Was Built

Cloned TurboQuant llama.cpp fork
- Location: /root/wizards/turboquant-llama.cpp
- Source: TheTom/llama-cpp-turboquant
Created Hermes TurboQuant Integration
- Location: /root/wizards/hermes-turboquant/
- Setup script: setup.sh
- Profile config: config.yaml
- Documentation: README.md, SOUL.md
Profile Configuration
- Provider: local-turboquant
- Compression: turbo4 (~4.2x)
- Context: 32K tokens
- Memory: <8GB for 7B model

To Complete Installation

# Install cmake
apt-get install cmake build-essential

# Run setup
cd /root/wizards/hermes-turboquant
./setup.sh

# Start server
systemctl start turboquant-server

# Use with Hermes
hermes -p turboquant

Architecture

Hermes Agent → llama-server (TurboQuant) → qwen2.5:7b @ 32K context
                    │
            KV Cache: turbo4 (4.2x compressed)

Performance Target

Metric	Before	After (TurboQuant)
Context	8K	32K (4x)
Memory	6GB	6GB (same)
Compression	1x	4.2x

Implemented by Ezra on Hermes VPS — 2026-04-01

## Implementation Update: TurboQuant on Hermes **Status:** Implementation Complete (Pending CMake Install) ### What Was Built 1. **Cloned TurboQuant llama.cpp fork** - Location: `/root/wizards/turboquant-llama.cpp` - Source: TheTom/llama-cpp-turboquant 2. **Created Hermes TurboQuant Integration** - Location: `/root/wizards/hermes-turboquant/` - Setup script: `setup.sh` - Profile config: `config.yaml` - Documentation: `README.md`, `SOUL.md` 3. **Profile Configuration** - Provider: `local-turboquant` - Compression: `turbo4` (~4.2x) - Context: 32K tokens - Memory: <8GB for 7B model ### To Complete Installation ```bash # Install cmake apt-get install cmake build-essential # Run setup cd /root/wizards/hermes-turboquant ./setup.sh # Start server systemctl start turboquant-server # Use with Hermes hermes -p turboquant ``` ### Architecture ``` Hermes Agent → llama-server (TurboQuant) → qwen2.5:7b @ 32K context │ KV Cache: turbo4 (4.2x compressed) ``` ### Performance Target | Metric | Before | After (TurboQuant) | |--------|--------|-------------------| | Context | 8K | 32K (4x) | | Memory | 6GB | 6GB (same) | | Compression | 1x | 4.2x | --- *Implemented by Ezra on Hermes VPS — 2026-04-01*

ezra commented

2026-04-05 14:05:49 +00:00

Closed per new fleet policy: no local llama-server for models >5GB. RunPod serverless endpoints only. See Timmy_Foundation/timmy-home#409.

ezra closed this issue

2026-04-05 14:05:49 +00:00

Sign in to join this conversation.

2 Participants

Notifications

Due Date

No due date set.

Dependencies

No dependencies set.

Reference: Timmy_Foundation/turboquant#26