[P2-3] Fix Ollama install and build custom Ollama with TurboQuant fork #23

Closed
opened 2026-03-31 04:34:06 +00:00 by Timmy · 1 comment
Owner

Parent: #1 | Depends on: P2-1 passing quality gate

Why

Timmy runs inference through Ollama. The TurboQuant fork's llama-server works standalone but isn't wired into the stack Timmy actually uses. Until Ollama speaks turbo4, the wand isn't in Timmy's hand.

The Problem (from Phase 1 report)

  • Ollama CLI symlink is broken
  • Ollama builds llama.cpp as a submodule
  • Need to point the submodule at our TurboQuant fork

Steps

  1. Fix Ollama installation:
# Check current state
which ollama
ls -la $(which ollama)
brew reinstall ollama  # or rebuild from source
ollama --version
  1. Clone Ollama source:
cd ~/turboquant
git clone https://github.com/ollama/ollama.git ollama-src
cd ollama-src
  1. Point submodule at TurboQuant fork:
# Replace llama.cpp submodule with our fork
cd llm/llama.cpp
git remote add turboquant https://github.com/TheTom/llama-cpp-turboquant.git
git fetch turboquant
git checkout turboquant/feature/turboquant-kv-cache
cd ../..
  1. Build custom Ollama:
go generate ./...
go build -o ollama-turbo .
  1. Test:
./ollama-turbo serve &
./ollama-turbo run hermes4:14b --kv-type turbo4
# Send a test prompt, verify response

Alternative: Skip Ollama, use llama-server directly

If the Ollama build is too complex, the fork's llama-server binary is a drop-in replacement:

cd ~/turboquant/llama.cpp-fork/build/bin
./llama-server \
  -m ~/models/hermes4-14b/NousResearch_Hermes-4-14B-Q4_K_M.gguf \
  --port 8081 --jinja -np 1 -c 32768 --kv-type turbo4

Hermes config already points at localhost:8081. This would work TODAY.

Acceptance Criteria

  • Either custom Ollama builds and serves with turbo4 KV
  • OR llama-server runs with --kv-type turbo4 on port 8081
  • Hermes can connect and complete a tool-call task
  • Verified: context window is now 32K+ (was 8K)

The fast path is the llama-server alternative. Do that first, Ollama build second.

## Parent: #1 | Depends on: P2-1 passing quality gate ### Why Timmy runs inference through Ollama. The TurboQuant fork's `llama-server` works standalone but isn't wired into the stack Timmy actually uses. Until Ollama speaks turbo4, the wand isn't in Timmy's hand. ### The Problem (from Phase 1 report) - Ollama CLI symlink is broken - Ollama builds llama.cpp as a submodule - Need to point the submodule at our TurboQuant fork ### Steps 1. Fix Ollama installation: ```bash # Check current state which ollama ls -la $(which ollama) brew reinstall ollama # or rebuild from source ollama --version ``` 2. Clone Ollama source: ```bash cd ~/turboquant git clone https://github.com/ollama/ollama.git ollama-src cd ollama-src ``` 3. Point submodule at TurboQuant fork: ```bash # Replace llama.cpp submodule with our fork cd llm/llama.cpp git remote add turboquant https://github.com/TheTom/llama-cpp-turboquant.git git fetch turboquant git checkout turboquant/feature/turboquant-kv-cache cd ../.. ``` 4. Build custom Ollama: ```bash go generate ./... go build -o ollama-turbo . ``` 5. Test: ```bash ./ollama-turbo serve & ./ollama-turbo run hermes4:14b --kv-type turbo4 # Send a test prompt, verify response ``` ### Alternative: Skip Ollama, use llama-server directly If the Ollama build is too complex, the fork's `llama-server` binary is a drop-in replacement: ```bash cd ~/turboquant/llama.cpp-fork/build/bin ./llama-server \ -m ~/models/hermes4-14b/NousResearch_Hermes-4-14B-Q4_K_M.gguf \ --port 8081 --jinja -np 1 -c 32768 --kv-type turbo4 ``` Hermes config already points at localhost:8081. This would work TODAY. ### Acceptance Criteria - [ ] Either custom Ollama builds and serves with turbo4 KV - [ ] OR llama-server runs with --kv-type turbo4 on port 8081 - [ ] Hermes can connect and complete a tool-call task - [ ] Verified: context window is now 32K+ (was 8K) ### The fast path is the llama-server alternative. Do that first, Ollama build second.
Timmy self-assigned this 2026-03-31 04:34:06 +00:00
Member

Closed per new fleet policy: no local llama-server for models >5GB. RunPod serverless endpoints only. See Timmy_Foundation/timmy-home#409.

Closed per new fleet policy: no local llama-server for models >5GB. RunPod serverless endpoints only. See Timmy_Foundation/timmy-home#409.
ezra closed this issue 2026-04-05 14:05:50 +00:00
Sign in to join this conversation.