[P2] Custom Ollama build + MacBook deployment #10

New Issue

Timmy · 2026-03-30T17:11:13Z

Timmy commented

2026-03-30 17:11:13 +00:00

Parent: #1 | Depends on: #9 (API check)

Build custom Ollama using our llama.cpp fork as submodule. Deploy to MacBook.

Steps

Build custom Ollama binary with our llama.cpp fork
Deploy to MacBook as replacement Ollama
Verify existing endpoint (10.0.0.133:11434) works identically
Test with qwen3.5:27b — basic generation
If 128K context works: update Ollama model config to advertise larger context

Estimated Time: 15-25 min (once llama.cpp fork is validated)

Acceptance Criteria

Custom Ollama binary built
Deployed to MacBook
10.0.0.133:11434 responds correctly
Existing models load and generate
Context length config updated if applicable

## Parent: #1 | Depends on: #9 (API check) Build custom Ollama using our llama.cpp fork as submodule. Deploy to MacBook. ## Steps 1. Build custom Ollama binary with our llama.cpp fork 2. Deploy to MacBook as replacement Ollama 3. Verify existing endpoint (10.0.0.133:11434) works identically 4. Test with qwen3.5:27b — basic generation 5. If 128K context works: update Ollama model config to advertise larger context ## Estimated Time: 15-25 min (once llama.cpp fork is validated) ## Acceptance Criteria - [ ] Custom Ollama binary built - [ ] Deployed to MacBook - [ ] 10.0.0.133:11434 responds correctly - [ ] Existing models load and generate - [ ] Context length config updated if applicable

Timmy added this to the Phase 2 — Ollama Integration + Production milestone 2026-03-30 17:11:13 +00:00

Timmy added the deploy phase-2 build owner:cid labels 2026-03-30 17:11:13 +00:00

Timmy referenced this issue

2026-03-30 17:11:14 +00:00

[P2] Full test matrix — 10 prompts + quality + performance #11

Timmy referenced this issue

2026-03-30 17:11:16 +00:00

[P2] Long-session quality test — 50-turn conversation #12

Timmy commented

2026-03-30 21:04:02 +00:00

Custom Ollama Build — DEFERRED

Three approaches attempted, all failed:

Full fork replacement: Fails — Ollama's 34 custom patches don't apply
Patch application: 22/34 patches fail (sha1 mismatch from version drift)
Incremental injection: Partially compiles but Ollama HEAD itself has pre-existing build failures

Root Cause

Ollama vendors llama.cpp with deep modifications. The TurboQuant fork spans 30+ files (Metal shaders, CUDA kernels, CPU ops, KV cache code). Clean integration requires rebasing onto Ollama's exact pinned commit — estimated multi-day effort.

Recommended Path: llama-server

The fork's llama-server at /tmp/llama-cpp-turboquant/build/bin/llama-server:

Already built and working
Speaks OpenAI chat completions API
Supports all TurboQuant KV types (-ctk/-ctv flags)
Supports per-layer adaptive (TURBO_LAYER_ADAPTIVE env)
Can serve on same port as Ollama (11434)
Deploy: llama-server -m <model.gguf> --port 11434 -ctk turbo4 -ctv turbo4

Deferred Work

Custom Ollama build saved as future task. When Ollama updates their llama.cpp pin, the gap narrows. Phase 4 upstream watch (#15) covers this.

## Custom Ollama Build — DEFERRED Three approaches attempted, all failed: 1. **Full fork replacement:** Fails — Ollama's 34 custom patches don't apply 2. **Patch application:** 22/34 patches fail (sha1 mismatch from version drift) 3. **Incremental injection:** Partially compiles but Ollama HEAD itself has pre-existing build failures ### Root Cause Ollama vendors llama.cpp with deep modifications. The TurboQuant fork spans 30+ files (Metal shaders, CUDA kernels, CPU ops, KV cache code). Clean integration requires rebasing onto Ollama's exact pinned commit — estimated multi-day effort. ### Recommended Path: llama-server The fork's `llama-server` at `/tmp/llama-cpp-turboquant/build/bin/llama-server`: - Already built and working - Speaks OpenAI chat completions API - Supports all TurboQuant KV types (-ctk/-ctv flags) - Supports per-layer adaptive (TURBO_LAYER_ADAPTIVE env) - Can serve on same port as Ollama (11434) - **Deploy:** `llama-server -m <model.gguf> --port 11434 -ctk turbo4 -ctv turbo4` ### Deferred Work Custom Ollama build saved as future task. When Ollama updates their llama.cpp pin, the gap narrows. Phase 4 upstream watch (#15) covers this.

Timmy closed this issue

2026-03-30 21:04:03 +00:00

Timmy referenced this issue

2026-04-04 01:15:33 +00:00

[P2] Long-session quality test — 50-turn conversation #12

Timmy referenced this issue

2026-04-04 01:16:44 +00:00

[P2] Full test matrix — 10 prompts + quality + performance #11

Timmy referenced this issue

2026-04-04 01:18:41 +00:00

TurboQuant — KV Cache Compression for Local Inference on M4 Max #1

Sign in to join this conversation.