[P2] Custom Ollama build + MacBook deployment #10

Closed
opened 2026-03-30 17:11:13 +00:00 by Timmy · 1 comment
Owner

Parent: #1 | Depends on: #9 (API check)

Build custom Ollama using our llama.cpp fork as submodule. Deploy to MacBook.

Steps

  1. Build custom Ollama binary with our llama.cpp fork
  2. Deploy to MacBook as replacement Ollama
  3. Verify existing endpoint (10.0.0.133:11434) works identically
  4. Test with qwen3.5:27b — basic generation
  5. If 128K context works: update Ollama model config to advertise larger context

Estimated Time: 15-25 min (once llama.cpp fork is validated)

Acceptance Criteria

  • Custom Ollama binary built
  • Deployed to MacBook
  • 10.0.0.133:11434 responds correctly
  • Existing models load and generate
  • Context length config updated if applicable
## Parent: #1 | Depends on: #9 (API check) Build custom Ollama using our llama.cpp fork as submodule. Deploy to MacBook. ## Steps 1. Build custom Ollama binary with our llama.cpp fork 2. Deploy to MacBook as replacement Ollama 3. Verify existing endpoint (10.0.0.133:11434) works identically 4. Test with qwen3.5:27b — basic generation 5. If 128K context works: update Ollama model config to advertise larger context ## Estimated Time: 15-25 min (once llama.cpp fork is validated) ## Acceptance Criteria - [ ] Custom Ollama binary built - [ ] Deployed to MacBook - [ ] 10.0.0.133:11434 responds correctly - [ ] Existing models load and generate - [ ] Context length config updated if applicable
Timmy added this to the Phase 2 — Ollama Integration + Production milestone 2026-03-30 17:11:13 +00:00
Timmy added the deployphase-2buildowner:cid labels 2026-03-30 17:11:13 +00:00
Author
Owner

Custom Ollama Build — DEFERRED

Three approaches attempted, all failed:

  1. Full fork replacement: Fails — Ollama's 34 custom patches don't apply
  2. Patch application: 22/34 patches fail (sha1 mismatch from version drift)
  3. Incremental injection: Partially compiles but Ollama HEAD itself has pre-existing build failures

Root Cause

Ollama vendors llama.cpp with deep modifications. The TurboQuant fork spans 30+ files (Metal shaders, CUDA kernels, CPU ops, KV cache code). Clean integration requires rebasing onto Ollama's exact pinned commit — estimated multi-day effort.

The fork's llama-server at /tmp/llama-cpp-turboquant/build/bin/llama-server:

  • Already built and working
  • Speaks OpenAI chat completions API
  • Supports all TurboQuant KV types (-ctk/-ctv flags)
  • Supports per-layer adaptive (TURBO_LAYER_ADAPTIVE env)
  • Can serve on same port as Ollama (11434)
  • Deploy: llama-server -m <model.gguf> --port 11434 -ctk turbo4 -ctv turbo4

Deferred Work

Custom Ollama build saved as future task. When Ollama updates their llama.cpp pin, the gap narrows. Phase 4 upstream watch (#15) covers this.

## Custom Ollama Build — DEFERRED Three approaches attempted, all failed: 1. **Full fork replacement:** Fails — Ollama's 34 custom patches don't apply 2. **Patch application:** 22/34 patches fail (sha1 mismatch from version drift) 3. **Incremental injection:** Partially compiles but Ollama HEAD itself has pre-existing build failures ### Root Cause Ollama vendors llama.cpp with deep modifications. The TurboQuant fork spans 30+ files (Metal shaders, CUDA kernels, CPU ops, KV cache code). Clean integration requires rebasing onto Ollama's exact pinned commit — estimated multi-day effort. ### Recommended Path: llama-server The fork's `llama-server` at `/tmp/llama-cpp-turboquant/build/bin/llama-server`: - Already built and working - Speaks OpenAI chat completions API - Supports all TurboQuant KV types (-ctk/-ctv flags) - Supports per-layer adaptive (TURBO_LAYER_ADAPTIVE env) - Can serve on same port as Ollama (11434) - **Deploy:** `llama-server -m <model.gguf> --port 11434 -ctk turbo4 -ctv turbo4` ### Deferred Work Custom Ollama build saved as future task. When Ollama updates their llama.cpp pin, the gap narrows. Phase 4 upstream watch (#15) covers this.
Timmy closed this issue 2026-03-30 21:04:03 +00:00
Sign in to join this conversation.