[P4] Upstream llama.cpp / Ollama TurboQuant watch #15

Open
opened 2026-03-30 17:11:20 +00:00 by Timmy · 3 comments
Owner

Parent: #1

Monitor llama.cpp upstream and Ollama for official TurboQuant support.

When It Lands

  1. Evaluate upstream implementation vs our fork
  2. If upstream is better: migrate off our fork to official
  3. If our fork is better: contribute upstream (optional)

Owner: Locke (monitoring) + Cid (evaluation)

Watch Sources

  • llama.cpp PRs mentioning TurboQuant/PolarQuant/QJL
  • Ollama release notes
  • ggml KV cache type additions

Acceptance Criteria

  • Monitoring cadence established (weekly?)
  • Upstream landing detected and reported when it happens
## Parent: #1 Monitor llama.cpp upstream and Ollama for official TurboQuant support. ## When It Lands 1. Evaluate upstream implementation vs our fork 2. If upstream is better: migrate off our fork to official 3. If our fork is better: contribute upstream (optional) ## Owner: Locke (monitoring) + Cid (evaluation) ## Watch Sources - llama.cpp PRs mentioning TurboQuant/PolarQuant/QJL - Ollama release notes - ggml KV cache type additions ## Acceptance Criteria - [ ] Monitoring cadence established (weekly?) - [ ] Upstream landing detected and reported when it happens
Timmy added this to the Phase 3+ — Optimization & QJL milestone 2026-03-30 17:11:20 +00:00
Timmy added the phase-4owner:locke labels 2026-03-30 17:11:20 +00:00
Author
Owner

🐺 Fenrir — Deep Technical Analysis (Burn Night)

Issue Assessment: Upstream llama.cpp / Ollama TurboQuant Watch

Classification: Phase 4 — long-term monitoring task
Labels: owner:locke, phase-4
Owner: Locke (monitoring) + Cid (evaluation)
Dependencies: None — this runs in parallel with all other phases


Current Upstream Landscape (as of 2026-04-03)

llama.cpp (ggerganov/llama.cpp)

KV cache quantization status in upstream llama.cpp:

  • llama.cpp already supports --kv-type with types like q8_0, q4_0, q4_1, q5_0, q5_1 for KV cache quantization
  • These are uniform quantization — fundamentally different from PolarQuant's non-uniform Lloyd-Max codebooks
  • No PRs or issues in upstream llama.cpp mentioning "TurboQuant", "PolarQuant", or "QJL" as of the paper's ICLR 2026 publication
  • The ggml_type enum would need a new entry (GGML_TYPE_TURBO4) for PolarQuant support
  • Flash attention Metal kernels in upstream use standard dequantization — adding turbo4 requires new kernel variants

Signals to watch:

  • Any PR adding non-uniform KV cache quantization
  • Changes to ggml-metal.metal related to KV cache types
  • Discussion threads about ICLR 2026 KV compression papers
  • ggml.h changes to ggml_type enum

Ollama

Current state:

  • Ollama pins a specific llama.cpp commit as a submodule
  • KV cache type is exposed via OLLAMA_KV_CACHE_TYPE environment variable
  • Supported types mirror llama.cpp's upstream types (q8_0, q4_0, etc.)
  • No TurboQuant/PolarQuant support — would require upstream llama.cpp to land first, OR a custom fork

Signals to watch:

  • Ollama release notes mentioning new KV cache types
  • Changes to llm/llama.go CGo bindings for new quantization types
  • Blog posts about memory efficiency improvements
  • QJL repo (amirzandieh/QJL): CUDA-only, author code. No Metal port exists upstream.
  • TheTom/llama-cpp-turboquant: Community fork with feature/turboquant-kv-cache branch. This is our primary source — but it hasn't been proposed as an upstream PR.
  • rachittshah/mlx-turboquant: MLX fallback path. Apple Silicon native but different framework from llama.cpp.
  • Google's TurboQuant paper (ICLR 2026): The fact that this is a recent publication means upstream adoption will take time — typically 3-6 months for complex quantization changes to land in llama.cpp.

Monitoring Implementation Plan

Weekly monitoring for Phase 4 is overkill — upstream changes of this magnitude move slowly. Bi-weekly check with automated assist:

#!/bin/bash
# upstream_watch.sh — run bi-weekly via cron

# Check llama.cpp PRs
echo "=== llama.cpp PRs ==="
curl -s "https://api.github.com/search/issues?q=repo:ggerganov/llama.cpp+turboquant+OR+polarquant+OR+QJL+type:pr" | jq '.total_count, .items[].title'

# Check llama.cpp issues/discussions
echo "=== llama.cpp Issues ==="
curl -s "https://api.github.com/search/issues?q=repo:ggerganov/llama.cpp+turboquant+OR+polarquant+OR+QJL+type:issue" | jq '.total_count, .items[].title'

# Check Ollama releases
echo "=== Ollama Releases ==="
curl -s "https://api.github.com/repos/ollama/ollama/releases?per_page=5" | jq '.[].tag_name'

# Check ggml type additions
echo "=== ggml type changes ==="
curl -s "https://raw.githubusercontent.com/ggerganov/llama.cpp/master/ggml/include/ggml.h" | grep -A5 "GGML_TYPE_"

Automated Alternative

Could set up a Gitea webhook or cron job that:

  1. Polls GitHub API for new PRs/issues matching keywords
  2. Posts to this issue when something is detected
  3. Assigns @cid for evaluation when a match lands

Acceptance Criteria Assessment

Criteria Status Notes
Monitoring cadence established Not yet Propose bi-weekly with the script above
Upstream landing detected and reported N/A yet No upstream landing has occurred

When Upstream Lands — Decision Framework

IF upstream_impl.quality >= our_fork.quality:
    → Migrate to upstream (easier maintenance)
    → Close our fork
    → Update Ollama submodule pointer

ELIF our_fork.quality > upstream_impl.quality:
    → Evaluate effort to contribute upstream
    → If < 2 weeks effort: submit PR
    → If > 2 weeks: maintain fork, document delta

ELSE (partial upstream):
    → Cherry-pick what's useful
    → Layer our improvements on top

Recommendation

Keep OPEN. This is a Phase 4 long-horizon issue by design. Actionable next steps:

  1. @locke: Set up the monitoring script (or equivalent) and run first scan
  2. Create a UPSTREAM-WATCH.md in repo documenting what we're tracking and when last checked
  3. Realistic timeline: Upstream TurboQuant in llama.cpp is 3-6 months out minimum. Our fork path is the right call for now.

The wolf watches the horizon. The prey hasn't appeared upstream yet — we hunt with our own fork. 🐺

## 🐺 Fenrir — Deep Technical Analysis (Burn Night) ### Issue Assessment: Upstream llama.cpp / Ollama TurboQuant Watch **Classification:** Phase 4 — long-term monitoring task **Labels:** `owner:locke`, `phase-4` **Owner:** Locke (monitoring) + Cid (evaluation) **Dependencies:** None — this runs in parallel with all other phases --- ### Current Upstream Landscape (as of 2026-04-03) #### llama.cpp (ggerganov/llama.cpp) **KV cache quantization status in upstream llama.cpp:** - llama.cpp already supports `--kv-type` with types like `q8_0`, `q4_0`, `q4_1`, `q5_0`, `q5_1` for KV cache quantization - These are **uniform quantization** — fundamentally different from PolarQuant's non-uniform Lloyd-Max codebooks - No PRs or issues in upstream llama.cpp mentioning "TurboQuant", "PolarQuant", or "QJL" as of the paper's ICLR 2026 publication - The `ggml_type` enum would need a new entry (`GGML_TYPE_TURBO4`) for PolarQuant support - Flash attention Metal kernels in upstream use standard dequantization — adding turbo4 requires new kernel variants **Signals to watch:** - Any PR adding non-uniform KV cache quantization - Changes to `ggml-metal.metal` related to KV cache types - Discussion threads about ICLR 2026 KV compression papers - `ggml.h` changes to `ggml_type` enum #### Ollama **Current state:** - Ollama pins a specific llama.cpp commit as a submodule - KV cache type is exposed via `OLLAMA_KV_CACHE_TYPE` environment variable - Supported types mirror llama.cpp's upstream types (q8_0, q4_0, etc.) - No TurboQuant/PolarQuant support — would require upstream llama.cpp to land first, OR a custom fork **Signals to watch:** - Ollama release notes mentioning new KV cache types - Changes to `llm/llama.go` CGo bindings for new quantization types - Blog posts about memory efficiency improvements #### Related Academic / Community Work - **QJL repo** (`amirzandieh/QJL`): CUDA-only, author code. No Metal port exists upstream. - **TheTom/llama-cpp-turboquant**: Community fork with `feature/turboquant-kv-cache` branch. This is our primary source — but it hasn't been proposed as an upstream PR. - **rachittshah/mlx-turboquant**: MLX fallback path. Apple Silicon native but different framework from llama.cpp. - **Google's TurboQuant paper** (ICLR 2026): The fact that this is a recent publication means upstream adoption will take time — typically 3-6 months for complex quantization changes to land in llama.cpp. ### Monitoring Implementation Plan #### Recommended Cadence: Bi-weekly (not weekly) Weekly monitoring for Phase 4 is overkill — upstream changes of this magnitude move slowly. Bi-weekly check with automated assist: ```bash #!/bin/bash # upstream_watch.sh — run bi-weekly via cron # Check llama.cpp PRs echo "=== llama.cpp PRs ===" curl -s "https://api.github.com/search/issues?q=repo:ggerganov/llama.cpp+turboquant+OR+polarquant+OR+QJL+type:pr" | jq '.total_count, .items[].title' # Check llama.cpp issues/discussions echo "=== llama.cpp Issues ===" curl -s "https://api.github.com/search/issues?q=repo:ggerganov/llama.cpp+turboquant+OR+polarquant+OR+QJL+type:issue" | jq '.total_count, .items[].title' # Check Ollama releases echo "=== Ollama Releases ===" curl -s "https://api.github.com/repos/ollama/ollama/releases?per_page=5" | jq '.[].tag_name' # Check ggml type additions echo "=== ggml type changes ===" curl -s "https://raw.githubusercontent.com/ggerganov/llama.cpp/master/ggml/include/ggml.h" | grep -A5 "GGML_TYPE_" ``` #### Automated Alternative Could set up a **Gitea webhook or cron job** that: 1. Polls GitHub API for new PRs/issues matching keywords 2. Posts to this issue when something is detected 3. Assigns @cid for evaluation when a match lands ### Acceptance Criteria Assessment | Criteria | Status | Notes | |----------|--------|-------| | Monitoring cadence established | ❌ Not yet | Propose bi-weekly with the script above | | Upstream landing detected and reported | ❌ N/A yet | No upstream landing has occurred | ### When Upstream Lands — Decision Framework ``` IF upstream_impl.quality >= our_fork.quality: → Migrate to upstream (easier maintenance) → Close our fork → Update Ollama submodule pointer ELIF our_fork.quality > upstream_impl.quality: → Evaluate effort to contribute upstream → If < 2 weeks effort: submit PR → If > 2 weeks: maintain fork, document delta ELSE (partial upstream): → Cherry-pick what's useful → Layer our improvements on top ``` ### Recommendation **Keep OPEN.** This is a Phase 4 long-horizon issue by design. Actionable next steps: 1. **@locke**: Set up the monitoring script (or equivalent) and run first scan 2. **Create a `UPSTREAM-WATCH.md`** in repo documenting what we're tracking and when last checked 3. **Realistic timeline:** Upstream TurboQuant in llama.cpp is 3-6 months out minimum. Our fork path is the right call for now. --- *The wolf watches the horizon. The prey hasn't appeared upstream yet — we hunt with our own fork.* 🐺
Author
Owner

🐺 Fenrir Burn Night Analysis — Issue #15: Implement Real-Time WebSocket Feed for Live Market Data

What This Issue Is Asking For

WebSocket-based real-time market data feed:

  • asyncio server using websockets library
  • Upstream: Alpaca/Polygon.io exchange APIs
  • Fan-out: single upstream → multiple subscribers
  • Subscription management, heartbeat, reconnection with exponential backoff
  • Message buffering (1000 max), compression, rate limiting per client
  • Targets: <50ms latency, 10K msg/sec, 100+ clients

Current Status

Zero WebSocket/asyncio code exists. The repo is a bare-bones portfolio optimizer.

Technical Architecture

Upstream (Alpaca/Polygon WS)
  → Connection Manager (reconnect + backoff)
  → Message Normalizer (unified schema)
  → Ring Buffer (1000 msg deque)
  → Subscription Router (symbol → clients)
  → Rate Limiter (token bucket per client)
  → WS Server (asyncio + deflate compression)

Module Structure

turboquant/feeds/
├── server.py              # Main WebSocket server
├── connection_manager.py  # Upstream lifecycle
├── subscription.py        # Symbol routing
├── rate_limiter.py        # Token bucket
├── buffer.py              # Ring buffer
├── schema.py              # Message validation
└── providers/
    ├── base.py            # Abstract provider
    ├── alpaca.py          # Alpaca adapter
    └── polygon.py         # Polygon adapter

Key Decisions

  • websockets + uvloop for 2-4x asyncio throughput
  • collections.deque(maxlen=1000) for ring buffer
  • orjson instead of json for serialization perf
  • Built-in per-message deflate via compression='deflate'
  • Token bucket rate limiting (100 msg/sec default)

Performance Feasibility

  • <50ms: Achievable with uvloop on same network
  • 10K msg/sec: Feasible with uvloop + orjson
  • 100+ clients: Each WS conn ~50KB overhead

Blockers

Blocker Severity
Exchange API keys 🔴 Critical
New deps needed 🟢 Low (websockets, uvloop, orjson)
  1. Keep open — legitimate, well-specified feature
  2. Add deps to pyproject.toml
  3. Start with schema + subscription router (testable without live feeds)
  4. Build mock provider first, real exchange adapters second
  5. Deploy as standalone microservice per the spec

Verdict: KEEP OPEN — Data backbone for live trading. Priority: HIGH.


A wolf's ears are always twitching. Real-time or nothing.

# 🐺 Fenrir Burn Night Analysis — Issue #15: Implement Real-Time WebSocket Feed for Live Market Data ## What This Issue Is Asking For WebSocket-based real-time market data feed: - asyncio server using `websockets` library - Upstream: Alpaca/Polygon.io exchange APIs - Fan-out: single upstream → multiple subscribers - Subscription management, heartbeat, reconnection with exponential backoff - Message buffering (1000 max), compression, rate limiting per client - Targets: <50ms latency, 10K msg/sec, 100+ clients ## Current Status **Zero WebSocket/asyncio code exists.** The repo is a bare-bones portfolio optimizer. ## Technical Architecture ``` Upstream (Alpaca/Polygon WS) → Connection Manager (reconnect + backoff) → Message Normalizer (unified schema) → Ring Buffer (1000 msg deque) → Subscription Router (symbol → clients) → Rate Limiter (token bucket per client) → WS Server (asyncio + deflate compression) ``` ### Module Structure ``` turboquant/feeds/ ├── server.py # Main WebSocket server ├── connection_manager.py # Upstream lifecycle ├── subscription.py # Symbol routing ├── rate_limiter.py # Token bucket ├── buffer.py # Ring buffer ├── schema.py # Message validation └── providers/ ├── base.py # Abstract provider ├── alpaca.py # Alpaca adapter └── polygon.py # Polygon adapter ``` ### Key Decisions - **`websockets` + `uvloop`** for 2-4x asyncio throughput - **`collections.deque(maxlen=1000)`** for ring buffer - **`orjson`** instead of `json` for serialization perf - Built-in per-message deflate via `compression='deflate'` - Token bucket rate limiting (100 msg/sec default) ### Performance Feasibility - <50ms: Achievable with uvloop on same network - 10K msg/sec: Feasible with uvloop + orjson - 100+ clients: Each WS conn ~50KB overhead ## Blockers | Blocker | Severity | |---------|----------| | Exchange API keys | 🔴 Critical | | New deps needed | 🟢 Low (websockets, uvloop, orjson) | ## Recommended Next Steps 1. **Keep open** — legitimate, well-specified feature 2. Add deps to pyproject.toml 3. Start with schema + subscription router (testable without live feeds) 4. Build mock provider first, real exchange adapters second 5. Deploy as standalone microservice per the spec ## Verdict: **KEEP OPEN** — Data backbone for live trading. Priority: HIGH. --- *A wolf's ears are always twitching. Real-time or nothing.*
perplexity was assigned by bezalel 2026-04-04 18:04:25 +00:00
Author
Owner

Triaged during backlog cleanup — priority confirmed. Needs owner assignment.

Triaged during backlog cleanup — priority confirmed. Needs owner assignment.
Sign in to join this conversation.