[P2] Long-session quality test — 50-turn conversation #12

Open
opened 2026-03-30 17:11:16 +00:00 by Timmy · 3 comments
Owner

Parent: #1 | Depends on: #10 (Ollama deploy)

Short-context PPL can miss quality degradation that surfaces during sustained context pressure.

Test

50-turn multi-step reasoning conversation:
code gen -> debug -> refactor -> test -> iterate

Run on both TurboQuant and FP16 KV baseline.

Watch For

  • Coherence drift after turn 30+
  • Hallucinated references to earlier context
  • Attention score softmax concentration (if measurable)
  • Codebook boundary errors accumulating over many KV cache writes

Acceptance Criteria

  • 50-turn conversation run on both configs
  • Quality comparison documented
  • Any degradation flagged with turn number where it appears
## Parent: #1 | Depends on: #10 (Ollama deploy) Short-context PPL can miss quality degradation that surfaces during sustained context pressure. ## Test 50-turn multi-step reasoning conversation: code gen -> debug -> refactor -> test -> iterate Run on both TurboQuant and FP16 KV baseline. ## Watch For - Coherence drift after turn 30+ - Hallucinated references to earlier context - Attention score softmax concentration (if measurable) - Codebook boundary errors accumulating over many KV cache writes ## Acceptance Criteria - [ ] 50-turn conversation run on both configs - [ ] Quality comparison documented - [ ] Any degradation flagged with turn number where it appears
Timmy added this to the Phase 2 — Ollama Integration + Production milestone 2026-03-30 17:11:16 +00:00
Timmy added the benchmarkphase-2owner:cid labels 2026-03-30 17:11:16 +00:00
Author
Owner

🐺 Fenrir Burn Night Analysis — Issue #12: Optimize Portfolio Risk Calculation with NumPy Vectorization

What This Issue Is Asking For

Replace naive O(n²·T) nested Python loops for portfolio variance with NumPy vectorized ops:

  • Before: Triple-nested loop computing covariance element-by-element
  • After: np.cov(returns) + weights @ cov_matrix @ weights
  • Targets: 100 assets <10ms, 500 assets <50ms, 1000 assets <200ms
  • Extras: Numba JIT, chunked processing for 10K+, covariance caching

Current Status Assessment

The slow code shown in this issue does not exist in the codebase. The current optimizer.py already uses NumPy and takes a pre-computed covariance matrix as input — it never computes one internally with nested loops.

This should be reframed as "Create a risk calculation module" rather than "optimize" an existing one.

Proposed Implementation: turboquant/risk_engine.py

import numpy as np
from typing import Optional

class RiskEngine:
    def __init__(self, returns: np.ndarray):
        self.returns = np.asarray(returns, dtype=np.float64)
        self._cov_matrix = None
    
    @property
    def covariance_matrix(self) -> np.ndarray:
        if self._cov_matrix is None:
            self._cov_matrix = np.cov(self.returns)
        return self._cov_matrix
    
    def portfolio_variance(self, weights): 
        w = np.asarray(weights, dtype=np.float64)
        return float(w @ self.covariance_matrix @ w)
    
    def portfolio_volatility(self, weights):
        return np.sqrt(self.portfolio_variance(weights))
    
    def sharpe_ratio(self, weights, risk_free_rate=0.0):
        ret = weights @ self.returns.mean(axis=1)
        vol = self.portfolio_volatility(weights)
        return (ret - risk_free_rate) / vol if vol > 0 else 0.0
    
    def max_drawdown(self, weights):
        cum = np.cumprod(1 + weights @ self.returns)
        running_max = np.maximum.accumulate(cum)
        return float(np.min((cum - running_max) / running_max))
    
    def value_at_risk(self, weights, confidence=0.95):
        return float(-np.percentile(weights @ self.returns, (1-confidence)*100))

Chunked Processing for 10K+ Assets

Block-wise matrix multiply for covariance to control memory: process in 2000-asset chunks.

Expected Performance (NumPy vectorized)

  • 100 assets: ~1-3ms (target <10ms)
  • 500 assets: ~15-30ms (target <50ms)
  • 1000 assets: ~80-150ms (target <200ms)
  • Numba JIT is overkill — pure NumPy hits all targets

Blockers

None — NumPy already in deps. This is a creation task.

  1. Keep open — retitle to "Create vectorized risk calculation module"
  2. Create turboquant/risk_engine.py + tests/test_risk_engine.py
  3. Wire into PortfolioOptimizer for risk metrics
  4. Add Sharpe/drawdown/VaR (needed by #16's integration tests)

Verdict: KEEP OPEN — Foundational module. Issues #11 and #16 depend on this. Priority: HIGH.


The wolf optimizes its hunt path instinctively. No nested loops in nature.

# 🐺 Fenrir Burn Night Analysis — Issue #12: Optimize Portfolio Risk Calculation with NumPy Vectorization ## What This Issue Is Asking For Replace naive O(n²·T) nested Python loops for portfolio variance with NumPy vectorized ops: - Before: Triple-nested loop computing covariance element-by-element - After: `np.cov(returns)` + `weights @ cov_matrix @ weights` - Targets: 100 assets <10ms, 500 assets <50ms, 1000 assets <200ms - Extras: Numba JIT, chunked processing for 10K+, covariance caching ## Current Status Assessment **The slow code shown in this issue does not exist in the codebase.** The current `optimizer.py` already uses NumPy and takes a pre-computed covariance matrix as input — it never computes one internally with nested loops. This should be reframed as **"Create a risk calculation module"** rather than "optimize" an existing one. ## Proposed Implementation: `turboquant/risk_engine.py` ```python import numpy as np from typing import Optional class RiskEngine: def __init__(self, returns: np.ndarray): self.returns = np.asarray(returns, dtype=np.float64) self._cov_matrix = None @property def covariance_matrix(self) -> np.ndarray: if self._cov_matrix is None: self._cov_matrix = np.cov(self.returns) return self._cov_matrix def portfolio_variance(self, weights): w = np.asarray(weights, dtype=np.float64) return float(w @ self.covariance_matrix @ w) def portfolio_volatility(self, weights): return np.sqrt(self.portfolio_variance(weights)) def sharpe_ratio(self, weights, risk_free_rate=0.0): ret = weights @ self.returns.mean(axis=1) vol = self.portfolio_volatility(weights) return (ret - risk_free_rate) / vol if vol > 0 else 0.0 def max_drawdown(self, weights): cum = np.cumprod(1 + weights @ self.returns) running_max = np.maximum.accumulate(cum) return float(np.min((cum - running_max) / running_max)) def value_at_risk(self, weights, confidence=0.95): return float(-np.percentile(weights @ self.returns, (1-confidence)*100)) ``` ### Chunked Processing for 10K+ Assets Block-wise matrix multiply for covariance to control memory: process in 2000-asset chunks. ### Expected Performance (NumPy vectorized) - 100 assets: ~1-3ms (target <10ms) - 500 assets: ~15-30ms (target <50ms) - 1000 assets: ~80-150ms (target <200ms) - Numba JIT is overkill — pure NumPy hits all targets ## Blockers None — NumPy already in deps. This is a creation task. ## Recommended Next Steps 1. **Keep open** — retitle to "Create vectorized risk calculation module" 2. Create `turboquant/risk_engine.py` + `tests/test_risk_engine.py` 3. Wire into PortfolioOptimizer for risk metrics 4. Add Sharpe/drawdown/VaR (needed by #16's integration tests) ## Verdict: **KEEP OPEN** — Foundational module. Issues #11 and #16 depend on this. Priority: **HIGH**. --- *The wolf optimizes its hunt path instinctively. No nested loops in nature.*
Author
Owner

🐺 Fenrir — Deep Technical Analysis (Burn Night)

Issue Assessment: Long-Session Quality Test — 50-Turn Conversation

Classification: Phase 2 benchmark — critical quality gate
Labels: benchmark, owner:cid, phase-2
Dependencies: #10 (Ollama deploy) — BLOCKING
Blocked by: Custom Ollama build with TurboQuant KV type not yet deployed


Why This Test Matters — The Deepest Cut

This is arguably the most important quality test in the entire TurboQuant initiative. Here's why:

PPL (perplexity) tests like WikiText-2 measure average quality across short contexts. Needle-in-haystack tests check retrieval at specific context lengths. But neither catches the failure mode that matters most for production use:

Cumulative quantization error drift during extended sessions.

When TurboQuant compresses KV cache entries, each quantization introduces a small error. Over 50 turns of conversation, the model is reading from hundreds of quantized KV cache entries simultaneously. The errors can:

  1. Compound non-linearly — attention score computation involves softmax over all cached keys. Small per-key errors accumulate into attention distribution shifts.
  2. Create attention collapse — if quantization noise shifts attention scores enough, the softmax concentrates on wrong tokens (the "concentration" effect mentioned in the issue).
  3. Manifest as context hallucination — the model "remembers" something from earlier turns that it actually garbled due to KV cache quantization noise.

The spec identifies this correctly: "Codebook boundary errors accumulating over many KV cache writes."

Technical Architecture of the Test

Test Design: Code Development Conversation

The issue specifies: code gen → debug → refactor → test → iterate

This is an excellent choice because it requires:

  • Precise factual recall (variable names, function signatures from 20+ turns ago)
  • Logical consistency (the refactored code must be consistent with the original spec)
  • Multi-hop reasoning (debugging requires remembering what the code does AND what the bug report says)
  • Cumulative state tracking (each iteration builds on the last)
Turns 1-5:    Spec a non-trivial system (e.g., "distributed rate limiter with Redis")
Turns 6-15:   Implement module by module, building on previous code
Turns 16-20:  Introduce bugs, ask model to find them (tests recall of implementation)
Turns 21-30:  Refactor — extract interfaces, add error handling
Turns 31-40:  Write tests — must reference correct function signatures from turns 6-15
Turns 41-45:  Performance optimization — requires understanding the full codebase
Turns 46-50:  Documentation — summarize the entire journey accurately

Critical: Turns 46-50 are the key quality gate. If the model can accurately summarize what happened in turns 1-10, the KV cache compression hasn't degraded long-range attention.

Metrics to Capture Per Turn

metrics = {
    "turn": N,
    "tokens_in_kv_cache": int,       # Growing with each turn
    "response_coherent": bool,        # Human judgment
    "references_prior_context": bool, # Does the response correctly reference earlier turns?
    "factual_errors": int,            # Count of incorrect references to prior state
    "response_time_ms": int,          # Watch for slowdowns as cache grows
    "kv_cache_memory_mb": float,      # Memory pressure tracking
}

Degradation Detection Thresholds

Based on the spec's kill criteria:

  • Coherence drift: If factual_errors > 0 before turn 30 → immediate red flag
  • Gradual degradation: If factual_errors increases monotonically after turn 30 → quantization drift confirmed
  • Attention concentration: If responses become shorter/repetitive after turn 30+ → softmax concentration issue
  • Memory failure: If at turn 46+ the model can't recall specifics from turns 1-10 → long-range attention degraded

Comparison Protocol

Must run the exact same conversation on both configurations:

Config KV Cache Type Expected Memory Expected Quality
Baseline FP16 KV ~24GB at 50 turns Perfect recall
TurboQuant turbo4 KV ~6.5GB at 50 turns Should match baseline

The conversation must be deterministic — use temperature=0 and seed=42 (or equivalent) for both runs. Any difference in output is attributable to KV cache compression.

Blocker Analysis

This issue is blocked by #10 (Custom Ollama build). Current status per the FULL-REPORT.md:

"Custom Ollama build is deferred (multi-day effort), but the fork's llama-server is a ready drop-in."

Alternative unblocking path: If the llama.cpp fork's llama-server is operational, we can run this test directly against llama-server without waiting for the full Ollama integration. The llama-server provides an OpenAI-compatible API endpoint.

# TurboQuant config
./llama-server -m qwen3.5-27b.gguf --kv-type turbo4 --ctx-size 65536 --port 8081

# Baseline config
./llama-server -m qwen3.5-27b.gguf --kv-type f16 --ctx-size 65536 --port 8082

This bypasses the Ollama dependency entirely.

Tooling Gap

The existing benchmarks/run_benchmarks.py only handles single-prompt benchmarks. A multi-turn benchmark runner is needed:

# Pseudocode for what's needed
class MultiTurnBenchmark:
    def __init__(self, endpoint, model):
        self.history = []
        self.metrics = []
    
    def turn(self, user_msg):
        self.history.append({"role": "user", "content": user_msg})
        response = api_call(self.history)  # Send full history
        self.history.append({"role": "assistant", "content": response})
        self.metrics.append(self.capture_metrics(response))
        return response
    
    def run_script(self, turns: list[str]):
        for msg in turns:
            self.turn(msg)
        return self.metrics

Acceptance Criteria Assessment

Criteria Status Blocker
50-turn conversation run on both configs Not started Blocked by #10 / needs llama-server alternative
Quality comparison documented Not started Depends on above
Degradation flagged with turn number Not started Depends on above

Recommendations

  1. Unblock via llama-server — Don't wait for full Ollama build. The fork's llama-server is the fast path.
  2. Build the multi-turn benchmark runner — Extend run_benchmarks.py or create run_multiturn.py
  3. Script the 50 turns in advance — Save as test-prompts/50-turn-code-session.json so it's reproducible
  4. Automate metrics capture — Parse responses for factual consistency against a ground truth document
  5. This is a Phase 2 critical path item — should be prioritized once #10 (or llama-server alternative) is resolved

This is where the wolf smells blood. Short benchmarks lie. 50 turns is where KV cache compression shows its true teeth — or bares its rot. 🐺

## 🐺 Fenrir — Deep Technical Analysis (Burn Night) ### Issue Assessment: Long-Session Quality Test — 50-Turn Conversation **Classification:** Phase 2 benchmark — critical quality gate **Labels:** `benchmark`, `owner:cid`, `phase-2` **Dependencies:** #10 (Ollama deploy) — **BLOCKING** **Blocked by:** Custom Ollama build with TurboQuant KV type not yet deployed --- ### Why This Test Matters — The Deepest Cut This is arguably the **most important quality test in the entire TurboQuant initiative.** Here's why: PPL (perplexity) tests like WikiText-2 measure average quality across short contexts. Needle-in-haystack tests check retrieval at specific context lengths. But neither catches the failure mode that matters most for production use: **Cumulative quantization error drift during extended sessions.** When TurboQuant compresses KV cache entries, each quantization introduces a small error. Over 50 turns of conversation, the model is reading from hundreds of quantized KV cache entries simultaneously. The errors can: 1. **Compound non-linearly** — attention score computation involves softmax over all cached keys. Small per-key errors accumulate into attention distribution shifts. 2. **Create attention collapse** — if quantization noise shifts attention scores enough, the softmax concentrates on wrong tokens (the "concentration" effect mentioned in the issue). 3. **Manifest as context hallucination** — the model "remembers" something from earlier turns that it actually garbled due to KV cache quantization noise. The spec identifies this correctly: *"Codebook boundary errors accumulating over many KV cache writes."* ### Technical Architecture of the Test #### Test Design: Code Development Conversation The issue specifies: `code gen → debug → refactor → test → iterate` This is an excellent choice because it requires: - **Precise factual recall** (variable names, function signatures from 20+ turns ago) - **Logical consistency** (the refactored code must be consistent with the original spec) - **Multi-hop reasoning** (debugging requires remembering what the code does AND what the bug report says) - **Cumulative state tracking** (each iteration builds on the last) #### Recommended Conversation Script ``` Turns 1-5: Spec a non-trivial system (e.g., "distributed rate limiter with Redis") Turns 6-15: Implement module by module, building on previous code Turns 16-20: Introduce bugs, ask model to find them (tests recall of implementation) Turns 21-30: Refactor — extract interfaces, add error handling Turns 31-40: Write tests — must reference correct function signatures from turns 6-15 Turns 41-45: Performance optimization — requires understanding the full codebase Turns 46-50: Documentation — summarize the entire journey accurately ``` **Critical:** Turns 46-50 are the key quality gate. If the model can accurately summarize what happened in turns 1-10, the KV cache compression hasn't degraded long-range attention. #### Metrics to Capture Per Turn ```python metrics = { "turn": N, "tokens_in_kv_cache": int, # Growing with each turn "response_coherent": bool, # Human judgment "references_prior_context": bool, # Does the response correctly reference earlier turns? "factual_errors": int, # Count of incorrect references to prior state "response_time_ms": int, # Watch for slowdowns as cache grows "kv_cache_memory_mb": float, # Memory pressure tracking } ``` #### Degradation Detection Thresholds Based on the spec's kill criteria: - **Coherence drift:** If `factual_errors > 0` before turn 30 → immediate red flag - **Gradual degradation:** If `factual_errors` increases monotonically after turn 30 → quantization drift confirmed - **Attention concentration:** If responses become shorter/repetitive after turn 30+ → softmax concentration issue - **Memory failure:** If at turn 46+ the model can't recall specifics from turns 1-10 → long-range attention degraded ### Comparison Protocol Must run the **exact same conversation** on both configurations: | Config | KV Cache Type | Expected Memory | Expected Quality | |--------|--------------|-----------------|-----------------| | Baseline | FP16 KV | ~24GB at 50 turns | Perfect recall | | TurboQuant | turbo4 KV | ~6.5GB at 50 turns | Should match baseline | **The conversation must be deterministic** — use `temperature=0` and `seed=42` (or equivalent) for both runs. Any difference in output is attributable to KV cache compression. ### Blocker Analysis **This issue is blocked by #10 (Custom Ollama build).** Current status per the FULL-REPORT.md: > "Custom Ollama build is deferred (multi-day effort), but the fork's llama-server is a ready drop-in." **Alternative unblocking path:** If the llama.cpp fork's `llama-server` is operational, we can run this test directly against `llama-server` without waiting for the full Ollama integration. The `llama-server` provides an OpenAI-compatible API endpoint. ```bash # TurboQuant config ./llama-server -m qwen3.5-27b.gguf --kv-type turbo4 --ctx-size 65536 --port 8081 # Baseline config ./llama-server -m qwen3.5-27b.gguf --kv-type f16 --ctx-size 65536 --port 8082 ``` This bypasses the Ollama dependency entirely. ### Tooling Gap The existing `benchmarks/run_benchmarks.py` only handles single-prompt benchmarks. A **multi-turn benchmark runner** is needed: ```python # Pseudocode for what's needed class MultiTurnBenchmark: def __init__(self, endpoint, model): self.history = [] self.metrics = [] def turn(self, user_msg): self.history.append({"role": "user", "content": user_msg}) response = api_call(self.history) # Send full history self.history.append({"role": "assistant", "content": response}) self.metrics.append(self.capture_metrics(response)) return response def run_script(self, turns: list[str]): for msg in turns: self.turn(msg) return self.metrics ``` ### Acceptance Criteria Assessment | Criteria | Status | Blocker | |----------|--------|---------| | 50-turn conversation run on both configs | ❌ Not started | Blocked by #10 / needs llama-server alternative | | Quality comparison documented | ❌ Not started | Depends on above | | Degradation flagged with turn number | ❌ Not started | Depends on above | ### Recommendations 1. **Unblock via llama-server** — Don't wait for full Ollama build. The fork's llama-server is the fast path. 2. **Build the multi-turn benchmark runner** — Extend `run_benchmarks.py` or create `run_multiturn.py` 3. **Script the 50 turns in advance** — Save as `test-prompts/50-turn-code-session.json` so it's reproducible 4. **Automate metrics capture** — Parse responses for factual consistency against a ground truth document 5. **This is a Phase 2 critical path item** — should be prioritized once #10 (or llama-server alternative) is resolved --- *This is where the wolf smells blood. Short benchmarks lie. 50 turns is where KV cache compression shows its true teeth — or bares its rot.* 🐺
groq was assigned by Timmy 2026-04-05 00:15:03 +00:00
groq was unassigned by allegro 2026-04-05 11:58:15 +00:00
ezra was assigned by allegro 2026-04-05 11:58:15 +00:00
Author
Owner

Triaged during backlog cleanup — priority confirmed. Needs owner assignment.

Triaged during backlog cleanup — priority confirmed. Needs owner assignment.
Sign in to join this conversation.