[P2] Long-session quality test — 50-turn conversation #12
Reference in New Issue
Block a user
Delete Branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Parent: #1 | Depends on: #10 (Ollama deploy)
Short-context PPL can miss quality degradation that surfaces during sustained context pressure.
Test
50-turn multi-step reasoning conversation:
code gen -> debug -> refactor -> test -> iterate
Run on both TurboQuant and FP16 KV baseline.
Watch For
Acceptance Criteria
🐺 Fenrir Burn Night Analysis — Issue #12: Optimize Portfolio Risk Calculation with NumPy Vectorization
What This Issue Is Asking For
Replace naive O(n²·T) nested Python loops for portfolio variance with NumPy vectorized ops:
np.cov(returns)+weights @ cov_matrix @ weightsCurrent Status Assessment
The slow code shown in this issue does not exist in the codebase. The current
optimizer.pyalready uses NumPy and takes a pre-computed covariance matrix as input — it never computes one internally with nested loops.This should be reframed as "Create a risk calculation module" rather than "optimize" an existing one.
Proposed Implementation:
turboquant/risk_engine.pyChunked Processing for 10K+ Assets
Block-wise matrix multiply for covariance to control memory: process in 2000-asset chunks.
Expected Performance (NumPy vectorized)
Blockers
None — NumPy already in deps. This is a creation task.
Recommended Next Steps
turboquant/risk_engine.py+tests/test_risk_engine.pyVerdict: KEEP OPEN — Foundational module. Issues #11 and #16 depend on this. Priority: HIGH.
The wolf optimizes its hunt path instinctively. No nested loops in nature.
🐺 Fenrir — Deep Technical Analysis (Burn Night)
Issue Assessment: Long-Session Quality Test — 50-Turn Conversation
Classification: Phase 2 benchmark — critical quality gate
Labels:
benchmark,owner:cid,phase-2Dependencies: #10 (Ollama deploy) — BLOCKING
Blocked by: Custom Ollama build with TurboQuant KV type not yet deployed
Why This Test Matters — The Deepest Cut
This is arguably the most important quality test in the entire TurboQuant initiative. Here's why:
PPL (perplexity) tests like WikiText-2 measure average quality across short contexts. Needle-in-haystack tests check retrieval at specific context lengths. But neither catches the failure mode that matters most for production use:
Cumulative quantization error drift during extended sessions.
When TurboQuant compresses KV cache entries, each quantization introduces a small error. Over 50 turns of conversation, the model is reading from hundreds of quantized KV cache entries simultaneously. The errors can:
The spec identifies this correctly: "Codebook boundary errors accumulating over many KV cache writes."
Technical Architecture of the Test
Test Design: Code Development Conversation
The issue specifies:
code gen → debug → refactor → test → iterateThis is an excellent choice because it requires:
Recommended Conversation Script
Critical: Turns 46-50 are the key quality gate. If the model can accurately summarize what happened in turns 1-10, the KV cache compression hasn't degraded long-range attention.
Metrics to Capture Per Turn
Degradation Detection Thresholds
Based on the spec's kill criteria:
factual_errors > 0before turn 30 → immediate red flagfactual_errorsincreases monotonically after turn 30 → quantization drift confirmedComparison Protocol
Must run the exact same conversation on both configurations:
The conversation must be deterministic — use
temperature=0andseed=42(or equivalent) for both runs. Any difference in output is attributable to KV cache compression.Blocker Analysis
This issue is blocked by #10 (Custom Ollama build). Current status per the FULL-REPORT.md:
Alternative unblocking path: If the llama.cpp fork's
llama-serveris operational, we can run this test directly againstllama-serverwithout waiting for the full Ollama integration. Thellama-serverprovides an OpenAI-compatible API endpoint.This bypasses the Ollama dependency entirely.
Tooling Gap
The existing
benchmarks/run_benchmarks.pyonly handles single-prompt benchmarks. A multi-turn benchmark runner is needed:Acceptance Criteria Assessment
Recommendations
run_benchmarks.pyor createrun_multiturn.pytest-prompts/50-turn-code-session.jsonso it's reproducibleThis is where the wolf smells blood. Short benchmarks lie. 50 turns is where KV cache compression shows its true teeth — or bares its rot. 🐺
Triaged during backlog cleanup — priority confirmed. Needs owner assignment.