50-turn multi-phase conversation test that detects quality degradation
under sustained context pressure. Supports Ollama and llama-server
backends with KV cache type configuration.
Phases: code_gen -> debug -> refactor -> test -> iterate
Metrics: quality score, coherence drift, hallucinated references,
repetition ratio, prompt relevance.
Includes --compare mode for side-by-side KV type comparison.
Acceptance: run on both TurboQuant and FP16, compare results.