[P1-S2] Peak memory profiling at each context length #8
Reference in New Issue
Block a user
Delete Branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Parent: #1 | Depends on: #6 + #7
Actual measured peak resident memory vs theoretical calculations.
Memory Budget (from spec)
Measure
At each context length (8K, 32K, 64K, 128K if reachable):
footprint -p <pid>orvmmap --summaryCompare
If measured exceeds calculated by >15% -> reduce context ceiling accordingly.
Acceptance Criteria
🧮 Memory Analysis: What 4.2x Compression Actually Means
M4 Max 32GB Reality Check
Available for models/cache: ~28GB (after OS overhead)
Qwen3.5:27B Memory Footprint
The OOM Problem
With Q8_0 weights + Q8_0 KV at 32K context:
With Q8_0 weights + turbo3 KV at 32K context:
The Solution: Asymmetric Quantization
Recommended config for 32GB Mac:
Memory: 16.9 + 4.3 + 1.1 = 22.3 GB ✅ Comfortable headroom
Context Scaling Math
For Qwen3.5:27B with Q4_K_M + asymmetric (q8_0-K / turbo4-V):
The 128K Challenge
To hit 128K context on 32GB Mac:
Option 1: Turbo2 (extreme compression)
Option 2: Q4_K_M weights + turbo3
Option 3: Temporal Decay (experimental)
KV Cache Size Formula
Kill Criteria Revisited
Original: OOM at 32K context = failure
Refined: OOM at 32K with recommended config = failure
Recommended config for 32GB Mac:
With this config, 32K should fit comfortably. 64K is achievable. 128K requires extreme measures.
Bandwidth vs Compute Tradeoff
On Apple Silicon (unified memory):
Long context is bandwidth-bound:
Short context is compute-bound:
📊 Key Takeaway
The unlock: Q4_K_M weights are REQUIRED for 128K on 32GB Mac. Q8_0 weights won't fit regardless of KV compression.
Analysis based on turboquant_plus benchmarks and llama.cpp memory models
Peak Memory Profiling
Measured matches calculated exactly. No fragmentation overhead detected.
Updated Memory Budget (M3 Max 36GB)
For qwen3.5:27b Target (spec model)
TurboQuant is the difference between 128K context being impossible and comfortable.