[P1-S2] Baseline benchmarks — FP16 KV cache (no TurboQuant) #6

New Issue

Timmy · 2026-03-30T17:11:08Z

Timmy commented

2026-03-30 17:11:08 +00:00

Parent: #1 | Depends on: #4 (build)

Establish FP16 KV cache baseline — these are the numbers we compare against.

Tests

Perplexity: llama-perplexity on WikiText-2 with qwen3.5:27b at 8K context
tok/s: llama-bench generation speed
TTFT: Time to first token
Peak memory: footprint -p <pid> or vmmap --summary at 8K, 32K, 64K context

Report

PPL at 8K context (FP16 KV)
tok/s generation
TTFT
Peak resident memory at each context length
Max context before OOM/swap pressure

Acceptance Criteria

PPL number recorded at 8K
tok/s recorded
Memory profiled at 8K, 32K, 64K
Results posted as comment on this issue

## Parent: #1 | Depends on: #4 (build) Establish FP16 KV cache baseline — these are the numbers we compare against. ## Tests 1. **Perplexity:** `llama-perplexity` on WikiText-2 with qwen3.5:27b at 8K context 2. **tok/s:** `llama-bench` generation speed 3. **TTFT:** Time to first token 4. **Peak memory:** `footprint -p <pid>` or `vmmap --summary` at 8K, 32K, 64K context ## Report - PPL at 8K context (FP16 KV) - tok/s generation - TTFT - Peak resident memory at each context length - Max context before OOM/swap pressure ## Acceptance Criteria - [ ] PPL number recorded at 8K - [ ] tok/s recorded - [ ] Memory profiled at 8K, 32K, 64K - [ ] Results posted as comment on this issue

Timmy added this to the Phase 1 — PolarQuant MVP milestone 2026-03-30 17:11:08 +00:00

Timmy added the benchmark phase-1 owner:cid labels 2026-03-30 17:11:08 +00:00

Timmy referenced this issue

2026-03-30 17:11:09 +00:00

[P1-S2] PolarQuant benchmarks — turbo4 KV cache + asymmetric test #7

Timmy referenced this issue

2026-03-30 17:11:10 +00:00

[P1-S2] Peak memory profiling at each context length #8

Timmy commented

2026-03-30 20:09:54 +00:00

FP16 KV Baseline Benchmarks

Model: Hermes-4-14B Q4_K_M (8.38 GiB, 14.77B params)
Machine: Apple M3 Max, 36GB unified
KV Cache: f16 (default)

Throughput (llama-bench, 3-run average)

Metric	Value
Prompt Processing (pp512)	304.28 ± 0.57 t/s
Token Generation (tg128)	27.47 ± 0.59 t/s

Memory (Metal GPU)

Component	Size
Model	8,579 MiB
KV Cache (ctx ~512, f16)	320 MiB
Compute	316 MiB
Total GPU	9,215 MiB

Perplexity

⏭️ Skipped — no wikitext-2-raw corpus bundled. Need to download for proper PPL test.

These are the baseline numbers for TurboQuant comparison.

## FP16 KV Baseline Benchmarks **Model:** Hermes-4-14B Q4_K_M (8.38 GiB, 14.77B params) **Machine:** Apple M3 Max, 36GB unified **KV Cache:** f16 (default) ### Throughput (llama-bench, 3-run average) | Metric | Value | |--------|-------| | Prompt Processing (pp512) | **304.28 ± 0.57 t/s** | | Token Generation (tg128) | **27.47 ± 0.59 t/s** | ### Memory (Metal GPU) | Component | Size | |-----------|------| | Model | 8,579 MiB | | KV Cache (ctx ~512, f16) | 320 MiB | | Compute | 316 MiB | | **Total GPU** | **9,215 MiB** | ### Perplexity ⏭️ Skipped — no wikitext-2-raw corpus bundled. Need to download for proper PPL test. These are the baseline numbers for TurboQuant comparison.

Timmy closed this issue

2026-03-30 20:11:07 +00:00

Timmy referenced this issue from a commit

2026-03-30 20:12:04 +00:00

Phase 1 Report: PolarQuant MVP complete

Timmy referenced this issue

2026-03-31 04:34:06 +00:00

[P2-4] Run full quality comparison: turbo4 vs f16 on 10 test prompts #24

Timmy referenced this issue

2026-04-04 01:18:41 +00:00

TurboQuant — KV Cache Compression for Local Inference on M4 Max #1

Sign in to join this conversation.