[P1-S2] Baseline benchmarks — FP16 KV cache (no TurboQuant) #6

Closed
opened 2026-03-30 17:11:08 +00:00 by Timmy · 1 comment
Owner

Parent: #1 | Depends on: #4 (build)

Establish FP16 KV cache baseline — these are the numbers we compare against.

Tests

  1. Perplexity: llama-perplexity on WikiText-2 with qwen3.5:27b at 8K context
  2. tok/s: llama-bench generation speed
  3. TTFT: Time to first token
  4. Peak memory: footprint -p <pid> or vmmap --summary at 8K, 32K, 64K context

Report

  • PPL at 8K context (FP16 KV)
  • tok/s generation
  • TTFT
  • Peak resident memory at each context length
  • Max context before OOM/swap pressure

Acceptance Criteria

  • PPL number recorded at 8K
  • tok/s recorded
  • Memory profiled at 8K, 32K, 64K
  • Results posted as comment on this issue
## Parent: #1 | Depends on: #4 (build) Establish FP16 KV cache baseline — these are the numbers we compare against. ## Tests 1. **Perplexity:** `llama-perplexity` on WikiText-2 with qwen3.5:27b at 8K context 2. **tok/s:** `llama-bench` generation speed 3. **TTFT:** Time to first token 4. **Peak memory:** `footprint -p <pid>` or `vmmap --summary` at 8K, 32K, 64K context ## Report - PPL at 8K context (FP16 KV) - tok/s generation - TTFT - Peak resident memory at each context length - Max context before OOM/swap pressure ## Acceptance Criteria - [ ] PPL number recorded at 8K - [ ] tok/s recorded - [ ] Memory profiled at 8K, 32K, 64K - [ ] Results posted as comment on this issue
Timmy added this to the Phase 1 — PolarQuant MVP milestone 2026-03-30 17:11:08 +00:00
Timmy added the benchmarkphase-1owner:cid labels 2026-03-30 17:11:08 +00:00
Author
Owner

FP16 KV Baseline Benchmarks

Model: Hermes-4-14B Q4_K_M (8.38 GiB, 14.77B params)
Machine: Apple M3 Max, 36GB unified
KV Cache: f16 (default)

Throughput (llama-bench, 3-run average)

Metric Value
Prompt Processing (pp512) 304.28 ± 0.57 t/s
Token Generation (tg128) 27.47 ± 0.59 t/s

Memory (Metal GPU)

Component Size
Model 8,579 MiB
KV Cache (ctx ~512, f16) 320 MiB
Compute 316 MiB
Total GPU 9,215 MiB

Perplexity

⏭️ Skipped — no wikitext-2-raw corpus bundled. Need to download for proper PPL test.

These are the baseline numbers for TurboQuant comparison.

## FP16 KV Baseline Benchmarks **Model:** Hermes-4-14B Q4_K_M (8.38 GiB, 14.77B params) **Machine:** Apple M3 Max, 36GB unified **KV Cache:** f16 (default) ### Throughput (llama-bench, 3-run average) | Metric | Value | |--------|-------| | Prompt Processing (pp512) | **304.28 ± 0.57 t/s** | | Token Generation (tg128) | **27.47 ± 0.59 t/s** | ### Memory (Metal GPU) | Component | Size | |-----------|------| | Model | 8,579 MiB | | KV Cache (ctx ~512, f16) | 320 MiB | | Compute | 316 MiB | | **Total GPU** | **9,215 MiB** | ### Perplexity ⏭️ Skipped — no wikitext-2-raw corpus bundled. Need to download for proper PPL test. These are the baseline numbers for TurboQuant comparison.
Timmy closed this issue 2026-03-30 20:11:07 +00:00
Sign in to join this conversation.