[P2-5] Download qwen3.5:27b and benchmark turbo4 at 64K/128K context #25

Closed
opened 2026-03-31 04:34:06 +00:00 by Timmy · 1 comment
Owner

Parent: #1 | Depends on: P2-3 (server working), P2-1 (quality validated on 14B)

Why

Hermes-4-14B was the test model. The target is qwen3.5:27b at 128K context — that's the spec goal. Need to prove it actually fits and runs.

Steps

  1. Download the model:
# Via Ollama (if custom build works):
ollama-turbo pull qwen3.5:27b

# Or direct GGUF download:
huggingface-cli download Qwen/Qwen3.5-27B-Q4_K_M-GGUF --local-dir ~/models/qwen3.5-27b/
  1. Benchmark at 64K context:
./llama-server -m ~/models/qwen3.5-27b/*.gguf \
  --port 8081 -np 1 -c 65536 --kv-type turbo4 --jinja

Send test prompts. Record: memory usage, tok/s, response quality.

  1. Benchmark at 128K context:
./llama-server -m ~/models/qwen3.5-27b/*.gguf \
  --port 8081 -np 1 -c 131072 --kv-type turbo4 --jinja

Record same metrics. Check: does it OOM?

  1. Memory validation:
# While server is running with 128K context:
memory_pressure  # macOS
top -l 1 | grep PhysMem

Expected from Phase 1 report: ~23.4 GB total, 7.6 GB headroom in 31GB.

Acceptance Criteria

  • qwen3.5:27b model downloaded
  • Runs with turbo4 at 64K context without OOM
  • Runs with turbo4 at 128K context without OOM
  • Memory usage within 15% of Phase 1 projections
  • tok/s recorded for both context sizes
  • At least 1 test prompt produces coherent output at 128K
## Parent: #1 | Depends on: P2-3 (server working), P2-1 (quality validated on 14B) ### Why Hermes-4-14B was the test model. The target is qwen3.5:27b at 128K context — that's the spec goal. Need to prove it actually fits and runs. ### Steps 1. Download the model: ```bash # Via Ollama (if custom build works): ollama-turbo pull qwen3.5:27b # Or direct GGUF download: huggingface-cli download Qwen/Qwen3.5-27B-Q4_K_M-GGUF --local-dir ~/models/qwen3.5-27b/ ``` 2. Benchmark at 64K context: ```bash ./llama-server -m ~/models/qwen3.5-27b/*.gguf \ --port 8081 -np 1 -c 65536 --kv-type turbo4 --jinja ``` Send test prompts. Record: memory usage, tok/s, response quality. 3. Benchmark at 128K context: ```bash ./llama-server -m ~/models/qwen3.5-27b/*.gguf \ --port 8081 -np 1 -c 131072 --kv-type turbo4 --jinja ``` Record same metrics. Check: does it OOM? 4. Memory validation: ```bash # While server is running with 128K context: memory_pressure # macOS top -l 1 | grep PhysMem ``` Expected from Phase 1 report: ~23.4 GB total, 7.6 GB headroom in 31GB. ### Acceptance Criteria - [ ] qwen3.5:27b model downloaded - [ ] Runs with turbo4 at 64K context without OOM - [ ] Runs with turbo4 at 128K context without OOM - [ ] Memory usage within 15% of Phase 1 projections - [ ] tok/s recorded for both context sizes - [ ] At least 1 test prompt produces coherent output at 128K
Timmy self-assigned this 2026-03-31 04:34:06 +00:00
Member

Closed per new fleet policy: no local llama-server for models >5GB. RunPod serverless endpoints only. See Timmy_Foundation/timmy-home#409.

Closed per new fleet policy: no local llama-server for models >5GB. RunPod serverless endpoints only. See Timmy_Foundation/timmy-home#409.
ezra closed this issue 2026-04-05 14:05:50 +00:00
Sign in to join this conversation.