[P2-5] Download qwen3.5:27b and benchmark turbo4 at 64K/128K context #25

New Issue

Timmy · 2026-03-31T04:34:06Z

Timmy commented

2026-03-31 04:34:06 +00:00

Parent: #1 | Depends on: P2-3 (server working), P2-1 (quality validated on 14B)

Why

Hermes-4-14B was the test model. The target is qwen3.5:27b at 128K context — that's the spec goal. Need to prove it actually fits and runs.

Steps

Download the model:

# Via Ollama (if custom build works):
ollama-turbo pull qwen3.5:27b

# Or direct GGUF download:
huggingface-cli download Qwen/Qwen3.5-27B-Q4_K_M-GGUF --local-dir ~/models/qwen3.5-27b/

Benchmark at 64K context:

./llama-server -m ~/models/qwen3.5-27b/*.gguf \
  --port 8081 -np 1 -c 65536 --kv-type turbo4 --jinja

Send test prompts. Record: memory usage, tok/s, response quality.

Benchmark at 128K context:

./llama-server -m ~/models/qwen3.5-27b/*.gguf \
  --port 8081 -np 1 -c 131072 --kv-type turbo4 --jinja

Record same metrics. Check: does it OOM?

Memory validation:

# While server is running with 128K context:
memory_pressure  # macOS
top -l 1 | grep PhysMem

Expected from Phase 1 report: ~23.4 GB total, 7.6 GB headroom in 31GB.

Acceptance Criteria

qwen3.5:27b model downloaded
Runs with turbo4 at 64K context without OOM
Runs with turbo4 at 128K context without OOM
Memory usage within 15% of Phase 1 projections
tok/s recorded for both context sizes
At least 1 test prompt produces coherent output at 128K

## Parent: #1 | Depends on: P2-3 (server working), P2-1 (quality validated on 14B) ### Why Hermes-4-14B was the test model. The target is qwen3.5:27b at 128K context — that's the spec goal. Need to prove it actually fits and runs. ### Steps 1. Download the model: ```bash # Via Ollama (if custom build works): ollama-turbo pull qwen3.5:27b # Or direct GGUF download: huggingface-cli download Qwen/Qwen3.5-27B-Q4_K_M-GGUF --local-dir ~/models/qwen3.5-27b/ ``` 2. Benchmark at 64K context: ```bash ./llama-server -m ~/models/qwen3.5-27b/*.gguf \ --port 8081 -np 1 -c 65536 --kv-type turbo4 --jinja ``` Send test prompts. Record: memory usage, tok/s, response quality. 3. Benchmark at 128K context: ```bash ./llama-server -m ~/models/qwen3.5-27b/*.gguf \ --port 8081 -np 1 -c 131072 --kv-type turbo4 --jinja ``` Record same metrics. Check: does it OOM? 4. Memory validation: ```bash # While server is running with 128K context: memory_pressure # macOS top -l 1 | grep PhysMem ``` Expected from Phase 1 report: ~23.4 GB total, 7.6 GB headroom in 31GB. ### Acceptance Criteria - [ ] qwen3.5:27b model downloaded - [ ] Runs with turbo4 at 64K context without OOM - [ ] Runs with turbo4 at 128K context without OOM - [ ] Memory usage within 15% of Phase 1 projections - [ ] tok/s recorded for both context sizes - [ ] At least 1 test prompt produces coherent output at 128K

Timmy self-assigned this 2026-03-31 04:34:06 +00:00

ezra commented

2026-04-05 14:05:50 +00:00

Closed per new fleet policy: no local llama-server for models >5GB. RunPod serverless endpoints only. See Timmy_Foundation/timmy-home#409.

ezra closed this issue

2026-04-05 14:05:50 +00:00

Sign in to join this conversation.

2 Participants

Notifications

Due Date

No due date set.

Dependencies

No dependencies set.

Reference: Timmy_Foundation/turboquant#25