feat: evaluate Qwen3.5:35B as local model option (#288) #587

Rockachopa · 2026-04-14T11:39:19Z

Rockachopa commented

2026-04-14 11:39:19 +00:00

Fixes #288

Evaluation of Qwen3.5-35B-A3B (MoE, 35B/3B active) for local deployment as privacy-sensitive tier (Epic #281).

Verdict: APPROVED -- 8.8/10 security score

Data locality (10/10): all inference local via Ollama, zero exfiltration
No API keys (10/10), no telemetry (10/10)
Privacy filter elimination (9/10): PII never leaves machine
MoE: 35B quality at 3B speed, 128K ctx, Apache 2.0
VRAM: 20GB Q4 (fits M2 Ultra, M4 Pro, RTX 4090, RunPod L40S)
Integration: ollama pull qwen3.5:35b -> config.yaml privacy_model

Follow-up issues filed from evaluation gaps:

#502: live tool dispatch reliability benchmark
#503: reasoning benchmark vs hermes4:14b
#518: document minimum hardware requirements fleet-wide
#581: automated pull + smoke test for fleet deployment
#324: prompt injection red-team testing (existing)

Files: scripts/evaluate_qwen35.py + tests/test_evaluate_qwen35.py (10 tests)

Fixes #288 Evaluation of Qwen3.5-35B-A3B (MoE, 35B/3B active) for local deployment as privacy-sensitive tier (Epic #281). **Verdict: APPROVED -- 8.8/10 security score** - Data locality (10/10): all inference local via Ollama, zero exfiltration - No API keys (10/10), no telemetry (10/10) - Privacy filter elimination (9/10): PII never leaves machine - MoE: 35B quality at 3B speed, 128K ctx, Apache 2.0 - VRAM: 20GB Q4 (fits M2 Ultra, M4 Pro, RTX 4090, RunPod L40S) - Integration: `ollama pull qwen3.5:35b` -> config.yaml privacy_model **Follow-up issues filed from evaluation gaps:** - #502: live tool dispatch reliability benchmark - #503: reasoning benchmark vs hermes4:14b - #518: document minimum hardware requirements fleet-wide - #581: automated pull + smoke test for fleet deployment - #324: prompt injection red-team testing (existing) Files: scripts/evaluate_qwen35.py + tests/test_evaluate_qwen35.py (10 tests)

Rockachopa added 1 commit 2026-04-14 11:39:20 +00:00

feat: evaluate Qwen3.5:35B as local model option (#288 )

Forge CI / smoke-and-build (pull_request) Failing after 1m8s

Details

b301ea1439

Part of Epic #281. Verdict: APPROVED 8.8/10 security.
MoE 35B/3B active, 128K ctx, Apache 2.0, perfect data locality.

Follow-up issues filed:
- #502: live tool dispatch benchmark
- #503: reasoning benchmark vs hermes4:14b
- #518: document minimum hardware requirements fleet-wide
- #581: automated pull + smoke test for fleet deployment
- #324: prompt injection red-team testing (existing)

Closes #288

Timmy approved these changes 2026-04-14 12:12:25 +00:00

Timmy left a comment

Review: Evaluate Qwen3.5:35B as local model option (#288)

This is an evaluation script and associated tests — essentially documentation-as-code. It does not modify any runtime behavior of hermes-agent.

Looks good:

The evaluation is thorough: covers model spec, VRAM requirements, hardware profiles, security scoring, fleet comparison, and follow-up issues
Security scoring with weighted criteria (CRITICAL/HIGH/MEDIUM) is well thought out
Test coverage is decent for a report script — validates data integrity, score bounds, report sections, and Ollama status check
The check_ollama_status() function uses --max-time 5 and timeout=10 to avoid hanging

Minor suggestions (non-blocking):

check_ollama_status() shells out to curl via subprocess instead of using Python's urllib or requests. Since this is an evaluation script (not production code), it's not critical, but using urllib.request would avoid the curl dependency and be more portable.
import json, sys, time — time is imported but never used. Remove the unused import.
Hardcoded hardware specs — The tok/sec numbers (40, 30, 50, etc.) are estimates. The script could note these are projections, not benchmarks, to avoid confusion when actual measurements differ.
Prompt injection score of 6/10 with a note about 3B active parameters — this is the weakest score and the PR description mentions red-team testing (#324). Good that this is called out; the follow-up tracking is solid.

This is a useful research artifact that doesn't affect runtime code. Approved.

## Review: Evaluate Qwen3.5:35B as local model option (#288) This is an evaluation script and associated tests — essentially documentation-as-code. It does not modify any runtime behavior of hermes-agent. ### Looks good: - The evaluation is thorough: covers model spec, VRAM requirements, hardware profiles, security scoring, fleet comparison, and follow-up issues - Security scoring with weighted criteria (CRITICAL/HIGH/MEDIUM) is well thought out - Test coverage is decent for a report script — validates data integrity, score bounds, report sections, and Ollama status check - The `check_ollama_status()` function uses `--max-time 5` and `timeout=10` to avoid hanging ### Minor suggestions (non-blocking): 1. **`check_ollama_status()` shells out to `curl` via subprocess** instead of using Python's `urllib` or `requests`. Since this is an evaluation script (not production code), it's not critical, but using `urllib.request` would avoid the `curl` dependency and be more portable. 2. **`import json, sys, time`** — `time` is imported but never used. Remove the unused import. 3. **Hardcoded hardware specs** — The tok/sec numbers (`40`, `30`, `50`, etc.) are estimates. The script could note these are projections, not benchmarks, to avoid confusion when actual measurements differ. 4. **Prompt injection score of 6/10** with a note about 3B active parameters — this is the weakest score and the PR description mentions red-team testing (#324). Good that this is called out; the follow-up tracking is solid. This is a useful research artifact that doesn't affect runtime code. Approved.

Rockachopa commented

2026-04-14 15:35:47 +00:00

Superseded by dispatch/288-1776180746. Context window verification issue filed.

Rockachopa closed this pull request

2026-04-14 15:35:47 +00:00

Forge CI / smoke-and-build (pull_request) Failing after 1m8s

Details

Pull request closed

Please reopen this pull request to perform a merge.

Sign in to join this conversation.

2 Participants

Notifications

Due Date

No due date set.

Dependencies

No dependencies set.

Reference: Timmy_Foundation/hermes-agent#587