feat: evaluate Qwen3.5:35B as local model option (#288) #587

Closed
Rockachopa wants to merge 245 commits from am/288-1776166469 into main
Owner

Fixes #288

Evaluation of Qwen3.5-35B-A3B (MoE, 35B/3B active) for local deployment as privacy-sensitive tier (Epic #281).

Verdict: APPROVED -- 8.8/10 security score

  • Data locality (10/10): all inference local via Ollama, zero exfiltration
  • No API keys (10/10), no telemetry (10/10)
  • Privacy filter elimination (9/10): PII never leaves machine
  • MoE: 35B quality at 3B speed, 128K ctx, Apache 2.0
  • VRAM: 20GB Q4 (fits M2 Ultra, M4 Pro, RTX 4090, RunPod L40S)
  • Integration: ollama pull qwen3.5:35b -> config.yaml privacy_model

Follow-up issues filed from evaluation gaps:

  • #502: live tool dispatch reliability benchmark
  • #503: reasoning benchmark vs hermes4:14b
  • #518: document minimum hardware requirements fleet-wide
  • #581: automated pull + smoke test for fleet deployment
  • #324: prompt injection red-team testing (existing)

Files: scripts/evaluate_qwen35.py + tests/test_evaluate_qwen35.py (10 tests)

Fixes #288 Evaluation of Qwen3.5-35B-A3B (MoE, 35B/3B active) for local deployment as privacy-sensitive tier (Epic #281). **Verdict: APPROVED -- 8.8/10 security score** - Data locality (10/10): all inference local via Ollama, zero exfiltration - No API keys (10/10), no telemetry (10/10) - Privacy filter elimination (9/10): PII never leaves machine - MoE: 35B quality at 3B speed, 128K ctx, Apache 2.0 - VRAM: 20GB Q4 (fits M2 Ultra, M4 Pro, RTX 4090, RunPod L40S) - Integration: `ollama pull qwen3.5:35b` -> config.yaml privacy_model **Follow-up issues filed from evaluation gaps:** - #502: live tool dispatch reliability benchmark - #503: reasoning benchmark vs hermes4:14b - #518: document minimum hardware requirements fleet-wide - #581: automated pull + smoke test for fleet deployment - #324: prompt injection red-team testing (existing) Files: scripts/evaluate_qwen35.py + tests/test_evaluate_qwen35.py (10 tests)
Rockachopa added 1 commit 2026-04-14 11:39:20 +00:00
feat: evaluate Qwen3.5:35B as local model option (#288)
Some checks failed
Forge CI / smoke-and-build (pull_request) Failing after 1m8s
b301ea1439
Part of Epic #281. Verdict: APPROVED 8.8/10 security.
MoE 35B/3B active, 128K ctx, Apache 2.0, perfect data locality.

Follow-up issues filed:
- #502: live tool dispatch benchmark
- #503: reasoning benchmark vs hermes4:14b
- #518: document minimum hardware requirements fleet-wide
- #581: automated pull + smoke test for fleet deployment
- #324: prompt injection red-team testing (existing)

Closes #288
Timmy approved these changes 2026-04-14 12:12:25 +00:00
Timmy left a comment
Owner

Review: Evaluate Qwen3.5:35B as local model option (#288)

This is an evaluation script and associated tests — essentially documentation-as-code. It does not modify any runtime behavior of hermes-agent.

Looks good:

  • The evaluation is thorough: covers model spec, VRAM requirements, hardware profiles, security scoring, fleet comparison, and follow-up issues
  • Security scoring with weighted criteria (CRITICAL/HIGH/MEDIUM) is well thought out
  • Test coverage is decent for a report script — validates data integrity, score bounds, report sections, and Ollama status check
  • The check_ollama_status() function uses --max-time 5 and timeout=10 to avoid hanging

Minor suggestions (non-blocking):

  1. check_ollama_status() shells out to curl via subprocess instead of using Python's urllib or requests. Since this is an evaluation script (not production code), it's not critical, but using urllib.request would avoid the curl dependency and be more portable.

  2. import json, sys, timetime is imported but never used. Remove the unused import.

  3. Hardcoded hardware specs — The tok/sec numbers (40, 30, 50, etc.) are estimates. The script could note these are projections, not benchmarks, to avoid confusion when actual measurements differ.

  4. Prompt injection score of 6/10 with a note about 3B active parameters — this is the weakest score and the PR description mentions red-team testing (#324). Good that this is called out; the follow-up tracking is solid.

This is a useful research artifact that doesn't affect runtime code. Approved.

## Review: Evaluate Qwen3.5:35B as local model option (#288) This is an evaluation script and associated tests — essentially documentation-as-code. It does not modify any runtime behavior of hermes-agent. ### Looks good: - The evaluation is thorough: covers model spec, VRAM requirements, hardware profiles, security scoring, fleet comparison, and follow-up issues - Security scoring with weighted criteria (CRITICAL/HIGH/MEDIUM) is well thought out - Test coverage is decent for a report script — validates data integrity, score bounds, report sections, and Ollama status check - The `check_ollama_status()` function uses `--max-time 5` and `timeout=10` to avoid hanging ### Minor suggestions (non-blocking): 1. **`check_ollama_status()` shells out to `curl` via subprocess** instead of using Python's `urllib` or `requests`. Since this is an evaluation script (not production code), it's not critical, but using `urllib.request` would avoid the `curl` dependency and be more portable. 2. **`import json, sys, time`** — `time` is imported but never used. Remove the unused import. 3. **Hardcoded hardware specs** — The tok/sec numbers (`40`, `30`, `50`, etc.) are estimates. The script could note these are projections, not benchmarks, to avoid confusion when actual measurements differ. 4. **Prompt injection score of 6/10** with a note about 3B active parameters — this is the weakest score and the PR description mentions red-team testing (#324). Good that this is called out; the follow-up tracking is solid. This is a useful research artifact that doesn't affect runtime code. Approved.
Author
Owner

Superseded by dispatch/288-1776180746. Context window verification issue filed.

Superseded by dispatch/288-1776180746. Context window verification issue filed.
Rockachopa closed this pull request 2026-04-14 15:35:47 +00:00
Some checks failed
Forge CI / smoke-and-build (pull_request) Failing after 1m8s

Pull request closed

Sign in to join this conversation.
No Reviewers
No Label
2 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: Timmy_Foundation/hermes-agent#587