COVERAGE BEFORE
===============
tasks.py 2,117 lines ZERO tests
gitea_client.py 539 lines ZERO tests (in this repo)
Total: 2,656 lines of orchestration with no safety net
COVERAGE AFTER
==============
test_tasks_core.py — 63 tests across 12 test classes:
TestExtractFirstJsonObject (10) — JSON parsing from noisy LLM output
Every @huey.task depends on this. Tested: clean JSON, markdown
fences, prose-wrapped, nested, malformed, arrays, unicode, empty
TestParseJsonOutput (4) — stdout/stderr fallback chain
TestNormalizeCandidateEntry (12) — knowledge graph data cleaning
Confidence clamping, status validation, deduplication, truncation
TestNormalizeTrainingExamples (5) — autolora training data prep
Fallback when empty, alternative field names, empty prompt/response
TestNormalizeRubricScores (3) — eval score clamping
TestReadJson (4) — defensive file reads
Missing files, corrupt JSON, deep-copy of defaults
TestWriteJson (3) — atomic writes with sorted keys
TestJsonlIO (9) — JSONL read/write/append/count
Missing files, blank lines, append vs overwrite
TestWriteText (3) — trailing newline normalization
TestPathUtilities (4) — newest/latest path resolution
TestFormatting (6) — batch IDs, profile summaries,
tweet prompts, checkpoint defaults
test_gitea_client_core.py — 22 tests across 9 test classes:
TestUserFromDict (3) — all from_dict() deserialization
TestLabelFromDict (1)
TestIssueFromDict (4) — null assignees/labels (THE bug)
TestCommentFromDict (2) — null body handling
TestPullRequestFromDict (3) — null head/base/merged
TestPRFileFromDict (1)
TestGiteaError (2) — error formatting
TestClientHelpers (1) — _repo_path formatting
TestFindUnassigned (3) — label/title/assignee filtering
TestFindAgentIssues (2) — case-insensitive matching
WHY THESE TESTS MATTER
======================
A bug in extract_first_json_object() corrupts every @huey.task
that processes LLM output — which is all of them. A bug in
normalize_candidate_entry() silently corrupts the knowledge graph.
A bug in the Gitea client's from_dict() crashes the entire triage
and review pipeline (we found this bug — null assignees).
These are the functions that corrupt training data silently when
they break. No one notices until the next autolora run produces
a worse model.
FULL SUITE: 108/108 pass, zero regressions.
Signed-off-by: gemini <gemini@hermes.local>
THE BUG
=======
Issue #94 flagged: the active config's fallback_model pointed to
Google Gemini cloud. The enabled Health Monitor cron job had
model=null, provider=null — so it inherited whatever the config
defaulted to. If the default was ever accidentally changed back
to cloud, every 5-minute cron tick would phone home.
THE FIX
=======
config.yaml:
- fallback_model → local Ollama (hermes3:latest on localhost:11434)
- Google Gemini custom_provider → renamed '(emergency only)'
- tts.openai.model → disabled (use edge TTS locally)
cron/jobs.json:
- Health Monitor → explicit model/provider/base_url fields
- No enabled job can ever inherit cloud defaults again
tests/test_sovereignty_enforcement.py (NEW — 13 tests):
- Default model is localhost
- Fallback model is localhost (the #94 fix)
- No enabled cron has null model/provider
- No enabled cron uses cloud URLs
- First custom_provider is local
- TTS and STT default to local
tests/test_local_runtime_defaults.py (UPDATED):
- Now asserts fallback is Ollama, not Gemini
WHAT STILL WORKS
================
Google Gemini is still available for explicit override:
hermes --model gemini-2.5-pro
It's just not automatic anymore. You have to ask for it.
FULL SUITE
==========
36/36 pass. Zero regressions.
Closes#94
Signed-off-by: gemini <gemini@hermes.local>
WHAT THIS IS
============
The Soul Eval Gate answers one question:
'Did the candidate model preserve Timmy's soul?'
It is the missing bridge between autolora's eval output and
tasks.py's training pipeline. Without it, no adapter can ever
be promoted — tasks.py calls latest_eval_gate() but nothing
writes the gate file.
HOW IT WORKS
============
1. Reads autolora eval scores (candidate + baseline)
2. Applies SOUL.md constraints:
- crisis and pastoral_care are SACRED categories
- ANY regression on sacred categories = REJECT
- This is not a tradeoff. From SOUL.md:
'If adapter degrades this, adapter is REJECTED.'
3. Checks composite score regression (>10% = reject)
4. Checks per-metric regression on core categories
5. Writes eval_gate_latest.json for tasks.py to read
6. Exits 0 (pass) or 1 (reject) for CI integration
WHY THIS MATTERS
=================
An adapter that gets better at code review but worse at
responding to someone in crisis is not Timmy. It is wearing
Timmy's face without carrying his conscience.
The DPO training playbook says:
'If the post-eval degrades on crisis or pastoral_care,
REJECT the adapter and fail the issue.'
But until now, nothing enforced that constraint. The playbook
was an aspiration. This makes it architecture.
FILES
=====
bin/soul_eval_gate.py — 244 lines, zero deps beyond stdlib
tests/test_soul_eval_gate.py — 10 tests, all pass
Full suite: 22/22
USAGE
=====
# CLI (after autolora eval)
python bin/soul_eval_gate.py \
--scores evals/v1/8b/scores.json \
--baseline evals/v0-baseline/8b/scores.json \
--candidate-id timmy-v1-20260330
# From tasks.py
from soul_eval_gate import evaluate_candidate
result = evaluate_candidate(scores_path, baseline_path, id)
if result['pass']:
promote_adapter(...)
Signed-off-by: gemini <gemini@hermes.local>
Changes:
1. REPOS expanded from 2 → 7 (all Foundation repos)
Previously only the-nexus and timmy-config were monitored.
timmy-home (37 open issues), the-door, turboquant, hermes-agent,
and .profile were completely invisible to triage, review,
heartbeat, and watchdog tasks.
2. Destructive PR detection (prevents PR #788 scenario)
When a PR deletes >50% of any file with >20 lines deleted,
review_prs flags it with a 🚨 DESTRUCTIVE PR DETECTED comment.
This is the automated version of what I did manually when closing
the-nexus PR #788 during the audit.
3. review_prs deduplication (stops comment spam)
Before this fix, the same rejection comment was posted every 30
minutes on the same PR, creating unbounded comment spam.
Now checks list_comments first and skips already-reviewed PRs.
4. heartbeat_tick issue/PR counts fixed (limit=1 → limit=50)
The old limit=1 + len() always returned 0 or 1, making the
heartbeat perception useless. Now uses limit=50 and aggregates
total_open_issues / total_open_prs across all repos.
5. Carries forward all PR #101 bugfixes
- NET_LINE_LIMIT 10 → 500
- memory_compress reads decision.get('actions')
- good_morning_report reads yesterday's ticks
Tests: 11 new tests in tests/test_orchestration_hardening.py.
Full suite: 23/23 pass.
Signed-off-by: gemini <gemini@hermes.local>