Changed message param from str to list[str] in chat() and route() commands.
Words are joined with spaces, so 'timmy chat hello how are you' works without
quoting. Single-word messages still work as before.
- chat(): message: list[str], joined to full_message
- route(): message: list[str], joined to full_message
- 7 new tests in test_cli_multiword.py
Closes#26
Timmy can now introspect which session he's running in (cli, dashboard, loop).
- Add {session_id} placeholder to both lite and full system prompts
- get_system_prompt() accepts session_id param (default: 'unknown')
- create_timmy() accepts session_id param, forwards to prompt
- CLI chat/think/status pass their session_id to create_timmy()
- session.py passes _DEFAULT_SESSION_ID to create_timmy()
- 7 new tests in test_session_identity.py
- Updated 2 existing CLI test mocks
Closes#64
- Use sys.executable instead of hardcoded "python" in tests
- Fixes test_run_python_expression and test_run_nonzero_exit
- Passes allowed_prefixes for both python and python3
- Replace substring matching with word-boundary regex in route_request()
- "fix the bug" now correctly routes to coder
- Multi-word patterns match if all words appear (any order)
- Add "timmy route" CLI command for debugging routing
- Add route_request_with_match() for pattern visibility
- Expand routing keywords in agents.yaml
- 22 new routing tests, all passing
Comprehensive test coverage for the semantic memory module:
- _simple_hash_embedding determinism and normalization
- cosine_similarity including zero vectors
- SemanticMemory: init, index_file, index_vault, search, stats
- _split_into_chunks with various sizes
- memory_search, memory_read, memory_write, memory_forget tools
- MemorySearcher class
- Edge cases: empty DB, unicode, very long text, special chars
- All tests use tmp_path for isolation, no sentence-transformers needed
86 tests, all passing. 1393 total tests passing.
- Add ollama_num_ctx setting (default 4096) to config.py
- Pass num_ctx option to Ollama in agent.py and agents/base.py
- Add OLLAMA_NUM_CTX to .env.example with usage docs
- Add context_window note in providers.yaml
- Fix mock_settings in test_agent.py for new attribute
- qwen3:30b with 4096 ctx uses ~19GB vs 45GB default
- BaseAgent.run(): catch httpx.ConnectError/ReadError/ConnectionError,
log 'Ollama disconnected: <error>' at ERROR level, then re-raise
- session.py: distinguish Ollama disconnects from other errors in
chat(), chat_with_tools(), continue_chat() — return specific message
'Ollama appears to be disconnected' instead of generic error
- 11 new tests covering all disconnect paths
Wrap numpy and voice_loop imports in try/except with pytestmark skipif.
Tests skip cleanly instead of ImportError when numpy not in dev deps.
Closes#48
Remove DuckDuckGoTools import, all web_search registrations across 4 toolkit
factories, catalog entry, safety classification, prompt references, and
session regex. Total: -41 lines of dead code.
consult_grok is functional (grok_enabled=True, API key set) and opt-in,
so it stays — but Timmy never calls it autonomously, which is correct
sovereign behavior (no cloud calls unless user permits).
Closes#87
_get_ollama_model() used prefix match (startswith) on /api/tags,
causing qwen3:30b to match qwen3.5:latest. Now:
1. Queries /api/ps (loaded models) first — most accurate
2. Falls back to /api/tags with exact name match
3. Reports actual running model, not just configured one
Updated test_get_system_info_contains_model to not assume model==config.
Fixes#77. 5 regression tests added.
Add config/allowlist.yaml — YAML-driven gate that auto-approves bounded
tool calls when no human is present.
When Timmy runs with --autonomous or stdin is not a terminal, tool calls
are checked against allowlist: matched → auto-approved, else → rejected.
Changes:
- config/allowlist.yaml: shell prefixes, deny patterns, path rules
- tool_safety.py: is_allowlisted() checks tools against YAML rules
- cli.py: --autonomous flag, _is_interactive() detection
- 44 new allowlist tests, 8 updated CLI tests
Closes#69
Move hardcoded model fallback lists from module-level constants into
settings.fallback_models and settings.vision_fallback_models (pydantic
Settings fields). Can now be overridden via env vars
FALLBACK_MODELS / VISION_FALLBACK_MODELS or config/providers.yaml.
Removed:
- OLLAMA_MODEL_PRIMARY / OLLAMA_MODEL_FALLBACK from config.py
- DEFAULT_MODEL_FALLBACKS / VISION_MODEL_FALLBACKS from agent.py
get_effective_ollama_model() and _resolve_model_with_fallback() now
walk the configurable chains instead of hardcoded constants.
5 new tests guard the configurable behavior and prevent regression
to hardcoded constants.
Closes#71: Timmy was responding with elaborate markdown formatting
(tables, headers, emoji, bullet lists) for simple questions.
Root causes fixed:
1. Agno Agent markdown=True flag explicitly told the model to format
responses as markdown. Set to False in both agent.py and agents/base.py.
2. SYSTEM_PROMPT_FULL used ## and ### markdown headers, bold (**), and
numbered lists — teaching by example that markdown is expected.
Rewritten to plain text with labeled sections.
3. Brevity instructions were buried at the bottom of the full prompt.
Moved to immediately after the opening line as 'VOICE AND BREVITY'
with explicit override priority.
4. Orchestrator prompt in agents.yaml was silent on response style.
Added 'Voice: brief, plain, direct' with concrete examples.
The full prompt is now 41 lines shorter (124 → 83). The prompt itself
practices the brevity it preaches.
SOUL.md alignment:
- 'Brevity is a kindness' — now front-loaded in both base and agent prompt
- 'I do not fill silence with noise' — explicit in both tiers
- 'I speak plainly. I prefer short sentences.' — structural enforcement
4 new tests guard against regression:
- test_full_prompt_brevity_first: brevity section before tools/memory
- test_full_prompt_no_markdown_headers: no ## or ### in prompt text
- test_full_prompt_plain_text_brevity: 'plain text' instruction present
- test_lite_prompt_brevity: lite tier also instructs brevity
Add post-generation similarity check to ThinkingEngine.think_once().
Problem: Timmy's thinking engine generates repetitive thoughts because
small local models ignore 'don't repeat' instructions in the prompt.
The same observation ('still no chat messages', 'Alexander's name is in
profile') would appear 14+ times in a single day's journal.
Fix: After generating a thought, compare it against the last 5 thoughts
using SequenceMatcher. If similarity >= 0.6, retry with a new seed up to
2 times. If all retries produce repetitive content, discard rather than
store. Uses stdlib difflib — no new dependencies.
Changes:
- thinking.py: Add _is_too_similar() method with SequenceMatcher
- thinking.py: Wrap generation in retry loop with dedup check
- test_thinking.py: 7 new tests covering exact match, near match,
different thoughts, retry behavior, and max-retry discard
+96/-20 lines in thinking.py, +87 lines in tests.
Replace in-memory MessageLog with SQLite-backed implementation.
Same API surface (append/all/clear/len) so zero caller changes needed.
- data/chat.db stores messages with role, content, timestamp, source
- Lazy DB connection (opened on first use, not at import time)
- Retention policy: oldest messages pruned when count > 500
- New .recent(limit) method for efficient last-N queries
- Thread-safe with explicit locking
- WAL mode for concurrent read performance
- Test isolation: conftest redirects DB to tmp_path per test
- 8 new tests: persistence, retention, concurrency, source field
Closes#46
Fixes#52
- Replace eval() in calculator() with _safe_eval() that walks the AST
and only permits: numeric constants, arithmetic ops (+,-,*,/,//,%,**),
unary +/-, math module access, and whitelisted builtins (abs, round,
min, max)
- Reject all other syntax: imports, attribute access on non-math objects,
lambdas, comprehensions, string literals, etc.
- Add 39 tests covering arithmetic, precedence, math functions,
allowed builtins, error handling, and 14 injection prevention cases
Three fixes from real-world testing:
1. Event loop: replaced asyncio.run() with a persistent loop so
Agno's MCP sessions survive across conversation turns. No more
'Event loop is closed' errors on turn 2+.
2. Markdown stripping: voice preamble tells Timmy to respond in
natural spoken language, plus _strip_markdown() as a safety net
removes **bold**, *italic*, bullets, headers, code fences, etc.
TTS no longer reads 'asterisk asterisk'.
3. MCP noise: _suppress_mcp_noise() quiets mcp/agno/httpx loggers
during voice mode so the terminal shows clean transcript only.
32 tests (12 new for markdown stripping + persistent loop).
The httpx AsyncClient was cached across asyncio.run() boundaries.
Each asyncio.run() creates and closes a new event loop, leaving the
cached client's connections on a dead loop. Second+ calls would fail
with "Event loop is closed".
Fix: create a fresh client per request and close it in a finally block.
No more cross-loop client reuse.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Give Timmy the ability to file Gitea issues when he notices bugs,
stale state, or improvement opportunities in his own codebase.
Components:
- GiteaHand async API client (infrastructure/hands/gitea.py)
- Token auth with ~/.config/gitea/token fallback
- Create/list/close issues, dedup by title similarity
- Graceful degradation when Gitea unreachable
- Tool functions (timmy/tools_gitea.py)
- create_gitea_issue: file issues with dedup + work order bridge
- list_gitea_issues: check existing backlog
- Classified as SAFE (no confirmation needed)
- Thinking post-hook (_maybe_file_issues in thinking.py)
- Every 20 thoughts, LLM classifies recent thoughts for actionable items
- Auto-files bugs/improvements to Gitea with dedup
- Bridges to local work order system for dashboard tracking
- Config: gitea_url, gitea_token, gitea_repo, gitea_enabled,
gitea_timeout, thinking_issue_every
All 1426 tests pass, 74.17% coverage.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Consolidates 3 separate memory databases (semantic_memory.db, swarm.db
memory_entries, brain.db) into a single data/memory.db with facts,
chunks, and episodes tables.
Key changes:
- Add unified schema (timmy/memory/unified.py) with 3 core tables
- Redirect vector_store.py and semantic_memory.py to memory.db
- Add thought distillation: every Nth thought extracts lasting facts
- Enrich agent context with known facts in system prompt
- Add memory_forget tool for removing outdated memories
- Unify embeddings: vector_store delegates to semantic_memory.embed_text
- Bridge spark events to unified event log
- Add pruning for thoughts and events with configurable retention
- Add data migration script (timmy/memory_migrate.py)
- Deprecate brain.memory in favor of unified system
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Rewrite _THINKING_PROMPT with strict rules: 2-3 sentence limit,
anti-confabulation (only reference real data), anti-repetition.
- Add _pick_seed_type() with recent-type dedup (excludes last 3)
- Add _gather_system_snapshot() for real-time grounding (time, thought
count, chat activity, task queue)
- Improve _build_continuity_context() with anti-repetition header and
100-char truncation
- Fix journal + memory timestamps to include local timezone
- 12 new TDD tests covering all improvements
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Add optional prompt argument to `timmy tick` so custom journal
prompts can be passed from the CLI (seed_type="prompted").
Fix extract_user_name() learning verbs as names (e.g. "Serving").
Now requires the candidate word to start with a capital letter in
the original message, rejects common verb suffixes (-ing, -tion,
etc.), and deduplicates the naive regex in TimmyWithMemory to use
the fixed ConversationManager.extract_user_name() instead.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Wire MEMORY.md + soul.md into the thinking loop so each heartbeat
is grounded in identity and recent context, breaking repetitive loops.
Pre-hook: _load_memory_context() reads hot memory first (changes each
cycle) then soul.md (stable identity), truncated to 1500 chars.
Post-hook: _update_memory() writes a "Last Reflection" section to
MEMORY.md after each thought so the next cycle has fresh context.
soul.md is read-only from the heartbeat — never modified by it.
All hooks degrade gracefully and never crash the heartbeat.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Test data was bleeding into production tasks.db because
swarm.task_queue.models.DB_PATH (relative path) was never patched in
conftest.clean_database. Fixed by switching to absolute paths via
settings.repo_root and adding the missing module to the patching list.
Discord bot could leak orphaned clients on retry after ERROR state.
Added _cleanup_stale() to close stale client/task before each start()
attempt, with improved logging in the token watcher.
Rewrote test_paperclip_client.py to use httpx.MockTransport instead of
patching _get/_post/_delete — tests now exercise real HTTP status codes,
error handling, and JSON parsing. Added end-to-end test for
capture_error → create_task DB isolation.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
load_token() was checking the state file before settings.discord_token,
so a stale fake token in discord_state.json would block the real token
from .env/DISCORD_TOKEN. Flipped the priority: env config first, state
file as fallback for tokens set via /discord/setup UI.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Model upgrade:
- qwen2.5:14b → qwen3.5:latest across config, tools, and docs
- Added qwen3.5 to multimodal model registry
Self-hosted Gitea CI:
- .gitea/workflows/tests.yml: lint + test jobs via act_runner
- Unified Dockerfile: pre-baked deps from poetry.lock for fast CI
- sitepackages=true in tox for ~2s dep resolution (was ~40s)
- OLLAMA_URL set to dead port in CI to prevent real LLM calls
Test isolation fixes:
- Smoke test fixture mocks create_timmy (was hitting real Ollama)
- WebSocket sends initial_state before joining broadcast pool (race fix)
- Tests use settings.ollama_model/url instead of hardcoded values
- skip_ci marker for Ollama-dependent tests, excluded in CI tox envs
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
* feat: set qwen3.5:latest as default model
- Make qwen3.5:latest the primary default model for faster inference
- Move llama3.1:8b-instruct to fallback chain
- Update text fallback chain to prioritize qwen3.5:latest
Retains full backward compatibility via cascade fallback.
* test: remove ~55 brittle, duplicate, and useless tests
Audit of all 100 test files identified tests that provided no real
regression protection. Removed:
- 4 files deleted entirely: test_setup_script (always skipped),
test_csrf_bypass (tautological assertions), test_input_validation
(accepts 200-500 status codes), test_security_regression (fragile
source-pattern checks redundant with rendering tests)
- Duplicate test classes (TestToolTracking, TestCalculatorExtended)
- Mock-only tests that just verify mock wiring, not behavior
- Structurally broken tests (TestCreateToolFunctions patches after import)
- Empty/pass-body tests and meaningless assertions (len > 20)
- Flaky subprocess tests (aider tool calling real binary)
All 1328 remaining tests pass. Net: -699 lines, zero coverage loss.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
* fix: prevent test pollution from autoresearch_enabled mutation
test_autoresearch_perplexity.py was setting settings.autoresearch_enabled = True
but never restoring it in the finally block — polluting subsequent tests.
When pytest-randomly ordered it before test_experiments_page_shows_disabled_when_off,
the victim test saw enabled=True and failed to find "Disabled" in the page.
Fix both sides:
- Restore autoresearch_enabled in the finally block (root cause)
- Mock settings explicitly in the victim test (defense in depth)
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
---------
Co-authored-by: Trip T <trip@local>
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
The agno dependency was pinned to <2.0 but the code uses agno.db.sqlite
(a 2.x API), breaking all tests in CI. Also fix airllm provider tests
to patch importlib.util.find_spec (what the production code uses) instead
of builtins.__import__.
Co-authored-by: Trip T <trip@local>
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
* polish: streamline nav, extract inline styles, improve tablet UX
- Restructure desktop nav from 8+ flat links + overflow dropdown into
5 grouped dropdowns (Core, Agents, Intel, System, More) matching
the mobile menu structure to reduce decision fatigue
- Extract all inline styles from mission_control.html and base.html
notification elements into mission-control.css with semantic classes
- Replace JS-built innerHTML with secure DOM construction in
notification loader and chat history
- Add CONNECTING state to connection indicator (amber) instead of
showing OFFLINE before WebSocket connects
- Add tablet breakpoint (1024px) with larger touch targets for
Apple Pencil / stylus use and safe-area padding for iPad toolbar
- Add active-link highlighting in desktop dropdown menus
- Rename "Mission Control" page title to "System Overview" to
disambiguate from the chat home page
- Add "Home — Timmy Time" page title to index.html
https://claude.ai/code/session_015uPUoKyYa8M2UAcyk5Gt6h
* fix(security): move auth-gate credentials to environment variables
Hardcoded username, password, and HMAC secret in auth-gate.py replaced
with os.environ lookups. Startup now refuses to run if any variable is
unset. Added AUTH_GATE_SECRET/USER/PASS to .env.example.
https://claude.ai/code/session_015uPUoKyYa8M2UAcyk5Gt6h
* refactor(tooling): migrate from black+isort+bandit to ruff
Replace three separate linting/formatting tools with a single ruff
invocation. Updates tox.ini (lint, format, pre-push, pre-commit envs),
.pre-commit-config.yaml, and CI workflow. Fixes all ruff errors
including unused imports, missing raise-from, and undefined names.
Ruff config maps existing bandit skips to equivalent S-rules.
https://claude.ai/code/session_015uPUoKyYa8M2UAcyk5Gt6h
---------
Co-authored-by: Claude <noreply@anthropic.com>
Major:
- Extract all inline <style> blocks from 22 Jinja2 templates into
static/css/mission-control.css — single cacheable stylesheet
- Add tox lint check that fails on inline <style> in templates
Minor:
1. Connection status indicator in topbar (green/amber/red dot) reflecting
WebSocket + Ollama reachability, with auto-reconnect
2. Jinja2 {% macro panel(title) %} in macros.html — eliminates repeated
.card.mc-panel markup; index.html converted as example
3. SVG favicon (purple T + orange dot)
4. 30-second TTL cache on _check_ollama() to avoid blocking the event loop
on every health poll (asyncio.to_thread was already in place)
5. Toast notification system (McToast.show) for transient status messages —
wired into connection status for Ollama/WebSocket state changes
Enforcement:
- CLAUDE.md updated with conventions 11-14 (no inline CSS, use panel macro,
use toasts, never block the event loop)
- tox lint + pre-push environments now fail on inline <style> blocks
https://claude.ai/code/session_014FQ785MQdyJQ4BAXrRSo9w
Co-authored-by: Claude <noreply@anthropic.com>
- Fix test_autoresearch_perplexity: patch target was
dashboard.routes.experiments.get_experiment_history but the function
is imported locally inside the route handler, so patch the source
module timmy.autoresearch.get_experiment_history instead.
- Add tests for src/timmy/interview.py (previously 0% coverage):
question structure, run_interview flow, error handling, formatting.
- Produce interview transcript document from structured Timmy interview.
https://claude.ai/code/session_01EXDzXqgsC2ohS8qreF1fBo
Co-authored-by: Claude <noreply@anthropic.com>
Replace the homebrew regex-based tool extraction and manual dispatch
(tool_executor.py) with Agno's built-in Human-In-The-Loop confirmation:
- Toolkit(requires_confirmation_tools=...) marks dangerous tools
- agent.run() returns RunOutput with status=paused when confirmation needed
- RunRequirement.confirm()/reject() + agent.continue_run() resumes execution
Dashboard and Discord vendor both use the native flow. DuckDuckGo import
isolated so its absence doesn't kill all tools. Test stubs cleaned up
(agno is a real dependency, only truly optional packages stubbed).
1384 tests pass in parallel (~14s).
Co-authored-by: Trip T <trip@local>
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
* fix: remove invalid show_tool_calls kwarg crashing Agent init (regression)
show_tool_calls was removed in f95c960 (Feb 26) because agno 2.5.x
doesn't accept it, then reintroduced in fd0ede0 (Mar 8) without
runtime testing — mocked tests hid the breakage.
Replace the bogus assertion with a regression guard and an allowlist
test that catches unknown kwargs before they reach production.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
* fix: auto-install git hooks, add black/isort to dev deps
- Add .githooks/ with portable pre-commit hook (macOS + Linux)
- make install now auto-activates hooks via core.hooksPath
- Add black and isort to poetry dev group (were only in CI via raw pip)
- Fix black formatting on 2 files flagged by CI
- Fix test_autoresearch_perplexity patching wrong module path
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
---------
Co-authored-by: Trip T <trip@local>
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>