* feat: set qwen3.5:latest as default model
- Make qwen3.5:latest the primary default model for faster inference
- Move llama3.1:8b-instruct to fallback chain
- Update text fallback chain to prioritize qwen3.5:latest
Retains full backward compatibility via cascade fallback.
* test: remove ~55 brittle, duplicate, and useless tests
Audit of all 100 test files identified tests that provided no real
regression protection. Removed:
- 4 files deleted entirely: test_setup_script (always skipped),
test_csrf_bypass (tautological assertions), test_input_validation
(accepts 200-500 status codes), test_security_regression (fragile
source-pattern checks redundant with rendering tests)
- Duplicate test classes (TestToolTracking, TestCalculatorExtended)
- Mock-only tests that just verify mock wiring, not behavior
- Structurally broken tests (TestCreateToolFunctions patches after import)
- Empty/pass-body tests and meaningless assertions (len > 20)
- Flaky subprocess tests (aider tool calling real binary)
All 1328 remaining tests pass. Net: -699 lines, zero coverage loss.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
* fix: prevent test pollution from autoresearch_enabled mutation
test_autoresearch_perplexity.py was setting settings.autoresearch_enabled = True
but never restoring it in the finally block — polluting subsequent tests.
When pytest-randomly ordered it before test_experiments_page_shows_disabled_when_off,
the victim test saw enabled=True and failed to find "Disabled" in the page.
Fix both sides:
- Restore autoresearch_enabled in the finally block (root cause)
- Mock settings explicitly in the victim test (defense in depth)
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
---------
Co-authored-by: Trip T <trip@local>
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
* polish: streamline nav, extract inline styles, improve tablet UX
- Restructure desktop nav from 8+ flat links + overflow dropdown into
5 grouped dropdowns (Core, Agents, Intel, System, More) matching
the mobile menu structure to reduce decision fatigue
- Extract all inline styles from mission_control.html and base.html
notification elements into mission-control.css with semantic classes
- Replace JS-built innerHTML with secure DOM construction in
notification loader and chat history
- Add CONNECTING state to connection indicator (amber) instead of
showing OFFLINE before WebSocket connects
- Add tablet breakpoint (1024px) with larger touch targets for
Apple Pencil / stylus use and safe-area padding for iPad toolbar
- Add active-link highlighting in desktop dropdown menus
- Rename "Mission Control" page title to "System Overview" to
disambiguate from the chat home page
- Add "Home — Timmy Time" page title to index.html
https://claude.ai/code/session_015uPUoKyYa8M2UAcyk5Gt6h
* fix(security): move auth-gate credentials to environment variables
Hardcoded username, password, and HMAC secret in auth-gate.py replaced
with os.environ lookups. Startup now refuses to run if any variable is
unset. Added AUTH_GATE_SECRET/USER/PASS to .env.example.
https://claude.ai/code/session_015uPUoKyYa8M2UAcyk5Gt6h
* refactor(tooling): migrate from black+isort+bandit to ruff
Replace three separate linting/formatting tools with a single ruff
invocation. Updates tox.ini (lint, format, pre-push, pre-commit envs),
.pre-commit-config.yaml, and CI workflow. Fixes all ruff errors
including unused imports, missing raise-from, and undefined names.
Ruff config maps existing bandit skips to equivalent S-rules.
https://claude.ai/code/session_015uPUoKyYa8M2UAcyk5Gt6h
---------
Co-authored-by: Claude <noreply@anthropic.com>
Replace the homebrew regex-based tool extraction and manual dispatch
(tool_executor.py) with Agno's built-in Human-In-The-Loop confirmation:
- Toolkit(requires_confirmation_tools=...) marks dangerous tools
- agent.run() returns RunOutput with status=paused when confirmation needed
- RunRequirement.confirm()/reject() + agent.continue_run() resumes execution
Dashboard and Discord vendor both use the native flow. DuckDuckGo import
isolated so its absence doesn't kill all tools. Test stubs cleaned up
(agno is a real dependency, only truly optional packages stubbed).
1384 tests pass in parallel (~14s).
Co-authored-by: Trip T <trip@local>
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
* CI/CD Optimization: Guard Rails, Black Linting, and Pre-commit Hooks
- Fixed all test collection errors (Selenium imports, fixture paths, syntax)
- Implemented pre-commit hooks with Black formatting and isort
- Created comprehensive Makefile with test targets (unit, integration, functional, e2e)
- Added pytest.ini with marker definitions for test categorization
- Established guard rails to prevent future collection errors
- Wrapped optional dependencies (Selenium, MoviePy) in try-except blocks
- Added conftest_markers for automatic test categorization
This ensures a smooth development stream with:
- Fast feedback loops (pre-commit checks before push)
- Consistent code formatting (Black)
- Reliable CI/CD (no collection errors, proper test isolation)
- Clear test organization (unit, integration, functional, E2E)
* Fix CI/CD test failures:
- Export templates from dashboard.app
- Fix model name assertion in test_agent.py
- Fix platform-agnostic path resolution in test_path_resolution.py
- Skip Docker tests in test_docker_deployment.py if docker not available
- Fix test_model_fallback_chain logic in test_ollama_integration.py
* Add preventative pre-commit checks and Docker test skipif decorators:
- Create pre_commit_checks.py script for common CI failures
- Add skipif decorators to Docker tests
- Improve test robustness for CI environments
The task queue was completely stuck: 82 tasks trapped in pending_approval,
4 zombie tasks frozen in running, and the worker loop unable to process
anything. This removes the approval gate as the default and adds startup
recovery for orphaned tasks.
- Auto-approve all tasks by default; only task_type="escalation" requires
human review (and escalations never block the processor)
- Add reconcile_zombie_tasks() to reset RUNNING→APPROVED on startup
- Use in-memory _current_task for concurrency check instead of DB status
so stale RUNNING rows from a crash can't block new work
- Update get_next_pending_task to only query APPROVED tasks
- Update all callsites (chat route, API, form) to match new defaults
Co-authored-by: Alexander Payne <apayne@MM.local>
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
11 tests verify that the TaskProcessor drains all queued tasks to
completion — the core behavior needed for Timmy's stream of
consciousness. Tests cover: single/batch/burst processing, priority
ordering, mixed task types, failure recovery, timestamp tracking,
and a loop-based inbox zero assertion.
Adds an `isolated_task_db` fixture to functional conftest that gives
each test a fresh temporary SQLite database via pytest's tmp_path.
Co-authored-by: Alexander Payne <apayne@MM.local>
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
- Replace fixed time.sleep() calls with intelligent polling or WebDriverWait
- Add pytest-timeout dependency and --timeout=30 to prevent hangs
- Fixes test flakiness and improves test suite speed
Co-authored-by: Alexander Payne <apayne@MM.local>
Add fixtures required by functional test suite:
- docker_stack: Docker container test URL (skips if FUNCTIONAL_DOCKER != 1)
- serve_client: FastAPI TestClient for timmy-serve app
- tdd_runner: Alias for self_tdd_runner
Fixes CI errors in test_docker_swarm.py, test_l402_flow.py, test_cli.py
- Fix recall_personal_facts() call - remove unsupported limit parameter
- Replace 4 slow E2E test files with single fast test file
- All 6 E2E tests complete in ~9 seconds (was 60+ seconds)
- Reuse browser session across tests (module-scoped fixture)
- Combine related checks into single tests
- Add HTTP-only smoke test for speed
selenium depends on websocket-client which installs a top-level
`websocket` package that shadows our src/websocket/ module on CI.
Renaming to ws_manager eliminates the conflict entirely — no more
sys.path hacks needed in conftest or Selenium tests.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Add an ollama service (behind --profile ollama) to the test compose stack
and a new test suite that verifies real LLM inference end-to-end:
- docker-compose.test.yml: add ollama/ollama service with health check,
make OLLAMA_URL and OLLAMA_MODEL configurable via env vars
- tests/functional/test_ollama_chat.py: session-scoped fixture that
brings up Ollama + dashboard, pulls qwen2.5:0.5b (~400MB, CPU-only),
and runs chat/history/multi-turn tests against the live stack
- Makefile: add `make test-ollama` target
Run with: make test-ollama (or FUNCTIONAL_DOCKER=1 pytest tests/functional/test_ollama_chat.py -v)
https://claude.ai/code/session_01NTEzfRHSZQCfkfypxgyHKk
Three-tier functional test infrastructure:
- CLI tests via Typer CliRunner (timmy, timmy-serve, self-tdd)
- Dashboard integration tests with real TestClient, real SQLite, real
coordinator (no patch/mock — Ollama offline = graceful degradation)
- Docker compose container-level tests (gated by FUNCTIONAL_DOCKER=1)
- End-to-end L402 payment flow with real mock-lightning backend
42 new tests (8 Docker tests skipped without FUNCTIONAL_DOCKER=1).
All 849 tests pass.
https://claude.ai/code/session_01WU4h3cQQiouMwmgYmAgkMM