Commit Graph

167 Commits

Author SHA1 Message Date
d28e2f4a7e [loop-cycle-1] feat: tool allowlist for autonomous operation (#69)
Add config/allowlist.yaml — YAML-driven gate that auto-approves bounded
tool calls when no human is present.

When Timmy runs with --autonomous or stdin is not a terminal, tool calls
are checked against allowlist: matched → auto-approved, else → rejected.

Changes:
  - config/allowlist.yaml: shell prefixes, deny patterns, path rules
  - tool_safety.py: is_allowlisted() checks tools against YAML rules
  - cli.py: --autonomous flag, _is_interactive() detection
  - 44 new allowlist tests, 8 updated CLI tests

Closes #69
2026-03-14 17:39:48 -04:00
rockachopa
927e25cc40 Merge pull request 'fix: replace print() with proper logging (#29, #51)' (#59) from fix/print-to-logging into main 2026-03-14 16:50:04 -04:00
b30b5c6b57 [loop-cycle-6] Break thinking rumination loop — semantic dedup (#38)
Add post-generation similarity check to ThinkingEngine.think_once().

Problem: Timmy's thinking engine generates repetitive thoughts because
small local models ignore 'don't repeat' instructions in the prompt.
The same observation ('still no chat messages', 'Alexander's name is in
profile') would appear 14+ times in a single day's journal.

Fix: After generating a thought, compare it against the last 5 thoughts
using SequenceMatcher. If similarity >= 0.6, retry with a new seed up to
2 times. If all retries produce repetitive content, discard rather than
store. Uses stdlib difflib — no new dependencies.

Changes:
- thinking.py: Add _is_too_similar() method with SequenceMatcher
- thinking.py: Wrap generation in retry loop with dedup check
- test_thinking.py: 7 new tests covering exact match, near match,
  different thoughts, retry behavior, and max-retry discard

+96/-20 lines in thinking.py, +87 lines in tests.
2026-03-14 16:21:16 -04:00
79edfd1106 feat: persist chat history in SQLite — survives server restarts
Replace in-memory MessageLog with SQLite-backed implementation.
Same API surface (append/all/clear/len) so zero caller changes needed.

- data/chat.db stores messages with role, content, timestamp, source
- Lazy DB connection (opened on first use, not at import time)
- Retention policy: oldest messages pruned when count > 500
- New .recent(limit) method for efficient last-N queries
- Thread-safe with explicit locking
- WAL mode for concurrent read performance
- Test isolation: conftest redirects DB to tmp_path per test
- 8 new tests: persistence, retention, concurrency, source field

Closes #46
2026-03-14 16:09:26 -04:00
9535dd86de test: push event system coverage to ≥80% on all three modules
Add 3 targeted tests for infrastructure/error_capture.py:
- test_stale_entries_pruned: exercises dedup cache pruning (line 61)
- test_git_context_fallback_on_failure: exercises exception path (lines 90-91)
- test_returns_none_when_feedback_disabled: exercises early return (line 112)

Coverage results (63 tests, all passing):
- error_capture.py: 75.6% → 80.0%
- broadcaster.py: 93.9% (unchanged)
- bus.py: 92.9% (unchanged)
- Total: 88.1% → 89.4%

Closes #45
2026-03-14 16:01:05 -04:00
70d5dc5ce1 fix: replace eval() with AST-walking safe evaluator in calculator
Fixes #52

- Replace eval() in calculator() with _safe_eval() that walks the AST
  and only permits: numeric constants, arithmetic ops (+,-,*,/,//,%,**),
  unary +/-, math module access, and whitelisted builtins (abs, round,
  min, max)
- Reject all other syntax: imports, attribute access on non-math objects,
  lambdas, comprehensions, string literals, etc.
- Add 39 tests covering arithmetic, precedence, math functions,
  allowed builtins, error handling, and 14 injection prevention cases
2026-03-14 15:51:35 -04:00
782218aa2c fix: voice loop — persistent event loop, markdown stripping, MCP noise
Three fixes from real-world testing:

1. Event loop: replaced asyncio.run() with a persistent loop so
   Agno's MCP sessions survive across conversation turns. No more
   'Event loop is closed' errors on turn 2+.

2. Markdown stripping: voice preamble tells Timmy to respond in
   natural spoken language, plus _strip_markdown() as a safety net
   removes **bold**, *italic*, bullets, headers, code fences, etc.
   TTS no longer reads 'asterisk asterisk'.

3. MCP noise: _suppress_mcp_noise() quiets mcp/agno/httpx loggers
   during voice mode so the terminal shows clean transcript only.

32 tests (12 new for markdown stripping + persistent loop).
2026-03-14 14:05:24 -04:00
dbadfc425d feat: sovereign voice loop — timmy voice command
Adds fully local listen-think-speak voice interface.
STT: Whisper, LLM: Ollama, TTS: Piper. No cloud, no network.

- src/timmy/voice_loop.py: VoiceLoop with VAD, Whisper, Piper
- src/timmy/cli.py: new voice command
- pyproject.toml: voice extras updated
- 20 new tests
2026-03-14 13:58:56 -04:00
2f623826bd cleanup: delete dead modules — ~7,900 lines removed
Closes #22, Closes #23

Deleted: brain/, swarm/, openfang/, paperclip/, cascade_adapter,
memory_migrate, agents/timmy.py, dead routes + all corresponding tests.

Updated pyproject.toml, app.py, loop_qa.py for removed imports.
2026-03-14 09:49:24 -04:00
Trip T
78167675f2 feat: replace custom Gitea client with MCP servers
Replace the bespoke GiteaHand httpx client and tools_gitea.py wrappers
with official MCP tool servers (gitea-mcp + filesystem MCP), wired into
Agno via MCPTools. Switch all session functions to async (arun/acontinue_run)
so MCP tools auto-connect. Delete ~1070 lines of custom Gitea code.

- Create src/timmy/mcp_tools.py with MCP factories + standalone issue bridge
- Wire MCPTools into agent.py tool list (Gitea + filesystem)
- Switch session.py chat/chat_with_tools/continue_chat to async
- Update all callers (dashboard routes, Discord vendor, CLI, thinking engine)
- Add gitea_token fallback from ~/.config/gitea/token
- Add MCP session cleanup to app shutdown hook
- Update tool_safety.py for MCP tool names
- 11 new tests, all 1417 passing, coverage 74.2%

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-12 21:40:32 -04:00
Trip T
41d6ebaf6a feat: CLI session persistence + tool confirmation gate
- Chat sessions persist across `timmy chat` invocations via Agno SQLite
  (session_id="cli"), fixing context amnesia between turns
- Dangerous tools (shell, write_file, etc.) now prompt for approval in CLI
  instead of silently exiting — uses typer.confirm() + Agno continue_run
- --new flag starts a fresh conversation when needed
- Improved _maybe_file_issues prompt for engineer-quality issue bodies
  (what's happening, expected behavior, suggested fix, acceptance criteria)
- think/status commands also pass session_id for continuity

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-12 20:55:56 -04:00
Trip T
350e6f54ff fix: prevent "Event loop is closed" on repeated Gitea API calls
The httpx AsyncClient was cached across asyncio.run() boundaries.
Each asyncio.run() creates and closes a new event loop, leaving the
cached client's connections on a dead loop.  Second+ calls would fail
with "Event loop is closed".

Fix: create a fresh client per request and close it in a finally block.
No more cross-loop client reuse.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-12 20:40:39 -04:00
Trip T
7163b15300 feat: add Gitea issue creation — Timmy's self-improvement channel
Give Timmy the ability to file Gitea issues when he notices bugs,
stale state, or improvement opportunities in his own codebase.

Components:
- GiteaHand async API client (infrastructure/hands/gitea.py)
  - Token auth with ~/.config/gitea/token fallback
  - Create/list/close issues, dedup by title similarity
  - Graceful degradation when Gitea unreachable
- Tool functions (timmy/tools_gitea.py)
  - create_gitea_issue: file issues with dedup + work order bridge
  - list_gitea_issues: check existing backlog
  - Classified as SAFE (no confirmation needed)
- Thinking post-hook (_maybe_file_issues in thinking.py)
  - Every 20 thoughts, LLM classifies recent thoughts for actionable items
  - Auto-files bugs/improvements to Gitea with dedup
  - Bridges to local work order system for dashboard tracking
- Config: gitea_url, gitea_token, gitea_repo, gitea_enabled,
  gitea_timeout, thinking_issue_every

All 1426 tests pass, 74.17% coverage.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-12 18:36:06 -04:00
Trip T
b2f12ca97c feat: consolidate memory into unified memory.db with 4-type model
Consolidates 3 separate memory databases (semantic_memory.db, swarm.db
memory_entries, brain.db) into a single data/memory.db with facts,
chunks, and episodes tables.

Key changes:
- Add unified schema (timmy/memory/unified.py) with 3 core tables
- Redirect vector_store.py and semantic_memory.py to memory.db
- Add thought distillation: every Nth thought extracts lasting facts
- Enrich agent context with known facts in system prompt
- Add memory_forget tool for removing outdated memories
- Unify embeddings: vector_store delegates to semantic_memory.embed_text
- Bridge spark events to unified event log
- Add pruning for thoughts and events with configurable retention
- Add data migration script (timmy/memory_migrate.py)
- Deprecate brain.memory in favor of unified system

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-12 11:23:18 -04:00
Trip T
d42c574d26 feat: add Loop QA self-testing framework
Structured self-test framework that probes 6 capabilities (tool use,
multistep planning, memory read/write, self-coding, lightning econ) in
round-robin. Reuses existing infra: event_log for persistence,
create_task() for upgrade proposals, capture_error() for crash handling,
and in-memory circuit breaker for failure tracking.

- src/timmy/loop_qa.py: Capability enum, 6 async probes, orchestrator
- src/dashboard/routes/loop_qa.py: JSON + HTMX health endpoints
- HTMX partial polls every 30s on the health panel
- Background scheduler in app.py lifespan
- 25 tests covering probes, orchestrator, health snapshot, routes

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-11 22:33:16 -04:00
Trip T
f1e909b1e3 feat: enrich thinking engine — anti-loop, anti-confabulation, grounding
Rewrite _THINKING_PROMPT with strict rules: 2-3 sentence limit,
anti-confabulation (only reference real data), anti-repetition.

- Add _pick_seed_type() with recent-type dedup (excludes last 3)
- Add _gather_system_snapshot() for real-time grounding (time, thought
  count, chat activity, task queue)
- Improve _build_continuity_context() with anti-repetition header and
  100-char truncation
- Fix journal + memory timestamps to include local timezone
- 12 new TDD tests covering all improvements

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-11 21:47:28 -04:00
Trip T
f8dadeec59 feat: tick prompt arg + fix name extraction learning verbs as names
Add optional prompt argument to `timmy tick` so custom journal
prompts can be passed from the CLI (seed_type="prompted").

Fix extract_user_name() learning verbs as names (e.g. "Serving").
Now requires the candidate word to start with a capital letter in
the original message, rejects common verb suffixes (-ing, -tion,
etc.), and deduplicates the naive regex in TimmyWithMemory to use
the fixed ConversationManager.extract_user_name() instead.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-11 21:11:53 -04:00
Trip T
6a7875e05f feat: heartbeat memory hooks — pre-recall and post-update
Wire MEMORY.md + soul.md into the thinking loop so each heartbeat
is grounded in identity and recent context, breaking repetitive loops.

Pre-hook: _load_memory_context() reads hot memory first (changes each
cycle) then soul.md (stable identity), truncated to 1500 chars.

Post-hook: _update_memory() writes a "Last Reflection" section to
MEMORY.md after each thought so the next cycle has fresh context.

soul.md is read-only from the heartbeat — never modified by it.
All hooks degrade gracefully and never crash the heartbeat.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-11 20:54:13 -04:00
Trip T
ea2dbdb4b5 fix: test DB isolation, Discord recovery, and over-mocked tests
Test data was bleeding into production tasks.db because
swarm.task_queue.models.DB_PATH (relative path) was never patched in
conftest.clean_database. Fixed by switching to absolute paths via
settings.repo_root and adding the missing module to the patching list.

Discord bot could leak orphaned clients on retry after ERROR state.
Added _cleanup_stale() to close stale client/task before each start()
attempt, with improved logging in the token watcher.

Rewrote test_paperclip_client.py to use httpx.MockTransport instead of
patching _get/_post/_delete — tests now exercise real HTTP status codes,
error handling, and JSON parsing. Added end-to-end test for
capture_error → create_task DB isolation.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-11 20:33:59 -04:00
Trip T
ffdfa53259 fix: Discord token priority — settings before state file
load_token() was checking the state file before settings.discord_token,
so a stale fake token in discord_state.json would block the real token
from .env/DISCORD_TOKEN. Flipped the priority: env config first, state
file as fallback for tokens set via /discord/setup UI.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-11 19:03:24 -04:00
Trip T
f6a6c0f62e feat: upgrade to qwen3.5, self-hosted Gitea CI, optimize Docker image
Model upgrade:
- qwen2.5:14b → qwen3.5:latest across config, tools, and docs
- Added qwen3.5 to multimodal model registry

Self-hosted Gitea CI:
- .gitea/workflows/tests.yml: lint + test jobs via act_runner
- Unified Dockerfile: pre-baked deps from poetry.lock for fast CI
- sitepackages=true in tox for ~2s dep resolution (was ~40s)
- OLLAMA_URL set to dead port in CI to prevent real LLM calls

Test isolation fixes:
- Smoke test fixture mocks create_timmy (was hitting real Ollama)
- WebSocket sends initial_state before joining broadcast pool (race fix)
- Tests use settings.ollama_model/url instead of hardcoded values
- skip_ci marker for Ollama-dependent tests, excluded in CI tox envs

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-11 18:36:42 -04:00
Alexander Whitestone
36fc10097f Claude/angry cerf (#173)
* feat: set qwen3.5:latest as default model

- Make qwen3.5:latest the primary default model for faster inference
- Move llama3.1:8b-instruct to fallback chain
- Update text fallback chain to prioritize qwen3.5:latest

Retains full backward compatibility via cascade fallback.

* test: remove ~55 brittle, duplicate, and useless tests

Audit of all 100 test files identified tests that provided no real
regression protection. Removed:

- 4 files deleted entirely: test_setup_script (always skipped),
  test_csrf_bypass (tautological assertions), test_input_validation
  (accepts 200-500 status codes), test_security_regression (fragile
  source-pattern checks redundant with rendering tests)
- Duplicate test classes (TestToolTracking, TestCalculatorExtended)
- Mock-only tests that just verify mock wiring, not behavior
- Structurally broken tests (TestCreateToolFunctions patches after import)
- Empty/pass-body tests and meaningless assertions (len > 20)
- Flaky subprocess tests (aider tool calling real binary)

All 1328 remaining tests pass. Net: -699 lines, zero coverage loss.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* fix: prevent test pollution from autoresearch_enabled mutation

test_autoresearch_perplexity.py was setting settings.autoresearch_enabled = True
but never restoring it in the finally block — polluting subsequent tests.
When pytest-randomly ordered it before test_experiments_page_shows_disabled_when_off,
the victim test saw enabled=True and failed to find "Disabled" in the page.

Fix both sides:
- Restore autoresearch_enabled in the finally block (root cause)
- Mock settings explicitly in the victim test (defense in depth)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

---------

Co-authored-by: Trip T <trip@local>
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-11 16:55:27 -04:00
Alexander Whitestone
68115fe477 fix: update agno to v2 and fix airllm availability tests (#170)
The agno dependency was pinned to <2.0 but the code uses agno.db.sqlite
(a 2.x API), breaking all tests in CI. Also fix airllm provider tests
to patch importlib.util.find_spec (what the production code uses) instead
of builtins.__import__.

Co-authored-by: Trip T <trip@local>
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-11 12:40:45 -04:00
Alexander Whitestone
9d78eb31d1 ruff (#169)
* polish: streamline nav, extract inline styles, improve tablet UX

- Restructure desktop nav from 8+ flat links + overflow dropdown into
  5 grouped dropdowns (Core, Agents, Intel, System, More) matching
  the mobile menu structure to reduce decision fatigue
- Extract all inline styles from mission_control.html and base.html
  notification elements into mission-control.css with semantic classes
- Replace JS-built innerHTML with secure DOM construction in
  notification loader and chat history
- Add CONNECTING state to connection indicator (amber) instead of
  showing OFFLINE before WebSocket connects
- Add tablet breakpoint (1024px) with larger touch targets for
  Apple Pencil / stylus use and safe-area padding for iPad toolbar
- Add active-link highlighting in desktop dropdown menus
- Rename "Mission Control" page title to "System Overview" to
  disambiguate from the chat home page
- Add "Home — Timmy Time" page title to index.html

https://claude.ai/code/session_015uPUoKyYa8M2UAcyk5Gt6h

* fix(security): move auth-gate credentials to environment variables

Hardcoded username, password, and HMAC secret in auth-gate.py replaced
with os.environ lookups. Startup now refuses to run if any variable is
unset. Added AUTH_GATE_SECRET/USER/PASS to .env.example.

https://claude.ai/code/session_015uPUoKyYa8M2UAcyk5Gt6h

* refactor(tooling): migrate from black+isort+bandit to ruff

Replace three separate linting/formatting tools with a single ruff
invocation. Updates tox.ini (lint, format, pre-push, pre-commit envs),
.pre-commit-config.yaml, and CI workflow. Fixes all ruff errors
including unused imports, missing raise-from, and undefined names.
Ruff config maps existing bandit skips to equivalent S-rules.

https://claude.ai/code/session_015uPUoKyYa8M2UAcyk5Gt6h

---------

Co-authored-by: Claude <noreply@anthropic.com>
2026-03-11 12:23:35 -04:00
Alexander Whitestone
622a6a9204 polish: extract inline CSS, add connection status, panel macro, favicon, ollama cache, toast system (#164)
Major:
- Extract all inline <style> blocks from 22 Jinja2 templates into
  static/css/mission-control.css — single cacheable stylesheet
- Add tox lint check that fails on inline <style> in templates

Minor:
1. Connection status indicator in topbar (green/amber/red dot) reflecting
   WebSocket + Ollama reachability, with auto-reconnect
2. Jinja2 {% macro panel(title) %} in macros.html — eliminates repeated
   .card.mc-panel markup; index.html converted as example
3. SVG favicon (purple T + orange dot)
4. 30-second TTL cache on _check_ollama() to avoid blocking the event loop
   on every health poll (asyncio.to_thread was already in place)
5. Toast notification system (McToast.show) for transient status messages —
   wired into connection status for Ollama/WebSocket state changes

Enforcement:
- CLAUDE.md updated with conventions 11-14 (no inline CSS, use panel macro,
  use toasts, never block the event loop)
- tox lint + pre-push environments now fail on inline <style> blocks

https://claude.ai/code/session_014FQ785MQdyJQ4BAXrRSo9w

Co-authored-by: Claude <noreply@anthropic.com>
2026-03-11 09:52:57 -04:00
Alexander Whitestone
2a5f317a12 fix: implement @csrf_exempt decorator support in CSRFMiddleware (#159) 2026-03-10 15:26:40 -04:00
Alexander Whitestone
4a4c9be1eb fix: repair broken test patch target and add interview transcript (#156)
- Fix test_autoresearch_perplexity: patch target was
  dashboard.routes.experiments.get_experiment_history but the function
  is imported locally inside the route handler, so patch the source
  module timmy.autoresearch.get_experiment_history instead.
- Add tests for src/timmy/interview.py (previously 0% coverage):
  question structure, run_interview flow, error handling, formatting.
- Produce interview transcript document from structured Timmy interview.

https://claude.ai/code/session_01EXDzXqgsC2ohS8qreF1fBo

Co-authored-by: Claude <noreply@anthropic.com>
2026-03-10 15:26:29 -04:00
Alexander Whitestone
904a7c564e feat: migrate to Agno native HITL tool confirmation flow (#158)
Replace the homebrew regex-based tool extraction and manual dispatch
(tool_executor.py) with Agno's built-in Human-In-The-Loop confirmation:

- Toolkit(requires_confirmation_tools=...) marks dangerous tools
- agent.run() returns RunOutput with status=paused when confirmation needed
- RunRequirement.confirm()/reject() + agent.continue_run() resumes execution

Dashboard and Discord vendor both use the native flow. DuckDuckGo import
isolated so its absence doesn't kill all tools. Test stubs cleaned up
(agno is a real dependency, only truly optional packages stubbed).

1384 tests pass in parallel (~14s).

Co-authored-by: Trip T <trip@local>
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-09 21:54:04 -04:00
Alexander Whitestone
574031a55c fix: remove invalid show_tool_calls kwarg crashing Agent init (#157)
* fix: remove invalid show_tool_calls kwarg crashing Agent init (regression)

show_tool_calls was removed in f95c960 (Feb 26) because agno 2.5.x
doesn't accept it, then reintroduced in fd0ede0 (Mar 8) without
runtime testing — mocked tests hid the breakage.

Replace the bogus assertion with a regression guard and an allowlist
test that catches unknown kwargs before they reach production.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* fix: auto-install git hooks, add black/isort to dev deps

- Add .githooks/ with portable pre-commit hook (macOS + Linux)
- make install now auto-activates hooks via core.hooksPath
- Add black and isort to poetry dev group (were only in CI via raw pip)
- Fix black formatting on 2 files flagged by CI
- Fix test_autoresearch_perplexity patching wrong module path

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

---------

Co-authored-by: Trip T <trip@local>
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-09 15:01:00 -04:00
Alexander Whitestone
21e2ae427a Add test plan for autoresearch with perplexity metric (#154) 2026-03-09 09:36:26 -04:00
Alexander Whitestone
fe484ad7b6 Fix input validation for chat and memory routes (#155) 2026-03-09 09:36:16 -04:00
Alexander Whitestone
82fb2417e3 feat: enable SQLite WAL mode for all databases (AGI ticket #1) (#153) 2026-03-08 16:07:02 -04:00
Alexander Whitestone
8dbce25183 fix: handle concurrent table creation race in SQLite (#151) 2026-03-08 13:27:11 -04:00
Alexander Whitestone
ae3bb1cc21 feat: code quality audit + autoresearch integration + infra hardening (#150) 2026-03-08 12:50:44 -04:00
Alexander Whitestone
fd0ede0d51 feat: auto-escalation system + agentic loop fixes (#149) (#149)
Wire up automatic error-to-task escalation and fix the agentic loop
stopping after the first tool call.

Auto-escalation:
- Add swarm.task_queue.models with create_task() bridge to existing
  task queue SQLite DB
- Add swarm.event_log with EventType enum, log_event(), and SQLite
  persistence + WebSocket broadcast
- Wire capture_error() into request logging middleware so unhandled
  HTTP exceptions auto-create [BUG] tasks with stack traces, git
  context, and push notifications (5-min dedup window)

Agentic loop (Round 11 Bug #1):
- Wrap agent_chat() in asyncio.to_thread() to stop blocking the
  event loop (fixes Discord heartbeat warnings)
- Enable Agno's native multi-turn tool chaining via show_tool_calls
  and tool_call_limit on the Agent config
- Strengthen multi-step continuation prompts with explicit examples

Co-authored-by: Trip T <trip@local>
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-08 03:11:14 -04:00
Alexander Whitestone
7792ae745f feat: agentic loop for multi-step tasks + regression fixes (#148)
* fix: name extraction blocklist, memory preview escaping, and gitignore cleanup

- Add _NAME_BLOCKLIST to extract_user_name() to reject gerunds and UI-state
  words like "Sending" that were incorrectly captured as user names
- Collapse whitespace in get_memory_status() preview so newlines survive
  JSON serialization without showing raw \n escape sequences
- Broaden .gitignore from specific memory/self/user_profile.md to memory/self/
  and untrack memory/self/methodology.md (runtime-edited file)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* fix: catch Ollama connection errors in session.py + add 71 smoke tests

- Wrap agent.run() in session.py with try/except so Ollama connection
  failures return a graceful fallback message instead of dumping raw
  tracebacks to Docker logs
- Add tests/test_smoke.py with 71 tests covering every GET route:
  core pages, feature pages, JSON APIs, and a parametrized no-500 sweep
  — catches import errors, template failures, and schema mismatches
  that unit tests miss

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* feat: agentic loop for multi-step tasks + Round 10 regression fixes

Agentic loop (Parts 1-4):
- Add multi-step chaining instructions to system prompt
- New agentic_loop.py with plan→execute→adapt→summarize flow
- Register plan_and_execute tool for background task execution
- Add max_agent_steps config setting (default: 10)
- Discord fix: 300s timeout, typing indicator, send error handling
- 16 new unit + e2e tests for agentic loop

Round 10 regressions (R1-R5, P1):
- R1: Fix literal \n escape sequences in tool responses
- R2: Chat timeout/error feedback in agent panel
- R3: /hands infinite spinner → static empty states
- R4: /self-coding infinite spinner → static stats + journal
- R5: /grok/status raw JSON → HTML dashboard template
- P1: VETO confirmation dialog on task cards

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* fix: briefing route 500 in CI when agno is MagicMock stub

_call_agent() returned a MagicMock instead of a string when agno is
stubbed in tests, causing SQLite "Error binding parameter 4" on save.
Ensure the return value is always an actual string.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* fix: briefing route 500 in CI — graceful degradation at route level

When agno is stubbed with MagicMock in CI, agent.run() returns a
MagicMock instead of raising — so the exception handler never fires
and a MagicMock propagates as the summary to SQLite, which can't
bind it.

Fix: catch at the route level and return a fallback Briefing object.
This follows the project's graceful degradation pattern — the briefing
page always renders, even when the backend is completely unavailable.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

---------

Co-authored-by: Trip T <trip@local>
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-08 01:46:29 -05:00
Alexander Whitestone
b8e0f4539f fix: Discord memory bug — add session continuity + 6 memory system fixes (#147)
Discord created a new agent per message with no conversation history,
causing Timmy to lose context between messages (the "yes" bug). Now uses
a singleton agent with per-channel/thread session_id, matching the
dashboard's session.py pattern. Also applies _clean_response() to strip
hallucinated tool-call JSON from Discord output.

Additional fixes:
- get_system_context() no longer clears the handoff file (was destroying
  session context on every agent creation)
- Orchestrator uses HotMemory.read() to auto-create MEMORY.md if missing
- vector_store DB_PATH anchored to __file__ instead of relative CWD
- brain/schema.py: removed invalid .load dot-commands from INIT_SQL
- tools_intro: fixed wrong table name 'vectors' → 'chunks' in tier3 check

Co-authored-by: Trip T <trip@local>
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-08 00:20:38 -05:00
Alexander Whitestone
4bc53a43f9 fix: Round 4 bug fixes — 8 dashboard bugs + git blocker + Discord regression (#146)
* chore: stop tracking runtime-generated self-modify reports

These 65 files in data/self_modify_reports/ are auto-generated at
runtime and already listed in .gitignore. Tracking them caused
conflicts when pulling from main.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* fix: resolve 8 dashboard bugs from Round 4 testing report

- Fix Ollama timeout regression: request_timeout → timeout (agno API)
- Add Bootstrap JS to base.html (fixes creative UI tab switching)
- Send initial_state on Swarm Live WebSocket connect
- Add /api/queue/status endpoint (stops 404 log spam from chat panel)
- Populate agent tools from registry on /tools page
- Add notification bell dropdown with /api/notifications endpoint
- All 1157 tests pass

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* test: add 17 e2e tests covering all Round 4 bug fixes

Covers: /calm 200, /api/queue/status JSON, Bootstrap JS presence,
Swarm Live WebSocket initial_state, agent tools populated on /tools,
/api/notifications endpoint, Ollama timeout param, full task lifecycle,
and smoke test for all 15 dashboard pages.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

---------

Co-authored-by: Trip T <trip@local>
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-07 23:48:20 -05:00
Alexander Whitestone
248af9ed03 fix: dashboard bugs and clean up build artifacts (#145)
* chore: stop tracking runtime-generated self-modify reports

These 65 files in data/self_modify_reports/ are auto-generated at
runtime and already listed in .gitignore. Tracking them caused
conflicts when pulling from main.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* fix: resolve 8 dashboard bugs from Round 4 testing report

- Fix Ollama timeout regression: request_timeout → timeout (agno API)
- Add Bootstrap JS to base.html (fixes creative UI tab switching)
- Send initial_state on Swarm Live WebSocket connect
- Add /api/queue/status endpoint (stops 404 log spam from chat panel)
- Populate agent tools from registry on /tools page
- Add notification bell dropdown with /api/notifications endpoint
- All 1157 tests pass

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

---------

Co-authored-by: Trip T <trip@local>
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-07 23:44:56 -05:00
Alexander Whitestone
e36a1dc939 fix: resolve 6 dashboard bugs and rebuild Task Queue + Work Orders (#144) (#144)
Round 2+3 bug fix batch:

1. Ollama timeout: Add request_timeout=300 to prevent socket read errors
   on complex 30-60s prompts (production crash fix)

2. Memory API: Create missing HTMX partial templates (memory_facts.html,
   memory_results.html) so Save/Search buttons work

3. CALM page: Add create_tables() call so SQLAlchemy tables exist on
   first request (was returning HTTP 500)

4. Task Queue: Full SQLite-backed rebuild with CRUD endpoints, HTMX
   partials, and action buttons (approve/veto/pause/cancel/retry)

5. Work Orders: Full SQLite-backed rebuild with submit/approve/reject/
   execute pipeline and HTMX polling partials

6. Memory READ tool: Add memory_read function so Timmy stops calling
   read_file when trying to recall stored facts

Also: Close GitHub issues #115, #114, #112, #110 as won't-fix.
Comment on #107 confirming prune_memories() already wired to startup.

Tests: 33 new tests across 4 test files, all passing.
Full suite: 1155 passed, 2 pre-existing failures (hands_shell).

Co-authored-by: Trip T <trip@local>
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-07 23:21:30 -05:00
Alexander Whitestone
b8164e46b0 fix: remove dead swarm imports, add memory_write tool, and auto-prune on startup (#143)
- Replace dead `from swarm` imports in tools_delegation and tools_intro
  with working implementations sourced from _PERSONAS
- Add `memory_write` tool so the agent can actually persist memories
  when users ask it to remember something
- Enhance `memory_search` to search both vault files AND the runtime
  vector store for cross-channel recall (Discord/web/Telegram)
- Add memory management config: memory_prune_days, memory_prune_keep_facts,
  memory_vault_max_mb
- Auto-prune old vector store entries and warn on vault size at startup
- Update tests for new delegation agent list (mace removed)

Co-authored-by: Trip T <trip@local>
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-07 22:34:30 -05:00
Alexander Whitestone
b615595100 refactor: centralize config & harden security (#141)
* feat: upgrade primary model from llama3.1:8b to qwen2.5:14b

- Swap OLLAMA_MODEL_PRIMARY to qwen2.5:14b for better reasoning
- llama3.1:8b-instruct becomes fallback
- Update .env default and README quick start
- Fix hardcoded model assertions in tests

qwen2.5:14b provides significantly better multi-step reasoning
and tool calling reliability while still running locally on
modest hardware. The 8B model remains as automatic fallback.

* security: centralize config, harden uploads, fix silent exceptions

- Add 9 pydantic Settings fields (skip_embeddings, disable_csrf,
  rqlite_url, brain_source, brain_db_path, csrf_cookie_secure,
  chat_api_max_body_bytes, timmy_test_mode) to centralize env-var access
- Migrate 8 os.environ.get() calls across 5 source files to use
  `from config import settings` per project convention
- Add path traversal defense-in-depth to file upload endpoint
- Add 1MB request body size limit to chat API
- Make CSRF cookie secure flag configurable via settings
- Replace 2 silent `except: pass` blocks with debug logging in session.py
- Remove unused `import os` from brain/memory.py and csrf.py
- Update 5 CSRF test fixtures to patch settings instead of os.environ

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

---------

Co-authored-by: Trip T <trip@local>
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-07 18:49:37 -05:00
Alexander Whitestone
cdd3e1a90b feat: upgrade primary model from llama3.1:8b to qwen2.5:14b (#140)
- Swap OLLAMA_MODEL_PRIMARY to qwen2.5:14b for better reasoning
- llama3.1:8b-instruct becomes fallback
- Update .env default and README quick start
- Fix hardcoded model assertions in tests

qwen2.5:14b provides significantly better multi-step reasoning
and tool calling reliability while still running locally on
modest hardware. The 8B model remains as automatic fallback.

Co-authored-by: Trip T <trip@local>
2026-03-07 18:20:34 -05:00
Alexander Whitestone
480b8d324e security: fix CSRF bypass vulnerabilities via strict path matching and normalization (#138) 2026-03-07 06:45:32 -05:00
Alexander Whitestone
3f06e7231d Improve test coverage from 63.6% to 73.4% and fix test infrastructure (#137) 2026-03-06 13:21:05 -05:00
Alexander Whitestone
3b322d185c feat: add Shell and Git execution hands for Timmy (#136) 2026-03-06 09:01:24 -05:00
Alexander Whitestone
39461858a0 SEC: Fix CSRF bypass via path traversal in exempt routes (#135) 2026-03-06 09:00:56 -05:00
Alexander Whitestone
87dc5eadfe Wire orchestrator pipe into task runner + pipe-verifying integration tests (#134) 2026-03-06 01:20:14 -05:00
AlexanderWhitestone
bbe975ec54 feat: add TDD setup script and functional tests for Sovereign Agent Stack 2026-03-05 21:54:24 -05:00
Alexander Whitestone
fb97625404 Consolidate architecture: flatten agents, kill Redis/Celery, thin routes (#133) 2026-03-05 20:27:02 -05:00