- Add `requests` to conftest.py module stubs so patch("requests.post") works
in reward scoring tests without the package installed
- Use sys.executable instead of bare "python" in git safety tests so the
subprocess finds pytest from the venv rather than system python
https://claude.ai/code/session_012Ye9nyFEiw2QQfx4bZeDmn
Test fixtures that create temporary git repos now set commit.gpgsign=false
to avoid failures in environments with global commit signing configured.
The permission error test is skipped when running as root since file
permissions don't apply to the root user.
https://claude.ai/code/session_018u1fAx2GihSGctYS64tD4H
Inspired by OpenClaw-RL's multi-model orchestration, this adds four
features for custom model management:
1. Custom model registry (infrastructure/models/registry.py) — SQLite-backed
registry for GGUF, safetensors, HF checkpoint, and Ollama models with
role-based lookups (general, reward, teacher, judge).
2. Per-agent model assignment — each swarm persona can use a different model
instead of sharing the global default. Resolved via registry assignment >
persona default > global default.
3. Runtime model management API (/api/v1/models) — REST endpoints to register,
list, assign, enable/disable, and remove custom models without restart.
Includes a dashboard page at /models.
4. Reward model scoring (PRM-style) — majority-vote quality evaluation of
agent outputs using a configurable reward model. Scores persist in SQLite
and feed into the swarm learner.
New config settings: custom_weights_dir, reward_model_enabled,
reward_model_name, reward_model_votes.
54 new tests covering registry CRUD, API endpoints, agent assignments,
role lookups, and reward scoring.
https://claude.ai/code/session_01V4iTozMwcE2gjfnCJdCugC
- Add GrokBackend class in src/timmy/backends.py with full sync/async
support, health checks, usage stats, and cost estimation in sats
- Add consult_grok tool to Timmy's toolkit for proactive Grok queries
- Extend cascade router with Grok provider type for failover chain
- Add Grok Mode toggle card to Mission Control dashboard (HTMX live)
- Add "Ask Grok" button on chat input for direct Grok queries
- Add /grok/* routes: status, toggle, chat, stats endpoints
- Integrate Lightning invoice generation for Grok usage monetization
- Add GROK_ENABLED, XAI_API_KEY, GROK_DEFAULT_MODEL, GROK_MAX_SATS_PER_QUERY,
GROK_FREE config settings via pydantic-settings
- Update .env.example and docker-compose.yml with Grok env vars
- Add 21 tests covering backend, tools, and route endpoints (all green)
Local-first ethos preserved: Grok is premium augmentation only,
disabled by default, and Lightning-payable when enabled.
https://claude.ai/code/session_01FygwN8wS8J6WGZ8FPb7XGV
Enable Timmy to run directly on iPhone by loading a small LLM into
the browser via WebGPU (Safari 26+ / iOS 26+). No server connection
required — fully sovereign, fully offline.
New files:
- static/local_llm.js: WebLLM wrapper with model catalogue, WebGPU
detection, streaming chat, and progress callbacks
- templates/mobile_local.html: Mobile-optimized UI with model
selector, download progress, LOCAL/SERVER badge, and chat
- tests/dashboard/test_local_models.py: 31 tests covering routes,
config, template UX, JS asset, and XSS prevention
Changes:
- config.py: browser_model_enabled, browser_model_id,
browser_model_fallback settings
- routes/mobile.py: /mobile/local page, /mobile/local-models API
- base.html: LOCAL AI nav link
Supported models: SmolLM2-360M (~200MB), Qwen2.5-0.5B (~350MB),
SmolLM2-1.7B (~1GB), Llama-3.2-1B (~700MB). Falls back to
server-side Ollama when local model is unavailable.
https://claude.ai/code/session_01Cqkvr4sZbED7T3iDu1rwSD
Add comprehensive test suite for Hands framework:
TestHandRegistry:
- Load all Hands from directory
- Get Hand by name (with not-found handling)
- Get scheduled vs all Hands
- State management (status updates)
- Approval queue operations
TestHandScheduler:
- Scheduler initialization
- Schedule Hand with cron
- Get scheduled jobs list
- Manual trigger execution
TestHandRunner:
- Load system prompts from SYSTEM.md
- Load skills from skills/ directory
- Build execution prompts
TestHandConfig:
- HandConfig creation and validation
- Cron schedule validation
TestHandModels:
- HandStatus enum values
- HandState serialization to dict
17 tests total, all passing.
Resolved merge conflicts in agents.py and test_task_queue.py:
- Keep full chat-to-task pipeline (agent/priority extraction, question
filtering, context injection) over simpler main version
- Incorporate test_briefing_task_queue_summary from main
- All 64 task queue tests pass
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Addresses 14 bugs from 3 rounds of deep chat evaluation:
- Add chat-to-task pipeline in agents.py with regex-based intent detection,
agent extraction, priority extraction, and title cleaning
- Filter meta-questions ("how do I create a task?") from task creation
- Inject real-time date/time context into every chat message
- Inject live queue state when user asks about tasks
- Ground system prompts with agent roster, honesty guardrails, self-knowledge,
math delegation template, anti-filler rules, values-conflict guidance
- Add CSS for markdown code blocks, inline code, lists, blockquotes in chat
- Add highlight.js CDN for syntax highlighting in chat responses
- Reduce small-model memory context budget (4000→2000) for expanded prompt
- Add 27 comprehensive tests covering the full chat-to-task pipeline
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Chat messages like "add X to the queue" or "create a task" are
intercepted and create a task_queue entry with pending_approval
status instead of going through to the LLM
- Briefing engine now gathers task queue stats (pending, running,
completed, failed) and includes them in the morning briefing prompt
- 7 new tests covering detection patterns, chat integration, and
briefing summary
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Task Queue system:
- New /tasks page with three-column layout (Pending/Active/Completed)
- Full CRUD API at /api/tasks with approve/veto/modify/pause/cancel/retry
- SQLite persistence in task_queue table
- WebSocket live updates via ws_manager
- Create task modal with agent assignment and priority
- Auto-approve rules for low-risk tasks
- HTMX polling for real-time column updates
- HOME TASK buttons now link to task queue with agent pre-selected
- MARKET HIRE buttons link to task queue with agent pre-selected
Work Order system:
- External submission API for agents/users (POST /work-orders/submit)
- Risk scoring and configurable auto-execution thresholds
- Dashboard at /work-orders/queue with approve/reject/execute flow
- Integration with swarm task system for execution
UI & Dashboard bug fixes:
- EVENTS: add startup event so page is never empty
- LEDGER: fix empty filter params in URL
- MISSION CONTROL: LLM backend and model now read from /health
- MISSION CONTROL: agent count fallback to /swarm/agents
- SWARM: HTMX fallback loads initial data if WebSocket is slow
- MEMORY: add edit/delete buttons for personal facts
- UPGRADES: add empty state guidance with links
- BRIEFING: add regenerate button and POST /briefing/regenerate endpoint
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Add sandboxed calculator tool to Timmy's toolkit so arithmetic questions
get exact answers instead of LLM hallucinations
- Update system prompts (lite + full) to instruct Timmy to always use the
calculator and never attempt multi-digit math in his head
- Add self-contradiction guard to both prompts ("commit to your facts")
- Render Timmy's chat responses as markdown via marked.js + DOMPurify
instead of raw escaped text
- Suppress empty briefing notification on startup when there are 0
pending approval items
- Add calculator to session response sanitizer regex
- 18 new calculator tests, 2 updated briefing notification tests
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Remove show_tool_calls kwarg (not in Agno 2.5.3), which crashed Agent.__init__
- Guard memory_search against top_k=None from model, return formatted string
- Skip Telegram/Discord startup silently when no token configured
- Replace placeholder MEMORY.md with proper structured hot memory document
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Add fixtures required by functional test suite:
- docker_stack: Docker container test URL (skips if FUNCTIONAL_DOCKER != 1)
- serve_client: FastAPI TestClient for timmy-serve app
- tdd_runner: Alias for self_tdd_runner
Fixes CI errors in test_docker_swarm.py, test_l402_flow.py, test_cli.py
- Fix recall_personal_facts() call - remove unsupported limit parameter
- Replace 4 slow E2E test files with single fast test file
- All 6 E2E tests complete in ~9 seconds (was 60+ seconds)
- Reuse browser session across tests (module-scoped fixture)
- Combine related checks into single tests
- Add HTTP-only smoke test for speed
- YAML-based provider configuration (config/providers.yaml)
- Priority-ordered provider routing
- Circuit breaker pattern for failing providers
- Health check and availability monitoring
- Metrics tracking (latency, errors, success rates)
- Support for Ollama, OpenAI, Anthropic, AirLLM providers
- Automatic failover on rate limits or errors
- REST API endpoints for monitoring and control
- 41 comprehensive tests
API Endpoints:
- POST /api/v1/router/complete - Chat completion with failover
- GET /api/v1/router/status - Provider health status
- GET /api/v1/router/metrics - Detailed metrics
- GET /api/v1/router/providers - List all providers
- POST /api/v1/router/providers/{name}/control - Enable/disable/reset
- POST /api/v1/router/health-check - Run health checks
- GET /api/v1/router/config - View configuration
Timmy was exhibiting severe incoherence (no memory between messages, tool call
leakage, chain-of-thought narration, random tool invocations) due to creating
a brand new agent per HTTP request and giving a 3B model (llama3.2) a 73-line
system prompt with complex tool-calling instructions it couldn't follow.
Key changes:
- Add session.py singleton with stable session_id for conversation continuity
- Add _model_supports_tools() to strip tools from small models (< 7B)
- Add two-tier prompts: lite (12 lines) for small models, full for capable ones
- Add response sanitizer to strip leaked JSON tool calls and CoT narration
- Set show_tool_calls=False to prevent raw tool JSON in output
- Wire ConversationManager for user name extraction
- Deprecate orphaned memory_layers.py (unused 4-layer system)
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
This commit replaces the previous memory_layers.py with a proper three-tier
memory system as specified by the user:
## Tier 1 — Hot Memory (MEMORY.md)
- Single flat file always loaded into system context
- Contains: current status, standing rules, agent roster, key decisions
- ~300 lines max, pruned monthly
- Managed by HotMemory class
## Tier 2 — Structured Vault (memory/)
- Directory with three namespaces:
• self/ — identity.md, user_profile.md, methodology.md
• notes/ — session logs, AARs, research
• aar/ — post-task retrospectives
- Markdown format, Obsidian-compatible
- Append-only, date-stamped
- Managed by VaultMemory class
## Handoff Protocol
- last-session-handoff.md written at session end
- Contains: summary, key decisions, open items, next steps
- Auto-loaded at next session start
- Maintains continuity across resets
## Implementation
### New Files:
- src/timmy/memory_system.py — Core memory system
- MEMORY.md — Hot memory template
- memory/self/*.md — Identity, user profile, methodology
### Modified:
- src/timmy/agent.py — Integrated with memory system
- create_timmy() injects memory context
- TimmyWithMemory class with automatic fact extraction
- tests/test_agent.py — Updated for memory context
## Key Principles
- Hot memory = small and curated
- Vault = append-only, never delete
- Handoffs = continuity mechanism
- Flat files = human-readable, portable
## Usage
All 973 tests pass.
Keep Mission Control link from this branch alongside SWARM and SPARK
links from main. All 939 tests pass.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The POST /swarm/tasks endpoint now triggers an automatic auction via
asyncio.create_task. Tests must allow tasks to be in bidding, assigned,
or failed status since the background auction may resolve before the
follow-up GET query.
All 895 tests pass.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Adds SelfModifyLoop — an edit→validate→test→commit cycle that can read
its own failure reports, diagnose root causes, and restart autonomously.
Key capabilities:
- Multi-backend LLM: Anthropic Claude API, Ollama, or auto-detect
- Syntax validation via compile() before writing to disk
- Autonomous self-correction loop with configurable max cycles
- XML-based output format to avoid triple-quote delimiter conflicts
- Branch creation skipped by default to prevent container restarts
- CLI: self-modify run "instruction" --backend auto --autonomous
- 939 tests passing, 30 skipped
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
selenium depends on websocket-client which installs a top-level
`websocket` package that shadows our src/websocket/ module on CI.
Renaming to ws_manager eliminates the conflict entirely — no more
sys.path hacks needed in conftest or Selenium tests.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
selenium depends on websocket-client which installs a top-level
`websocket` package that shadows our src/websocket/ module. Ensure
src/ is inserted at the front of sys.path in conftest so the project
module wins the import race. Fixes collection errors for
test_websocket.py and test_websocket_extended.py on GitHub Actions.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Add an ollama service (behind --profile ollama) to the test compose stack
and a new test suite that verifies real LLM inference end-to-end:
- docker-compose.test.yml: add ollama/ollama service with health check,
make OLLAMA_URL and OLLAMA_MODEL configurable via env vars
- tests/functional/test_ollama_chat.py: session-scoped fixture that
brings up Ollama + dashboard, pulls qwen2.5:0.5b (~400MB, CPU-only),
and runs chat/history/multi-turn tests against the live stack
- Makefile: add `make test-ollama` target
Run with: make test-ollama (or FUNCTIONAL_DOCKER=1 pytest tests/functional/test_ollama_chat.py -v)
https://claude.ai/code/session_01NTEzfRHSZQCfkfypxgyHKk
Introduces a vendor-agnostic chat platform architecture:
- chat_bridge/base.py: ChatPlatform ABC, ChatMessage, ChatThread
- chat_bridge/registry.py: PlatformRegistry singleton
- chat_bridge/invite_parser.py: QR + Ollama vision invite extraction
- chat_bridge/vendors/discord.py: DiscordVendor with native threads
Workflow: paste a screenshot of a Discord invite or QR code at
POST /discord/join → Timmy extracts the invite automatically.
Every Discord conversation gets its own thread, keeping channels clean.
Bot responds to @mentions and DMs, routes through Timmy agent.
43 new tests (base classes, registry, invite parser, vendor, routes).
https://claude.ai/code/session_01WU4h3cQQiouMwmgYmAgkMM
Three-tier functional test infrastructure:
- CLI tests via Typer CliRunner (timmy, timmy-serve, self-tdd)
- Dashboard integration tests with real TestClient, real SQLite, real
coordinator (no patch/mock — Ollama offline = graceful degradation)
- Docker compose container-level tests (gated by FUNCTIONAL_DOCKER=1)
- End-to-end L402 payment flow with real mock-lightning backend
42 new tests (8 Docker tests skipped without FUNCTIONAL_DOCKER=1).
All 849 tests pass.
https://claude.ai/code/session_01WU4h3cQQiouMwmgYmAgkMM