# Task Classifier Design Document ## Overview The **Task Classifier** is an enhanced prompt routing system for the Uniwizard agent harness. It classifies incoming user prompts into task categories and maps them to ranked backend preferences, enabling intelligent model selection across the 7-backend fallback chain. ## Goals 1. **Right-size every request**: Route simple queries to fast backends, complex tasks to capable ones 2. **Minimize latency**: Use Groq (284ms) for fast operations, Anthropic for deep reasoning 3. **Maximize quality**: Match task type to backend strengths 4. **Provide transparency**: Return clear reasoning for routing decisions 5. **Enable fallback**: Support the full 7-backend chain with intelligent ordering ## Architecture ``` ┌─────────────────┐ │ User Prompt │ └────────┬────────┘ │ ▼ ┌─────────────────────┐ │ Feature Extraction │ │ - Length metrics │ │ - Code detection │ │ - URL extraction │ │ - Keyword tokenize │ └────────┬────────────┘ │ ▼ ┌─────────────────────┐ │ Complexity Assess │ │ - Low/Medium/High │ └────────┬────────────┘ │ ▼ ┌─────────────────────┐ │ Task Classification│ │ - Code │ │ - Reasoning │ │ - Research │ │ - Creative │ │ - Fast Ops │ │ - Tool Use │ └────────┬────────────┘ │ ▼ ┌─────────────────────┐ │ Backend Selection │ │ - Ranked by task │ │ - Complexity adj. │ │ - Feature boosts │ └────────┬────────────┘ │ ▼ ┌─────────────────────┐ │ ClassificationResult │ - task_type │ │ - preferred_backends │ - complexity │ │ - reason │ │ - confidence │ └─────────────────────┘ ``` ## Task Types | Task Type | Description | Primary Indicators | |-----------|-------------|-------------------| | `code` | Programming tasks, debugging, refactoring | Keywords: implement, debug, refactor, test, function, class, API | | `reasoning` | Analysis, comparison, evaluation | Keywords: analyze, compare, evaluate, step by step, trade-offs | | `research` | Information gathering, literature review | Keywords: research, find, paper, study, URLs present | | `creative` | Writing, design, content creation | Keywords: write, create, design, story, poem, brainstorm | | `fast_ops` | Quick status checks, simple queries | Short length (<20 words), simple keywords | | `tool_use` | Tool invocations, commands, API calls | Keywords: run, execute, use tool, browser, delegate | | `unknown` | No clear indicators | Fallback classification | ## Backend Strengths Mapping ### 1. Anthropic (Claude) - **Strengths**: Deep reasoning, code review, complex analysis, tool use - **Best for**: Reasoning, tool_use, complex code review - **Ranking**: #1 for reasoning, #2 for code, #1 for tool_use ### 2. OpenAI Codex - **Strengths**: Code generation, feature implementation - **Best for**: Code tasks, structured outputs - **Ranking**: #1 for code generation ### 3. Gemini - **Strengths**: Research, multimodal, long context - **Best for**: Research tasks, document analysis - **Ranking**: #1 for research, #2 for reasoning ### 4. Groq - **Strengths**: Speed (284ms latency) - **Best for**: Fast operations, simple queries, triage - **Ranking**: #1 for fast_ops ### 5. Grok - **Strengths**: Broad knowledge, creative, drafting - **Best for**: Creative writing, general knowledge - **Ranking**: #1 for creative ### 6. Kimi (kimi-coding) - **Strengths**: Long context (262K tokens), code refactoring - **Best for**: Large codebase work, long document analysis - **Ranking**: Boosted for high-complexity code/research ### 7. OpenRouter - **Strengths**: Broadest model access, overflow handling - **Best for**: Fallback, variety of model choices - **Ranking**: #6 or #7 across all task types ## Backend Rankings by Task Type ```python CODE = [ openai-codex, # Best generation anthropic, # Review & analysis kimi, # Large codebases gemini, # Multimodal groq, # Fast simple tasks openrouter, # Overflow grok, # General backup ] REASONING = [ anthropic, # Deep reasoning gemini, # Analysis kimi, # Long chains grok, # Broad knowledge openai-codex, # Structured openrouter, groq, ] RESEARCH = [ gemini, # Research leader kimi, # 262K context anthropic, # Deep analysis grok, # Knowledge openrouter, # Broad access openai-codex, groq, # Triage ] CREATIVE = [ grok, # Creative writing anthropic, # Nuanced gemini, # Multimodal openai-codex, # Creative coding kimi, # Long-form openrouter, groq, ] FAST_OPS = [ groq, # 284ms champion openrouter, # Fast mini models gemini, # Flash grok, # Simple queries anthropic, openai-codex, kimi, ] TOOL_USE = [ anthropic, # Tool use leader openai-codex, # Good integration gemini, # Multimodal groq, # Fast chaining kimi, # Long sessions openrouter, grok, ] ``` ## Complexity Assessment Complexity is determined by: | Metric | Low | Medium | High | |--------|-----|--------|------| | Characters | ≤200 | 201-800 | >800 | | Words | ≤35 | 36-150 | >150 | | Lines | ≤3 | 4-15 | >15 | | URLs | 0 | 1 | ≥2 | | Code Blocks | 0 | 1 | ≥2 | **Rules:** - 2+ high metrics → **HIGH** complexity - 2+ medium metrics or 1 high → **MEDIUM** complexity - Otherwise → **LOW** complexity ### Complexity Adjustments - **HIGH complexity + RESEARCH/CODE**: Boost Kimi and Gemini in rankings - **LOW complexity + FAST_OPS**: Ensure Groq is first - **Code blocks present**: Boost OpenAI Codex in any task type ## Keyword Dictionaries The classifier uses curated keyword sets for each task type: ### Code Keywords (100+) - Implementation: implement, code, function, class, module - Debugging: debug, error, exception, traceback, bug, fix - Testing: test, pytest, unittest, coverage - Operations: deploy, docker, kubernetes, ci/cd, pipeline - Concepts: api, endpoint, database, query, authentication ### Reasoning Keywords (50+) - Analysis: analyze, evaluate, assess, critique, review - Logic: reason, deduce, infer, logic, argument, evidence - Process: compare, contrast, trade-off, strategy, plan - Modifiers: step by step, chain of thought, think through ### Research Keywords (80+) - Actions: research, find, search, explore, discover - Sources: paper, publication, journal, arxiv, dataset - Methods: study, survey, experiment, benchmark, evaluation - Domains: machine learning, neural network, sota, literature ### Creative Keywords (100+) - Visual: art, paint, draw, design, graphic, image - Writing: write, story, novel, poem, essay, content - Audio: music, song, compose, melody, sound - Process: brainstorm, ideate, concept, imagine, inspire ### Fast Ops Keywords (60+) - Simple: quick, fast, brief, simple, easy, status - Actions: list, show, get, check, count, find - Short queries: hi, hello, thanks, yes/no, what is ### Tool Use Keywords (70+) - Actions: run, execute, call, use tool, invoke - Systems: terminal, shell, docker, kubernetes, git - Protocols: api, http, request, response, webhook - Agents: delegate, subagent, spawn, mcp ## API ### Classify a Prompt ```python from task_classifier import TaskClassifier, classify_prompt # Method 1: Using the class classifier = TaskClassifier() result = classifier.classify("Implement a Python function") print(result.task_type) # TaskType.CODE print(result.preferred_backends) # ["openai-codex", "anthropic", ...] print(result.complexity) # ComplexityLevel.LOW print(result.reason) # "Task: code; Complexity: low; ..." print(result.confidence) # 0.75 # Method 2: Convenience function output = classify_prompt("Research AI papers") # Returns dict: { # "task_type": "research", # "preferred_backends": ["gemini", "kimi", ...], # "complexity": "low", # "reason": "...", # "confidence": 0.65, # "features": {...} # } ``` ### ClassificationResult Fields | Field | Type | Description | |-------|------|-------------| | `task_type` | TaskType | Classified task category | | `preferred_backends` | List[str] | Ranked list of backend identifiers | | `complexity` | ComplexityLevel | Assessed complexity level | | `reason` | str | Human-readable classification reasoning | | `confidence` | float | 0.0-1.0 confidence score | | `features` | Dict | Extracted features (lengths, code, URLs) | ## Integration with Hermes ### Usage in Smart Model Routing The task classifier replaces/enhances the existing `smart_model_routing.py`: ```python # In hermes-agent/agent/smart_model_routing.py from uniwizard.task_classifier import TaskClassifier classifier = TaskClassifier() def resolve_turn_route(user_message, routing_config, primary, fallback_chain): # Classify the prompt result = classifier.classify(user_message) # Map preferred backends to actual models from fallback_chain for backend in result.preferred_backends: model_config = fallback_chain.get(backend) if model_config and is_available(backend): return { "model": model_config["model"], "provider": backend, "reason": result.reason, "complexity": result.complexity.value, } # Fallback to primary return primary ``` ### Configuration ```yaml # config.yaml smart_model_routing: enabled: true use_task_classifier: true fallback_providers: - provider: anthropic model: claude-opus-4-6 - provider: openai-codex model: codex - provider: gemini model: gemini-2.5-flash - provider: groq model: llama-3.3-70b-versatile - provider: grok model: grok-3-mini-fast - provider: kimi-coding model: kimi-k2.5 - provider: openrouter model: openai/gpt-4.1-mini ``` ## Testing Run the test suite: ```bash cd ~/.timmy/uniwizard python -m pytest test_task_classifier.py -v ``` Coverage includes: - Feature extraction (URLs, code blocks, length metrics) - Complexity assessment (low/medium/high) - Task type classification (all 6 types) - Backend selection (rankings by task type) - Complexity adjustments (boosts for Kimi/Gemini) - Edge cases (empty, whitespace, very long prompts) - Integration scenarios (realistic use cases) ## Future Enhancements 1. **Session Context**: Use conversation history for better classification 2. **Performance Feedback**: Learn from actual backend performance 3. **User Preferences**: Allow user-defined backend preferences 4. **Cost Optimization**: Factor in backend costs for routing 5. **Streaming Detection**: Identify streaming-suitable tasks 6. **Multi-Modal**: Better handling of image/audio inputs 7. **Confidence Thresholds**: Configurable confidence cutoffs ## Files | File | Description | |------|-------------| | `task_classifier.py` | Main implementation (600+ lines) | | `test_task_classifier.py` | Unit tests (400+ lines) | | `task_classifier_design.md` | This design document | ## References - Gitea Issue: timmy-home #88 - Existing: `~/.hermes/hermes-agent/agent/smart_model_routing.py` - Config: `~/.hermes/config.yaml` (fallback_providers chain)