Files
timmy-home/uniwizard/task_classifier_design.md

12 KiB

Task Classifier Design Document

Overview

The Task Classifier is an enhanced prompt routing system for the Uniwizard agent harness. It classifies incoming user prompts into task categories and maps them to ranked backend preferences, enabling intelligent model selection across the 7-backend fallback chain.

Goals

  1. Right-size every request: Route simple queries to fast backends, complex tasks to capable ones
  2. Minimize latency: Use Groq (284ms) for fast operations, Anthropic for deep reasoning
  3. Maximize quality: Match task type to backend strengths
  4. Provide transparency: Return clear reasoning for routing decisions
  5. Enable fallback: Support the full 7-backend chain with intelligent ordering

Architecture

┌─────────────────┐
│  User Prompt    │
└────────┬────────┘
         │
         ▼
┌─────────────────────┐
│  Feature Extraction │
│  - Length metrics   │
│  - Code detection   │
│  - URL extraction   │
│  - Keyword tokenize │
└────────┬────────────┘
         │
         ▼
┌─────────────────────┐
│  Complexity Assess  │
│  - Low/Medium/High  │
└────────┬────────────┘
         │
         ▼
┌─────────────────────┐
│  Task Classification│
│  - Code             │
│  - Reasoning        │
│  - Research         │
│  - Creative         │
│  - Fast Ops         │
│  - Tool Use         │
└────────┬────────────┘
         │
         ▼
┌─────────────────────┐
│  Backend Selection  │
│  - Ranked by task   │
│  - Complexity adj.  │
│  - Feature boosts   │
└────────┬────────────┘
         │
         ▼
┌─────────────────────┐
│  ClassificationResult
│  - task_type        │
│  - preferred_backends
│  - complexity       │
│  - reason           │
│  - confidence       │
└─────────────────────┘

Task Types

Task Type Description Primary Indicators
code Programming tasks, debugging, refactoring Keywords: implement, debug, refactor, test, function, class, API
reasoning Analysis, comparison, evaluation Keywords: analyze, compare, evaluate, step by step, trade-offs
research Information gathering, literature review Keywords: research, find, paper, study, URLs present
creative Writing, design, content creation Keywords: write, create, design, story, poem, brainstorm
fast_ops Quick status checks, simple queries Short length (<20 words), simple keywords
tool_use Tool invocations, commands, API calls Keywords: run, execute, use tool, browser, delegate
unknown No clear indicators Fallback classification

Backend Strengths Mapping

1. Anthropic (Claude)

  • Strengths: Deep reasoning, code review, complex analysis, tool use
  • Best for: Reasoning, tool_use, complex code review
  • Ranking: #1 for reasoning, #2 for code, #1 for tool_use

2. OpenAI Codex

  • Strengths: Code generation, feature implementation
  • Best for: Code tasks, structured outputs
  • Ranking: #1 for code generation

3. Gemini

  • Strengths: Research, multimodal, long context
  • Best for: Research tasks, document analysis
  • Ranking: #1 for research, #2 for reasoning

4. Groq

  • Strengths: Speed (284ms latency)
  • Best for: Fast operations, simple queries, triage
  • Ranking: #1 for fast_ops

5. Grok

  • Strengths: Broad knowledge, creative, drafting
  • Best for: Creative writing, general knowledge
  • Ranking: #1 for creative

6. Kimi (kimi-coding)

  • Strengths: Long context (262K tokens), code refactoring
  • Best for: Large codebase work, long document analysis
  • Ranking: Boosted for high-complexity code/research

7. OpenRouter

  • Strengths: Broadest model access, overflow handling
  • Best for: Fallback, variety of model choices
  • Ranking: #6 or #7 across all task types

Backend Rankings by Task Type

CODE = [
    openai-codex,   # Best generation
    anthropic,      # Review & analysis
    kimi,           # Large codebases
    gemini,         # Multimodal
    groq,           # Fast simple tasks
    openrouter,     # Overflow
    grok,           # General backup
]

REASONING = [
    anthropic,      # Deep reasoning
    gemini,         # Analysis
    kimi,           # Long chains
    grok,           # Broad knowledge
    openai-codex,   # Structured
    openrouter,
    groq,
]

RESEARCH = [
    gemini,         # Research leader
    kimi,           # 262K context
    anthropic,      # Deep analysis
    grok,           # Knowledge
    openrouter,     # Broad access
    openai-codex,
    groq,           # Triage
]

CREATIVE = [
    grok,           # Creative writing
    anthropic,      # Nuanced
    gemini,         # Multimodal
    openai-codex,   # Creative coding
    kimi,           # Long-form
    openrouter,
    groq,
]

FAST_OPS = [
    groq,           # 284ms champion
    openrouter,     # Fast mini models
    gemini,         # Flash
    grok,           # Simple queries
    anthropic,
    openai-codex,
    kimi,
]

TOOL_USE = [
    anthropic,      # Tool use leader
    openai-codex,   # Good integration
    gemini,         # Multimodal
    groq,           # Fast chaining
    kimi,           # Long sessions
    openrouter,
    grok,
]

Complexity Assessment

Complexity is determined by:

Metric Low Medium High
Characters ≤200 201-800 >800
Words ≤35 36-150 >150
Lines ≤3 4-15 >15
URLs 0 1 ≥2
Code Blocks 0 1 ≥2

Rules:

  • 2+ high metrics → HIGH complexity
  • 2+ medium metrics or 1 high → MEDIUM complexity
  • Otherwise → LOW complexity

Complexity Adjustments

  • HIGH complexity + RESEARCH/CODE: Boost Kimi and Gemini in rankings
  • LOW complexity + FAST_OPS: Ensure Groq is first
  • Code blocks present: Boost OpenAI Codex in any task type

Keyword Dictionaries

The classifier uses curated keyword sets for each task type:

Code Keywords (100+)

  • Implementation: implement, code, function, class, module
  • Debugging: debug, error, exception, traceback, bug, fix
  • Testing: test, pytest, unittest, coverage
  • Operations: deploy, docker, kubernetes, ci/cd, pipeline
  • Concepts: api, endpoint, database, query, authentication

Reasoning Keywords (50+)

  • Analysis: analyze, evaluate, assess, critique, review
  • Logic: reason, deduce, infer, logic, argument, evidence
  • Process: compare, contrast, trade-off, strategy, plan
  • Modifiers: step by step, chain of thought, think through

Research Keywords (80+)

  • Actions: research, find, search, explore, discover
  • Sources: paper, publication, journal, arxiv, dataset
  • Methods: study, survey, experiment, benchmark, evaluation
  • Domains: machine learning, neural network, sota, literature

Creative Keywords (100+)

  • Visual: art, paint, draw, design, graphic, image
  • Writing: write, story, novel, poem, essay, content
  • Audio: music, song, compose, melody, sound
  • Process: brainstorm, ideate, concept, imagine, inspire

Fast Ops Keywords (60+)

  • Simple: quick, fast, brief, simple, easy, status
  • Actions: list, show, get, check, count, find
  • Short queries: hi, hello, thanks, yes/no, what is

Tool Use Keywords (70+)

  • Actions: run, execute, call, use tool, invoke
  • Systems: terminal, shell, docker, kubernetes, git
  • Protocols: api, http, request, response, webhook
  • Agents: delegate, subagent, spawn, mcp

API

Classify a Prompt

from task_classifier import TaskClassifier, classify_prompt

# Method 1: Using the class
classifier = TaskClassifier()
result = classifier.classify("Implement a Python function")

print(result.task_type)           # TaskType.CODE
print(result.preferred_backends)  # ["openai-codex", "anthropic", ...]
print(result.complexity)          # ComplexityLevel.LOW
print(result.reason)              # "Task: code; Complexity: low; ..."
print(result.confidence)          # 0.75

# Method 2: Convenience function
output = classify_prompt("Research AI papers")
# Returns dict: {
#   "task_type": "research",
#   "preferred_backends": ["gemini", "kimi", ...],
#   "complexity": "low",
#   "reason": "...",
#   "confidence": 0.65,
#   "features": {...}
# }

ClassificationResult Fields

Field Type Description
task_type TaskType Classified task category
preferred_backends List[str] Ranked list of backend identifiers
complexity ComplexityLevel Assessed complexity level
reason str Human-readable classification reasoning
confidence float 0.0-1.0 confidence score
features Dict Extracted features (lengths, code, URLs)

Integration with Hermes

Usage in Smart Model Routing

The task classifier replaces/enhances the existing smart_model_routing.py:

# In hermes-agent/agent/smart_model_routing.py
from uniwizard.task_classifier import TaskClassifier

classifier = TaskClassifier()

def resolve_turn_route(user_message, routing_config, primary, fallback_chain):
    # Classify the prompt
    result = classifier.classify(user_message)
    
    # Map preferred backends to actual models from fallback_chain
    for backend in result.preferred_backends:
        model_config = fallback_chain.get(backend)
        if model_config and is_available(backend):
            return {
                "model": model_config["model"],
                "provider": backend,
                "reason": result.reason,
                "complexity": result.complexity.value,
            }
    
    # Fallback to primary
    return primary

Configuration

# config.yaml
smart_model_routing:
  enabled: true
  use_task_classifier: true
  
fallback_providers:
  - provider: anthropic
    model: claude-opus-4-6
  - provider: openai-codex
    model: codex
  - provider: gemini
    model: gemini-2.5-flash
  - provider: groq
    model: llama-3.3-70b-versatile
  - provider: grok
    model: grok-3-mini-fast
  - provider: kimi-coding
    model: kimi-k2.5
  - provider: openrouter
    model: openai/gpt-4.1-mini

Testing

Run the test suite:

cd ~/.timmy/uniwizard
python -m pytest test_task_classifier.py -v

Coverage includes:

  • Feature extraction (URLs, code blocks, length metrics)
  • Complexity assessment (low/medium/high)
  • Task type classification (all 6 types)
  • Backend selection (rankings by task type)
  • Complexity adjustments (boosts for Kimi/Gemini)
  • Edge cases (empty, whitespace, very long prompts)
  • Integration scenarios (realistic use cases)

Future Enhancements

  1. Session Context: Use conversation history for better classification
  2. Performance Feedback: Learn from actual backend performance
  3. User Preferences: Allow user-defined backend preferences
  4. Cost Optimization: Factor in backend costs for routing
  5. Streaming Detection: Identify streaming-suitable tasks
  6. Multi-Modal: Better handling of image/audio inputs
  7. Confidence Thresholds: Configurable confidence cutoffs

Files

File Description
task_classifier.py Main implementation (600+ lines)
test_task_classifier.py Unit tests (400+ lines)
task_classifier_design.md This design document

References

  • Gitea Issue: timmy-home #88
  • Existing: ~/.hermes/hermes-agent/agent/smart_model_routing.py
  • Config: ~/.hermes/config.yaml (fallback_providers chain)