Co-authored-by: Kimi Claw <kimi@timmytime.ai> Co-committed-by: Kimi Claw <kimi@timmytime.ai>
12 KiB
12 KiB
Task Classifier Design Document
Overview
The Task Classifier is an enhanced prompt routing system for the Uniwizard agent harness. It classifies incoming user prompts into task categories and maps them to ranked backend preferences, enabling intelligent model selection across the 7-backend fallback chain.
Goals
- Right-size every request: Route simple queries to fast backends, complex tasks to capable ones
- Minimize latency: Use Groq (284ms) for fast operations, Anthropic for deep reasoning
- Maximize quality: Match task type to backend strengths
- Provide transparency: Return clear reasoning for routing decisions
- Enable fallback: Support the full 7-backend chain with intelligent ordering
Architecture
┌─────────────────┐
│ User Prompt │
└────────┬────────┘
│
▼
┌─────────────────────┐
│ Feature Extraction │
│ - Length metrics │
│ - Code detection │
│ - URL extraction │
│ - Keyword tokenize │
└────────┬────────────┘
│
▼
┌─────────────────────┐
│ Complexity Assess │
│ - Low/Medium/High │
└────────┬────────────┘
│
▼
┌─────────────────────┐
│ Task Classification│
│ - Code │
│ - Reasoning │
│ - Research │
│ - Creative │
│ - Fast Ops │
│ - Tool Use │
└────────┬────────────┘
│
▼
┌─────────────────────┐
│ Backend Selection │
│ - Ranked by task │
│ - Complexity adj. │
│ - Feature boosts │
└────────┬────────────┘
│
▼
┌─────────────────────┐
│ ClassificationResult
│ - task_type │
│ - preferred_backends
│ - complexity │
│ - reason │
│ - confidence │
└─────────────────────┘
Task Types
| Task Type | Description | Primary Indicators |
|---|---|---|
code |
Programming tasks, debugging, refactoring | Keywords: implement, debug, refactor, test, function, class, API |
reasoning |
Analysis, comparison, evaluation | Keywords: analyze, compare, evaluate, step by step, trade-offs |
research |
Information gathering, literature review | Keywords: research, find, paper, study, URLs present |
creative |
Writing, design, content creation | Keywords: write, create, design, story, poem, brainstorm |
fast_ops |
Quick status checks, simple queries | Short length (<20 words), simple keywords |
tool_use |
Tool invocations, commands, API calls | Keywords: run, execute, use tool, browser, delegate |
unknown |
No clear indicators | Fallback classification |
Backend Strengths Mapping
1. Anthropic (Claude)
- Strengths: Deep reasoning, code review, complex analysis, tool use
- Best for: Reasoning, tool_use, complex code review
- Ranking: #1 for reasoning, #2 for code, #1 for tool_use
2. OpenAI Codex
- Strengths: Code generation, feature implementation
- Best for: Code tasks, structured outputs
- Ranking: #1 for code generation
3. Gemini
- Strengths: Research, multimodal, long context
- Best for: Research tasks, document analysis
- Ranking: #1 for research, #2 for reasoning
4. Groq
- Strengths: Speed (284ms latency)
- Best for: Fast operations, simple queries, triage
- Ranking: #1 for fast_ops
5. Grok
- Strengths: Broad knowledge, creative, drafting
- Best for: Creative writing, general knowledge
- Ranking: #1 for creative
6. Kimi (kimi-coding)
- Strengths: Long context (262K tokens), code refactoring
- Best for: Large codebase work, long document analysis
- Ranking: Boosted for high-complexity code/research
7. OpenRouter
- Strengths: Broadest model access, overflow handling
- Best for: Fallback, variety of model choices
- Ranking: #6 or #7 across all task types
Backend Rankings by Task Type
CODE = [
openai-codex, # Best generation
anthropic, # Review & analysis
kimi, # Large codebases
gemini, # Multimodal
groq, # Fast simple tasks
openrouter, # Overflow
grok, # General backup
]
REASONING = [
anthropic, # Deep reasoning
gemini, # Analysis
kimi, # Long chains
grok, # Broad knowledge
openai-codex, # Structured
openrouter,
groq,
]
RESEARCH = [
gemini, # Research leader
kimi, # 262K context
anthropic, # Deep analysis
grok, # Knowledge
openrouter, # Broad access
openai-codex,
groq, # Triage
]
CREATIVE = [
grok, # Creative writing
anthropic, # Nuanced
gemini, # Multimodal
openai-codex, # Creative coding
kimi, # Long-form
openrouter,
groq,
]
FAST_OPS = [
groq, # 284ms champion
openrouter, # Fast mini models
gemini, # Flash
grok, # Simple queries
anthropic,
openai-codex,
kimi,
]
TOOL_USE = [
anthropic, # Tool use leader
openai-codex, # Good integration
gemini, # Multimodal
groq, # Fast chaining
kimi, # Long sessions
openrouter,
grok,
]
Complexity Assessment
Complexity is determined by:
| Metric | Low | Medium | High |
|---|---|---|---|
| Characters | ≤200 | 201-800 | >800 |
| Words | ≤35 | 36-150 | >150 |
| Lines | ≤3 | 4-15 | >15 |
| URLs | 0 | 1 | ≥2 |
| Code Blocks | 0 | 1 | ≥2 |
Rules:
- 2+ high metrics → HIGH complexity
- 2+ medium metrics or 1 high → MEDIUM complexity
- Otherwise → LOW complexity
Complexity Adjustments
- HIGH complexity + RESEARCH/CODE: Boost Kimi and Gemini in rankings
- LOW complexity + FAST_OPS: Ensure Groq is first
- Code blocks present: Boost OpenAI Codex in any task type
Keyword Dictionaries
The classifier uses curated keyword sets for each task type:
Code Keywords (100+)
- Implementation: implement, code, function, class, module
- Debugging: debug, error, exception, traceback, bug, fix
- Testing: test, pytest, unittest, coverage
- Operations: deploy, docker, kubernetes, ci/cd, pipeline
- Concepts: api, endpoint, database, query, authentication
Reasoning Keywords (50+)
- Analysis: analyze, evaluate, assess, critique, review
- Logic: reason, deduce, infer, logic, argument, evidence
- Process: compare, contrast, trade-off, strategy, plan
- Modifiers: step by step, chain of thought, think through
Research Keywords (80+)
- Actions: research, find, search, explore, discover
- Sources: paper, publication, journal, arxiv, dataset
- Methods: study, survey, experiment, benchmark, evaluation
- Domains: machine learning, neural network, sota, literature
Creative Keywords (100+)
- Visual: art, paint, draw, design, graphic, image
- Writing: write, story, novel, poem, essay, content
- Audio: music, song, compose, melody, sound
- Process: brainstorm, ideate, concept, imagine, inspire
Fast Ops Keywords (60+)
- Simple: quick, fast, brief, simple, easy, status
- Actions: list, show, get, check, count, find
- Short queries: hi, hello, thanks, yes/no, what is
Tool Use Keywords (70+)
- Actions: run, execute, call, use tool, invoke
- Systems: terminal, shell, docker, kubernetes, git
- Protocols: api, http, request, response, webhook
- Agents: delegate, subagent, spawn, mcp
API
Classify a Prompt
from task_classifier import TaskClassifier, classify_prompt
# Method 1: Using the class
classifier = TaskClassifier()
result = classifier.classify("Implement a Python function")
print(result.task_type) # TaskType.CODE
print(result.preferred_backends) # ["openai-codex", "anthropic", ...]
print(result.complexity) # ComplexityLevel.LOW
print(result.reason) # "Task: code; Complexity: low; ..."
print(result.confidence) # 0.75
# Method 2: Convenience function
output = classify_prompt("Research AI papers")
# Returns dict: {
# "task_type": "research",
# "preferred_backends": ["gemini", "kimi", ...],
# "complexity": "low",
# "reason": "...",
# "confidence": 0.65,
# "features": {...}
# }
ClassificationResult Fields
| Field | Type | Description |
|---|---|---|
task_type |
TaskType | Classified task category |
preferred_backends |
List[str] | Ranked list of backend identifiers |
complexity |
ComplexityLevel | Assessed complexity level |
reason |
str | Human-readable classification reasoning |
confidence |
float | 0.0-1.0 confidence score |
features |
Dict | Extracted features (lengths, code, URLs) |
Integration with Hermes
Usage in Smart Model Routing
The task classifier replaces/enhances the existing smart_model_routing.py:
# In hermes-agent/agent/smart_model_routing.py
from uniwizard.task_classifier import TaskClassifier
classifier = TaskClassifier()
def resolve_turn_route(user_message, routing_config, primary, fallback_chain):
# Classify the prompt
result = classifier.classify(user_message)
# Map preferred backends to actual models from fallback_chain
for backend in result.preferred_backends:
model_config = fallback_chain.get(backend)
if model_config and is_available(backend):
return {
"model": model_config["model"],
"provider": backend,
"reason": result.reason,
"complexity": result.complexity.value,
}
# Fallback to primary
return primary
Configuration
# config.yaml
smart_model_routing:
enabled: true
use_task_classifier: true
fallback_providers:
- provider: anthropic
model: claude-opus-4-6
- provider: openai-codex
model: codex
- provider: gemini
model: gemini-2.5-flash
- provider: groq
model: llama-3.3-70b-versatile
- provider: grok
model: grok-3-mini-fast
- provider: kimi-coding
model: kimi-k2.5
- provider: openrouter
model: openai/gpt-4.1-mini
Testing
Run the test suite:
cd ~/.timmy/uniwizard
python -m pytest test_task_classifier.py -v
Coverage includes:
- Feature extraction (URLs, code blocks, length metrics)
- Complexity assessment (low/medium/high)
- Task type classification (all 6 types)
- Backend selection (rankings by task type)
- Complexity adjustments (boosts for Kimi/Gemini)
- Edge cases (empty, whitespace, very long prompts)
- Integration scenarios (realistic use cases)
Future Enhancements
- Session Context: Use conversation history for better classification
- Performance Feedback: Learn from actual backend performance
- User Preferences: Allow user-defined backend preferences
- Cost Optimization: Factor in backend costs for routing
- Streaming Detection: Identify streaming-suitable tasks
- Multi-Modal: Better handling of image/audio inputs
- Confidence Thresholds: Configurable confidence cutoffs
Files
| File | Description |
|---|---|
task_classifier.py |
Main implementation (600+ lines) |
test_task_classifier.py |
Unit tests (400+ lines) |
task_classifier_design.md |
This design document |
References
- Gitea Issue: timmy-home #88
- Existing:
~/.hermes/hermes-agent/agent/smart_model_routing.py - Config:
~/.hermes/config.yaml(fallback_providers chain)