Files
timmy-home/uniwizard/task_classifier_design.md

380 lines
12 KiB
Markdown
Raw Permalink Normal View History

# Task Classifier Design Document
## Overview
The **Task Classifier** is an enhanced prompt routing system for the Uniwizard agent harness. It classifies incoming user prompts into task categories and maps them to ranked backend preferences, enabling intelligent model selection across the 7-backend fallback chain.
## Goals
1. **Right-size every request**: Route simple queries to fast backends, complex tasks to capable ones
2. **Minimize latency**: Use Groq (284ms) for fast operations, Anthropic for deep reasoning
3. **Maximize quality**: Match task type to backend strengths
4. **Provide transparency**: Return clear reasoning for routing decisions
5. **Enable fallback**: Support the full 7-backend chain with intelligent ordering
## Architecture
```
┌─────────────────┐
│ User Prompt │
└────────┬────────┘
┌─────────────────────┐
│ Feature Extraction │
│ - Length metrics │
│ - Code detection │
│ - URL extraction │
│ - Keyword tokenize │
└────────┬────────────┘
┌─────────────────────┐
│ Complexity Assess │
│ - Low/Medium/High │
└────────┬────────────┘
┌─────────────────────┐
│ Task Classification│
│ - Code │
│ - Reasoning │
│ - Research │
│ - Creative │
│ - Fast Ops │
│ - Tool Use │
└────────┬────────────┘
┌─────────────────────┐
│ Backend Selection │
│ - Ranked by task │
│ - Complexity adj. │
│ - Feature boosts │
└────────┬────────────┘
┌─────────────────────┐
│ ClassificationResult
│ - task_type │
│ - preferred_backends
│ - complexity │
│ - reason │
│ - confidence │
└─────────────────────┘
```
## Task Types
| Task Type | Description | Primary Indicators |
|-----------|-------------|-------------------|
| `code` | Programming tasks, debugging, refactoring | Keywords: implement, debug, refactor, test, function, class, API |
| `reasoning` | Analysis, comparison, evaluation | Keywords: analyze, compare, evaluate, step by step, trade-offs |
| `research` | Information gathering, literature review | Keywords: research, find, paper, study, URLs present |
| `creative` | Writing, design, content creation | Keywords: write, create, design, story, poem, brainstorm |
| `fast_ops` | Quick status checks, simple queries | Short length (<20 words), simple keywords |
| `tool_use` | Tool invocations, commands, API calls | Keywords: run, execute, use tool, browser, delegate |
| `unknown` | No clear indicators | Fallback classification |
## Backend Strengths Mapping
### 1. Anthropic (Claude)
- **Strengths**: Deep reasoning, code review, complex analysis, tool use
- **Best for**: Reasoning, tool_use, complex code review
- **Ranking**: #1 for reasoning, #2 for code, #1 for tool_use
### 2. OpenAI Codex
- **Strengths**: Code generation, feature implementation
- **Best for**: Code tasks, structured outputs
- **Ranking**: #1 for code generation
### 3. Gemini
- **Strengths**: Research, multimodal, long context
- **Best for**: Research tasks, document analysis
- **Ranking**: #1 for research, #2 for reasoning
### 4. Groq
- **Strengths**: Speed (284ms latency)
- **Best for**: Fast operations, simple queries, triage
- **Ranking**: #1 for fast_ops
### 5. Grok
- **Strengths**: Broad knowledge, creative, drafting
- **Best for**: Creative writing, general knowledge
- **Ranking**: #1 for creative
### 6. Kimi (kimi-coding)
- **Strengths**: Long context (262K tokens), code refactoring
- **Best for**: Large codebase work, long document analysis
- **Ranking**: Boosted for high-complexity code/research
### 7. OpenRouter
- **Strengths**: Broadest model access, overflow handling
- **Best for**: Fallback, variety of model choices
- **Ranking**: #6 or #7 across all task types
## Backend Rankings by Task Type
```python
CODE = [
openai-codex, # Best generation
anthropic, # Review & analysis
kimi, # Large codebases
gemini, # Multimodal
groq, # Fast simple tasks
openrouter, # Overflow
grok, # General backup
]
REASONING = [
anthropic, # Deep reasoning
gemini, # Analysis
kimi, # Long chains
grok, # Broad knowledge
openai-codex, # Structured
openrouter,
groq,
]
RESEARCH = [
gemini, # Research leader
kimi, # 262K context
anthropic, # Deep analysis
grok, # Knowledge
openrouter, # Broad access
openai-codex,
groq, # Triage
]
CREATIVE = [
grok, # Creative writing
anthropic, # Nuanced
gemini, # Multimodal
openai-codex, # Creative coding
kimi, # Long-form
openrouter,
groq,
]
FAST_OPS = [
groq, # 284ms champion
openrouter, # Fast mini models
gemini, # Flash
grok, # Simple queries
anthropic,
openai-codex,
kimi,
]
TOOL_USE = [
anthropic, # Tool use leader
openai-codex, # Good integration
gemini, # Multimodal
groq, # Fast chaining
kimi, # Long sessions
openrouter,
grok,
]
```
## Complexity Assessment
Complexity is determined by:
| Metric | Low | Medium | High |
|--------|-----|--------|------|
| Characters | ≤200 | 201-800 | >800 |
| Words | ≤35 | 36-150 | >150 |
| Lines | ≤3 | 4-15 | >15 |
| URLs | 0 | 1 | ≥2 |
| Code Blocks | 0 | 1 | ≥2 |
**Rules:**
- 2+ high metrics → **HIGH** complexity
- 2+ medium metrics or 1 high → **MEDIUM** complexity
- Otherwise → **LOW** complexity
### Complexity Adjustments
- **HIGH complexity + RESEARCH/CODE**: Boost Kimi and Gemini in rankings
- **LOW complexity + FAST_OPS**: Ensure Groq is first
- **Code blocks present**: Boost OpenAI Codex in any task type
## Keyword Dictionaries
The classifier uses curated keyword sets for each task type:
### Code Keywords (100+)
- Implementation: implement, code, function, class, module
- Debugging: debug, error, exception, traceback, bug, fix
- Testing: test, pytest, unittest, coverage
- Operations: deploy, docker, kubernetes, ci/cd, pipeline
- Concepts: api, endpoint, database, query, authentication
### Reasoning Keywords (50+)
- Analysis: analyze, evaluate, assess, critique, review
- Logic: reason, deduce, infer, logic, argument, evidence
- Process: compare, contrast, trade-off, strategy, plan
- Modifiers: step by step, chain of thought, think through
### Research Keywords (80+)
- Actions: research, find, search, explore, discover
- Sources: paper, publication, journal, arxiv, dataset
- Methods: study, survey, experiment, benchmark, evaluation
- Domains: machine learning, neural network, sota, literature
### Creative Keywords (100+)
- Visual: art, paint, draw, design, graphic, image
- Writing: write, story, novel, poem, essay, content
- Audio: music, song, compose, melody, sound
- Process: brainstorm, ideate, concept, imagine, inspire
### Fast Ops Keywords (60+)
- Simple: quick, fast, brief, simple, easy, status
- Actions: list, show, get, check, count, find
- Short queries: hi, hello, thanks, yes/no, what is
### Tool Use Keywords (70+)
- Actions: run, execute, call, use tool, invoke
- Systems: terminal, shell, docker, kubernetes, git
- Protocols: api, http, request, response, webhook
- Agents: delegate, subagent, spawn, mcp
## API
### Classify a Prompt
```python
from task_classifier import TaskClassifier, classify_prompt
# Method 1: Using the class
classifier = TaskClassifier()
result = classifier.classify("Implement a Python function")
print(result.task_type) # TaskType.CODE
print(result.preferred_backends) # ["openai-codex", "anthropic", ...]
print(result.complexity) # ComplexityLevel.LOW
print(result.reason) # "Task: code; Complexity: low; ..."
print(result.confidence) # 0.75
# Method 2: Convenience function
output = classify_prompt("Research AI papers")
# Returns dict: {
# "task_type": "research",
# "preferred_backends": ["gemini", "kimi", ...],
# "complexity": "low",
# "reason": "...",
# "confidence": 0.65,
# "features": {...}
# }
```
### ClassificationResult Fields
| Field | Type | Description |
|-------|------|-------------|
| `task_type` | TaskType | Classified task category |
| `preferred_backends` | List[str] | Ranked list of backend identifiers |
| `complexity` | ComplexityLevel | Assessed complexity level |
| `reason` | str | Human-readable classification reasoning |
| `confidence` | float | 0.0-1.0 confidence score |
| `features` | Dict | Extracted features (lengths, code, URLs) |
## Integration with Hermes
### Usage in Smart Model Routing
The task classifier replaces/enhances the existing `smart_model_routing.py`:
```python
# In hermes-agent/agent/smart_model_routing.py
from uniwizard.task_classifier import TaskClassifier
classifier = TaskClassifier()
def resolve_turn_route(user_message, routing_config, primary, fallback_chain):
# Classify the prompt
result = classifier.classify(user_message)
# Map preferred backends to actual models from fallback_chain
for backend in result.preferred_backends:
model_config = fallback_chain.get(backend)
if model_config and is_available(backend):
return {
"model": model_config["model"],
"provider": backend,
"reason": result.reason,
"complexity": result.complexity.value,
}
# Fallback to primary
return primary
```
### Configuration
```yaml
# config.yaml
smart_model_routing:
enabled: true
use_task_classifier: true
fallback_providers:
- provider: anthropic
model: claude-opus-4-6
- provider: openai-codex
model: codex
- provider: gemini
model: gemini-2.5-flash
- provider: groq
model: llama-3.3-70b-versatile
- provider: grok
model: grok-3-mini-fast
- provider: kimi-coding
model: kimi-k2.5
- provider: openrouter
model: openai/gpt-4.1-mini
```
## Testing
Run the test suite:
```bash
cd ~/.timmy/uniwizard
python -m pytest test_task_classifier.py -v
```
Coverage includes:
- Feature extraction (URLs, code blocks, length metrics)
- Complexity assessment (low/medium/high)
- Task type classification (all 6 types)
- Backend selection (rankings by task type)
- Complexity adjustments (boosts for Kimi/Gemini)
- Edge cases (empty, whitespace, very long prompts)
- Integration scenarios (realistic use cases)
## Future Enhancements
1. **Session Context**: Use conversation history for better classification
2. **Performance Feedback**: Learn from actual backend performance
3. **User Preferences**: Allow user-defined backend preferences
4. **Cost Optimization**: Factor in backend costs for routing
5. **Streaming Detection**: Identify streaming-suitable tasks
6. **Multi-Modal**: Better handling of image/audio inputs
7. **Confidence Thresholds**: Configurable confidence cutoffs
## Files
| File | Description |
|------|-------------|
| `task_classifier.py` | Main implementation (600+ lines) |
| `test_task_classifier.py` | Unit tests (400+ lines) |
| `task_classifier_design.md` | This design document |
## References
- Gitea Issue: timmy-home #88
- Existing: `~/.hermes/hermes-agent/agent/smart_model_routing.py`
- Config: `~/.hermes/config.yaml` (fallback_providers chain)