380 lines
12 KiB
Markdown
380 lines
12 KiB
Markdown
|
|
# Task Classifier Design Document
|
||
|
|
|
||
|
|
## Overview
|
||
|
|
|
||
|
|
The **Task Classifier** is an enhanced prompt routing system for the Uniwizard agent harness. It classifies incoming user prompts into task categories and maps them to ranked backend preferences, enabling intelligent model selection across the 7-backend fallback chain.
|
||
|
|
|
||
|
|
## Goals
|
||
|
|
|
||
|
|
1. **Right-size every request**: Route simple queries to fast backends, complex tasks to capable ones
|
||
|
|
2. **Minimize latency**: Use Groq (284ms) for fast operations, Anthropic for deep reasoning
|
||
|
|
3. **Maximize quality**: Match task type to backend strengths
|
||
|
|
4. **Provide transparency**: Return clear reasoning for routing decisions
|
||
|
|
5. **Enable fallback**: Support the full 7-backend chain with intelligent ordering
|
||
|
|
|
||
|
|
## Architecture
|
||
|
|
|
||
|
|
```
|
||
|
|
┌─────────────────┐
|
||
|
|
│ User Prompt │
|
||
|
|
└────────┬────────┘
|
||
|
|
│
|
||
|
|
▼
|
||
|
|
┌─────────────────────┐
|
||
|
|
│ Feature Extraction │
|
||
|
|
│ - Length metrics │
|
||
|
|
│ - Code detection │
|
||
|
|
│ - URL extraction │
|
||
|
|
│ - Keyword tokenize │
|
||
|
|
└────────┬────────────┘
|
||
|
|
│
|
||
|
|
▼
|
||
|
|
┌─────────────────────┐
|
||
|
|
│ Complexity Assess │
|
||
|
|
│ - Low/Medium/High │
|
||
|
|
└────────┬────────────┘
|
||
|
|
│
|
||
|
|
▼
|
||
|
|
┌─────────────────────┐
|
||
|
|
│ Task Classification│
|
||
|
|
│ - Code │
|
||
|
|
│ - Reasoning │
|
||
|
|
│ - Research │
|
||
|
|
│ - Creative │
|
||
|
|
│ - Fast Ops │
|
||
|
|
│ - Tool Use │
|
||
|
|
└────────┬────────────┘
|
||
|
|
│
|
||
|
|
▼
|
||
|
|
┌─────────────────────┐
|
||
|
|
│ Backend Selection │
|
||
|
|
│ - Ranked by task │
|
||
|
|
│ - Complexity adj. │
|
||
|
|
│ - Feature boosts │
|
||
|
|
└────────┬────────────┘
|
||
|
|
│
|
||
|
|
▼
|
||
|
|
┌─────────────────────┐
|
||
|
|
│ ClassificationResult
|
||
|
|
│ - task_type │
|
||
|
|
│ - preferred_backends
|
||
|
|
│ - complexity │
|
||
|
|
│ - reason │
|
||
|
|
│ - confidence │
|
||
|
|
└─────────────────────┘
|
||
|
|
```
|
||
|
|
|
||
|
|
## Task Types
|
||
|
|
|
||
|
|
| Task Type | Description | Primary Indicators |
|
||
|
|
|-----------|-------------|-------------------|
|
||
|
|
| `code` | Programming tasks, debugging, refactoring | Keywords: implement, debug, refactor, test, function, class, API |
|
||
|
|
| `reasoning` | Analysis, comparison, evaluation | Keywords: analyze, compare, evaluate, step by step, trade-offs |
|
||
|
|
| `research` | Information gathering, literature review | Keywords: research, find, paper, study, URLs present |
|
||
|
|
| `creative` | Writing, design, content creation | Keywords: write, create, design, story, poem, brainstorm |
|
||
|
|
| `fast_ops` | Quick status checks, simple queries | Short length (<20 words), simple keywords |
|
||
|
|
| `tool_use` | Tool invocations, commands, API calls | Keywords: run, execute, use tool, browser, delegate |
|
||
|
|
| `unknown` | No clear indicators | Fallback classification |
|
||
|
|
|
||
|
|
## Backend Strengths Mapping
|
||
|
|
|
||
|
|
### 1. Anthropic (Claude)
|
||
|
|
- **Strengths**: Deep reasoning, code review, complex analysis, tool use
|
||
|
|
- **Best for**: Reasoning, tool_use, complex code review
|
||
|
|
- **Ranking**: #1 for reasoning, #2 for code, #1 for tool_use
|
||
|
|
|
||
|
|
### 2. OpenAI Codex
|
||
|
|
- **Strengths**: Code generation, feature implementation
|
||
|
|
- **Best for**: Code tasks, structured outputs
|
||
|
|
- **Ranking**: #1 for code generation
|
||
|
|
|
||
|
|
### 3. Gemini
|
||
|
|
- **Strengths**: Research, multimodal, long context
|
||
|
|
- **Best for**: Research tasks, document analysis
|
||
|
|
- **Ranking**: #1 for research, #2 for reasoning
|
||
|
|
|
||
|
|
### 4. Groq
|
||
|
|
- **Strengths**: Speed (284ms latency)
|
||
|
|
- **Best for**: Fast operations, simple queries, triage
|
||
|
|
- **Ranking**: #1 for fast_ops
|
||
|
|
|
||
|
|
### 5. Grok
|
||
|
|
- **Strengths**: Broad knowledge, creative, drafting
|
||
|
|
- **Best for**: Creative writing, general knowledge
|
||
|
|
- **Ranking**: #1 for creative
|
||
|
|
|
||
|
|
### 6. Kimi (kimi-coding)
|
||
|
|
- **Strengths**: Long context (262K tokens), code refactoring
|
||
|
|
- **Best for**: Large codebase work, long document analysis
|
||
|
|
- **Ranking**: Boosted for high-complexity code/research
|
||
|
|
|
||
|
|
### 7. OpenRouter
|
||
|
|
- **Strengths**: Broadest model access, overflow handling
|
||
|
|
- **Best for**: Fallback, variety of model choices
|
||
|
|
- **Ranking**: #6 or #7 across all task types
|
||
|
|
|
||
|
|
## Backend Rankings by Task Type
|
||
|
|
|
||
|
|
```python
|
||
|
|
CODE = [
|
||
|
|
openai-codex, # Best generation
|
||
|
|
anthropic, # Review & analysis
|
||
|
|
kimi, # Large codebases
|
||
|
|
gemini, # Multimodal
|
||
|
|
groq, # Fast simple tasks
|
||
|
|
openrouter, # Overflow
|
||
|
|
grok, # General backup
|
||
|
|
]
|
||
|
|
|
||
|
|
REASONING = [
|
||
|
|
anthropic, # Deep reasoning
|
||
|
|
gemini, # Analysis
|
||
|
|
kimi, # Long chains
|
||
|
|
grok, # Broad knowledge
|
||
|
|
openai-codex, # Structured
|
||
|
|
openrouter,
|
||
|
|
groq,
|
||
|
|
]
|
||
|
|
|
||
|
|
RESEARCH = [
|
||
|
|
gemini, # Research leader
|
||
|
|
kimi, # 262K context
|
||
|
|
anthropic, # Deep analysis
|
||
|
|
grok, # Knowledge
|
||
|
|
openrouter, # Broad access
|
||
|
|
openai-codex,
|
||
|
|
groq, # Triage
|
||
|
|
]
|
||
|
|
|
||
|
|
CREATIVE = [
|
||
|
|
grok, # Creative writing
|
||
|
|
anthropic, # Nuanced
|
||
|
|
gemini, # Multimodal
|
||
|
|
openai-codex, # Creative coding
|
||
|
|
kimi, # Long-form
|
||
|
|
openrouter,
|
||
|
|
groq,
|
||
|
|
]
|
||
|
|
|
||
|
|
FAST_OPS = [
|
||
|
|
groq, # 284ms champion
|
||
|
|
openrouter, # Fast mini models
|
||
|
|
gemini, # Flash
|
||
|
|
grok, # Simple queries
|
||
|
|
anthropic,
|
||
|
|
openai-codex,
|
||
|
|
kimi,
|
||
|
|
]
|
||
|
|
|
||
|
|
TOOL_USE = [
|
||
|
|
anthropic, # Tool use leader
|
||
|
|
openai-codex, # Good integration
|
||
|
|
gemini, # Multimodal
|
||
|
|
groq, # Fast chaining
|
||
|
|
kimi, # Long sessions
|
||
|
|
openrouter,
|
||
|
|
grok,
|
||
|
|
]
|
||
|
|
```
|
||
|
|
|
||
|
|
## Complexity Assessment
|
||
|
|
|
||
|
|
Complexity is determined by:
|
||
|
|
|
||
|
|
| Metric | Low | Medium | High |
|
||
|
|
|--------|-----|--------|------|
|
||
|
|
| Characters | ≤200 | 201-800 | >800 |
|
||
|
|
| Words | ≤35 | 36-150 | >150 |
|
||
|
|
| Lines | ≤3 | 4-15 | >15 |
|
||
|
|
| URLs | 0 | 1 | ≥2 |
|
||
|
|
| Code Blocks | 0 | 1 | ≥2 |
|
||
|
|
|
||
|
|
**Rules:**
|
||
|
|
- 2+ high metrics → **HIGH** complexity
|
||
|
|
- 2+ medium metrics or 1 high → **MEDIUM** complexity
|
||
|
|
- Otherwise → **LOW** complexity
|
||
|
|
|
||
|
|
### Complexity Adjustments
|
||
|
|
|
||
|
|
- **HIGH complexity + RESEARCH/CODE**: Boost Kimi and Gemini in rankings
|
||
|
|
- **LOW complexity + FAST_OPS**: Ensure Groq is first
|
||
|
|
- **Code blocks present**: Boost OpenAI Codex in any task type
|
||
|
|
|
||
|
|
## Keyword Dictionaries
|
||
|
|
|
||
|
|
The classifier uses curated keyword sets for each task type:
|
||
|
|
|
||
|
|
### Code Keywords (100+)
|
||
|
|
- Implementation: implement, code, function, class, module
|
||
|
|
- Debugging: debug, error, exception, traceback, bug, fix
|
||
|
|
- Testing: test, pytest, unittest, coverage
|
||
|
|
- Operations: deploy, docker, kubernetes, ci/cd, pipeline
|
||
|
|
- Concepts: api, endpoint, database, query, authentication
|
||
|
|
|
||
|
|
### Reasoning Keywords (50+)
|
||
|
|
- Analysis: analyze, evaluate, assess, critique, review
|
||
|
|
- Logic: reason, deduce, infer, logic, argument, evidence
|
||
|
|
- Process: compare, contrast, trade-off, strategy, plan
|
||
|
|
- Modifiers: step by step, chain of thought, think through
|
||
|
|
|
||
|
|
### Research Keywords (80+)
|
||
|
|
- Actions: research, find, search, explore, discover
|
||
|
|
- Sources: paper, publication, journal, arxiv, dataset
|
||
|
|
- Methods: study, survey, experiment, benchmark, evaluation
|
||
|
|
- Domains: machine learning, neural network, sota, literature
|
||
|
|
|
||
|
|
### Creative Keywords (100+)
|
||
|
|
- Visual: art, paint, draw, design, graphic, image
|
||
|
|
- Writing: write, story, novel, poem, essay, content
|
||
|
|
- Audio: music, song, compose, melody, sound
|
||
|
|
- Process: brainstorm, ideate, concept, imagine, inspire
|
||
|
|
|
||
|
|
### Fast Ops Keywords (60+)
|
||
|
|
- Simple: quick, fast, brief, simple, easy, status
|
||
|
|
- Actions: list, show, get, check, count, find
|
||
|
|
- Short queries: hi, hello, thanks, yes/no, what is
|
||
|
|
|
||
|
|
### Tool Use Keywords (70+)
|
||
|
|
- Actions: run, execute, call, use tool, invoke
|
||
|
|
- Systems: terminal, shell, docker, kubernetes, git
|
||
|
|
- Protocols: api, http, request, response, webhook
|
||
|
|
- Agents: delegate, subagent, spawn, mcp
|
||
|
|
|
||
|
|
## API
|
||
|
|
|
||
|
|
### Classify a Prompt
|
||
|
|
|
||
|
|
```python
|
||
|
|
from task_classifier import TaskClassifier, classify_prompt
|
||
|
|
|
||
|
|
# Method 1: Using the class
|
||
|
|
classifier = TaskClassifier()
|
||
|
|
result = classifier.classify("Implement a Python function")
|
||
|
|
|
||
|
|
print(result.task_type) # TaskType.CODE
|
||
|
|
print(result.preferred_backends) # ["openai-codex", "anthropic", ...]
|
||
|
|
print(result.complexity) # ComplexityLevel.LOW
|
||
|
|
print(result.reason) # "Task: code; Complexity: low; ..."
|
||
|
|
print(result.confidence) # 0.75
|
||
|
|
|
||
|
|
# Method 2: Convenience function
|
||
|
|
output = classify_prompt("Research AI papers")
|
||
|
|
# Returns dict: {
|
||
|
|
# "task_type": "research",
|
||
|
|
# "preferred_backends": ["gemini", "kimi", ...],
|
||
|
|
# "complexity": "low",
|
||
|
|
# "reason": "...",
|
||
|
|
# "confidence": 0.65,
|
||
|
|
# "features": {...}
|
||
|
|
# }
|
||
|
|
```
|
||
|
|
|
||
|
|
### ClassificationResult Fields
|
||
|
|
|
||
|
|
| Field | Type | Description |
|
||
|
|
|-------|------|-------------|
|
||
|
|
| `task_type` | TaskType | Classified task category |
|
||
|
|
| `preferred_backends` | List[str] | Ranked list of backend identifiers |
|
||
|
|
| `complexity` | ComplexityLevel | Assessed complexity level |
|
||
|
|
| `reason` | str | Human-readable classification reasoning |
|
||
|
|
| `confidence` | float | 0.0-1.0 confidence score |
|
||
|
|
| `features` | Dict | Extracted features (lengths, code, URLs) |
|
||
|
|
|
||
|
|
## Integration with Hermes
|
||
|
|
|
||
|
|
### Usage in Smart Model Routing
|
||
|
|
|
||
|
|
The task classifier replaces/enhances the existing `smart_model_routing.py`:
|
||
|
|
|
||
|
|
```python
|
||
|
|
# In hermes-agent/agent/smart_model_routing.py
|
||
|
|
from uniwizard.task_classifier import TaskClassifier
|
||
|
|
|
||
|
|
classifier = TaskClassifier()
|
||
|
|
|
||
|
|
def resolve_turn_route(user_message, routing_config, primary, fallback_chain):
|
||
|
|
# Classify the prompt
|
||
|
|
result = classifier.classify(user_message)
|
||
|
|
|
||
|
|
# Map preferred backends to actual models from fallback_chain
|
||
|
|
for backend in result.preferred_backends:
|
||
|
|
model_config = fallback_chain.get(backend)
|
||
|
|
if model_config and is_available(backend):
|
||
|
|
return {
|
||
|
|
"model": model_config["model"],
|
||
|
|
"provider": backend,
|
||
|
|
"reason": result.reason,
|
||
|
|
"complexity": result.complexity.value,
|
||
|
|
}
|
||
|
|
|
||
|
|
# Fallback to primary
|
||
|
|
return primary
|
||
|
|
```
|
||
|
|
|
||
|
|
### Configuration
|
||
|
|
|
||
|
|
```yaml
|
||
|
|
# config.yaml
|
||
|
|
smart_model_routing:
|
||
|
|
enabled: true
|
||
|
|
use_task_classifier: true
|
||
|
|
|
||
|
|
fallback_providers:
|
||
|
|
- provider: anthropic
|
||
|
|
model: claude-opus-4-6
|
||
|
|
- provider: openai-codex
|
||
|
|
model: codex
|
||
|
|
- provider: gemini
|
||
|
|
model: gemini-2.5-flash
|
||
|
|
- provider: groq
|
||
|
|
model: llama-3.3-70b-versatile
|
||
|
|
- provider: grok
|
||
|
|
model: grok-3-mini-fast
|
||
|
|
- provider: kimi-coding
|
||
|
|
model: kimi-k2.5
|
||
|
|
- provider: openrouter
|
||
|
|
model: openai/gpt-4.1-mini
|
||
|
|
```
|
||
|
|
|
||
|
|
## Testing
|
||
|
|
|
||
|
|
Run the test suite:
|
||
|
|
|
||
|
|
```bash
|
||
|
|
cd ~/.timmy/uniwizard
|
||
|
|
python -m pytest test_task_classifier.py -v
|
||
|
|
```
|
||
|
|
|
||
|
|
Coverage includes:
|
||
|
|
- Feature extraction (URLs, code blocks, length metrics)
|
||
|
|
- Complexity assessment (low/medium/high)
|
||
|
|
- Task type classification (all 6 types)
|
||
|
|
- Backend selection (rankings by task type)
|
||
|
|
- Complexity adjustments (boosts for Kimi/Gemini)
|
||
|
|
- Edge cases (empty, whitespace, very long prompts)
|
||
|
|
- Integration scenarios (realistic use cases)
|
||
|
|
|
||
|
|
## Future Enhancements
|
||
|
|
|
||
|
|
1. **Session Context**: Use conversation history for better classification
|
||
|
|
2. **Performance Feedback**: Learn from actual backend performance
|
||
|
|
3. **User Preferences**: Allow user-defined backend preferences
|
||
|
|
4. **Cost Optimization**: Factor in backend costs for routing
|
||
|
|
5. **Streaming Detection**: Identify streaming-suitable tasks
|
||
|
|
6. **Multi-Modal**: Better handling of image/audio inputs
|
||
|
|
7. **Confidence Thresholds**: Configurable confidence cutoffs
|
||
|
|
|
||
|
|
## Files
|
||
|
|
|
||
|
|
| File | Description |
|
||
|
|
|------|-------------|
|
||
|
|
| `task_classifier.py` | Main implementation (600+ lines) |
|
||
|
|
| `test_task_classifier.py` | Unit tests (400+ lines) |
|
||
|
|
| `task_classifier_design.md` | This design document |
|
||
|
|
|
||
|
|
## References
|
||
|
|
|
||
|
|
- Gitea Issue: timmy-home #88
|
||
|
|
- Existing: `~/.hermes/hermes-agent/agent/smart_model_routing.py`
|
||
|
|
- Config: `~/.hermes/config.yaml` (fallback_providers chain)
|