uniwizard/task_classifier_design.md

# Task Classifier Design Document

## Overview

The **Task Classifier** is an enhanced prompt routing system for the Uniwizard agent harness. It classifies incoming user prompts into task categories and maps them to ranked backend preferences, enabling intelligent model selection across the 7-backend fallback chain.

## Goals

1. **Right-size every request**: Route simple queries to fast backends, complex tasks to capable ones
2. **Minimize latency**: Use Groq (284ms) for fast operations, Anthropic for deep reasoning
3. **Maximize quality**: Match task type to backend strengths
4. **Provide transparency**: Return clear reasoning for routing decisions
5. **Enable fallback**: Support the full 7-backend chain with intelligent ordering

## Architecture

```
┌─────────────────┐
│  User Prompt    │
└────────┬────────┘
         │
         ▼
┌─────────────────────┐
│  Feature Extraction │
│  - Length metrics   │
│  - Code detection   │
│  - URL extraction   │
│  - Keyword tokenize │
└────────┬────────────┘
         │
         ▼
┌─────────────────────┐
│  Complexity Assess  │
│  - Low/Medium/High  │
└────────┬────────────┘
         │
         ▼
┌─────────────────────┐
│  Task Classification│
│  - Code             │
│  - Reasoning        │
│  - Research         │
│  - Creative         │
│  - Fast Ops         │
│  - Tool Use         │
└────────┬────────────┘
         │
         ▼
┌─────────────────────┐
│  Backend Selection  │
│  - Ranked by task   │
│  - Complexity adj.  │
│  - Feature boosts   │
└────────┬────────────┘
         │
         ▼
┌─────────────────────┐
│  ClassificationResult
│  - task_type        │
│  - preferred_backends
│  - complexity       │
│  - reason           │
│  - confidence       │
└─────────────────────┘
```

## Task Types

| Task Type | Description | Primary Indicators |
|-----------|-------------|-------------------|
| `code` | Programming tasks, debugging, refactoring | Keywords: implement, debug, refactor, test, function, class, API |
| `reasoning` | Analysis, comparison, evaluation | Keywords: analyze, compare, evaluate, step by step, trade-offs |
| `research` | Information gathering, literature review | Keywords: research, find, paper, study, URLs present |
| `creative` | Writing, design, content creation | Keywords: write, create, design, story, poem, brainstorm |
| `fast_ops` | Quick status checks, simple queries | Short length (<20 words), simple keywords |
| `tool_use` | Tool invocations, commands, API calls | Keywords: run, execute, use tool, browser, delegate |
| `unknown` | No clear indicators | Fallback classification |

## Backend Strengths Mapping

### 1. Anthropic (Claude)
- **Strengths**: Deep reasoning, code review, complex analysis, tool use
- **Best for**: Reasoning, tool_use, complex code review
- **Ranking**: #1 for reasoning, #2 for code, #1 for tool_use

### 2. OpenAI Codex
- **Strengths**: Code generation, feature implementation
- **Best for**: Code tasks, structured outputs
- **Ranking**: #1 for code generation

### 3. Gemini
- **Strengths**: Research, multimodal, long context
- **Best for**: Research tasks, document analysis
- **Ranking**: #1 for research, #2 for reasoning

### 4. Groq
- **Strengths**: Speed (284ms latency)
- **Best for**: Fast operations, simple queries, triage
- **Ranking**: #1 for fast_ops

### 5. Grok
- **Strengths**: Broad knowledge, creative, drafting
- **Best for**: Creative writing, general knowledge
- **Ranking**: #1 for creative

### 6. Kimi (kimi-coding)
- **Strengths**: Long context (262K tokens), code refactoring
- **Best for**: Large codebase work, long document analysis
- **Ranking**: Boosted for high-complexity code/research

### 7. OpenRouter
- **Strengths**: Broadest model access, overflow handling
- **Best for**: Fallback, variety of model choices
- **Ranking**: #6 or #7 across all task types

## Backend Rankings by Task Type

```python
CODE = [
    openai-codex,   # Best generation
    anthropic,      # Review & analysis
    kimi,           # Large codebases
    gemini,         # Multimodal
    groq,           # Fast simple tasks
    openrouter,     # Overflow
    grok,           # General backup
]

REASONING = [
    anthropic,      # Deep reasoning
    gemini,         # Analysis
    kimi,           # Long chains
    grok,           # Broad knowledge
    openai-codex,   # Structured
    openrouter,
    groq,
]

RESEARCH = [
    gemini,         # Research leader
    kimi,           # 262K context
    anthropic,      # Deep analysis
    grok,           # Knowledge
    openrouter,     # Broad access
    openai-codex,
    groq,           # Triage
]

CREATIVE = [
    grok,           # Creative writing
    anthropic,      # Nuanced
    gemini,         # Multimodal
    openai-codex,   # Creative coding
    kimi,           # Long-form
    openrouter,
    groq,
]

FAST_OPS = [
    groq,           # 284ms champion
    openrouter,     # Fast mini models
    gemini,         # Flash
    grok,           # Simple queries
    anthropic,
    openai-codex,
    kimi,
]

TOOL_USE = [
    anthropic,      # Tool use leader
    openai-codex,   # Good integration
    gemini,         # Multimodal
    groq,           # Fast chaining
    kimi,           # Long sessions
    openrouter,
    grok,
]
```

## Complexity Assessment

Complexity is determined by:

| Metric | Low | Medium | High |
|--------|-----|--------|------|
| Characters | ≤200 | 201-800 | >800 |
| Words | ≤35 | 36-150 | >150 |
| Lines | ≤3 | 4-15 | >15 |
| URLs | 0 | 1 | ≥2 |
| Code Blocks | 0 | 1 | ≥2 |

**Rules:**
- 2+ high metrics → **HIGH** complexity
- 2+ medium metrics or 1 high → **MEDIUM** complexity
- Otherwise → **LOW** complexity

### Complexity Adjustments

- **HIGH complexity + RESEARCH/CODE**: Boost Kimi and Gemini in rankings
- **LOW complexity + FAST_OPS**: Ensure Groq is first
- **Code blocks present**: Boost OpenAI Codex in any task type

## Keyword Dictionaries

The classifier uses curated keyword sets for each task type:

### Code Keywords (100+)
- Implementation: implement, code, function, class, module
- Debugging: debug, error, exception, traceback, bug, fix
- Testing: test, pytest, unittest, coverage
- Operations: deploy, docker, kubernetes, ci/cd, pipeline
- Concepts: api, endpoint, database, query, authentication

### Reasoning Keywords (50+)
- Analysis: analyze, evaluate, assess, critique, review
- Logic: reason, deduce, infer, logic, argument, evidence
- Process: compare, contrast, trade-off, strategy, plan
- Modifiers: step by step, chain of thought, think through

### Research Keywords (80+)
- Actions: research, find, search, explore, discover
- Sources: paper, publication, journal, arxiv, dataset
- Methods: study, survey, experiment, benchmark, evaluation
- Domains: machine learning, neural network, sota, literature

### Creative Keywords (100+)
- Visual: art, paint, draw, design, graphic, image
- Writing: write, story, novel, poem, essay, content
- Audio: music, song, compose, melody, sound
- Process: brainstorm, ideate, concept, imagine, inspire

### Fast Ops Keywords (60+)
- Simple: quick, fast, brief, simple, easy, status
- Actions: list, show, get, check, count, find
- Short queries: hi, hello, thanks, yes/no, what is

### Tool Use Keywords (70+)
- Actions: run, execute, call, use tool, invoke
- Systems: terminal, shell, docker, kubernetes, git
- Protocols: api, http, request, response, webhook
- Agents: delegate, subagent, spawn, mcp

## API

### Classify a Prompt

```python
from task_classifier import TaskClassifier, classify_prompt

# Method 1: Using the class
classifier = TaskClassifier()
result = classifier.classify("Implement a Python function")

print(result.task_type)           # TaskType.CODE
print(result.preferred_backends)  # ["openai-codex", "anthropic", ...]
print(result.complexity)          # ComplexityLevel.LOW
print(result.reason)              # "Task: code; Complexity: low; ..."
print(result.confidence)          # 0.75

# Method 2: Convenience function
output = classify_prompt("Research AI papers")
# Returns dict: {
#   "task_type": "research",
#   "preferred_backends": ["gemini", "kimi", ...],
#   "complexity": "low",
#   "reason": "...",
#   "confidence": 0.65,
#   "features": {...}
# }
```

### ClassificationResult Fields

| Field | Type | Description |
|-------|------|-------------|
| `task_type` | TaskType | Classified task category |
| `preferred_backends` | List[str] | Ranked list of backend identifiers |
| `complexity` | ComplexityLevel | Assessed complexity level |
| `reason` | str | Human-readable classification reasoning |
| `confidence` | float | 0.0-1.0 confidence score |
| `features` | Dict | Extracted features (lengths, code, URLs) |

## Integration with Hermes

### Usage in Smart Model Routing

The task classifier replaces/enhances the existing `smart_model_routing.py`:

```python
# In hermes-agent/agent/smart_model_routing.py
from uniwizard.task_classifier import TaskClassifier

classifier = TaskClassifier()

def resolve_turn_route(user_message, routing_config, primary, fallback_chain):
    # Classify the prompt
    result = classifier.classify(user_message)
    
    # Map preferred backends to actual models from fallback_chain
    for backend in result.preferred_backends:
        model_config = fallback_chain.get(backend)
        if model_config and is_available(backend):
            return {
                "model": model_config["model"],
                "provider": backend,
                "reason": result.reason,
                "complexity": result.complexity.value,
            }
    
    # Fallback to primary
    return primary
```

### Configuration

```yaml
# config.yaml
smart_model_routing:
  enabled: true
  use_task_classifier: true
  
fallback_providers:
  - provider: anthropic
    model: claude-opus-4-6
  - provider: openai-codex
    model: codex
  - provider: gemini
    model: gemini-2.5-flash
  - provider: groq
    model: llama-3.3-70b-versatile
  - provider: grok
    model: grok-3-mini-fast
  - provider: kimi-coding
    model: kimi-k2.5
  - provider: openrouter
    model: openai/gpt-4.1-mini
```

## Testing

Run the test suite:

```bash
cd ~/.timmy/uniwizard
python -m pytest test_task_classifier.py -v
```

Coverage includes:
- Feature extraction (URLs, code blocks, length metrics)
- Complexity assessment (low/medium/high)
- Task type classification (all 6 types)
- Backend selection (rankings by task type)
- Complexity adjustments (boosts for Kimi/Gemini)
- Edge cases (empty, whitespace, very long prompts)
- Integration scenarios (realistic use cases)

## Future Enhancements

1. **Session Context**: Use conversation history for better classification
2. **Performance Feedback**: Learn from actual backend performance
3. **User Preferences**: Allow user-defined backend preferences
4. **Cost Optimization**: Factor in backend costs for routing
5. **Streaming Detection**: Identify streaming-suitable tasks
6. **Multi-Modal**: Better handling of image/audio inputs
7. **Confidence Thresholds**: Configurable confidence cutoffs

## Files

| File | Description |
|------|-------------|
| `task_classifier.py` | Main implementation (600+ lines) |
| `test_task_classifier.py` | Unit tests (400+ lines) |
| `task_classifier_design.md` | This design document |

## References

- Gitea Issue: timmy-home #88
- Existing: `~/.hermes/hermes-agent/agent/smart_model_routing.py`
- Config: `~/.hermes/config.yaml` (fallback_providers chain)
[KimiClaw] Uniwizard routing modules — quality scorer, task classifier, self-grader (#107) Co-authored-by: Kimi Claw <kimi@timmytime.ai> Co-committed-by: Kimi Claw <kimi@timmytime.ai> 2026-03-30 20:15:36 +00:00			`# Task Classifier Design Document`

			`## Overview`

			`The Task Classifier is an enhanced prompt routing system for the Uniwizard agent harness. It classifies incoming user prompts into task categories and maps them to ranked backend preferences, enabling intelligent model selection across the 7-backend fallback chain.`

			`## Goals`

			`1. Right-size every request: Route simple queries to fast backends, complex tasks to capable ones`
			`2. Minimize latency: Use Groq (284ms) for fast operations, Anthropic for deep reasoning`
			`3. Maximize quality: Match task type to backend strengths`
			`4. Provide transparency: Return clear reasoning for routing decisions`
			`5. Enable fallback: Support the full 7-backend chain with intelligent ordering`

			`## Architecture`

			```
			`┌─────────────────┐`
			`│ User Prompt │`
			`└────────┬────────┘`
			`│`
			`▼`
			`┌─────────────────────┐`
			`│ Feature Extraction │`
			`│ - Length metrics │`
			`│ - Code detection │`
			`│ - URL extraction │`
			`│ - Keyword tokenize │`
			`└────────┬────────────┘`
			`│`
			`▼`
			`┌─────────────────────┐`
			`│ Complexity Assess │`
			`│ - Low/Medium/High │`
			`└────────┬────────────┘`
			`│`
			`▼`
			`┌─────────────────────┐`
			`│ Task Classification│`
			`│ - Code │`
			`│ - Reasoning │`
			`│ - Research │`
			`│ - Creative │`
			`│ - Fast Ops │`
			`│ - Tool Use │`
			`└────────┬────────────┘`
			`│`
			`▼`
			`┌─────────────────────┐`
			`│ Backend Selection │`
			`│ - Ranked by task │`
			`│ - Complexity adj. │`
			`│ - Feature boosts │`
			`└────────┬────────────┘`
			`│`
			`▼`
			`┌─────────────────────┐`
			`│ ClassificationResult`
			`│ - task_type │`
			`│ - preferred_backends`
			`│ - complexity │`
			`│ - reason │`
			`│ - confidence │`
			`└─────────────────────┘`
			```

			`## Task Types`

			`\| Task Type \| Description \| Primary Indicators \|`
			`\|-----------\|-------------\|-------------------\|`
			\| `code` \| Programming tasks, debugging, refactoring \| Keywords: implement, debug, refactor, test, function, class, API \|
			\| `reasoning` \| Analysis, comparison, evaluation \| Keywords: analyze, compare, evaluate, step by step, trade-offs \|
			\| `research` \| Information gathering, literature review \| Keywords: research, find, paper, study, URLs present \|
			\| `creative` \| Writing, design, content creation \| Keywords: write, create, design, story, poem, brainstorm \|
			\| `fast_ops` \| Quick status checks, simple queries \| Short length (<20 words), simple keywords \|
			\| `tool_use` \| Tool invocations, commands, API calls \| Keywords: run, execute, use tool, browser, delegate \|
			\| `unknown` \| No clear indicators \| Fallback classification \|

			`## Backend Strengths Mapping`

			`### 1. Anthropic (Claude)`
			`- Strengths: Deep reasoning, code review, complex analysis, tool use`
			`- Best for: Reasoning, tool_use, complex code review`
			`- Ranking: #1 for reasoning, #2 for code, #1 for tool_use`

			`### 2. OpenAI Codex`
			`- Strengths: Code generation, feature implementation`
			`- Best for: Code tasks, structured outputs`
			`- Ranking: #1 for code generation`

			`### 3. Gemini`
			`- Strengths: Research, multimodal, long context`
			`- Best for: Research tasks, document analysis`
			`- Ranking: #1 for research, #2 for reasoning`

			`### 4. Groq`
			`- Strengths: Speed (284ms latency)`
			`- Best for: Fast operations, simple queries, triage`
			`- Ranking: #1 for fast_ops`

			`### 5. Grok`
			`- Strengths: Broad knowledge, creative, drafting`
			`- Best for: Creative writing, general knowledge`
			`- Ranking: #1 for creative`

			`### 6. Kimi (kimi-coding)`
			`- Strengths: Long context (262K tokens), code refactoring`
			`- Best for: Large codebase work, long document analysis`
			`- Ranking: Boosted for high-complexity code/research`

			`### 7. OpenRouter`
			`- Strengths: Broadest model access, overflow handling`
			`- Best for: Fallback, variety of model choices`
			`- Ranking: #6 or #7 across all task types`

			`## Backend Rankings by Task Type`

			```python
			`CODE = [`
			`openai-codex, # Best generation`
			`anthropic, # Review & analysis`
			`kimi, # Large codebases`
			`gemini, # Multimodal`
			`groq, # Fast simple tasks`
			`openrouter, # Overflow`
			`grok, # General backup`
			`]`

			`REASONING = [`
			`anthropic, # Deep reasoning`
			`gemini, # Analysis`
			`kimi, # Long chains`
			`grok, # Broad knowledge`
			`openai-codex, # Structured`
			`openrouter,`
			`groq,`
			`]`

			`RESEARCH = [`
			`gemini, # Research leader`
			`kimi, # 262K context`
			`anthropic, # Deep analysis`
			`grok, # Knowledge`
			`openrouter, # Broad access`
			`openai-codex,`
			`groq, # Triage`
			`]`

			`CREATIVE = [`
			`grok, # Creative writing`
			`anthropic, # Nuanced`
			`gemini, # Multimodal`
			`openai-codex, # Creative coding`
			`kimi, # Long-form`
			`openrouter,`
			`groq,`
			`]`

			`FAST_OPS = [`
			`groq, # 284ms champion`
			`openrouter, # Fast mini models`
			`gemini, # Flash`
			`grok, # Simple queries`
			`anthropic,`
			`openai-codex,`
			`kimi,`
			`]`

			`TOOL_USE = [`
			`anthropic, # Tool use leader`
			`openai-codex, # Good integration`
			`gemini, # Multimodal`
			`groq, # Fast chaining`
			`kimi, # Long sessions`
			`openrouter,`
			`grok,`
			`]`
			```

			`## Complexity Assessment`

			`Complexity is determined by:`

			`\| Metric \| Low \| Medium \| High \|`
			`\|--------\|-----\|--------\|------\|`
			`\| Characters \| ≤200 \| 201-800 \| >800 \|`
			`\| Words \| ≤35 \| 36-150 \| >150 \|`
			`\| Lines \| ≤3 \| 4-15 \| >15 \|`
			`\| URLs \| 0 \| 1 \| ≥2 \|`
			`\| Code Blocks \| 0 \| 1 \| ≥2 \|`

			`Rules:`
			`- 2+ high metrics → HIGH complexity`
			`- 2+ medium metrics or 1 high → MEDIUM complexity`
			`- Otherwise → LOW complexity`

			`### Complexity Adjustments`

			`- HIGH complexity + RESEARCH/CODE: Boost Kimi and Gemini in rankings`
			`- LOW complexity + FAST_OPS: Ensure Groq is first`
			`- Code blocks present: Boost OpenAI Codex in any task type`

			`## Keyword Dictionaries`

			`The classifier uses curated keyword sets for each task type:`

			`### Code Keywords (100+)`
			`- Implementation: implement, code, function, class, module`
			`- Debugging: debug, error, exception, traceback, bug, fix`
			`- Testing: test, pytest, unittest, coverage`
			`- Operations: deploy, docker, kubernetes, ci/cd, pipeline`
			`- Concepts: api, endpoint, database, query, authentication`

			`### Reasoning Keywords (50+)`
			`- Analysis: analyze, evaluate, assess, critique, review`
			`- Logic: reason, deduce, infer, logic, argument, evidence`
			`- Process: compare, contrast, trade-off, strategy, plan`
			`- Modifiers: step by step, chain of thought, think through`

			`### Research Keywords (80+)`
			`- Actions: research, find, search, explore, discover`
			`- Sources: paper, publication, journal, arxiv, dataset`
			`- Methods: study, survey, experiment, benchmark, evaluation`
			`- Domains: machine learning, neural network, sota, literature`

			`### Creative Keywords (100+)`
			`- Visual: art, paint, draw, design, graphic, image`
			`- Writing: write, story, novel, poem, essay, content`
			`- Audio: music, song, compose, melody, sound`
			`- Process: brainstorm, ideate, concept, imagine, inspire`

			`### Fast Ops Keywords (60+)`
			`- Simple: quick, fast, brief, simple, easy, status`
			`- Actions: list, show, get, check, count, find`
			`- Short queries: hi, hello, thanks, yes/no, what is`

			`### Tool Use Keywords (70+)`
			`- Actions: run, execute, call, use tool, invoke`
			`- Systems: terminal, shell, docker, kubernetes, git`
			`- Protocols: api, http, request, response, webhook`
			`- Agents: delegate, subagent, spawn, mcp`

			`## API`

			`### Classify a Prompt`

			```python
			`from task_classifier import TaskClassifier, classify_prompt`

			`# Method 1: Using the class`
			`classifier = TaskClassifier()`
			`result = classifier.classify("Implement a Python function")`

			`print(result.task_type) # TaskType.CODE`
			`print(result.preferred_backends) # ["openai-codex", "anthropic", ...]`
			`print(result.complexity) # ComplexityLevel.LOW`
			`print(result.reason) # "Task: code; Complexity: low; ..."`
			`print(result.confidence) # 0.75`

			`# Method 2: Convenience function`
			`output = classify_prompt("Research AI papers")`
			`# Returns dict: {`
			`# "task_type": "research",`
			`# "preferred_backends": ["gemini", "kimi", ...],`
			`# "complexity": "low",`
			`# "reason": "...",`
			`# "confidence": 0.65,`
			`# "features": {...}`
			`# }`
			```

			`### ClassificationResult Fields`

			`\| Field \| Type \| Description \|`
			`\|-------\|------\|-------------\|`
			\| `task_type` \| TaskType \| Classified task category \|
			\| `preferred_backends` \| List[str] \| Ranked list of backend identifiers \|
			\| `complexity` \| ComplexityLevel \| Assessed complexity level \|
			\| `reason` \| str \| Human-readable classification reasoning \|
			\| `confidence` \| float \| 0.0-1.0 confidence score \|
			\| `features` \| Dict \| Extracted features (lengths, code, URLs) \|

			`## Integration with Hermes`

			`### Usage in Smart Model Routing`

			The task classifier replaces/enhances the existing `smart_model_routing.py`:

			```python
			`# In hermes-agent/agent/smart_model_routing.py`
			`from uniwizard.task_classifier import TaskClassifier`

			`classifier = TaskClassifier()`

			`def resolve_turn_route(user_message, routing_config, primary, fallback_chain):`
			`# Classify the prompt`
			`result = classifier.classify(user_message)`

			`# Map preferred backends to actual models from fallback_chain`
			`for backend in result.preferred_backends:`
			`model_config = fallback_chain.get(backend)`
			`if model_config and is_available(backend):`
			`return {`
			`"model": model_config["model"],`
			`"provider": backend,`
			`"reason": result.reason,`
			`"complexity": result.complexity.value,`
			`}`

			`# Fallback to primary`
			`return primary`
			```

			`### Configuration`

			```yaml
			`# config.yaml`
			`smart_model_routing:`
			`enabled: true`
			`use_task_classifier: true`

			`fallback_providers:`
			`- provider: anthropic`
			`model: claude-opus-4-6`
			`- provider: openai-codex`
			`model: codex`
			`- provider: gemini`
			`model: gemini-2.5-flash`
			`- provider: groq`
			`model: llama-3.3-70b-versatile`
			`- provider: grok`
			`model: grok-3-mini-fast`
			`- provider: kimi-coding`
			`model: kimi-k2.5`
			`- provider: openrouter`
			`model: openai/gpt-4.1-mini`
			```

			`## Testing`

			`Run the test suite:`

			```bash
			`cd ~/.timmy/uniwizard`
			`python -m pytest test_task_classifier.py -v`
			```

			`Coverage includes:`
			`- Feature extraction (URLs, code blocks, length metrics)`
			`- Complexity assessment (low/medium/high)`
			`- Task type classification (all 6 types)`
			`- Backend selection (rankings by task type)`
			`- Complexity adjustments (boosts for Kimi/Gemini)`
			`- Edge cases (empty, whitespace, very long prompts)`
			`- Integration scenarios (realistic use cases)`

			`## Future Enhancements`

			`1. Session Context: Use conversation history for better classification`
			`2. Performance Feedback: Learn from actual backend performance`
			`3. User Preferences: Allow user-defined backend preferences`
			`4. Cost Optimization: Factor in backend costs for routing`
			`5. Streaming Detection: Identify streaming-suitable tasks`
			`6. Multi-Modal: Better handling of image/audio inputs`
			`7. Confidence Thresholds: Configurable confidence cutoffs`

			`## Files`

			`\| File \| Description \|`
			`\|------\|-------------\|`
			\| `task_classifier.py` \| Main implementation (600+ lines) \|
			\| `test_task_classifier.py` \| Unit tests (400+ lines) \|
			\| `task_classifier_design.md` \| This design document \|

			`## References`

			`- Gitea Issue: timmy-home #88`
			- Existing: `~/.hermes/hermes-agent/agent/smart_model_routing.py`
			- Config: `~/.hermes/config.yaml` (fallback_providers chain)