# Task Classifier Design Document

## Overview

The **Task Classifier** is an enhanced prompt routing system for the Uniwizard agent harness. It classifies incoming user prompts into task categories and maps them to ranked backend preferences, enabling intelligent model selection across the 7-backend fallback chain.

## Goals

1. **Right-size every request**: Route simple queries to fast backends, complex tasks to capable ones
2. **Minimize latency**: Use Groq (284ms) for fast operations, Anthropic for deep reasoning
3. **Maximize quality**: Match task type to backend strengths
4. **Provide transparency**: Return clear reasoning for routing decisions
5. **Enable fallback**: Support the full 7-backend chain with intelligent ordering

## Architecture

```
┌─────────────────┐
│  User Prompt    │
└────────┬────────┘
         │
         ▼
┌─────────────────────┐
│  Feature Extraction │
│  - Length metrics   │
│  - Code detection   │
│  - URL extraction   │
│  - Keyword tokenize │
└────────┬────────────┘
         │
         ▼
┌─────────────────────┐
│  Complexity Assess  │
│  - Low/Medium/High  │
└────────┬────────────┘
         │
         ▼
┌─────────────────────┐
│  Task Classification│
│  - Code             │
│  - Reasoning        │
│  - Research         │
│  - Creative         │
│  - Fast Ops         │
│  - Tool Use         │
└────────┬────────────┘
         │
         ▼
┌─────────────────────┐
│  Backend Selection  │
│  - Ranked by task   │
│  - Complexity adj.  │
│  - Feature boosts   │
└────────┬────────────┘
         │
         ▼
┌─────────────────────┐
│  ClassificationResult
│  - task_type        │
│  - preferred_backends
│  - complexity       │
│  - reason           │
│  - confidence       │
└─────────────────────┘
```

## Task Types

| Task Type | Description | Primary Indicators |
|-----------|-------------|-------------------|
| `code` | Programming tasks, debugging, refactoring | Keywords: implement, debug, refactor, test, function, class, API |
| `reasoning` | Analysis, comparison, evaluation | Keywords: analyze, compare, evaluate, step by step, trade-offs |
| `research` | Information gathering, literature review | Keywords: research, find, paper, study, URLs present |
| `creative` | Writing, design, content creation | Keywords: write, create, design, story, poem, brainstorm |
| `fast_ops` | Quick status checks, simple queries | Short length (<20 words), simple keywords |
| `tool_use` | Tool invocations, commands, API calls | Keywords: run, execute, use tool, browser, delegate |
| `unknown` | No clear indicators | Fallback classification |

## Backend Strengths Mapping

### 1. Anthropic (Claude)
- **Strengths**: Deep reasoning, code review, complex analysis, tool use
- **Best for**: Reasoning, tool_use, complex code review
- **Ranking**: #1 for reasoning, #2 for code, #1 for tool_use

### 2. OpenAI Codex
- **Strengths**: Code generation, feature implementation
- **Best for**: Code tasks, structured outputs
- **Ranking**: #1 for code generation

### 3. Gemini
- **Strengths**: Research, multimodal, long context
- **Best for**: Research tasks, document analysis
- **Ranking**: #1 for research, #2 for reasoning

### 4. Groq
- **Strengths**: Speed (284ms latency)
- **Best for**: Fast operations, simple queries, triage
- **Ranking**: #1 for fast_ops

### 5. Grok
- **Strengths**: Broad knowledge, creative, drafting
- **Best for**: Creative writing, general knowledge
- **Ranking**: #1 for creative

### 6. Kimi (kimi-coding)
- **Strengths**: Long context (262K tokens), code refactoring
- **Best for**: Large codebase work, long document analysis
- **Ranking**: Boosted for high-complexity code/research

### 7. OpenRouter
- **Strengths**: Broadest model access, overflow handling
- **Best for**: Fallback, variety of model choices
- **Ranking**: #6 or #7 across all task types

## Backend Rankings by Task Type

```python
CODE = [
    openai-codex,   # Best generation
    anthropic,      # Review & analysis
    kimi,           # Large codebases
    gemini,         # Multimodal
    groq,           # Fast simple tasks
    openrouter,     # Overflow
    grok,           # General backup
]

REASONING = [
    anthropic,      # Deep reasoning
    gemini,         # Analysis
    kimi,           # Long chains
    grok,           # Broad knowledge
    openai-codex,   # Structured
    openrouter,
    groq,
]

RESEARCH = [
    gemini,         # Research leader
    kimi,           # 262K context
    anthropic,      # Deep analysis
    grok,           # Knowledge
    openrouter,     # Broad access
    openai-codex,
    groq,           # Triage
]

CREATIVE = [
    grok,           # Creative writing
    anthropic,      # Nuanced
    gemini,         # Multimodal
    openai-codex,   # Creative coding
    kimi,           # Long-form
    openrouter,
    groq,
]

FAST_OPS = [
    groq,           # 284ms champion
    openrouter,     # Fast mini models
    gemini,         # Flash
    grok,           # Simple queries
    anthropic,
    openai-codex,
    kimi,
]

TOOL_USE = [
    anthropic,      # Tool use leader
    openai-codex,   # Good integration
    gemini,         # Multimodal
    groq,           # Fast chaining
    kimi,           # Long sessions
    openrouter,
    grok,
]
```

## Complexity Assessment

Complexity is determined by:

| Metric | Low | Medium | High |
|--------|-----|--------|------|
| Characters | ≤200 | 201-800 | >800 |
| Words | ≤35 | 36-150 | >150 |
| Lines | ≤3 | 4-15 | >15 |
| URLs | 0 | 1 | ≥2 |
| Code Blocks | 0 | 1 | ≥2 |

**Rules:**
- 2+ high metrics → **HIGH** complexity
- 2+ medium metrics or 1 high → **MEDIUM** complexity
- Otherwise → **LOW** complexity

### Complexity Adjustments

- **HIGH complexity + RESEARCH/CODE**: Boost Kimi and Gemini in rankings
- **LOW complexity + FAST_OPS**: Ensure Groq is first
- **Code blocks present**: Boost OpenAI Codex in any task type

## Keyword Dictionaries

The classifier uses curated keyword sets for each task type:

### Code Keywords (100+)
- Implementation: implement, code, function, class, module
- Debugging: debug, error, exception, traceback, bug, fix
- Testing: test, pytest, unittest, coverage
- Operations: deploy, docker, kubernetes, ci/cd, pipeline
- Concepts: api, endpoint, database, query, authentication

### Reasoning Keywords (50+)
- Analysis: analyze, evaluate, assess, critique, review
- Logic: reason, deduce, infer, logic, argument, evidence
- Process: compare, contrast, trade-off, strategy, plan
- Modifiers: step by step, chain of thought, think through

### Research Keywords (80+)
- Actions: research, find, search, explore, discover
- Sources: paper, publication, journal, arxiv, dataset
- Methods: study, survey, experiment, benchmark, evaluation
- Domains: machine learning, neural network, sota, literature

### Creative Keywords (100+)
- Visual: art, paint, draw, design, graphic, image
- Writing: write, story, novel, poem, essay, content
- Audio: music, song, compose, melody, sound
- Process: brainstorm, ideate, concept, imagine, inspire

### Fast Ops Keywords (60+)
- Simple: quick, fast, brief, simple, easy, status
- Actions: list, show, get, check, count, find
- Short queries: hi, hello, thanks, yes/no, what is

### Tool Use Keywords (70+)
- Actions: run, execute, call, use tool, invoke
- Systems: terminal, shell, docker, kubernetes, git
- Protocols: api, http, request, response, webhook
- Agents: delegate, subagent, spawn, mcp

## API

### Classify a Prompt

```python
from task_classifier import TaskClassifier, classify_prompt

# Method 1: Using the class
classifier = TaskClassifier()
result = classifier.classify("Implement a Python function")

print(result.task_type)           # TaskType.CODE
print(result.preferred_backends)  # ["openai-codex", "anthropic", ...]
print(result.complexity)          # ComplexityLevel.LOW
print(result.reason)              # "Task: code; Complexity: low; ..."
print(result.confidence)          # 0.75

# Method 2: Convenience function
output = classify_prompt("Research AI papers")
# Returns dict: {
#   "task_type": "research",
#   "preferred_backends": ["gemini", "kimi", ...],
#   "complexity": "low",
#   "reason": "...",
#   "confidence": 0.65,
#   "features": {...}
# }
```

### ClassificationResult Fields

| Field | Type | Description |
|-------|------|-------------|
| `task_type` | TaskType | Classified task category |
| `preferred_backends` | List[str] | Ranked list of backend identifiers |
| `complexity` | ComplexityLevel | Assessed complexity level |
| `reason` | str | Human-readable classification reasoning |
| `confidence` | float | 0.0-1.0 confidence score |
| `features` | Dict | Extracted features (lengths, code, URLs) |

## Integration with Hermes

### Usage in Smart Model Routing

The task classifier replaces/enhances the existing `smart_model_routing.py`:

```python
# In hermes-agent/agent/smart_model_routing.py
from uniwizard.task_classifier import TaskClassifier

classifier = TaskClassifier()

def resolve_turn_route(user_message, routing_config, primary, fallback_chain):
    # Classify the prompt
    result = classifier.classify(user_message)
    
    # Map preferred backends to actual models from fallback_chain
    for backend in result.preferred_backends:
        model_config = fallback_chain.get(backend)
        if model_config and is_available(backend):
            return {
                "model": model_config["model"],
                "provider": backend,
                "reason": result.reason,
                "complexity": result.complexity.value,
            }
    
    # Fallback to primary
    return primary
```

### Configuration

```yaml
# config.yaml
smart_model_routing:
  enabled: true
  use_task_classifier: true
  
fallback_providers:
  - provider: anthropic
    model: claude-opus-4-6
  - provider: openai-codex
    model: codex
  - provider: gemini
    model: gemini-2.5-flash
  - provider: groq
    model: llama-3.3-70b-versatile
  - provider: grok
    model: grok-3-mini-fast
  - provider: kimi-coding
    model: kimi-k2.5
  - provider: openrouter
    model: openai/gpt-4.1-mini
```

## Testing

Run the test suite:

```bash
cd ~/.timmy/uniwizard
python -m pytest test_task_classifier.py -v
```

Coverage includes:
- Feature extraction (URLs, code blocks, length metrics)
- Complexity assessment (low/medium/high)
- Task type classification (all 6 types)
- Backend selection (rankings by task type)
- Complexity adjustments (boosts for Kimi/Gemini)
- Edge cases (empty, whitespace, very long prompts)
- Integration scenarios (realistic use cases)

## Future Enhancements

1. **Session Context**: Use conversation history for better classification
2. **Performance Feedback**: Learn from actual backend performance
3. **User Preferences**: Allow user-defined backend preferences
4. **Cost Optimization**: Factor in backend costs for routing
5. **Streaming Detection**: Identify streaming-suitable tasks
6. **Multi-Modal**: Better handling of image/audio inputs
7. **Confidence Thresholds**: Configurable confidence cutoffs

## Files

| File | Description |
|------|-------------|
| `task_classifier.py` | Main implementation (600+ lines) |
| `test_task_classifier.py` | Unit tests (400+ lines) |
| `task_classifier_design.md` | This design document |

## References

- Gitea Issue: timmy-home #88
- Existing: `~/.hermes/hermes-agent/agent/smart_model_routing.py`
- Config: `~/.hermes/config.yaml` (fallback_providers chain)