Files

Kimi Claw 83e400d4aa [KimiClaw] Uniwizard routing modules — quality scorer, task classifier, self-grader (#107 )

Co-authored-by: Kimi Claw <kimi@timmytime.ai>
Co-committed-by: Kimi Claw <kimi@timmytime.ai>

2026-03-30 20:15:36 +00:00

12 KiB

Raw Permalink Blame History

Task Classifier Design Document

Overview

The Task Classifier is an enhanced prompt routing system for the Uniwizard agent harness. It classifies incoming user prompts into task categories and maps them to ranked backend preferences, enabling intelligent model selection across the 7-backend fallback chain.

Goals

Right-size every request: Route simple queries to fast backends, complex tasks to capable ones
Minimize latency: Use Groq (284ms) for fast operations, Anthropic for deep reasoning
Maximize quality: Match task type to backend strengths
Provide transparency: Return clear reasoning for routing decisions
Enable fallback: Support the full 7-backend chain with intelligent ordering

Architecture

┌─────────────────┐
│  User Prompt    │
└────────┬────────┘
         │
         ▼
┌─────────────────────┐
│  Feature Extraction │
│  - Length metrics   │
│  - Code detection   │
│  - URL extraction   │
│  - Keyword tokenize │
└────────┬────────────┘
         │
         ▼
┌─────────────────────┐
│  Complexity Assess  │
│  - Low/Medium/High  │
└────────┬────────────┘
         │
         ▼
┌─────────────────────┐
│  Task Classification│
│  - Code             │
│  - Reasoning        │
│  - Research         │
│  - Creative         │
│  - Fast Ops         │
│  - Tool Use         │
└────────┬────────────┘
         │
         ▼
┌─────────────────────┐
│  Backend Selection  │
│  - Ranked by task   │
│  - Complexity adj.  │
│  - Feature boosts   │
└────────┬────────────┘
         │
         ▼
┌─────────────────────┐
│  ClassificationResult
│  - task_type        │
│  - preferred_backends
│  - complexity       │
│  - reason           │
│  - confidence       │
└─────────────────────┘

Task Types

Task Type	Description	Primary Indicators
`code`	Programming tasks, debugging, refactoring	Keywords: implement, debug, refactor, test, function, class, API
`reasoning`	Analysis, comparison, evaluation	Keywords: analyze, compare, evaluate, step by step, trade-offs
`research`	Information gathering, literature review	Keywords: research, find, paper, study, URLs present
`creative`	Writing, design, content creation	Keywords: write, create, design, story, poem, brainstorm
`fast_ops`	Quick status checks, simple queries	Short length (<20 words), simple keywords
`tool_use`	Tool invocations, commands, API calls	Keywords: run, execute, use tool, browser, delegate
`unknown`	No clear indicators	Fallback classification

Backend Strengths Mapping

1. Anthropic (Claude)

Strengths: Deep reasoning, code review, complex analysis, tool use
Best for: Reasoning, tool_use, complex code review
Ranking: #1 for reasoning, #2 for code, #1 for tool_use

2. OpenAI Codex

Strengths: Code generation, feature implementation
Best for: Code tasks, structured outputs
Ranking: #1 for code generation

3. Gemini

Strengths: Research, multimodal, long context
Best for: Research tasks, document analysis
Ranking: #1 for research, #2 for reasoning

4. Groq

Strengths: Speed (284ms latency)
Best for: Fast operations, simple queries, triage
Ranking: #1 for fast_ops

5. Grok

Strengths: Broad knowledge, creative, drafting
Best for: Creative writing, general knowledge
Ranking: #1 for creative

6. Kimi (kimi-coding)

Strengths: Long context (262K tokens), code refactoring
Best for: Large codebase work, long document analysis
Ranking: Boosted for high-complexity code/research

7. OpenRouter

Strengths: Broadest model access, overflow handling
Best for: Fallback, variety of model choices
Ranking: #6 or #7 across all task types

Backend Rankings by Task Type

CODE = [
    openai-codex,   # Best generation
    anthropic,      # Review & analysis
    kimi,           # Large codebases
    gemini,         # Multimodal
    groq,           # Fast simple tasks
    openrouter,     # Overflow
    grok,           # General backup
]

REASONING = [
    anthropic,      # Deep reasoning
    gemini,         # Analysis
    kimi,           # Long chains
    grok,           # Broad knowledge
    openai-codex,   # Structured
    openrouter,
    groq,
]

RESEARCH = [
    gemini,         # Research leader
    kimi,           # 262K context
    anthropic,      # Deep analysis
    grok,           # Knowledge
    openrouter,     # Broad access
    openai-codex,
    groq,           # Triage
]

CREATIVE = [
    grok,           # Creative writing
    anthropic,      # Nuanced
    gemini,         # Multimodal
    openai-codex,   # Creative coding
    kimi,           # Long-form
    openrouter,
    groq,
]

FAST_OPS = [
    groq,           # 284ms champion
    openrouter,     # Fast mini models
    gemini,         # Flash
    grok,           # Simple queries
    anthropic,
    openai-codex,
    kimi,
]

TOOL_USE = [
    anthropic,      # Tool use leader
    openai-codex,   # Good integration
    gemini,         # Multimodal
    groq,           # Fast chaining
    kimi,           # Long sessions
    openrouter,
    grok,
]

Complexity Assessment

Complexity is determined by:

Metric	Low	Medium	High
Characters	≤200	201-800	>800
Words	≤35	36-150	>150
Lines	≤3	4-15	>15
URLs	0	1	≥2
Code Blocks	0	1	≥2

Rules:

2+ high metrics → HIGH complexity
2+ medium metrics or 1 high → MEDIUM complexity
Otherwise → LOW complexity

Complexity Adjustments

HIGH complexity + RESEARCH/CODE: Boost Kimi and Gemini in rankings
LOW complexity + FAST_OPS: Ensure Groq is first
Code blocks present: Boost OpenAI Codex in any task type

Keyword Dictionaries

The classifier uses curated keyword sets for each task type:

Code Keywords (100+)

Implementation: implement, code, function, class, module
Debugging: debug, error, exception, traceback, bug, fix
Testing: test, pytest, unittest, coverage
Operations: deploy, docker, kubernetes, ci/cd, pipeline
Concepts: api, endpoint, database, query, authentication

Reasoning Keywords (50+)

Analysis: analyze, evaluate, assess, critique, review
Logic: reason, deduce, infer, logic, argument, evidence
Process: compare, contrast, trade-off, strategy, plan
Modifiers: step by step, chain of thought, think through

Research Keywords (80+)

Actions: research, find, search, explore, discover
Sources: paper, publication, journal, arxiv, dataset
Methods: study, survey, experiment, benchmark, evaluation
Domains: machine learning, neural network, sota, literature

Creative Keywords (100+)

Visual: art, paint, draw, design, graphic, image
Writing: write, story, novel, poem, essay, content
Audio: music, song, compose, melody, sound
Process: brainstorm, ideate, concept, imagine, inspire

Fast Ops Keywords (60+)

Simple: quick, fast, brief, simple, easy, status
Actions: list, show, get, check, count, find
Short queries: hi, hello, thanks, yes/no, what is

Tool Use Keywords (70+)

Actions: run, execute, call, use tool, invoke
Systems: terminal, shell, docker, kubernetes, git
Protocols: api, http, request, response, webhook
Agents: delegate, subagent, spawn, mcp

API

Classify a Prompt

from task_classifier import TaskClassifier, classify_prompt

# Method 1: Using the class
classifier = TaskClassifier()
result = classifier.classify("Implement a Python function")

print(result.task_type)           # TaskType.CODE
print(result.preferred_backends)  # ["openai-codex", "anthropic", ...]
print(result.complexity)          # ComplexityLevel.LOW
print(result.reason)              # "Task: code; Complexity: low; ..."
print(result.confidence)          # 0.75

# Method 2: Convenience function
output = classify_prompt("Research AI papers")
# Returns dict: {
#   "task_type": "research",
#   "preferred_backends": ["gemini", "kimi", ...],
#   "complexity": "low",
#   "reason": "...",
#   "confidence": 0.65,
#   "features": {...}
# }

ClassificationResult Fields

Field	Type	Description
`task_type`	TaskType	Classified task category
`preferred_backends`	List[str]	Ranked list of backend identifiers
`complexity`	ComplexityLevel	Assessed complexity level
`reason`	str	Human-readable classification reasoning
`confidence`	float	0.0-1.0 confidence score
`features`	Dict	Extracted features (lengths, code, URLs)

Integration with Hermes

Usage in Smart Model Routing

The task classifier replaces/enhances the existing smart_model_routing.py:

# In hermes-agent/agent/smart_model_routing.py
from uniwizard.task_classifier import TaskClassifier

classifier = TaskClassifier()

def resolve_turn_route(user_message, routing_config, primary, fallback_chain):
    # Classify the prompt
    result = classifier.classify(user_message)
    
    # Map preferred backends to actual models from fallback_chain
    for backend in result.preferred_backends:
        model_config = fallback_chain.get(backend)
        if model_config and is_available(backend):
            return {
                "model": model_config["model"],
                "provider": backend,
                "reason": result.reason,
                "complexity": result.complexity.value,
            }
    
    # Fallback to primary
    return primary

Configuration

# config.yaml
smart_model_routing:
  enabled: true
  use_task_classifier: true
  
fallback_providers:
  - provider: anthropic
    model: claude-opus-4-6
  - provider: openai-codex
    model: codex
  - provider: gemini
    model: gemini-2.5-flash
  - provider: groq
    model: llama-3.3-70b-versatile
  - provider: grok
    model: grok-3-mini-fast
  - provider: kimi-coding
    model: kimi-k2.5
  - provider: openrouter
    model: openai/gpt-4.1-mini

Testing

Run the test suite:

cd ~/.timmy/uniwizard
python -m pytest test_task_classifier.py -v

Coverage includes:

Feature extraction (URLs, code blocks, length metrics)
Complexity assessment (low/medium/high)
Task type classification (all 6 types)
Backend selection (rankings by task type)
Complexity adjustments (boosts for Kimi/Gemini)
Edge cases (empty, whitespace, very long prompts)
Integration scenarios (realistic use cases)

Future Enhancements

Session Context: Use conversation history for better classification
Performance Feedback: Learn from actual backend performance
User Preferences: Allow user-defined backend preferences
Cost Optimization: Factor in backend costs for routing
Streaming Detection: Identify streaming-suitable tasks
Multi-Modal: Better handling of image/audio inputs
Confidence Thresholds: Configurable confidence cutoffs

Files

File	Description
`task_classifier.py`	Main implementation (600+ lines)
`test_task_classifier.py`	Unit tests (400+ lines)
`task_classifier_design.md`	This design document

References

Gitea Issue: timmy-home #88
Existing: ~/.hermes/hermes-agent/agent/smart_model_routing.py
Config: ~/.hermes/config.yaml (fallback_providers chain)

12 KiB Raw Permalink Blame History